Compare commits

...

53 Commits

Author SHA1 Message Date
cf501abe10 flat layer structure 2023-10-09 10:49:26 +02:00
1827a75684 update md file 2023-10-09 10:44:33 +02:00
900c4c3402 make style 2023-10-09 10:41:18 +02:00
9300c5b5d2 update md file 2023-10-09 10:38:35 +02:00
27771500bc remove tip 2023-10-09 10:27:49 +02:00
50a5da9565 change repo_id 2023-10-08 13:23:27 +02:00
1f5685c579 rename 4 2023-10-07 12:43:29 +02:00
463cd9453d fix docstring 2023-10-07 11:39:39 +02:00
2a8c211c6a fix table 2023-10-07 11:19:26 +02:00
d9c6f53741 make fixup 2023-10-07 11:15:52 +02:00
d88de83c06 rename 3 2023-10-07 11:13:49 +02:00
2ec3d42d2c rename 2 2023-10-07 11:13:49 +02:00
c0eff767e9 rename 1 2023-10-07 11:13:49 +02:00
1476e11c1d fix doctest and copies 2023-10-07 11:13:49 +02:00
e79d0d5cfe Add _reorder_cache 2023-10-07 11:13:49 +02:00
ae1bd7cbae conversion scripts 2023-10-07 11:13:49 +02:00
82918cb13e conversion scripts 2023-10-07 11:13:49 +02:00
55bc246851 conversion scripts 2023-10-07 11:13:49 +02:00
9fc9b4d5ec style 2023-10-07 11:13:49 +02:00
30e73a86ad use present_key_value_states instead of next_decoder_cache 2023-10-07 11:13:49 +02:00
ff40971409 fix attn mask 2023-10-07 11:13:49 +02:00
7d60abfe4f UTM5 Atten 2023-10-07 11:13:48 +02:00
bed6ff888e Apply suggestions from code review
Co-authored-by: Arthur <48595927+ArthurZucker@users.noreply.github.com>
2023-10-07 11:13:48 +02:00
c7c90a2e33 remove "returned when being computed by the model" 2023-10-07 11:13:48 +02:00
cb69f03f6a style 2023-10-07 11:13:48 +02:00
9c1c6f75dc no more Kosmos2Tokenizer 2023-10-07 11:13:48 +02:00
1c449b728e Update docs/source/en/model_doc/kosmos-2.md
Co-authored-by: NielsRogge <48327001+NielsRogge@users.noreply.github.com>
2023-10-07 11:13:48 +02:00
35b6cdb958 fix docstring 2023-10-07 11:13:48 +02:00
d5498d851c fix docstring 2023-10-07 11:13:48 +02:00
768a137d8b revert the change in _decode 2023-10-07 11:13:48 +02:00
054bcbd586 [skip ci] fix 2023-10-07 11:13:48 +02:00
a3a55750ac fix 2023-10-07 11:13:48 +02:00
9b57870294 fix 2023-10-07 11:13:48 +02:00
edd600f6b4 fix 2023-10-07 11:13:48 +02:00
e30d4537b5 update readme 2023-10-07 11:13:48 +02:00
b591255866 address review comment - 011 2023-10-07 11:13:48 +02:00
fd8c75290e address review comment - 010 2023-10-07 11:13:48 +02:00
1afbea0202 address review comment - 009 2023-10-07 11:13:48 +02:00
b12eebb13e address review comment - 008 2023-10-07 11:13:48 +02:00
7c922fddd0 address review comment - 007 2023-10-07 11:13:48 +02:00
7310702493 address review comment - 006 2023-10-07 11:13:48 +02:00
304bd5e7df address review comment - 005 2023-10-07 11:13:48 +02:00
6cca7b451e address review comment - 004 2023-10-07 11:13:48 +02:00
02a9842c31 fix 2023-10-07 11:13:48 +02:00
26194d6918 Apply suggestions from code review
Co-authored-by: amyeroberts <22614925+amyeroberts@users.noreply.github.com>
2023-10-07 11:13:48 +02:00
5025290277 style 2023-10-07 11:13:48 +02:00
78867ebec5 address review comment - 003 2023-10-07 11:13:47 +02:00
b0309ce167 address review comment - 002 2023-10-07 11:13:47 +02:00
d642e1e914 address review comment - 001 2023-10-07 11:13:47 +02:00
600e234507 update 2023-10-07 11:13:47 +02:00
632e3f2459 update 2023-10-07 11:13:47 +02:00
d258f6194d update 2023-10-07 11:13:47 +02:00
00ad3b0b80 Add KOSMOS-2 model 2023-10-07 11:13:47 +02:00
32 changed files with 5207 additions and 0 deletions

View File

@ -383,6 +383,7 @@ Current number of checkpoints: ![](https://img.shields.io/endpoint?url=https://h
1. **[Informer](https://huggingface.co/docs/transformers/model_doc/informer)** (from Beihang University, UC Berkeley, Rutgers University, SEDD Company) released with the paper [Informer: Beyond Efficient Transformer for Long Sequence Time-Series Forecasting](https://arxiv.org/abs/2012.07436) by Haoyi Zhou, Shanghang Zhang, Jieqi Peng, Shuai Zhang, Jianxin Li, Hui Xiong, and Wancai Zhang.
1. **[InstructBLIP](https://huggingface.co/docs/transformers/model_doc/instructblip)** (from Salesforce) released with the paper [InstructBLIP: Towards General-purpose Vision-Language Models with Instruction Tuning](https://arxiv.org/abs/2305.06500) by Wenliang Dai, Junnan Li, Dongxu Li, Anthony Meng Huat Tiong, Junqi Zhao, Weisheng Wang, Boyang Li, Pascale Fung, Steven Hoi.
1. **[Jukebox](https://huggingface.co/docs/transformers/model_doc/jukebox)** (from OpenAI) released with the paper [Jukebox: A Generative Model for Music](https://arxiv.org/pdf/2005.00341.pdf) by Prafulla Dhariwal, Heewoo Jun, Christine Payne, Jong Wook Kim, Alec Radford, Ilya Sutskever.
1. **[KOSMOS-2](https://huggingface.co/docs/transformers/main/model_doc/kosmos-2)** (from Microsoft Research Asia) released with the paper [Kosmos-2: Grounding Multimodal Large Language Models to the World](https://arxiv.org/abs/2306.14824) by Zhiliang Peng, Wenhui Wang, Li Dong, Yaru Hao, Shaohan Huang, Shuming Ma, Furu Wei.
1. **[LayoutLM](https://huggingface.co/docs/transformers/model_doc/layoutlm)** (from Microsoft Research Asia) released with the paper [LayoutLM: Pre-training of Text and Layout for Document Image Understanding](https://arxiv.org/abs/1912.13318) by Yiheng Xu, Minghao Li, Lei Cui, Shaohan Huang, Furu Wei, Ming Zhou.
1. **[LayoutLMv2](https://huggingface.co/docs/transformers/model_doc/layoutlmv2)** (from Microsoft Research Asia) released with the paper [LayoutLMv2: Multi-modal Pre-training for Visually-Rich Document Understanding](https://arxiv.org/abs/2012.14740) by Yang Xu, Yiheng Xu, Tengchao Lv, Lei Cui, Furu Wei, Guoxin Wang, Yijuan Lu, Dinei Florencio, Cha Zhang, Wanxiang Che, Min Zhang, Lidong Zhou.
1. **[LayoutLMv3](https://huggingface.co/docs/transformers/model_doc/layoutlmv3)** (from Microsoft Research Asia) released with the paper [LayoutLMv3: Pre-training for Document AI with Unified Text and Image Masking](https://arxiv.org/abs/2204.08387) by Yupan Huang, Tengchao Lv, Lei Cui, Yutong Lu, Furu Wei.

View File

@ -359,6 +359,7 @@ Número actual de puntos de control: ![](https://img.shields.io/endpoint?url=htt
1. **[Informer](https://huggingface.co/docs/transformers/model_doc/informer)** (from Beihang University, UC Berkeley, Rutgers University, SEDD Company) released with the paper [Informer: Beyond Efficient Transformer for Long Sequence Time-Series Forecasting](https://arxiv.org/abs/2012.07436) by Haoyi Zhou, Shanghang Zhang, Jieqi Peng, Shuai Zhang, Jianxin Li, Hui Xiong, and Wancai Zhang.
1. **[InstructBLIP](https://huggingface.co/docs/transformers/model_doc/instructblip)** (from Salesforce) released with the paper [InstructBLIP: Towards General-purpose Vision-Language Models with Instruction Tuning](https://arxiv.org/abs/2305.06500) by Wenliang Dai, Junnan Li, Dongxu Li, Anthony Meng Huat Tiong, Junqi Zhao, Weisheng Wang, Boyang Li, Pascale Fung, Steven Hoi.
1. **[Jukebox](https://huggingface.co/docs/transformers/model_doc/jukebox)** (from OpenAI) released with the paper [Jukebox: A Generative Model for Music](https://arxiv.org/pdf/2005.00341.pdf) by Prafulla Dhariwal, Heewoo Jun, Christine Payne, Jong Wook Kim, Alec Radford, Ilya Sutskever.
1. **[KOSMOS-2](https://huggingface.co/docs/transformers/main/model_doc/kosmos-2)** (from Microsoft Research Asia) released with the paper [Kosmos-2: Grounding Multimodal Large Language Models to the World](https://arxiv.org/abs/2306.14824) by Zhiliang Peng, Wenhui Wang, Li Dong, Yaru Hao, Shaohan Huang, Shuming Ma, Furu Wei.
1. **[LayoutLM](https://huggingface.co/docs/transformers/model_doc/layoutlm)** (from Microsoft Research Asia) released with the paper [LayoutLM: Pre-training of Text and Layout for Document Image Understanding](https://arxiv.org/abs/1912.13318) by Yiheng Xu, Minghao Li, Lei Cui, Shaohan Huang, Furu Wei, Ming Zhou.
1. **[LayoutLMv2](https://huggingface.co/docs/transformers/model_doc/layoutlmv2)** (from Microsoft Research Asia) released with the paper [LayoutLMv2: Multi-modal Pre-training for Visually-Rich Document Understanding](https://arxiv.org/abs/2012.14740) by Yang Xu, Yiheng Xu, Tengchao Lv, Lei Cui, Furu Wei, Guoxin Wang, Yijuan Lu, Dinei Florencio, Cha Zhang, Wanxiang Che, Min Zhang, Lidong Zhou.
1. **[LayoutLMv3](https://huggingface.co/docs/transformers/model_doc/layoutlmv3)** (from Microsoft Research Asia) released with the paper [LayoutLMv3: Pre-training for Document AI with Unified Text and Image Masking](https://arxiv.org/abs/2204.08387) by Yupan Huang, Tengchao Lv, Lei Cui, Yutong Lu, Furu Wei.

View File

@ -331,6 +331,7 @@ conda install -c huggingface transformers
1. **[Informer](https://huggingface.co/docs/transformers/model_doc/informer)** (from Beihang University, UC Berkeley, Rutgers University, SEDD Company) released with the paper [Informer: Beyond Efficient Transformer for Long Sequence Time-Series Forecasting](https://arxiv.org/abs/2012.07436) by Haoyi Zhou, Shanghang Zhang, Jieqi Peng, Shuai Zhang, Jianxin Li, Hui Xiong, and Wancai Zhang.
1. **[InstructBLIP](https://huggingface.co/docs/transformers/model_doc/instructblip)** (Salesforce से) Wenliang Dai, Junnan Li, Dongxu Li, Anthony Meng Huat Tiong, Junqi Zhao, Weisheng Wang, Boyang Li, Pascale Fung, Steven Hoi. द्वाराअनुसंधान पत्र [InstructBLIP: Towards General-purpose Vision-Language Models with Instruction Tuning](https://arxiv.org/abs/2305.06500) के साथ जारी किया गया
1. **[Jukebox](https://huggingface.co/docs/transformers/model_doc/jukebox)** (from OpenAI) released with the paper [Jukebox: A Generative Model for Music](https://arxiv.org/pdf/2005.00341.pdf) by Prafulla Dhariwal, Heewoo Jun, Christine Payne, Jong Wook Kim, Alec Radford, Ilya Sutskever.
1. **[KOSMOS-2](https://huggingface.co/docs/transformers/main/model_doc/kosmos-2)** (from Microsoft Research Asia) released with the paper [Kosmos-2: Grounding Multimodal Large Language Models to the World](https://arxiv.org/abs/2306.14824) by Zhiliang Peng, Wenhui Wang, Li Dong, Yaru Hao, Shaohan Huang, Shuming Ma, Furu Wei.
1. **[LayoutLM](https://huggingface.co/docs/transformers/model_doc/layoutlm)** (from Microsoft Research Asia) released with the paper [LayoutLM: Pre-training of Text and Layout for Document Image Understanding](https://arxiv.org/abs/1912.13318) by Yiheng Xu, Minghao Li, Lei Cui, Shaohan Huang, Furu Wei, Ming Zhou.
1. **[LayoutLMv2](https://huggingface.co/docs/transformers/model_doc/layoutlmv2)** (from Microsoft Research Asia) released with the paper [LayoutLMv2: Multi-modal Pre-training for Visually-Rich Document Understanding](https://arxiv.org/abs/2012.14740) by Yang Xu, Yiheng Xu, Tengchao Lv, Lei Cui, Furu Wei, Guoxin Wang, Yijuan Lu, Dinei Florencio, Cha Zhang, Wanxiang Che, Min Zhang, Lidong Zhou.
1. **[LayoutLMv3](https://huggingface.co/docs/transformers/model_doc/layoutlmv3)** (माइक्रोसॉफ्ट रिसर्च एशिया से) साथ देने वाला पेपर [लेआउटएलएमवी3: यूनिफाइड टेक्स्ट और इमेज मास्किंग के साथ दस्तावेज़ एआई के लिए पूर्व-प्रशिक्षण](https://arxiv.org/abs/2204.08387) युपन हुआंग, टेंगचाओ लव, लेई कुई, युटोंग लू, फुरु वेई द्वारा पोस्ट किया गया।

View File

@ -393,6 +393,7 @@ Flax、PyTorch、TensorFlowをcondaでインストールする方法は、それ
1. **[Informer](https://huggingface.co/docs/transformers/model_doc/informer)** (from Beihang University, UC Berkeley, Rutgers University, SEDD Company) released with the paper [Informer: Beyond Efficient Transformer for Long Sequence Time-Series Forecasting](https://arxiv.org/abs/2012.07436) by Haoyi Zhou, Shanghang Zhang, Jieqi Peng, Shuai Zhang, Jianxin Li, Hui Xiong, and Wancai Zhang.
1. **[InstructBLIP](https://huggingface.co/docs/transformers/model_doc/instructblip)** (Salesforce から) Wenliang Dai, Junnan Li, Dongxu Li, Anthony Meng Huat Tiong, Junqi Zhao, Weisheng Wang, Boyang Li, Pascale Fung, Steven Hoi. から公開された研究論文 [InstructBLIP: Towards General-purpose Vision-Language Models with Instruction Tuning](https://arxiv.org/abs/2305.06500)
1. **[Jukebox](https://huggingface.co/docs/transformers/model_doc/jukebox)** (OpenAI から) Prafulla Dhariwal, Heewoo Jun, Christine Payne, Jong Wook Kim, Alec Radford, Ilya Sutskever から公開された研究論文: [Jukebox: A Generative Model for Music](https://arxiv.org/pdf/2005.00341.pdf)
1. **[KOSMOS-2](https://huggingface.co/docs/transformers/main/model_doc/kosmos-2)** (from Microsoft Research Asia) released with the paper [Kosmos-2: Grounding Multimodal Large Language Models to the World](https://arxiv.org/abs/2306.14824) by Zhiliang Peng, Wenhui Wang, Li Dong, Yaru Hao, Shaohan Huang, Shuming Ma, Furu Wei.
1. **[LayoutLM](https://huggingface.co/docs/transformers/model_doc/layoutlm)** (Microsoft Research Asia から) Yiheng Xu, Minghao Li, Lei Cui, Shaohan Huang, Furu Wei, Ming Zhou から公開された研究論文: [LayoutLM: Pre-training of Text and Layout for Document Image Understanding](https://arxiv.org/abs/1912.13318)
1. **[LayoutLMv2](https://huggingface.co/docs/transformers/model_doc/layoutlmv2)** (Microsoft Research Asia から) Yang Xu, Yiheng Xu, Tengchao Lv, Lei Cui, Furu Wei, Guoxin Wang, Yijuan Lu, Dinei Florencio, Cha Zhang, Wanxiang Che, Min Zhang, Lidong Zhou から公開された研究論文: [LayoutLMv2: Multi-modal Pre-training for Visually-Rich Document Understanding](https://arxiv.org/abs/2012.14740)
1. **[LayoutLMv3](https://huggingface.co/docs/transformers/model_doc/layoutlmv3)** (Microsoft Research Asia から) Yupan Huang, Tengchao Lv, Lei Cui, Yutong Lu, Furu Wei から公開された研究論文: [LayoutLMv3: Pre-training for Document AI with Unified Text and Image Masking](https://arxiv.org/abs/2204.08387)

View File

@ -308,6 +308,7 @@ Flax, PyTorch, TensorFlow 설치 페이지에서 이들을 conda로 설치하는
1. **[Informer](https://huggingface.co/docs/transformers/model_doc/informer)** (from Beihang University, UC Berkeley, Rutgers University, SEDD Company) released with the paper [Informer: Beyond Efficient Transformer for Long Sequence Time-Series Forecasting](https://arxiv.org/abs/2012.07436) by Haoyi Zhou, Shanghang Zhang, Jieqi Peng, Shuai Zhang, Jianxin Li, Hui Xiong, and Wancai Zhang.
1. **[InstructBLIP](https://huggingface.co/docs/transformers/model_doc/instructblip)** (Salesforce 에서 제공)은 Wenliang Dai, Junnan Li, Dongxu Li, Anthony Meng Huat Tiong, Junqi Zhao, Weisheng Wang, Boyang Li, Pascale Fung, Steven Hoi.의 [InstructBLIP: Towards General-purpose Vision-Language Models with Instruction Tuning](https://arxiv.org/abs/2305.06500)논문과 함께 발표했습니다.
1. **[Jukebox](https://huggingface.co/docs/transformers/model_doc/jukebox)** (OpenAI 에서) Prafulla Dhariwal, Heewoo Jun, Christine Payne, Jong Wook Kim, Alec Radford, Ilya Sutskever 의 [Jukebox: A Generative Model for Music](https://arxiv.org/pdf/2005.00341.pdf) 논문과 함께 발표했습니다.
1. **[KOSMOS-2](https://huggingface.co/docs/transformers/main/model_doc/kosmos-2)** (from Microsoft Research Asia) released with the paper [Kosmos-2: Grounding Multimodal Large Language Models to the World](https://arxiv.org/abs/2306.14824) by Zhiliang Peng, Wenhui Wang, Li Dong, Yaru Hao, Shaohan Huang, Shuming Ma, Furu Wei.
1. **[LayoutLM](https://huggingface.co/docs/transformers/model_doc/layoutlm)** (Microsoft Research Asia 에서) Yiheng Xu, Minghao Li, Lei Cui, Shaohan Huang, Furu Wei, Ming Zhou 의 [LayoutLM: Pre-training of Text and Layout for Document Image Understanding](https://arxiv.org/abs/1912.13318) 논문과 함께 발표했습니다.
1. **[LayoutLMv2](https://huggingface.co/docs/transformers/model_doc/layoutlmv2)** (Microsoft Research Asia 에서) Yang Xu, Yiheng Xu, Tengchao Lv, Lei Cui, Furu Wei, Guoxin Wang, Yijuan Lu, Dinei Florencio, Cha Zhang, Wanxiang Che, Min Zhang, Lidong Zhou 의 [LayoutLMv2: Multi-modal Pre-training for Visually-Rich Document Understanding](https://arxiv.org/abs/2012.14740) 논문과 함께 발표했습니다.
1. **[LayoutLMv3](https://huggingface.co/docs/transformers/model_doc/layoutlmv3)** (Microsoft Research Asia 에서) Yupan Huang, Tengchao Lv, Lei Cui, Yutong Lu, Furu Wei 의 [LayoutLMv3: Pre-training for Document AI with Unified Text and Image Masking](https://arxiv.org/abs/2204.08387) 논문과 함께 발표했습니다.

View File

@ -332,6 +332,7 @@ conda install -c huggingface transformers
1. **[Informer](https://huggingface.co/docs/transformers/model_doc/informer)** (from Beihang University, UC Berkeley, Rutgers University, SEDD Company) released with the paper [Informer: Beyond Efficient Transformer for Long Sequence Time-Series Forecasting](https://arxiv.org/abs/2012.07436) by Haoyi Zhou, Shanghang Zhang, Jieqi Peng, Shuai Zhang, Jianxin Li, Hui Xiong, and Wancai Zhang.
1. **[InstructBLIP](https://huggingface.co/docs/transformers/model_doc/instructblip)** (来自 Salesforce) 伴随论文 [InstructBLIP: Towards General-purpose Vision-Language Models with Instruction Tuning](https://arxiv.org/abs/2305.06500) 由 Wenliang Dai, Junnan Li, Dongxu Li, Anthony Meng Huat Tiong, Junqi Zhao, Weisheng Wang, Boyang Li, Pascale Fung, Steven Hoi 发布。
1. **[Jukebox](https://huggingface.co/docs/transformers/model_doc/jukebox)** (from OpenAI) released with the paper [Jukebox: A Generative Model for Music](https://arxiv.org/pdf/2005.00341.pdf) by Prafulla Dhariwal, Heewoo Jun, Christine Payne, Jong Wook Kim, Alec Radford, Ilya Sutskever.
1. **[KOSMOS-2](https://huggingface.co/docs/transformers/main/model_doc/kosmos-2)** (from Microsoft Research Asia) released with the paper [Kosmos-2: Grounding Multimodal Large Language Models to the World](https://arxiv.org/abs/2306.14824) by Zhiliang Peng, Wenhui Wang, Li Dong, Yaru Hao, Shaohan Huang, Shuming Ma, Furu Wei.
1. **[LayoutLM](https://huggingface.co/docs/transformers/model_doc/layoutlm)** (来自 Microsoft Research Asia) 伴随论文 [LayoutLM: Pre-training of Text and Layout for Document Image Understanding](https://arxiv.org/abs/1912.13318) 由 Yiheng Xu, Minghao Li, Lei Cui, Shaohan Huang, Furu Wei, Ming Zhou 发布。
1. **[LayoutLMv2](https://huggingface.co/docs/transformers/model_doc/layoutlmv2)** (来自 Microsoft Research Asia) 伴随论文 [LayoutLMv2: Multi-modal Pre-training for Visually-Rich Document Understanding](https://arxiv.org/abs/2012.14740) 由 Yang Xu, Yiheng Xu, Tengchao Lv, Lei Cui, Furu Wei, Guoxin Wang, Yijuan Lu, Dinei Florencio, Cha Zhang, Wanxiang Che, Min Zhang, Lidong Zhou 发布。
1. **[LayoutLMv3](https://huggingface.co/docs/transformers/model_doc/layoutlmv3)** (来自 Microsoft Research Asia) 伴随论文 [LayoutLMv3: Pre-training for Document AI with Unified Text and Image Masking](https://arxiv.org/abs/2204.08387) 由 Yupan Huang, Tengchao Lv, Lei Cui, Yutong Lu, Furu Wei 发布。

View File

@ -344,6 +344,7 @@ conda install -c huggingface transformers
1. **[Informer](https://huggingface.co/docs/transformers/model_doc/informer)** (from Beihang University, UC Berkeley, Rutgers University, SEDD Company) released with the paper [Informer: Beyond Efficient Transformer for Long Sequence Time-Series Forecasting](https://arxiv.org/abs/2012.07436) by Haoyi Zhou, Shanghang Zhang, Jieqi Peng, Shuai Zhang, Jianxin Li, Hui Xiong, and Wancai Zhang.
1. **[InstructBLIP](https://huggingface.co/docs/transformers/model_doc/instructblip)** (from Salesforce) released with the paper [InstructBLIP: Towards General-purpose Vision-Language Models with Instruction Tuning](https://arxiv.org/abs/2305.06500) by Wenliang Dai, Junnan Li, Dongxu Li, Anthony Meng Huat Tiong, Junqi Zhao, Weisheng Wang, Boyang Li, Pascale Fung, Steven Hoi.
1. **[Jukebox](https://huggingface.co/docs/transformers/model_doc/jukebox)** (from OpenAI) released with the paper [Jukebox: A Generative Model for Music](https://arxiv.org/pdf/2005.00341.pdf) by Prafulla Dhariwal, Heewoo Jun, Christine Payne, Jong Wook Kim, Alec Radford, Ilya Sutskever.
1. **[KOSMOS-2](https://huggingface.co/docs/transformers/main/model_doc/kosmos-2)** (from Microsoft Research Asia) released with the paper [Kosmos-2: Grounding Multimodal Large Language Models to the World](https://arxiv.org/abs/2306.14824) by Zhiliang Peng, Wenhui Wang, Li Dong, Yaru Hao, Shaohan Huang, Shuming Ma, Furu Wei.
1. **[LayoutLM](https://huggingface.co/docs/transformers/model_doc/layoutlm)** (from Microsoft Research Asia) released with the paper [LayoutLM: Pre-training of Text and Layout for Document Image Understanding](https://arxiv.org/abs/1912.13318) by Yiheng Xu, Minghao Li, Lei Cui, Shaohan Huang, Furu Wei, Ming Zhou.
1. **[LayoutLMv2](https://huggingface.co/docs/transformers/model_doc/layoutlmv2)** (from Microsoft Research Asia) released with the paper [LayoutLMv2: Multi-modal Pre-training for Visually-Rich Document Understanding](https://arxiv.org/abs/2012.14740) by Yang Xu, Yiheng Xu, Tengchao Lv, Lei Cui, Furu Wei, Guoxin Wang, Yijuan Lu, Dinei Florencio, Cha Zhang, Wanxiang Che, Min Zhang, Lidong Zhou.
1. **[LayoutLMv3](https://huggingface.co/docs/transformers/model_doc/layoutlmv3)** (from Microsoft Research Asia) released with the paper [LayoutLMv3: Pre-training for Document AI with Unified Text and Image Masking](https://arxiv.org/abs/2204.08387) by Yupan Huang, Tengchao Lv, Lei Cui, Yutong Lu, Furu Wei.

View File

@ -358,6 +358,8 @@
title: I-BERT
- local: model_doc/jukebox
title: Jukebox
- local: model_doc/kosmos-2
title: KOSMOS-2
- local: model_doc/led
title: LED
- local: model_doc/llama

View File

@ -157,6 +157,7 @@ Flax), PyTorch, and/or TensorFlow.
| [Informer](model_doc/informer) | ✅ | ❌ | ❌ |
| [InstructBLIP](model_doc/instructblip) | ✅ | ❌ | ❌ |
| [Jukebox](model_doc/jukebox) | ✅ | ❌ | ❌ |
| [KOSMOS-2](model_doc/kosmos-2) | ✅ | ❌ | ❌ |
| [LayoutLM](model_doc/layoutlm) | ✅ | ✅ | ❌ |
| [LayoutLMv2](model_doc/layoutlmv2) | ✅ | ❌ | ❌ |
| [LayoutLMv3](model_doc/layoutlmv3) | ✅ | ✅ | ❌ |

View File

@ -0,0 +1,98 @@
<!--Copyright 2023 The HuggingFace Team. All rights reserved.
Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with
the License. You may obtain a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on
an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the
specific language governing permissions and limitations under the License.
⚠️ Note that this file is in Markdown but contain specific syntax for our doc-builder (similar to MDX) that may not be
rendered properly in your Markdown viewer.
-->
# KOSMOS-2
## Overview
The KOSMOS-2 model was proposed in [Kosmos-2: Grounding Multimodal Large Language Models to the World]
(https://arxiv.org/abs/2306.14824) by Zhiliang Peng, Wenhui Wang, Li Dong, Yaru Hao, Shaohan Huang, Shuming Ma, Furu Wei
KOSMOS-2 is a Transformer-based causal language model and is trained using the next-word prediction task on a web-scale
dataset of grounded image-text pairs [GRIT](https://huggingface.co/datasets/zzliang/GRIT). The spatial coordinates of
the bounding boxes in the dataset are converted to a sequence of location tokens, which are appended to their respective
text spans. The data format is similar to “hyperlinks” that connect the object regions in an image to their text span in
the corresponding caption.
The abstract from the paper is the following:
*We introduce Kosmos-2, a Multimodal Large Language Model (MLLM), enabling new capabilities of perceiving object descriptions (e.g., bounding boxes) and grounding text to the visual world. Specifically, we represent refer expressions as links in Markdown, i.e., ``[text span](bounding boxes)'', where object descriptions are sequences of location tokens. Together with multimodal corpora, we construct large-scale data of grounded image-text pairs (called GrIT) to train the model. In addition to the existing capabilities of MLLMs (e.g., perceiving general modalities, following instructions, and performing in-context learning), Kosmos-2 integrates the grounding capability into downstream applications. We evaluate Kosmos-2 on a wide range of tasks, including (i) multimodal grounding, such as referring expression comprehension, and phrase grounding, (ii) multimodal referring, such as referring expression generation, (iii) perception-language tasks, and (iv) language understanding and generation. This work lays out the foundation for the development of Embodiment AI and sheds light on the big convergence of language, multimodal perception, action, and world modeling, which is a key step toward artificial general intelligence. Code and pretrained models are available at https://aka.ms/kosmos-2.*
## Example
```python
>>> from PIL import Image
>>> import requests
>>> from transformers import AutoProcessor, Kosmos2ForConditionalGeneration
>>> model = Kosmos2ForConditionalGeneration.from_pretrained("ydshieh/temp-testing-kosmos-2-rename-001")
>>> processor = AutoProcessor.from_pretrained("ydshieh/temp-testing-kosmos-2-rename-001")
>>> url = "https://huggingface.co/ydshieh/temp-testing-kosmos-2-rename-001/resolve/main/snowman.jpg"
>>> image = Image.open(requests.get(url, stream=True).raw)
>>> prompt = "<grounding> An image of"
>>> # set `add_eos_token=False` when doing generation
>>> inputs = processor(text=prompt, images=image, return_tensors="pt", add_eos_token=False)
>>> generated_ids = model.generate(
... pixel_values=inputs["pixel_values"],
... input_ids=inputs["input_ids"],
... attention_mask=inputs["attention_mask"],
... image_embeds=None,
... image_embeds_position_mask=inputs["image_embeds_position_mask"],
... use_cache=True,
... max_new_tokens=64,
... )
>>> generated_text = processor.batch_decode(generated_ids, skip_special_tokens=True)[0]
>>> processed_text = processor.post_process_generation(generated_text, cleanup_and_extract=False)
>>> processed_text
<grounding> An image of<phrase> a snowman</phrase><object><patch_index_0044><patch_index_0863></object> warming himself by<phrase> a fire</phrase><object><patch_index_0005><patch_index_0911></object>.
>>> caption, entities = processor.post_process_generation(generated_text)
>>> caption
An image of a snowman warming himself by a fire.
>>> entities
[('a snowman', (12, 21), [(0.390625, 0.046875, 0.984375, 0.828125)]), ('a fire', (41, 47), [(0.171875, 0.015625, 0.484375, 0.890625)])]
```
This model was contributed by [Yih-Dar SHIEH](https://huggingface.co/ydshieh). The original code can be found [here](https://github.com/microsoft/unilm/tree/master/kosmos-2).
## Kosmos2Config
[[autodoc]] Kosmos2Config
## Kosmos2ImageProcessor
[[autodoc]] Kosmos2ImageProcessor
- preprocess
## Kosmos2Processor
[[autodoc]] Kosmos2Processor
- __call__
## Kosmos2Model
[[autodoc]] Kosmos2Model
- forward
## Kosmos2ForConditionalGeneration
[[autodoc]] Kosmos2ForConditionalGeneration
- forward

View File

@ -387,6 +387,11 @@ _import_structure = {
"JukeboxTokenizer",
"JukeboxVQVAEConfig",
],
"models.kosmos2": [
"KOSMOS2_PRETRAINED_CONFIG_ARCHIVE_MAP",
"Kosmos2Config",
"Kosmos2Processor",
],
"models.layoutlm": ["LAYOUTLM_PRETRAINED_CONFIG_ARCHIVE_MAP", "LayoutLMConfig", "LayoutLMTokenizer"],
"models.layoutlmv2": [
"LAYOUTLMV2_PRETRAINED_CONFIG_ARCHIVE_MAP",
@ -968,6 +973,7 @@ else:
_import_structure["models.glpn"].extend(["GLPNFeatureExtractor", "GLPNImageProcessor"])
_import_structure["models.idefics"].extend(["IdeficsImageProcessor"])
_import_structure["models.imagegpt"].extend(["ImageGPTFeatureExtractor", "ImageGPTImageProcessor"])
_import_structure["models.kosmos2"].append("Kosmos2ImageProcessor")
_import_structure["models.layoutlmv2"].extend(["LayoutLMv2FeatureExtractor", "LayoutLMv2ImageProcessor"])
_import_structure["models.layoutlmv3"].extend(["LayoutLMv3FeatureExtractor", "LayoutLMv3ImageProcessor"])
_import_structure["models.levit"].extend(["LevitFeatureExtractor", "LevitImageProcessor"])
@ -2030,6 +2036,14 @@ else:
"JukeboxVQVAE",
]
)
_import_structure["models.kosmos2"].extend(
[
"KOSMOS2_PRETRAINED_MODEL_ARCHIVE_LIST",
"Kosmos2ForConditionalGeneration",
"Kosmos2Model",
"Kosmos2PreTrainedModel",
]
)
_import_structure["models.layoutlm"].extend(
[
"LAYOUTLM_PRETRAINED_MODEL_ARCHIVE_LIST",
@ -4514,6 +4528,11 @@ if TYPE_CHECKING:
JukeboxTokenizer,
JukeboxVQVAEConfig,
)
from .models.kosmos2 import (
KOSMOS2_PRETRAINED_CONFIG_ARCHIVE_MAP,
Kosmos2Config,
Kosmos2Processor,
)
from .models.layoutlm import LAYOUTLM_PRETRAINED_CONFIG_ARCHIVE_MAP, LayoutLMConfig, LayoutLMTokenizer
from .models.layoutlmv2 import (
LAYOUTLMV2_PRETRAINED_CONFIG_ARCHIVE_MAP,
@ -5031,6 +5050,7 @@ if TYPE_CHECKING:
from .models.glpn import GLPNFeatureExtractor, GLPNImageProcessor
from .models.idefics import IdeficsImageProcessor
from .models.imagegpt import ImageGPTFeatureExtractor, ImageGPTImageProcessor
from .models.kosmos2 import Kosmos2ImageProcessor
from .models.layoutlmv2 import LayoutLMv2FeatureExtractor, LayoutLMv2ImageProcessor
from .models.layoutlmv3 import LayoutLMv3FeatureExtractor, LayoutLMv3ImageProcessor
from .models.levit import LevitFeatureExtractor, LevitImageProcessor
@ -5919,6 +5939,12 @@ if TYPE_CHECKING:
JukeboxPrior,
JukeboxVQVAE,
)
from .models.kosmos2 import (
KOSMOS2_PRETRAINED_MODEL_ARCHIVE_LIST,
Kosmos2ForConditionalGeneration,
Kosmos2Model,
Kosmos2PreTrainedModel,
)
from .models.layoutlm import (
LAYOUTLM_PRETRAINED_MODEL_ARCHIVE_LIST,
LayoutLMForMaskedLM,

View File

@ -108,6 +108,7 @@ from . import (
informer,
instructblip,
jukebox,
kosmos2,
layoutlm,
layoutlmv2,
layoutlmv3,

View File

@ -116,6 +116,7 @@ CONFIG_MAPPING_NAMES = OrderedDict(
("informer", "InformerConfig"),
("instructblip", "InstructBlipConfig"),
("jukebox", "JukeboxConfig"),
("kosmos-2", "Kosmos2Config"),
("layoutlm", "LayoutLMConfig"),
("layoutlmv2", "LayoutLMv2Config"),
("layoutlmv3", "LayoutLMv3Config"),
@ -327,6 +328,7 @@ CONFIG_ARCHIVE_MAP_MAPPING_NAMES = OrderedDict(
("informer", "INFORMER_PRETRAINED_CONFIG_ARCHIVE_MAP"),
("instructblip", "INSTRUCTBLIP_PRETRAINED_CONFIG_ARCHIVE_MAP"),
("jukebox", "JUKEBOX_PRETRAINED_CONFIG_ARCHIVE_MAP"),
("kosmos-2", "KOSMOS2_PRETRAINED_CONFIG_ARCHIVE_MAP"),
("layoutlm", "LAYOUTLM_PRETRAINED_CONFIG_ARCHIVE_MAP"),
("layoutlmv2", "LAYOUTLMV2_PRETRAINED_CONFIG_ARCHIVE_MAP"),
("layoutlmv3", "LAYOUTLMV3_PRETRAINED_CONFIG_ARCHIVE_MAP"),
@ -539,6 +541,7 @@ MODEL_NAMES_MAPPING = OrderedDict(
("informer", "Informer"),
("instructblip", "InstructBLIP"),
("jukebox", "Jukebox"),
("kosmos-2", "KOSMOS-2"),
("layoutlm", "LayoutLM"),
("layoutlmv2", "LayoutLMv2"),
("layoutlmv3", "LayoutLMv3"),
@ -700,6 +703,7 @@ SPECIAL_MODEL_TYPE_TO_MODULE_NAME = OrderedDict(
("data2vec-text", "data2vec"),
("data2vec-vision", "data2vec"),
("donut-swin", "donut"),
("kosmos-2", "kosmos2"),
("maskformer-swin", "maskformer"),
("xclip", "x_clip"),
]

View File

@ -70,6 +70,7 @@ IMAGE_PROCESSOR_MAPPING_NAMES = OrderedDict(
("idefics", "IdeficsImageProcessor"),
("imagegpt", "ImageGPTImageProcessor"),
("instructblip", "BlipImageProcessor"),
("kosmos-2", "Kosmos2ImageProcessor"),
("layoutlmv2", "LayoutLMv2ImageProcessor"),
("layoutlmv3", "LayoutLMv3ImageProcessor"),
("levit", "LevitImageProcessor"),

View File

@ -112,6 +112,7 @@ MODEL_MAPPING_NAMES = OrderedDict(
("imagegpt", "ImageGPTModel"),
("informer", "InformerModel"),
("jukebox", "JukeboxModel"),
("kosmos-2", "Kosmos2Model"),
("layoutlm", "LayoutLMModel"),
("layoutlmv2", "LayoutLMv2Model"),
("layoutlmv3", "LayoutLMv3Model"),
@ -567,6 +568,7 @@ MODEL_FOR_VISION_2_SEQ_MAPPING_NAMES = OrderedDict(
("blip-2", "Blip2ForConditionalGeneration"),
("git", "GitForCausalLM"),
("instructblip", "InstructBlipForConditionalGeneration"),
("kosmos-2", "Kosmos2ForConditionalGeneration"),
("pix2struct", "Pix2StructForConditionalGeneration"),
("vision-encoder-decoder", "VisionEncoderDecoderModel"),
]

View File

@ -59,6 +59,7 @@ PROCESSOR_MAPPING_NAMES = OrderedDict(
("hubert", "Wav2Vec2Processor"),
("idefics", "IdeficsProcessor"),
("instructblip", "InstructBlipProcessor"),
("kosmos-2", "Kosmos2Processor"),
("layoutlmv2", "LayoutLMv2Processor"),
("layoutlmv3", "LayoutLMv3Processor"),
("markuplm", "MarkupLMProcessor"),

View File

@ -181,6 +181,13 @@ else:
("idefics", (None, "LlamaTokenizerFast" if is_tokenizers_available() else None)),
("instructblip", ("GPT2Tokenizer", "GPT2TokenizerFast" if is_tokenizers_available() else None)),
("jukebox", ("JukeboxTokenizer", None)),
(
"kosmos-2",
(
"XLMRobertaTokenizer" if is_sentencepiece_available() else None,
"XLMRobertaTokenizerFast" if is_tokenizers_available() else None,
),
),
("layoutlm", ("LayoutLMTokenizer", "LayoutLMTokenizerFast" if is_tokenizers_available() else None)),
("layoutlmv2", ("LayoutLMv2Tokenizer", "LayoutLMv2TokenizerFast" if is_tokenizers_available() else None)),
("layoutlmv3", ("LayoutLMv3Tokenizer", "LayoutLMv3TokenizerFast" if is_tokenizers_available() else None)),

View File

@ -0,0 +1,80 @@
# coding=utf-8
# Copyright 2023 Microsoft Research and The HuggingFace Inc. team. All rights reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
from typing import TYPE_CHECKING
from ...utils import (
OptionalDependencyNotAvailable,
_LazyModule,
is_torch_available,
is_vision_available,
)
_import_structure = {
"configuration_kosmos2": ["KOSMOS2_PRETRAINED_CONFIG_ARCHIVE_MAP", "Kosmos2Config"],
"processing_kosmos2": ["Kosmos2Processor"],
}
try:
if not is_vision_available():
raise OptionalDependencyNotAvailable()
except OptionalDependencyNotAvailable:
pass
else:
_import_structure["image_processing_kosmos2"] = ["Kosmos2ImageProcessor"]
try:
if not is_torch_available():
raise OptionalDependencyNotAvailable()
except OptionalDependencyNotAvailable:
pass
else:
_import_structure["modeling_kosmos2"] = [
"KOSMOS2_PRETRAINED_MODEL_ARCHIVE_LIST",
"Kosmos2ForConditionalGeneration",
"Kosmos2Model",
"Kosmos2PreTrainedModel",
]
if TYPE_CHECKING:
from .configuration_kosmos2 import KOSMOS2_PRETRAINED_CONFIG_ARCHIVE_MAP, Kosmos2Config
from .processing_kosmos2 import Kosmos2Processor
try:
if not is_vision_available():
raise OptionalDependencyNotAvailable()
except OptionalDependencyNotAvailable:
pass
else:
from .image_processing_kosmos2 import Kosmos2ImageProcessor
try:
if not is_torch_available():
raise OptionalDependencyNotAvailable()
except OptionalDependencyNotAvailable:
pass
else:
from .modeling_kosmos2 import (
KOSMOS2_PRETRAINED_MODEL_ARCHIVE_LIST,
Kosmos2ForConditionalGeneration,
Kosmos2Model,
Kosmos2PreTrainedModel,
)
else:
import sys
sys.modules[__name__] = _LazyModule(__name__, globals()["__file__"], _import_structure)

View File

@ -0,0 +1,314 @@
# coding=utf-8
# Copyright 2023 Microsoft Research and The HuggingFace Inc. team. All rights reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
""" KOSMOS-2 model configuration"""
import copy
import os
from typing import Union
from ...configuration_utils import PretrainedConfig
from ...utils import logging
logger = logging.get_logger(__name__)
KOSMOS2_PRETRAINED_CONFIG_ARCHIVE_MAP = {
"ydshieh/temp-testing-kosmos-2-rename-002": (
"https://huggingface.co/ydshieh/temp-testing-kosmos-2-rename-002/resolve/main/config.json"
),
# See all KOSMOS-2 models at https://huggingface.co/models?filter=kosmos-2
}
class Kosmos2TextConfig(PretrainedConfig):
r"""
This is the configuration class to store the configuration of a [`Kosmos2TextModel`]. It is used to instantiate a
KOSMOS-2 text decoder according to the specified arguments, defining the model architecture. Instantiating a
configuration with the defaults will yield a similar configuration to that of the text decoder of the KOSMOS-2
[ydshieh/temp-testing-kosmos-2-rename-002](https://huggingface.co/ydshieh/temp-testing-kosmos-2-rename-002)
architecture.
Configuration objects inherit from [`PretrainedConfig`] and can be used to control the model outputs. Read the
documentation from [`PretrainedConfig`] for more information.
Args:
vocab_size (`int`, *optional*, defaults to 65037):
Vocabulary size of the Kosmos2 model. Defines the number of different tokens that can be represented by the
`inputs_ids` passed when calling [`Kosmos2Model`].
max_position_embeddings (`int`, *optional*, defaults to 2048):
The maximum sequence length that this model might ever be used with. Typically set this to something large
just in case (e.g., 512 or 1024 or 2048).
embed_dim (`int`, *optional*, defaults to 2048):
Dimensionality of the layers and the pooler layer.
layers (`int`, *optional*, defaults to 24):
Number of hidden layers in the Transformer encoder.
ffn_dim (`int`, *optional*, defaults to 8192):
Dimensionality of the "intermediate" (often named feed-forward) layer in the Transformer encoder.
attention_heads (`int`, *optional*, defaults to 32):
Number of attention heads for each attention layer in the Transformer encoder.
activation_function (`str` or `function`, *optional*, defaults to `"gelu"`):
The non-linear activation function (function or string) in the encoder and pooler. If string, `"gelu"`,
`"relu"`, `"silu"` and `"gelu_new"` are supported.
dropout (`float`, *optional*, defaults to 0.1):
The dropout probability for all fully connected layers in the embeddings, encoder, and pooler.
attention_dropout (`float`, *optional*, defaults to 0.1):
The dropout ratio for the attention probabilities.
activation_dropout (`float`, *optional*, defaults to 0.0):
The dropout ratio for activations inside the fully connected layer.
layerdrop (`float`, *optional*, defaults to 0.0):
The LayerDrop probability for the decoder. See the [LayerDrop paper](see https://arxiv.org/abs/1909.11556)
for more details.
layer_norm_eps (`float`, *optional*, defaults to 1e-5):
The epsilon used by the layer normalization layers.
init_std (`float`, *optional*, defaults to 0.02):
The standard deviation of the truncated_normal_initializer for initializing all weight matrices.
scale_embedding (`bool`, *optional*, defaults to `True`):
Scale embeddings by diving by sqrt(embed_dim).
use_cache (`bool`, *optional*, defaults to `True`):
Whether or not the model should return the last key/values attentions (not used by all models).
```"""
model_type = "kosmos_2_text_model"
keys_to_ignore_at_inference = ["past_key_values"]
attribute_map = {
"num_attention_heads": "attention_heads",
"hidden_size": "embed_dim",
"num_hidden_layers": "layers",
}
def __init__(
self,
vocab_size=65037,
max_position_embeddings=2048,
embed_dim=2048,
layers=24,
ffn_dim=8192,
attention_heads=32,
activation_function="gelu",
dropout=0.1,
attention_dropout=0.1,
activation_dropout=0.0,
layerdrop=0.0,
layer_norm_eps=1e-5,
init_std=0.02,
scale_embedding=True,
use_cache=True,
pad_token_id=1,
bos_token_id=0,
eos_token_id=2,
**kwargs,
):
super().__init__(
pad_token_id=pad_token_id,
bos_token_id=bos_token_id,
eos_token_id=eos_token_id,
**kwargs,
)
self.vocab_size = vocab_size
self.max_position_embeddings = max_position_embeddings
self.embed_dim = embed_dim
self.layers = layers
self.ffn_dim = ffn_dim
self.attention_heads = attention_heads
self.activation_function = activation_function
self.dropout = dropout
self.attention_dropout = attention_dropout
self.activation_dropout = activation_dropout
self.layerdrop = layerdrop
self.layer_norm_eps = layer_norm_eps
self.init_std = init_std
self.scale_embedding = scale_embedding
self.use_cache = use_cache
@classmethod
def from_pretrained(cls, pretrained_model_name_or_path: Union[str, os.PathLike], **kwargs) -> "PretrainedConfig":
cls._set_token_in_kwargs(kwargs)
config_dict, kwargs = cls.get_config_dict(pretrained_model_name_or_path, **kwargs)
# get the text config dict if we are loading from Kosmos2Config
if config_dict.get("model_type") == "kosmos-2":
config_dict = config_dict["text_config"]
if "model_type" in config_dict and hasattr(cls, "model_type") and config_dict["model_type"] != cls.model_type:
logger.warning(
f"You are using a model of type {config_dict['model_type']} to instantiate a model of type "
f"{cls.model_type}. This is not supported for all configurations of models and can yield errors."
)
return cls.from_dict(config_dict, **kwargs)
class Kosmos2VisionConfig(PretrainedConfig):
r"""
This is the configuration class to store the configuration of a [`Kosmos2VisionModel`]. It is used to instantiate a
KOSMOS-2 vision encoder according to the specified arguments, defining the model architecture. Instantiating a
configuration with the defaults will yield a similar configuration to that of the vision encoder of the KOSMOS-2
[ydshieh/temp-testing-kosmos-2-rename-002](https://huggingface.co/ydshieh/temp-testing-kosmos-2-rename-002)
architecture.
Configuration objects inherit from [`PretrainedConfig`] and can be used to control the model outputs. Read the
documentation from [`PretrainedConfig`] for more information.
Args:
hidden_size (`int`, *optional*, defaults to 1024):
Dimensionality of the encoder layers and the pooler layer.
intermediate_size (`int`, *optional*, defaults to 4096):
Dimensionality of the "intermediate" (i.e., feed-forward) layer in the Transformer encoder.
num_hidden_layers (`int`, *optional*, defaults to 24):
Number of hidden layers in the Transformer encoder.
num_attention_heads (`int`, *optional*, defaults to 16):
Number of attention heads for each attention layer in the Transformer encoder.
num_channels (`int`, *optional*, defaults to 3):
The number of input channels.
image_size (`int`, *optional*, defaults to 224):
The size (resolution) of each image.
patch_size (`int`, *optional*, defaults to 14):
The size (resolution) of each patch.
hidden_act (`str` or `function`, *optional*, defaults to `"quick_gelu"`):
The non-linear activation function (function or string) in the encoder and pooler. If string, `"gelu"`,
`"relu"`, `"selu"` and `"gelu_new"` ``"quick_gelu"` are supported.
layer_norm_eps (`float`, *optional*, defaults to 1e-5):
The epsilon used by the layer normalization layers.
attention_dropout (`float`, *optional*, defaults to 0.0):
The dropout ratio for the attention probabilities.
initializer_range (`float`, *optional*, defaults to 0.02):
The standard deviation of the truncated_normal_initializer for initializing all weight matrices.
initializer_factor (`float`, *optional*, defaults to 1):
A factor for initializing all weight matrices (should be kept to 1, used internally for initialization
testing).
```"""
model_type = "kosmos_2_vision_model"
def __init__(
self,
hidden_size=1024,
intermediate_size=4096,
num_hidden_layers=24,
num_attention_heads=16,
num_channels=3,
image_size=224,
patch_size=14,
hidden_act="quick_gelu",
layer_norm_eps=1e-5,
attention_dropout=0.0,
initializer_range=0.02,
initializer_factor=1.0,
**kwargs,
):
super().__init__(**kwargs)
self.hidden_size = hidden_size
self.intermediate_size = intermediate_size
self.num_hidden_layers = num_hidden_layers
self.num_attention_heads = num_attention_heads
self.num_channels = num_channels
self.patch_size = patch_size
self.image_size = image_size
self.initializer_range = initializer_range
self.initializer_factor = initializer_factor
self.attention_dropout = attention_dropout
self.layer_norm_eps = layer_norm_eps
self.hidden_act = hidden_act
@classmethod
def from_pretrained(cls, pretrained_model_name_or_path: Union[str, os.PathLike], **kwargs) -> "PretrainedConfig":
cls._set_token_in_kwargs(kwargs)
config_dict, kwargs = cls.get_config_dict(pretrained_model_name_or_path, **kwargs)
# get the vision config dict if we are loading from Kosmos2Config
if config_dict.get("model_type") == "kosmos-2":
config_dict = config_dict["vision_config"]
if "model_type" in config_dict and hasattr(cls, "model_type") and config_dict["model_type"] != cls.model_type:
logger.warning(
f"You are using a model of type {config_dict['model_type']} to instantiate a model of type "
f"{cls.model_type}. This is not supported for all configurations of models and can yield errors."
)
return cls.from_dict(config_dict, **kwargs)
class Kosmos2Config(PretrainedConfig):
r"""
This is the configuration class to store the configuration of a [`Kosmos2Model`]. It is used to instantiate a
KOSMOS-2 model according to the specified arguments, defining the model architecture. Instantiating a configuration
with the defaults will yield a similar configuration to that of the KOSMOS-2
[ydshieh/temp-testing-kosmos-2-rename-002](https://huggingface.co/ydshieh/temp-testing-kosmos-2-rename-002)
architecture.
Args:
text_config (`dict`, *optional*):
Dictionary of configuration options used to initialize [`Kosmos2TextConfig`].
vision_config (`dict`, *optional*):
Dictionary of configuration options used to initialize [`Kosmos2VisionConfig`].
latent_query_num (`int`, *optional*, defaults to 64):
The number of latent query tokens that represent the image features used in the text decoder component.
kwargs (*optional*):
Dictionary of keyword arguments.
Example:
```python
>>> from transformers import Kosmos2Config, Kosmos2Model
>>> # Initializing a Kosmos-2 kosmos-2-patch14-224 style configuration
>>> configuration = Kosmos2Config()
>>> # Initializing a model (with random weights) from the kosmos-2-patch14-224 style configuration
>>> model = Kosmos2Model(configuration)
>>> # Accessing the model configuration
>>> configuration = model.config
```"""
model_type = "kosmos-2"
is_composition = True
def __init__(
self,
text_config=None,
vision_config=None,
latent_query_num=64,
**kwargs,
):
super().__init__(**kwargs)
if text_config is None:
text_config = {}
logger.info("`text_config` is `None`. Initializing the `Kosmos2TextConfig` with default values.")
if vision_config is None:
vision_config = {}
logger.info("`vision_config` is `None`. Initializing the `Kosmos2VisionConfig` with default values.")
self.text_config = Kosmos2TextConfig(**text_config)
self.vision_config = Kosmos2VisionConfig(**vision_config)
self.latent_query_num = latent_query_num
def to_dict(self):
"""
Serializes this instance to a Python dictionary. Override the default [`~PretrainedConfig.to_dict`].
Returns:
`Dict[str, any]`: Dictionary of all the attributes that make up this configuration instance,
"""
output = copy.deepcopy(self.__dict__)
output["text_config"] = self.text_config.to_dict()
output["vision_config"] = self.vision_config.to_dict()
output["model_type"] = self.__class__.model_type
return output

View File

@ -0,0 +1,84 @@
import argparse
import re
from fairseq.checkpoint_utils import load_checkpoint_to_cpu
from transformers import Kosmos2Config, Kosmos2ForConditionalGeneration
KEYS_TO_IGNORE = [
# this buffer in the original code is only used to send weights to the desired device
"gpt_model.decoder.embed_positions._float_tensor",
# this weight is never used in the forward in the original KOSMOS-2)
"gpt_model.decoder.self_attn_sope.scale",
]
def rename_vision_key(key):
key = re.sub(r"img_model.visual\.", "vision_model.", key)
key = re.sub(r"\.class_embedding$", ".embeddings.class_embedding", key)
key = re.sub(r"\.positional_embedding$", ".embeddings.position_embedding.weight", key)
key = re.sub(r"\.conv1.weight$", ".embeddings.patch_embedding.weight", key)
key = re.sub(r"\.ln_pre\.", ".pre_layrnorm.", key)
key = re.sub(r"\.transformer.resblocks\.", ".encoder.layers.", key)
key = re.sub(r"\.ts_attn\.", ".self_attn.", key)
key = re.sub(r"\.ln_1\.", ".layer_norm1.", key)
key = re.sub(r"\.ln_2\.", ".layer_norm2.", key)
key = re.sub(r"\.c_fc\.", ".fc1.", key)
key = re.sub(r"\.c_proj\.", ".fc2.", key)
key = re.sub(r"\.ln_post\.", ".post_layernorm.", key)
return key
def rename_key(key):
# text decoder
key = re.sub(r"gpt_model.decoder\.", "text_model.", key)
# text decode: `embed_tokens`
key = re.sub(r"\.embed_tokens\.", ".model.embed_tokens.", key)
key = re.sub(r"\.layers\.", ".model.layers.", key)
key = re.sub(r"^text_model.layer_norm\.", "text_model.model.layer_norm.", key)
key = re.sub(r"^text_model.output_projection\.", "text_model.lm_head.", key)
key = re.sub(r"^img_connector\.", "image_to_text_projection.", key)
key = rename_vision_key(key)
return key
def convert_kosmos2_checkpoint_to_pytorch(checkpoint_path, pytorch_dump_folder_path):
state = load_checkpoint_to_cpu(checkpoint_path)
state["cfg"]
state_dict = state["model"]
state_dict_keys = list(state_dict.keys())
config = Kosmos2Config()
model = Kosmos2ForConditionalGeneration(config)
# convert (by renaming keys)
converted_state_dict = {}
for key in state_dict_keys:
if key in KEYS_TO_IGNORE:
continue
renamed_key = rename_key(key)
converted_state_dict[renamed_key] = state_dict[key]
# all HF model keys should be in the renamed keys from the original checkpoint
assert set(model.state_dict().keys()) == set(converted_state_dict.keys())
# check weight loading
model.load_state_dict(converted_state_dict, strict=True)
if __name__ == "__main__":
parser = argparse.ArgumentParser()
# Required parameters
parser.add_argument(
"--kosmos2_checkpoint_path", default=None, type=str, required=True, help="Path the official PyTorch dump."
)
parser.add_argument(
"--pytorch_dump_folder_path", default=None, type=str, required=True, help="Path to the output PyTorch model."
)
args = parser.parse_args()
convert_kosmos2_checkpoint_to_pytorch(args.kosmos2_checkpoint_path, args.pytorch_dump_folder_path)

View File

@ -0,0 +1,312 @@
# coding=utf-8
# Copyright 2023 Microsoft Research and The HuggingFace Inc. team. All rights reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
"""Image processor class for KOSMOS-2."""
from typing import Dict, List, Optional, Union
import numpy as np
from ...image_processing_utils import BaseImageProcessor, BatchFeature, get_size_dict
from ...image_transforms import (
convert_to_rgb,
get_resize_output_image_size,
resize,
to_channel_dimension_format,
)
from ...image_utils import (
OPENAI_CLIP_MEAN,
OPENAI_CLIP_STD,
ChannelDimension,
ImageInput,
PILImageResampling,
infer_channel_dimension_format,
is_scaled_image,
make_list_of_images,
to_numpy_array,
valid_images,
)
from ...utils import TensorType, is_vision_available, logging
logger = logging.get_logger(__name__)
if is_vision_available():
import PIL
class Kosmos2ImageProcessor(BaseImageProcessor):
r"""
Constructs a KOSMOS-2 image processor.
Args:
do_resize (`bool`, *optional*, defaults to `True`):
Whether to resize the image's (height, width) dimensions to the specified `size`. Can be overridden by
`do_resize` in the `preprocess` method.
size (`Dict[str, int]` *optional*, defaults to `{"shortest_edge": 224}`):
Size of the image after resizing. The shortest edge of the image is resized to size["shortest_edge"], with
the longest edge resized to keep the input aspect ratio. Can be overridden by `size` in the `preprocess`
method.
resample (`PILImageResampling`, *optional*, defaults to `Resampling.BICUBIC`):
Resampling filter to use if resizing the image. Can be overridden by `resample` in the `preprocess` method.
do_center_crop (`bool`, *optional*, defaults to `True`):
Whether to center crop the image to the specified `crop_size`. Can be overridden by `do_center_crop` in the
`preprocess` method.
crop_size (`Dict[str, int]` *optional*, defaults to 224):
Size of the output image after applying `center_crop`. Can be overridden by `crop_size` in the `preprocess`
method.
do_rescale (`bool`, *optional*, defaults to `True`):
Whether to rescale the image by the specified scale `rescale_factor`. Can be overridden by `do_rescale` in
the `preprocess` method.
rescale_factor (`int` or `float`, *optional*, defaults to `1/255`):
Scale factor to use if rescaling the image. Can be overridden by `rescale_factor` in the `preprocess`
method.
do_normalize (`bool`, *optional*, defaults to `True`):
Whether to normalize the image. Can be overridden by `do_normalize` in the `preprocess` method.
image_mean (`float` or `List[float]`, *optional*, defaults to `[0.48145466, 0.4578275, 0.40821073]`):
Mean to use if normalizing the image. This is a float or list of floats the length of the number of
channels in the image. Can be overridden by the `image_mean` parameter in the `preprocess` method.
image_std (`float` or `List[float]`, *optional*, defaults to `[0.26862954, 0.26130258, 0.27577711]`):
Standard deviation to use if normalizing the image. This is a float or list of floats the length of the
number of channels in the image. Can be overridden by the `image_std` parameter in the `preprocess` method.
do_convert_rgb (`bool`, *optional*, defaults to `True`):
Whether to convert the image to RGB.
"""
model_input_names = ["pixel_values"]
def __init__(
self,
do_resize: bool = True,
size: Dict[str, int] = None,
resample: PILImageResampling = PILImageResampling.BICUBIC,
do_center_crop: bool = True,
crop_size: Dict[str, int] = None,
do_rescale: bool = True,
rescale_factor: Union[int, float] = 1 / 255,
do_normalize: bool = True,
image_mean: Optional[Union[float, List[float]]] = None,
image_std: Optional[Union[float, List[float]]] = None,
do_convert_rgb: bool = True,
**kwargs,
) -> None:
super().__init__(**kwargs)
size = size if size is not None else {"shortest_edge": 224}
size = get_size_dict(size)
crop_size = crop_size if crop_size is not None else {"height": 224, "width": 224}
crop_size = get_size_dict(crop_size, default_to_square=True, param_name="crop_size")
self.do_resize = do_resize
self.size = size
self.resample = resample
self.do_center_crop = do_center_crop
self.crop_size = crop_size
self.do_rescale = do_rescale
self.rescale_factor = rescale_factor
self.do_normalize = do_normalize
self.image_mean = image_mean if image_mean is not None else OPENAI_CLIP_MEAN
self.image_std = image_std if image_std is not None else OPENAI_CLIP_STD
self.do_convert_rgb = do_convert_rgb
def resize(
self,
image: np.ndarray,
size: Dict[str, int],
resample: PILImageResampling = PILImageResampling.BICUBIC,
data_format: Optional[Union[str, ChannelDimension]] = None,
input_data_format: Optional[Union[str, ChannelDimension]] = None,
**kwargs,
) -> np.ndarray:
"""
Resize an image. The shortest edge of the image is resized to size["shortest_edge"], with the longest edge
resized to keep the input aspect ratio.
Args:
image (`np.ndarray`):
Image to resize.
size (`Dict[str, int]`):
Size of the output image.
resample (`PILImageResampling`, *optional*, defaults to `PILImageResampling.BICUBIC`):
Resampling filter to use when resiizing the image.
data_format (`str` or `ChannelDimension`, *optional*):
The channel dimension format of the image. If not provided, it will be the same as the input image.
input_data_format (`ChannelDimension` or `str`, *optional*):
The channel dimension format of the input image. If not provided, it will be inferred.
"""
size = get_size_dict(size)
if "shortest_edge" not in size:
raise ValueError(f"The `size` parameter must contain the key `shortest_edge`. Got {size.keys()}")
output_size = get_resize_output_image_size(
image, size=size["shortest_edge"], input_data_format=input_data_format
)
return resize(
image,
size=output_size,
resample=resample,
data_format=data_format,
input_data_format=input_data_format,
**kwargs,
)
def preprocess(
self,
images: ImageInput,
do_resize: bool = None,
size: Dict[str, int] = None,
resample: PILImageResampling = None,
do_center_crop: bool = None,
crop_size: int = None,
do_rescale: bool = None,
rescale_factor: float = None,
do_normalize: bool = None,
image_mean: Optional[Union[float, List[float]]] = None,
image_std: Optional[Union[float, List[float]]] = None,
do_convert_rgb: bool = None,
return_tensors: Optional[Union[str, TensorType]] = None,
data_format: Optional[ChannelDimension] = ChannelDimension.FIRST,
input_data_format: Optional[Union[str, ChannelDimension]] = None,
**kwargs,
) -> PIL.Image.Image:
"""
Preprocess an image or batch of images.
Args:
images (`ImageInput`):
Image to preprocess. Expects a single or batch of images with pixel values ranging from 0 to 255. If
passing in images with pixel values between 0 and 1, set `do_rescale=False`.
do_resize (`bool`, *optional*, defaults to `self.do_resize`):
Whether to resize the image.
size (`Dict[str, int]`, *optional*, defaults to `self.size`):
Size of the image after resizing. Shortest edge of the image is resized to size["shortest_edge"], with
the longest edge resized to keep the input aspect ratio.
resample (`int`, *optional*, defaults to `self.resample`):
Resampling filter to use if resizing the image. This can be one of the enum `PILImageResampling`. Only
has an effect if `do_resize` is set to `True`.
do_center_crop (`bool`, *optional*, defaults to `self.do_center_crop`):
Whether to center crop the image.
crop_size (`Dict[str, int]`, *optional*, defaults to `self.crop_size`):
Size of the center crop. Only has an effect if `do_center_crop` is set to `True`.
do_rescale (`bool`, *optional*, defaults to `self.do_rescale`):
Whether to rescale the image.
rescale_factor (`float`, *optional*, defaults to `self.rescale_factor`):
Rescale factor to rescale the image by if `do_rescale` is set to `True`.
do_normalize (`bool`, *optional*, defaults to `self.do_normalize`):
Whether to normalize the image.
image_mean (`float` or `List[float]`, *optional*, defaults to `self.image_mean`):
Image mean to use for normalization. Only has an effect if `do_normalize` is set to `True`.
image_std (`float` or `List[float]`, *optional*, defaults to `self.image_std`):
Image standard deviation to use for normalization. Only has an effect if `do_normalize` is set to
`True`.
do_convert_rgb (`bool`, *optional*, defaults to `self.do_convert_rgb`):
Whether to convert the image to RGB.
return_tensors (`str` or `TensorType`, *optional*):
The type of tensors to return. Can be one of:
- Unset: Return a list of `np.ndarray`.
- `TensorType.TENSORFLOW` or `'tf'`: Return a batch of type `tf.Tensor`.
- `TensorType.PYTORCH` or `'pt'`: Return a batch of type `torch.Tensor`.
- `TensorType.NUMPY` or `'np'`: Return a batch of type `np.ndarray`.
- `TensorType.JAX` or `'jax'`: Return a batch of type `jax.numpy.ndarray`.
data_format (`ChannelDimension` or `str`, *optional*, defaults to `ChannelDimension.FIRST`):
The channel dimension format for the output image. Can be one of:
- `"channels_first"` or `ChannelDimension.FIRST`: image in (num_channels, height, width) format.
- `"channels_last"` or `ChannelDimension.LAST`: image in (height, width, num_channels) format.
- Unset: Use the channel dimension format of the input image.
input_data_format (`ChannelDimension` or `str`, *optional*):
The channel dimension format for the input image. If unset, the channel dimension format is inferred
from the input image. Can be one of:
- `"channels_first"` or `ChannelDimension.FIRST`: image in (num_channels, height, width) format.
- `"channels_last"` or `ChannelDimension.LAST`: image in (height, width, num_channels) format.
- `"none"` or `ChannelDimension.NONE`: image in (height, width) format.
"""
do_resize = do_resize if do_resize is not None else self.do_resize
size = size if size is not None else self.size
size = get_size_dict(size, param_name="size")
resample = resample if resample is not None else self.resample
do_center_crop = do_center_crop if do_center_crop is not None else self.do_center_crop
crop_size = crop_size if crop_size is not None else self.crop_size
crop_size = get_size_dict(crop_size, param_name="crop_size", default_to_square=True)
do_rescale = do_rescale if do_rescale is not None else self.do_rescale
rescale_factor = rescale_factor if rescale_factor is not None else self.rescale_factor
do_normalize = do_normalize if do_normalize is not None else self.do_normalize
image_mean = image_mean if image_mean is not None else self.image_mean
image_std = image_std if image_std is not None else self.image_std
do_convert_rgb = do_convert_rgb if do_convert_rgb is not None else self.do_convert_rgb
images = make_list_of_images(images)
if not valid_images(images):
raise ValueError(
"Invalid image type. Must be of type PIL.Image.Image, numpy.ndarray, "
"torch.Tensor, tf.Tensor or jax.ndarray."
)
if do_resize and size is None:
raise ValueError("Size must be specified if do_resize is True.")
if do_center_crop and crop_size is None:
raise ValueError("Crop size must be specified if do_center_crop is True.")
if do_rescale and rescale_factor is None:
raise ValueError("Rescale factor must be specified if do_rescale is True.")
if do_normalize and (image_mean is None or image_std is None):
raise ValueError("Image mean and std must be specified if do_normalize is True.")
# PIL RGBA images are converted to RGB
if do_convert_rgb:
images = [convert_to_rgb(image) for image in images]
# All transformations expect numpy arrays.
images = [to_numpy_array(image) for image in images]
if is_scaled_image(images[0]) and do_rescale:
logger.warning_once(
"It looks like you are trying to rescale already rescaled images. If the input"
" images have pixel values between 0 and 1, set `do_rescale=False` to avoid rescaling them again."
)
if input_data_format is None:
# We assume that all images have the same channel dimension format.
input_data_format = infer_channel_dimension_format(images[0])
if do_resize:
images = [
self.resize(image=image, size=size, resample=resample, input_data_format=input_data_format)
for image in images
]
if do_center_crop:
images = [
self.center_crop(image=image, size=crop_size, input_data_format=input_data_format) for image in images
]
if do_rescale:
images = [
self.rescale(image=image, scale=rescale_factor, input_data_format=input_data_format)
for image in images
]
if do_normalize:
images = [
self.normalize(image=image, mean=image_mean, std=image_std, input_data_format=input_data_format)
for image in images
]
images = [
to_channel_dimension_format(image, data_format, input_channel_dim=input_data_format) for image in images
]
data = {"pixel_values": images}
return BatchFeature(data=data, tensor_type=return_tensors)

File diff suppressed because it is too large Load Diff

View File

@ -0,0 +1,696 @@
# coding=utf-8
# Copyright 2023 Microsoft Research and The HuggingFace Inc. team. All rights reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
"""Processor class for KOSMOS-2."""
import copy
import math
import re
from typing import List, Optional, Tuple, Union
import numpy as np
from ...image_processing_utils import BatchFeature
from ...image_utils import ImageInput, is_batched
from ...processing_utils import ProcessorMixin
from ...tokenization_utils import AddedToken
from ...tokenization_utils_base import PaddingStrategy, TextInput, TruncationStrategy
from ...utils import TensorType, is_torch_available
BboxInput = Union[
List[Tuple[int, int]],
List[Tuple[float, float, float, float]],
List[List[Tuple[int, int]]],
List[List[Tuple[float, float, float]]],
]
class Kosmos2Processor(ProcessorMixin):
r"""
Constructs an KOSMOS-2 processor which wraps a KOSMOS-2 image processor and a KOSMOS-2 tokenizer into a single
processor.
[`Kosmos2Processor`] offers all the functionalities of [`Kosmos2ImageProcessor`] and some functionalities of
[`XLMRobertaTokenizerFast`]. See the docstring of [`~Kosmos2Processor.__call__`] and [`~Kosmos2Processor.decode`]
for more information.
Args:
image_processor (`Kosmos2ImageProcessor`):
An instance of [`Kosmos2ImageProcessor`]. The image processor is a required input.
tokenizer (`XLMRobertaTokenizerFast`):
An instance of ['XLMRobertaTokenizerFast`]. The tokenizer is a required input.
num_patch_index_tokens (`int`, *optional*, defaults to 1024):
The number of tokens that represent patch indices.
"""
attributes = ["image_processor", "tokenizer"]
image_processor_class = "Kosmos2ImageProcessor"
tokenizer_class = ("XLMRobertaTokenizer", "XLMRobertaTokenizerFast")
def __init__(self, image_processor, tokenizer, num_patch_index_tokens=1024):
tokenizer.return_token_type_ids = False
self.eod_token = "</doc>"
self.boi_token = "<image>"
self.eoi_token = "</image>"
self.eoc_token = "</chunk>"
self.eol_token = "</line>"
self.bop_token = "<phrase>"
self.eop_token = "</phrase>"
self.boo_token = "<object>"
self.eoo_token = "</object>"
self.dom_token = "</delimiter_of_multi_objects/>"
self.grd_token = "<grounding>"
self.tag_tokens = [
self.eod_token,
self.boi_token,
self.eoi_token,
self.eoc_token,
self.eol_token,
self.bop_token,
self.eop_token,
self.boo_token,
self.eoo_token,
self.dom_token,
self.grd_token,
]
self.num_patch_index_tokens = num_patch_index_tokens
patch_index_tokens = [f"<patch_index_{str(x).zfill(4)}>" for x in range(self.num_patch_index_tokens)]
tokens_to_add = []
for token in self.tag_tokens + patch_index_tokens:
tokens_to_add.append(AddedToken(token, lstrip=True, rstrip=False))
tokenizer.add_tokens(tokens_to_add)
super().__init__(image_processor, tokenizer)
def __call__(
self,
images: ImageInput = None,
text: Union[TextInput, List[TextInput]] = None,
bboxes: BboxInput = None,
num_image_tokens: Optional[int] = 64,
first_image_token_id: Optional[int] = None,
add_special_tokens: bool = True,
add_eos_token: bool = True,
padding: Union[bool, str, PaddingStrategy] = False,
truncation: Union[bool, str, TruncationStrategy] = None,
max_length: Optional[int] = None,
pad_to_multiple_of: Optional[int] = None,
return_attention_mask: Optional[bool] = None,
return_length: bool = False,
verbose: bool = True,
return_tensors: Optional[Union[str, TensorType]] = None,
**kwargs,
) -> BatchFeature:
"""
This method uses [`Kosmos2ImageProcessor.__call__`] method to prepare image(s) for the model, and
[`XLMRobertaTokenizerFast.__call__`] to prepare text for the model.
Please refer to the docstring of the above two methods for more information.
The rest of this documentation shows the arguments specific to `Kosmos2Processor`.
Args:
bboxes (`Union[List[Tuple[int]], List[Tuple[float]], List[List[Tuple[int]]], List[List[Tuple[float]]]]`, *optional*):
The bounding bboxes associated to `texts`.
num_image_tokens (`int`, defaults to 64):
The number of (consecutive) places that are used to mark the placeholders to store image information.
This should be the same as `latent_query_num` in the instance of `Kosmos2Config` you are using.
first_image_token_id (`int`, *optional*):
The token id that will be used for the first place of the subsequence that is reserved to store image
information. If unset, will default to `self.tokenizer.unk_token_id + 1`.
add_eos_token (`bool`, defaults to `True`):
If to include `EOS` token id in the encoding when `add_special_tokens=True`.
"""
if images is None and text is None:
raise ValueError("You have to specify either images or text.")
encoding = BatchFeature()
if images is not None:
image_encoding = self.image_processor(images, return_tensors=return_tensors)
encoding.update(image_encoding)
if text is not None:
text = self.preprocess_text(text, images, bboxes, num_image_tokens=num_image_tokens)
if add_special_tokens and not add_eos_token:
if isinstance(text, str):
text = f"{self.tokenizer.bos_token}{text}"
elif isinstance(text, list):
text = [f"{self.tokenizer.bos_token}{s}" for s in text]
text_encoding = self.tokenizer(
text=text,
add_special_tokens=(add_special_tokens and add_eos_token),
padding=padding and images is None,
truncation=truncation,
max_length=max_length,
pad_to_multiple_of=pad_to_multiple_of if images is None else pad_to_multiple_of,
return_attention_mask=return_attention_mask,
verbose=verbose,
return_tensors=return_tensors if images is None else None,
**kwargs,
)
encoding.update(text_encoding)
if text is not None and images is not None:
# Use the id of the first token after <unk>
if first_image_token_id is None:
first_image_token_id = self.tokenizer.unk_token_id + 1
# To see if we need one more `0` (for `<s>`) at the beginning of `image_embeds_position_mask`.
with_bos = add_special_tokens
# The first (actual) `<image>` token is always at the 1st or 2nd place (after `<s>` if any). Here we look
# for the second `<image>` token (which indicate the first image token).
start_index = int(with_bos) + 1
# Add `image_embeds_position_mask`: the leading and trailing `0` are for `boi` and `eoi` tokens. The `1` indicates
# the places of image tokens.
image_token_ids = list(range(first_image_token_id, first_image_token_id + num_image_tokens))
base_image_embeds_position_mask = [0] + [1] * num_image_tokens + [0]
# loop over `encoding["input_ids"]`
input_ids = []
image_embeds_position_mask = []
all_input_ids = encoding["input_ids"]
# not batched -> (changed to) batch of size 1
if isinstance(text, str):
all_input_ids = [all_input_ids]
encoding["attention_mask"] = [encoding["attention_mask"]]
for text_ids in all_input_ids:
# change the ids for the fake `<image>` tokens in `input_ids`
text_ids = text_ids[:start_index] + image_token_ids + text_ids[start_index + num_image_tokens :]
input_ids.append(text_ids)
mask = copy.copy(base_image_embeds_position_mask)
if with_bos:
# for `<s>`
mask = [0] + mask
# trailing part (which are not related to the image)
mask += [0] * (len(text_ids) - len(mask))
image_embeds_position_mask.append(mask)
if isinstance(text, list):
sorted_length = sorted([(idx, len(x)) for idx, x in enumerate(text_encoding.input_ids)])
_, min_len_not_padded = sorted_length[0]
idx, _ = sorted_length[-1]
text_encoding = self.tokenizer(
text=[text[idx]],
add_special_tokens=(add_special_tokens and add_eos_token),
padding=padding,
truncation=truncation,
max_length=max_length,
pad_to_multiple_of=pad_to_multiple_of,
verbose=verbose,
return_tensors=None,
**kwargs,
)
max_len_padded = len(text_encoding.input_ids[0])
if min_len_not_padded != max_len_padded:
if self.tokenizer.padding_side == "right":
input_ids = [x + [self.tokenizer.pad_token_id] * (max_len_padded - len(x)) for x in input_ids]
image_embeds_position_mask = [
x + [0] * (max_len_padded - len(x)) for x in image_embeds_position_mask
]
if "attention_mask" in encoding:
encoding["attention_mask"] = [
x + [0] * (max_len_padded - len(x)) for x in encoding["attention_mask"]
]
elif self.tokenizer.padding_side == "left":
input_ids = [[self.tokenizer.pad_token_id] * (max_len_padded - len(x)) + x for x in input_ids]
image_embeds_position_mask = [
[0] * (max_len_padded - len(x)) + x for x in image_embeds_position_mask
]
if "attention_mask" in encoding:
encoding["attention_mask"] = [
[0] * (max_len_padded - len(x)) + x for x in encoding["attention_mask"]
]
# un-batch if necessary
if isinstance(text, str) and return_tensors is None:
input_ids = input_ids[0]
encoding["attention_mask"] = encoding["attention_mask"][0]
image_embeds_position_mask = image_embeds_position_mask[0]
# to the target tensor type
if return_tensors == "pt":
if not is_torch_available():
raise RuntimeError("return_tensors set to 'pt' but PyTorch can't be imported")
import torch
input_ids = torch.from_numpy(np.array(input_ids))
image_embeds_position_mask = torch.from_numpy(np.array(image_embeds_position_mask))
encoding["attention_mask"] = torch.from_numpy(np.array(encoding["attention_mask"]))
elif return_tensors is not None:
raise ValueError("return_tensors should be one of 'None' or 'pt'")
encoding["input_ids"] = input_ids
encoding["image_embeds_position_mask"] = image_embeds_position_mask
return encoding
def preprocess_text(
self,
texts: Union[TextInput, List[TextInput]],
images: ImageInput = None,
bboxes: BboxInput = None,
num_image_tokens: Optional[int] = 64,
) -> Union[str, List[str]]:
"""Add image and bounding box information to `texts` as image and patch index tokens.
Args:
texts (`Union[TextInput, List[TextInput]]`): The texts to be processed.
images (`ImageInput`, *optional*): The images associated to `texts`.
bboxes (`Union[List[Tuple[int]], List[Tuple[float]], List[List[Tuple[int]]], List[List[Tuple[float]]]]`, *optional*):
The bounding bboxes associated to `texts`.
num_image_tokens (`int`, *optional*, defaults to 64):
The number of image tokens (used as latent queries). This should corresponds to the `latent_query_num`
attribute in `Kosmos2Config`.
Returns:
`Union[TextInput, List[TextInput]]`: The processed texts with image and patch index tokens.
"""
# These are fake `<image>` tokens enclosed between (the actual) `<image>` token and `</image>`.
img_tokens = ["<image>"] * num_image_tokens
img_info = " ".join(["<image>"] + img_tokens + ["</image>"])
def check_bboxes_for_single_text(bboxes):
"""
Check `bboxes` for a single text example. It could be
- `None`: no bounding box associated to a text.
- A list with each element being the bounding boxes associated to one `<phrase> ... </phrase>` pair
found in a text. This could be:
- `None`: no bounding box associated to a `<phrase> ... </phrase>` pair.
- A tuple of 2 integers: A single bounding box specified by patch indices.
- A tuple of 4 float point number: A single bounding box specified by (normalized) coordinates.
- A list containing the above 2 tuple types: Multiple bounding boxes for a
`<phrase> ... </phrase>` pair.
"""
if bboxes is None:
return
elif not isinstance(bboxes, list):
raise ValueError("`bboxes` (for a single text example) should be `None` or a list.")
# `bbox` is the bounding boxes for a single <phrase> </phrase> pair
for bbox in bboxes:
if bbox is None:
continue
elif not isinstance(bbox, list):
bbox = [bbox]
for elt in bbox:
if not isinstance(elt, tuple) or not (
(len(elt) == 2 and all(isinstance(x, int) for x in elt))
or (len(elt) == 4 and all(isinstance(x, float) for x in elt))
):
raise ValueError(
"Each element in `bboxes` (for a single text example) should be `None`, a tuple containing "
"2 integers or 4 float point numbers, or a list containing such tuples. Also "
"make sure the arguments `texts` and `bboxes` passed to `preprocess_text` are both in "
"batches or both for a single example."
)
def preprocess_single(text, image, bboxes):
text = text.strip()
if image is not None:
# Add `<image> ... (fake) image tokens ... </image>`
text = f"{img_info} {text}"
# Add `<object> <patch_idx_xxxx> <patch_idx_yyy> </object>` after `<phrase> phrase text </phrase>`
text = self._insert_patch_index_tokens(text, bboxes)
text = self._add_remove_spaces_around_tag_tokens(text)
return text
# make batch to simplify processing logic
batched = True
if isinstance(texts, str):
batched = False
texts = [texts]
if images is None:
images = [None] * len(texts)
elif not is_batched(images):
images = [images]
if len(texts) != len(images):
raise ValueError(
f"The number of examples in `texts` and `images` should be the same. Got {len(texts)} v.s. {len(images)} instead."
)
if not batched:
check_bboxes_for_single_text(bboxes)
bboxes = [bboxes]
elif bboxes is not None:
if not isinstance(bboxes, list):
raise ValueError("`bboxes` should be `None` or a list (as a batch) when `texts` is passed as a batch.")
for x in bboxes:
check_bboxes_for_single_text(x)
else:
bboxes = [None] * len(texts)
if len(bboxes) != len(texts):
raise ValueError(
f"The number of examples in `texts` and `bboxes` should be the same. Got {len(texts)} v.s. {len(bboxes)} instead."
)
result = [preprocess_single(text, image, bbox) for text, image, bbox in zip(texts, images, bboxes)]
# un-batch if necessary
if not batched:
result = result[0]
return result
# Copied from transformers.models.blip.processing_blip.BlipProcessor.batch_decode with BertTokenizerFast->PreTrainedTokenizer
def batch_decode(self, *args, **kwargs):
"""
This method forwards all its arguments to PreTrainedTokenizer's [`~PreTrainedTokenizer.batch_decode`]. Please
refer to the docstring of this method for more information.
"""
return self.tokenizer.batch_decode(*args, **kwargs)
# Copied from transformers.models.blip.processing_blip.BlipProcessor.decode with BertTokenizerFast->PreTrainedTokenizer
def decode(self, *args, **kwargs):
"""
This method forwards all its arguments to PreTrainedTokenizer's [`~PreTrainedTokenizer.decode`]. Please refer
to the docstring of this method for more information.
"""
return self.tokenizer.decode(*args, **kwargs)
def post_process_generation(self, text, cleanup_and_extract=True):
caption = text.split("</image>")[-1]
if cleanup_and_extract:
return clean_text_and_extract_entities_with_bboxes(caption)
return caption
@property
# Copied from transformers.models.blip.processing_blip.BlipProcessor.model_input_names
def model_input_names(self):
tokenizer_input_names = self.tokenizer.model_input_names
image_processor_input_names = self.image_processor.model_input_names
return list(dict.fromkeys(tokenizer_input_names + image_processor_input_names))
def _insert_patch_index_tokens(self, text: str, bboxes: Union[List[Tuple[int]], List[Tuple[float]]]) -> str:
if bboxes is None or len(bboxes) == 0:
return text
matched_phrases = list(re.finditer(r"<phrase>.+?</phrase>", string=text))
if len(matched_phrases) != len(bboxes):
raise ValueError(
f"The number of elements in `bboxes` should be the same as the number of `<phrase> ... </phrase>` pairs in `text`. Got {len(matched_phrases)} v.s. {len(bboxes)} instead."
)
# insert object's patch index tokens
# the found `<phrase> ... </phrase>` pairs.
curr_pos = 0
buffer = []
for matched, bbox in zip(matched_phrases, bboxes):
_, end = matched.span()
buffer.append(text[curr_pos:end])
curr_pos = end
# A phrase without bbox
if bbox is None:
continue
# A phrase with a single bbox
if isinstance(bbox, tuple):
bbox = [bbox]
patch_index_strings = []
# A phrase could have multiple bboxes
assert all(box is not None for box in bbox)
for box in bbox:
patch_index_1, patch_index_2 = self._convert_bbox_to_patch_index_tokens(box)
patch_index_strings.append(f"{patch_index_1} {patch_index_2}")
# `bbox` being an empty list
if len(patch_index_strings) == 0:
continue
position_str = " </delimiter_of_multi_objects/> ".join(patch_index_strings)
buffer.append(f"<object> {position_str} </object>")
# remaining
if curr_pos < len(text):
buffer.append(text[curr_pos:])
text = "".join(buffer)
return text
def _convert_bbox_to_patch_index_tokens(
self, bbox: Union[Tuple[int, int], Tuple[float, float, float, float]]
) -> Tuple[str, str]:
# already computed patch indices
if len(bbox) == 2:
idx_1, idx_2 = bbox
# bbox specified with (normalized) coordinates
else:
# use `self.tokenizer` to get `num_patches_per_side`
num_patches_per_side = int(math.sqrt(self.num_patch_index_tokens))
idx_1, idx_2 = coordinate_to_patch_index(bbox, num_patches_per_side)
token_1 = f"<patch_index_{str(idx_1).zfill(4)}>"
token_2 = f"<patch_index_{str(idx_2).zfill(4)}>"
return token_1, token_2
def _add_remove_spaces_around_tag_tokens(self, text):
"""
Remove spaces before tag tokens (e.g. `<x>`). Also ensure a space after a tag token, if it is not followed by
another tag token (this is not technically necessary, but good for a standard/consistent format). This avoids
the inconsistency of tokenization results between kosmos-2 slow and fast tokenizers.
"""
tag_tokens = set(
self.tag_tokens + [f"<patch_index_{str(x).zfill(4)}>" for x in range(self.num_patch_index_tokens)]
)
pattern = "|".join(tag_tokens)
splits = re.split(rf"({pattern})", text)
# Don't keep the leading and trailing space if any
splits = [split for idx, split in enumerate(splits) if not (idx in [0, len(splits) - 1] and split == "")]
output = ""
prev_str_in_targets = False
for split in splits:
if split in tag_tokens:
prev_str_in_targets = True
output = output.rstrip() + split
else:
# we don't need to ensure a space before a normal token that is after a tag token. But having it and
# keeps a standard format is good anyway.
if prev_str_in_targets and not split.startswith(" "):
output += " " + split
else:
output += split
prev_str_in_targets = False
return output
def coordinate_to_patch_index(bbox: Tuple[float, float, float, float], num_patches_per_side: int) -> Tuple[int, int]:
"""Convert a bounding box to a pair of patch indices.
Args:
bbox (`Tuple[float, float, float, float]`):
The 4 coordinates of the bounding box, with the format being (x1, y1, x2, y2) specifying the upper-left and
lower-right corners of the box. It should have x2 > x1 and y2 > y1.
num_patches_per_side (`int`): the number of patches along each side.
Returns:
`Tuple[int, int]`: A pair of patch indices representing the upper-left patch and lower-right patch.
"""
(x1, y1, x2, y2) = bbox
if not (x2 > x1 and y2 > y1):
raise ValueError("The coordinates in `bbox` should be `(x1, y1, x2, y2)` with `x2 > x1` and `y2 > y1`.")
ul_x = math.floor(x1 * num_patches_per_side)
ul_y = math.floor(y1 * num_patches_per_side)
lr_x = math.ceil(x2 * num_patches_per_side - 1)
lr_y = math.ceil(y2 * num_patches_per_side - 1)
ul_idx = ul_y * num_patches_per_side + ul_x
lr_idx = lr_y * num_patches_per_side + lr_x
return ul_idx, lr_idx
# copied from https://github.com/microsoft/unilm/blob/97e4923e97d3ee10b57e97013556e3fd0d207a9b/kosmos-2/demo/decode_string.py#L35C1-L75C38
# (with format modifications)
def patch_index_to_coordinate(ul_idx: int, lr_idx: int, num_patches_per_side: int):
"""
Given a grid of length `num_patches_per_side` and the indices of the upper-left and lower-right corners of a
bounding box, returns the normalized coordinates of the bounding box, in the form (x1, y1, x2, y2).
Args:
ul_idx (`int`): the index of the grid cell that corresponds to the upper-left corner of the bounding box.
lr_idx (`int`): the index of the grid cell that corresponds to the lower-right corner of the bounding box.
num_patches_per_side (`int`): the number of patches along each side.
Returns:
`Tuple[float]`: the normalized coordinates of the bounding box, in the form (x1, y1, x2, y2).
"""
# Compute the size of each cell in the grid
cell_size = 1.0 / num_patches_per_side
# Compute the x and y indices of the upper-left and lower-right corners of the bounding box
ul_x = ul_idx % num_patches_per_side
ul_y = ul_idx // num_patches_per_side
lr_x = lr_idx % num_patches_per_side
lr_y = lr_idx // num_patches_per_side
# Compute the normalized coordinates of the bounding box
if ul_idx == lr_idx:
x1 = ul_x * cell_size
y1 = ul_y * cell_size
x2 = lr_x * cell_size + cell_size
y2 = lr_y * cell_size + cell_size
elif ul_x == lr_x or ul_y == lr_y:
x1 = ul_x * cell_size
y1 = ul_y * cell_size
x2 = lr_x * cell_size + cell_size
y2 = lr_y * cell_size + cell_size
else:
x1 = ul_x * cell_size + cell_size / 2
y1 = ul_y * cell_size + cell_size / 2
x2 = lr_x * cell_size + cell_size / 2
y2 = lr_y * cell_size + cell_size / 2
return x1, y1, x2, y2
# copied from https://github.com/microsoft/unilm/blob/97e4923e97d3ee10b57e97013556e3fd0d207a9b/kosmos-2/demo/decode_string.py#L4-L33
# (with format modifications)
def extract_entities_with_patch_indices(text):
"""Extract entities contained in `text`. The bounding bboxes is given in the form of patch indices.
This functioin is only intended to be used within `clean_text_and_extract_entities_with_bboxes` where further
processing happens, including converting to normalized coordinates and whitespace character cleaning up.
Examples:
```python
>>> text = "<grounding> An image of<phrase> a snowman</phrase><object><patch_index_0044><patch_index_0863></object> warming himself by<phrase> a fire</phrase><object><patch_index_0005><patch_index_0911></object>."
>>> entities = extract_entities_with_patch_indices(text)
>>> entities
[(' a snowman', (31, 41), [(44, 863)]), (' a fire', (130, 137), [(5, 911)])]
```"""
# The regular expression pattern for matching the required formats
pattern = r"(?:(<phrase>([^<]+)</phrase>))?<object>((?:<patch_index_\d+><patch_index_\d+></delimiter_of_multi_objects/>)*<patch_index_\d+><patch_index_\d+>)</object>"
# Find all matches in the given string
matches = re.finditer(pattern, text)
# Initialize an empty list to store the valid patch_index combinations
entities_with_patch_indices = []
for match in matches:
# span of a `phrase` that is between <phrase> and </phrase>
span = match.span(2)
phrase_tag, phrase, match_content = match.groups()
if not phrase_tag:
phrase = None
# We take the starting position of `<object>`
span = (match.span(0)[0], match.span(0)[0])
# Split the match_content by the delimiter to get individual patch_index pairs
patch_index_pairs = match_content.split("</delimiter_of_multi_objects/>")
entity_bboxes = []
for pair in patch_index_pairs:
# Extract the xxxx and yyyy values from the patch_index pair
x = re.search(r"<patch_index_(\d+)>", pair)
y = re.search(r"<patch_index_(\d+)>", pair[1:])
if x and y:
if phrase:
entity_bboxes.append((int(x.group(1)), int(y.group(1))))
else:
entity_bboxes.append((int(x.group(1)), int(y.group(1))))
if phrase:
entities_with_patch_indices.append((phrase, span, entity_bboxes))
else:
for bbox in entity_bboxes:
# fake entity name
entity = f"<patch_index_{bbox[0]}><patch_index_{bbox[1]}>"
entities_with_patch_indices.append((entity, span, [bbox]))
return entities_with_patch_indices
def remove_special_fields(text):
return re.sub("<.*?>", "", text)
def adjust_entity_positions(entity, text):
entity_name, (start, end) = entity
adjusted_start = len(remove_special_fields(text[:start]))
adjusted_end = len(remove_special_fields(text[:end]))
adjusted_entity = (entity_name, (adjusted_start, adjusted_end))
return adjusted_entity
# copied from https://github.com/microsoft/unilm/blob/97e4923e97d3ee10b57e97013556e3fd0d207a9b/kosmos-2/demo/decode_string.py#L77-L87
# (with format modifications)
def clean_text_and_extract_entities_with_bboxes(text, num_patches_per_side=32):
"""Remove the tag tokens from `text`, extract entities in it with some cleaning up of white characters.
Examples:
```python
>>> text = "<grounding> An image of<phrase> a snowman</phrase><object><patch_index_0044><patch_index_0863></object> warming himself by<phrase> a fire</phrase><object><patch_index_0005><patch_index_0911></object>."
>>> clean_text, entities = clean_text_and_extract_entities_with_bboxes(text)
>>> clean_text
'An image of a snowman warming himself by a fire.'
>>> entities
[('a snowman', (12, 21), [(0.390625, 0.046875, 0.984375, 0.828125)]), ('a fire', (41, 47), [(0.171875, 0.015625, 0.484375, 0.890625)])]
```"""
processed_text = remove_special_fields(text)
entities_with_patch_indices = extract_entities_with_patch_indices(text)
entities = []
for item in entities_with_patch_indices:
entity, bboxes = item[0:2], item[2]
adjusted_entity = adjust_entity_positions(entity, text)
bboxes_in_coords = [patch_index_to_coordinate(bbox[0], bbox[1], num_patches_per_side) for bbox in bboxes]
entities.append(adjusted_entity + (bboxes_in_coords,))
def cleanup_spaces(text, entities):
new_text = text.strip()
leading_spaces = len(text) - len(text.lstrip())
new_entities = []
for entity_name, (start, end), bboxes in entities:
entity_name_leading_spaces = len(entity_name) - len(entity_name.lstrip())
entity_name_trailing_spaces = len(entity_name) - len(entity_name.rstrip())
start = start - leading_spaces + entity_name_leading_spaces
end = end - leading_spaces - entity_name_trailing_spaces
entity_name = entity_name.strip()
new_entities.append((entity_name, (start, end), bboxes))
return new_text, new_entities
return cleanup_spaces(processed_text, entities)

View File

@ -4247,6 +4247,30 @@ class JukeboxVQVAE(metaclass=DummyObject):
requires_backends(self, ["torch"])
KOSMOS2_PRETRAINED_MODEL_ARCHIVE_LIST = None
class Kosmos2ForConditionalGeneration(metaclass=DummyObject):
_backends = ["torch"]
def __init__(self, *args, **kwargs):
requires_backends(self, ["torch"])
class Kosmos2Model(metaclass=DummyObject):
_backends = ["torch"]
def __init__(self, *args, **kwargs):
requires_backends(self, ["torch"])
class Kosmos2PreTrainedModel(metaclass=DummyObject):
_backends = ["torch"]
def __init__(self, *args, **kwargs):
requires_backends(self, ["torch"])
LAYOUTLM_PRETRAINED_MODEL_ARCHIVE_LIST = None

View File

@ -254,6 +254,13 @@ class ImageGPTImageProcessor(metaclass=DummyObject):
requires_backends(self, ["vision"])
class Kosmos2ImageProcessor(metaclass=DummyObject):
_backends = ["vision"]
def __init__(self, *args, **kwargs):
requires_backends(self, ["vision"])
class LayoutLMv2FeatureExtractor(metaclass=DummyObject):
_backends = ["vision"]

View File

View File

@ -0,0 +1,121 @@
# coding=utf-8
# Copyright 2023 Microsoft Research and The HuggingFace Inc. team. All rights reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
import unittest
from transformers.testing_utils import require_torch, require_vision
from transformers.utils import is_vision_available
from ...test_image_processing_common import ImageProcessingTestMixin, prepare_image_inputs
if is_vision_available():
from transformers import Kosmos2ImageProcessor
class Kosmos2ImageProcessingTester(unittest.TestCase):
def __init__(
self,
parent,
batch_size=7,
num_channels=3,
image_size=18,
min_resolution=30,
max_resolution=400,
do_resize=True,
size=None,
do_center_crop=True,
crop_size=None,
do_normalize=True,
image_mean=[0.48145466, 0.4578275, 0.40821073],
image_std=[0.26862954, 0.26130258, 0.27577711],
do_convert_rgb=True,
):
size = size if size is not None else {"shortest_edge": 20}
crop_size = crop_size if crop_size is not None else {"height": 18, "width": 18}
self.parent = parent
self.batch_size = batch_size
self.num_channels = num_channels
self.image_size = image_size
self.min_resolution = min_resolution
self.max_resolution = max_resolution
self.do_resize = do_resize
self.size = size
self.do_center_crop = do_center_crop
self.crop_size = crop_size
self.do_normalize = do_normalize
self.image_mean = image_mean
self.image_std = image_std
self.do_convert_rgb = do_convert_rgb
def prepare_image_processor_dict(self):
return {
"do_resize": self.do_resize,
"size": self.size,
"do_center_crop": self.do_center_crop,
"crop_size": self.crop_size,
"do_normalize": self.do_normalize,
"image_mean": self.image_mean,
"image_std": self.image_std,
"do_convert_rgb": self.do_convert_rgb,
}
def expected_output_image_shape(self, images):
return self.num_channels, self.crop_size["height"], self.crop_size["width"]
def prepare_image_inputs(self, equal_resolution=False, numpify=False, torchify=False):
return prepare_image_inputs(
batch_size=self.batch_size,
num_channels=self.num_channels,
min_resolution=self.min_resolution,
max_resolution=self.max_resolution,
equal_resolution=equal_resolution,
numpify=numpify,
torchify=torchify,
)
@require_torch
@require_vision
class Kosmos2ImageProcessingTest(ImageProcessingTestMixin, unittest.TestCase):
image_processing_class = Kosmos2ImageProcessor if is_vision_available() else None
def setUp(self):
self.image_processor_tester = Kosmos2ImageProcessingTester(self)
@property
def image_processor_dict(self):
return self.image_processor_tester.prepare_image_processor_dict()
def test_image_processor_properties(self):
image_processing = self.image_processing_class(**self.image_processor_dict)
self.assertTrue(hasattr(image_processing, "do_resize"))
self.assertTrue(hasattr(image_processing, "size"))
self.assertTrue(hasattr(image_processing, "do_center_crop"))
self.assertTrue(hasattr(image_processing, "center_crop"))
self.assertTrue(hasattr(image_processing, "do_normalize"))
self.assertTrue(hasattr(image_processing, "image_mean"))
self.assertTrue(hasattr(image_processing, "image_std"))
self.assertTrue(hasattr(image_processing, "do_convert_rgb"))
def test_image_processor_from_dict_with_kwargs(self):
image_processor = self.image_processing_class.from_dict(self.image_processor_dict)
self.assertEqual(image_processor.size, {"shortest_edge": 20})
self.assertEqual(image_processor.crop_size, {"height": 18, "width": 18})
image_processor = self.image_processing_class.from_dict(self.image_processor_dict, size=42, crop_size=84)
self.assertEqual(image_processor.size, {"height": 42, "width": 42})
self.assertEqual(image_processor.crop_size, {"height": 84, "width": 84})

View File

@ -0,0 +1,721 @@
# coding=utf-8
# Copyright 2023 Microsoft Research and The HuggingFace Inc. team. All rights reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
""" Testing suite for the PyTorch KOSMOS-2 model. """
import copy
import inspect
import os
import tempfile
import unittest
import numpy as np
import requests
from transformers import AutoModelForVision2Seq, AutoProcessor, Kosmos2Config
from transformers.models.kosmos2.configuration_kosmos2 import Kosmos2TextConfig, Kosmos2VisionConfig
from transformers.testing_utils import require_torch, require_vision, slow, torch_device
from transformers.utils import is_torch_available, is_vision_available
from ...test_configuration_common import ConfigTester
from ...test_modeling_common import (
ModelTesterMixin,
_config_zero_init,
floats_tensor,
ids_tensor,
random_attention_mask,
)
if is_torch_available():
import torch
from transformers import Kosmos2ForConditionalGeneration, Kosmos2Model
from transformers.models.kosmos2.modeling_kosmos2 import KOSMOS2_PRETRAINED_MODEL_ARCHIVE_LIST
if is_vision_available():
from PIL import Image
class Kosmos2VisionModelTester:
def __init__(
self,
parent,
batch_size=12,
image_size=32,
patch_size=4,
num_channels=3,
is_training=True,
hidden_size=32,
num_hidden_layers=2,
num_attention_heads=4,
intermediate_size=37,
dropout=0.1,
attention_dropout=0.1,
initializer_range=1e-10,
scope=None,
):
self.parent = parent
self.batch_size = batch_size
self.image_size = image_size
self.patch_size = patch_size
self.num_channels = num_channels
self.is_training = is_training
self.hidden_size = hidden_size
self.num_hidden_layers = num_hidden_layers
self.num_attention_heads = num_attention_heads
self.intermediate_size = intermediate_size
self.dropout = dropout
self.attention_dropout = attention_dropout
self.initializer_range = initializer_range
self.scope = scope
# in ViT, the seq length equals the number of patches + 1 (we add 1 for the [CLS] token)
num_patches = (image_size // patch_size) ** 2
self.seq_length = num_patches + 1
def prepare_config_and_inputs(self):
pixel_values = floats_tensor([self.batch_size, self.num_channels, self.image_size, self.image_size])
config = self.get_config()
return config, pixel_values
def get_config(self):
return Kosmos2VisionConfig(
image_size=self.image_size,
patch_size=self.patch_size,
num_channels=self.num_channels,
hidden_size=self.hidden_size,
num_hidden_layers=self.num_hidden_layers,
num_attention_heads=self.num_attention_heads,
intermediate_size=self.intermediate_size,
dropout=self.dropout,
attention_dropout=self.attention_dropout,
initializer_range=self.initializer_range,
)
def prepare_config_and_inputs_for_common(self):
config_and_inputs = self.prepare_config_and_inputs()
config, pixel_values = config_and_inputs
inputs_dict = {"pixel_values": pixel_values}
return config, inputs_dict
class Kosmos2TextModelTester:
def __init__(
self,
parent,
batch_size=12,
seq_length=7,
is_training=True,
use_input_mask=True,
use_labels=True,
vocab_size=99,
hidden_size=32,
num_hidden_layers=2,
num_attention_heads=4,
intermediate_size=37,
dropout=0.1,
attention_dropout=0.1,
max_position_embeddings=512,
scope=None,
):
self.parent = parent
self.batch_size = batch_size
self.seq_length = seq_length
self.is_training = is_training
self.use_input_mask = use_input_mask
self.use_labels = use_labels
self.vocab_size = vocab_size
self.hidden_size = hidden_size
self.num_hidden_layers = num_hidden_layers
self.num_attention_heads = num_attention_heads
self.intermediate_size = intermediate_size
self.dropout = dropout
self.attention_dropout = attention_dropout
self.max_position_embeddings = max_position_embeddings
self.scope = scope
def prepare_config_and_inputs(self):
input_ids = ids_tensor([self.batch_size, self.seq_length], self.vocab_size)
input_mask = None
if self.use_input_mask:
input_mask = random_attention_mask([self.batch_size, self.seq_length])
if input_mask is not None:
batch_size, seq_length = input_mask.shape
rnd_start_indices = np.random.randint(1, seq_length - 1, size=(batch_size,))
for batch_idx, start_index in enumerate(rnd_start_indices):
input_mask[batch_idx, :start_index] = 1
input_mask[batch_idx, start_index:] = 0
config = self.get_config()
return config, input_ids, input_mask
def get_config(self):
return Kosmos2TextConfig(
vocab_size=self.vocab_size,
embed_dim=self.hidden_size,
layers=self.num_hidden_layers,
attention_heads=self.num_attention_heads,
ffn_dim=self.intermediate_size,
dropout=self.dropout,
attention_dropout=self.attention_dropout,
max_position_embeddings=self.max_position_embeddings,
)
def prepare_config_and_inputs_for_common(self):
config_and_inputs = self.prepare_config_and_inputs()
config, input_ids, input_mask = config_and_inputs
inputs_dict = {"input_ids": input_ids, "attention_mask": input_mask}
return config, inputs_dict
class Kosmos2ModelTester:
def __init__(self, parent, text_kwargs=None, vision_kwargs=None, latent_query_num=3, is_training=True):
if text_kwargs is None:
text_kwargs = {}
if vision_kwargs is None:
vision_kwargs = {}
self.parent = parent
self.text_model_tester = Kosmos2TextModelTester(parent, **text_kwargs)
self.vision_model_tester = Kosmos2VisionModelTester(parent, **vision_kwargs)
self.latent_query_num = latent_query_num
self.is_training = is_training
def prepare_config_and_inputs(self):
text_config, input_ids, attention_mask = self.text_model_tester.prepare_config_and_inputs()
vision_config, pixel_values = self.vision_model_tester.prepare_config_and_inputs()
# build `image_embeds_position_mask`
image_embeds_position_mask = torch.zeros_like(input_ids)
image_embeds_position_mask[:, 1 : 1 + self.latent_query_num :] = 1
config = self.get_config()
return config, input_ids, attention_mask, image_embeds_position_mask, pixel_values
def get_config(self):
return Kosmos2Config(
self.text_model_tester.get_config().to_dict(),
self.vision_model_tester.get_config().to_dict(),
latent_query_num=self.latent_query_num,
)
def create_and_check_model(self, config, input_ids, attention_mask, image_embeds_position_mask, pixel_values):
model = Kosmos2Model(config).to(torch_device).eval()
with torch.no_grad():
result = model(pixel_values, input_ids, image_embeds_position_mask, attention_mask)
self.parent.assertEqual(
result.last_hidden_state.shape,
(self.text_model_tester.batch_size, self.text_model_tester.seq_length, self.text_model_tester.hidden_size),
)
self.parent.assertEqual(
result.image_embeds.shape,
(self.text_model_tester.batch_size, self.latent_query_num, self.text_model_tester.hidden_size),
)
def prepare_config_and_inputs_for_common(self):
config_and_inputs = self.prepare_config_and_inputs()
config, input_ids, attention_mask, image_embeds_position_mask, pixel_values = config_and_inputs
inputs_dict = {
"input_ids": input_ids,
"attention_mask": attention_mask,
"image_embeds_position_mask": image_embeds_position_mask,
"pixel_values": pixel_values,
}
return config, inputs_dict
@require_torch
class Kosmos2ModelTest(ModelTesterMixin, unittest.TestCase):
all_model_classes = (Kosmos2Model, Kosmos2ForConditionalGeneration) if is_torch_available() else ()
all_generative_model_classes = (Kosmos2ForConditionalGeneration,) if is_torch_available() else ()
fx_compatible = False
test_head_masking = False
test_pruning = False
test_resize_embeddings = False
test_attention_outputs = False
def _prepare_for_class(self, inputs_dict, model_class, return_labels=False):
inputs_dict = copy.deepcopy(inputs_dict)
if return_labels:
if model_class.__name__ == "Kosmos2ForConditionalGeneration":
inputs_dict["labels"] = torch.zeros(
(self.model_tester.text_model_tester.batch_size, self.model_tester.text_model_tester.seq_length),
dtype=torch.long,
device=torch_device,
)
return inputs_dict
def setUp(self):
self.model_tester = Kosmos2ModelTester(self)
self.config_tester = ConfigTester(self, config_class=Kosmos2Config, hidden_size=37)
# overwrite from test_modeling_common to skip `image_to_text_projection.latent_query`
def test_initialization(self):
config, inputs_dict = self.model_tester.prepare_config_and_inputs_for_common()
configs_no_init = _config_zero_init(config)
for model_class in self.all_model_classes:
model = model_class(config=configs_no_init)
for name, param in model.named_parameters():
if param.requires_grad:
if name == "image_to_text_projection.latent_query":
# The original code use ` nn.Parameter(torch.randn(...))` for which this test won't pass.
continue
self.assertIn(
((param.data.mean() * 1e9).round() / 1e9).item(),
[0.0, 1.0],
msg=f"Parameter {name} of model {model_class} seems not properly initialized",
)
def test_model(self):
config_and_inputs = self.model_tester.prepare_config_and_inputs()
self.model_tester.create_and_check_model(*config_and_inputs)
def test_forward_signature(self):
config, _ = self.model_tester.prepare_config_and_inputs_for_common()
for model_class in self.all_model_classes:
model = model_class(config)
signature = inspect.signature(model.forward)
# signature.parameters is an OrderedDict => so arg_names order is deterministic
arg_names = [*signature.parameters.keys()]
expected_arg_names = ["pixel_values"]
self.assertListEqual(arg_names[:1], expected_arg_names)
# over... from common
def test_hidden_states_output(self):
def check_hidden_states_output(inputs_dict, config, model_class):
model = model_class(config)
model.to(torch_device)
model.eval()
with torch.no_grad():
outputs = model(**self._prepare_for_class(inputs_dict, model_class))
hidden_states = outputs.hidden_states
expected_num_layers = getattr(
self.model_tester,
"expected_num_hidden_layers",
self.model_tester.text_model_tester.num_hidden_layers + 1,
)
self.assertEqual(len(hidden_states), expected_num_layers)
seq_length = self.model_tester.text_model_tester.seq_length
self.assertListEqual(
list(hidden_states[0].shape[-2:]),
[seq_length, self.model_tester.text_model_tester.hidden_size],
)
config, inputs_dict = self.model_tester.prepare_config_and_inputs_for_common()
for model_class in self.all_model_classes:
inputs_dict["output_hidden_states"] = True
check_hidden_states_output(inputs_dict, config, model_class)
# check that output_hidden_states also work using config
del inputs_dict["output_hidden_states"]
config.output_hidden_states = True
check_hidden_states_output(inputs_dict, config, model_class)
def test_tie_model_weights(self):
if not self.test_torchscript:
return
config, inputs_dict = self.model_tester.prepare_config_and_inputs_for_common()
def check_same_values(layer_1, layer_2):
equal = True
for p1, p2 in zip(layer_1.weight, layer_2.weight):
if p1.data.ne(p2.data).sum() > 0:
equal = False
return equal
for model_class in self.all_model_classes:
config.torchscript = True
model_not_tied = model_class(config)
if model_not_tied.get_output_embeddings() is None:
continue
config_tied = copy.deepcopy(config)
config_tied.torchscript = False
model_tied = model_class(config_tied)
params_tied = list(model_tied.parameters())
# Check that the embedding layer and decoding layer are the same in size and in value
# self.assertTrue(check_same_values(embeddings, decoding))
# # Check that after modification, they remain the same.
# embeddings.weight.data.div_(2)
# # Check that the embedding layer and decoding layer are the same in size and in value
# self.assertTrue(embeddings.weight.shape, decoding.weight.shape)
# self.assertTrue(check_same_values(embeddings, decoding))
# # Check that after modification, they remain the same.
# decoding.weight.data.div_(4)
# # Check that the embedding layer and decoding layer are the same in size and in value
# self.assertTrue(embeddings.weight.shape, decoding.weight.shape)
# self.assertTrue(check_same_values(embeddings, decoding))
# Check that after resize they remain tied.
model_tied.resize_token_embeddings(config.text_config.vocab_size + 10)
params_tied_2 = list(model_tied.parameters())
self.assertEqual(len(params_tied_2), len(params_tied))
# decoding.weight.data.mul_(20)
# # Check that the embedding layer and decoding layer are the same in size and in value
# self.assertTrue(model.transformer.wte.weight.shape, model.lm_head.weight.shape)
# self.assertTrue(check_same_values(model.transformer.wte, model.lm_head))
@slow
def test_model_from_pretrained(self):
for model_name in KOSMOS2_PRETRAINED_MODEL_ARCHIVE_LIST[:1]:
model = Kosmos2Model.from_pretrained(model_name)
self.assertIsNotNone(model)
def _create_and_check_torchscript(self, config, inputs_dict):
if not self.test_torchscript:
return
configs_no_init = _config_zero_init(config) # To be sure we have no Nan
configs_no_init.torchscript = True
for model_class in self.all_model_classes:
model = model_class(config=configs_no_init)
model.to(torch_device)
model.eval()
inputs = self._prepare_for_class(inputs_dict, model_class)
main_input_name = model_class.main_input_name
try:
main_input = inputs[main_input_name]
model(main_input, inputs["input_ids"], inputs["image_embeds_position_mask"])
traced_model = torch.jit.trace(
model, (main_input, inputs["input_ids"], inputs["image_embeds_position_mask"])
)
except RuntimeError:
self.fail("Couldn't trace module.")
with tempfile.TemporaryDirectory() as tmp_dir_name:
pt_file_name = os.path.join(tmp_dir_name, "traced_model.pt")
try:
torch.jit.save(traced_model, pt_file_name)
except Exception:
self.fail("Couldn't save module.")
try:
loaded_model = torch.jit.load(pt_file_name)
except Exception:
self.fail("Couldn't load module.")
model.to(torch_device)
model.eval()
loaded_model.to(torch_device)
loaded_model.eval()
model_state_dict = model.state_dict()
loaded_model_state_dict = loaded_model.state_dict()
non_persistent_buffers = {}
for key in loaded_model_state_dict.keys():
if key not in model_state_dict.keys():
non_persistent_buffers[key] = loaded_model_state_dict[key]
loaded_model_state_dict = {
key: value for key, value in loaded_model_state_dict.items() if key not in non_persistent_buffers
}
self.assertEqual(set(model_state_dict.keys()), set(loaded_model_state_dict.keys()))
model_buffers = list(model.buffers())
for non_persistent_buffer in non_persistent_buffers.values():
found_buffer = False
for i, model_buffer in enumerate(model_buffers):
if torch.equal(non_persistent_buffer, model_buffer):
found_buffer = True
break
self.assertTrue(found_buffer)
model_buffers.pop(i)
models_equal = True
for layer_name, p1 in model_state_dict.items():
if layer_name in loaded_model_state_dict:
p2 = loaded_model_state_dict[layer_name]
if p1.data.ne(p2.data).sum() > 0:
models_equal = False
self.assertTrue(models_equal)
# Avoid memory leak. Without this, each call increase RAM usage by ~20MB.
# (Even with this call, there are still memory leak by ~0.04MB)
self.clear_torch_jit_class_registry()
# We will verify our results on an image of cute cats
def prepare_img():
url = "https://huggingface.co/hf-internal-testing/Kosmos2-test-image/resolve/main/demo.jpg"
im = Image.open(requests.get(url, stream=True).raw)
return im
@require_vision
@require_torch
@slow
class Kosmos2ModelIntegrationTest(unittest.TestCase):
def run_example(self, prompt, image, model, processor):
inputs = processor(text=prompt, images=image, return_tensors="pt", add_eos_token=False, padding=True).to(
torch_device
)
generation_outputs = model.generate(
pixel_values=inputs["pixel_values"],
input_ids=inputs["input_ids"],
attention_mask=inputs["attention_mask"],
image_embeds=None,
image_embeds_position_mask=inputs["image_embeds_position_mask"],
use_cache=True,
max_new_tokens=128,
output_scores=True,
return_dict_in_generate=True,
)
scores = generation_outputs.scores
generated_ids = generation_outputs.sequences
generated_text = processor.batch_decode(generated_ids, skip_special_tokens=True)
# Specify `cleanup_and_extract=False` in order to see the raw model generation.
processed_text = [processor.post_process_generation(x, cleanup_and_extract=False) for x in generated_text]
# By default, the generated text is cleanup and the entities are extracted.
final_text_with_entities = [processor.post_process_generation(x) for x in generated_text]
return scores, generated_ids, generated_text, processed_text, final_text_with_entities
def test_snowman_image_captioning(self):
url = "https://huggingface.co/ydshieh/temp-testing-kosmos-2-rename-002/resolve/main/snowman.png"
image = Image.open(requests.get(url, stream=True).raw)
image.save("new_image.jpg")
image = Image.open("new_image.jpg")
model = AutoModelForVision2Seq.from_pretrained("ydshieh/temp-testing-kosmos-2-rename-002").to(torch_device)
processor = AutoProcessor.from_pretrained("ydshieh/temp-testing-kosmos-2-rename-002")
prompt = "<grounding>An image of"
scores, generated_ids, generated_text, processed_text, final_text_with_entities = self.run_example(
prompt, image, model, processor
)
processed_text = processed_text[0]
final_text, entities = final_text_with_entities[0]
assert np.allclose(
torch.concat(scores[1:4])[:3, :3].to("cpu").numpy(),
np.array(
[
[-1.5672581195831299, -5.007406711578369, 4.36448860168457],
[-2.147017002105713, -4.966302871704102, 4.592559337615967],
[-0.9352350831031799, -4.688288688659668, 6.240612983703613],
]
),
atol=1e-5,
)
assert np.allclose(
torch.concat(scores[-3:])[-3:, -3:].to("cpu").numpy(),
np.array(
[
[2.9916205406188965, 2.481820583343506, 4.646594524383545],
[-2.8381078243255615, -2.9687185287475586, -2.6926779747009277],
[-2.8909168243408203, -3.2228589057922363, -1.7056822776794434],
]
),
atol=1e-5,
)
# fmt: off
assert generated_ids.to("cpu").numpy().tolist() == [
[
0, 64003, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28,
29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 51, 52, 53, 54,
55, 56, 57, 58, 59, 60, 61, 62, 63, 64, 65, 66, 67, 64004, 64012, 712, 1648, 9, 64007, 10, 43867, 64008,
64009, 64057, 64876, 64010, 5950, 597, 32, 64007, 10, 646, 64008, 64009, 64018, 64924, 64010, 4, 2
]
]
# fmt: on
assert processed_text == (
"<grounding> An image of<phrase> a snowman</phrase><object><patch_index_0044><patch_index_0863></object> "
"warming himself by<phrase> a fire</phrase><object><patch_index_0005><patch_index_0911></object>."
)
assert final_text == "An image of a snowman warming himself by a fire."
assert entities == [
("a snowman", (12, 21), [(0.390625, 0.046875, 0.984375, 0.828125)]),
("a fire", (41, 47), [(0.171875, 0.015625, 0.484375, 0.890625)]),
]
prompt = "<grounding>Describe this image in detail:"
scores, generated_ids, generated_text, processed_text, final_text_with_entities = self.run_example(
prompt, image, model, processor
)
processed_text = processed_text[0]
final_text, entities = final_text_with_entities[0]
assert np.allclose(
torch.concat(scores[1:4])[:3, :3].to("cpu").numpy(),
np.array(
[
[-0.9093570113182068, -4.578373908996582, 5.96360969543457],
[2.452126979827881, -4.090598106384277, 8.738677024841309],
[-0.7624598741531372, -4.771658897399902, 6.576295852661133],
]
),
atol=1e-5,
)
assert np.allclose(
torch.concat(scores[-3:])[-3:, -3:].to("cpu").numpy(),
np.array(
[
[-1.673659086227417, -2.162452220916748, -1.95430588722229],
[-2.006824493408203, -2.2038745880126953, -1.24686861038208],
[-3.2783470153808594, -2.814181089401245, -1.390632152557373],
]
),
atol=1e-5,
)
# fmt: off
assert generated_ids.to("cpu").numpy().tolist() == [
[
0, 64003, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28,
29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 51, 52, 53, 54,
55, 56, 57, 58, 59, 60, 61, 62, 63, 64, 65, 66, 67, 64004, 64012, 34645, 247, 38, 1648, 12, 3391, 55,
24, 1648, 1338, 10, 43867, 1280, 32, 64007, 10, 30879, 64008, 64009, 64018, 65020, 64010, 12, 5, 1842,
4, 71, 17, 1679, 64007, 10, 3958, 64008, 64009, 64061, 64263, 64010, 6, 64007, 15719, 64008, 64009,
64253, 64617, 64010, 6, 8, 64007, 9626, 64008, 64009, 64413, 64545, 64010, 6, 23, 64007, 10, 4363,
64008, 64009, 64623, 64885, 64010, 2255, 8, 64007, 10, 3486, 64008, 64009, 64809, 65036, 64010, 1560,
2255, 4, 24, 43867, 1684, 7, 27, 3774, 5, 10356, 9, 5, 646, 6, 8, 22, 1684, 7, 30, 10, 2007, 8, 16239,
4337, 4, 2
]
]
# fmt: on
assert processed_text == (
"<grounding> Describe this image in detail: The image features a snowman sitting by<phrase> a campfire"
"</phrase><object><patch_index_0005><patch_index_1007></object> in the snow. He is wearing<phrase> a hat"
"</phrase><object><patch_index_0048><patch_index_0250></object>,<phrase> scarf</phrase><object>"
"<patch_index_0240><patch_index_0604></object>, and<phrase> gloves</phrase><object><patch_index_0400>"
"<patch_index_0532></object>, with<phrase> a pot</phrase><object><patch_index_0610><patch_index_0872>"
"</object> nearby and<phrase> a cup</phrase><object><patch_index_0796><patch_index_1023></object> placed "
"nearby. The snowman appears to be enjoying the warmth of the fire, and it appears to have a warm and cozy "
"atmosphere."
)
assert final_text == (
"Describe this image in detail: The image features a snowman sitting by a campfire in the snow. He is "
"wearing a hat, scarf, and gloves, with a pot nearby and a cup placed nearby. The snowman appears to be "
"enjoying the warmth of the fire, and it appears to have a warm and cozy atmosphere."
)
assert entities == [
("a campfire", (71, 81), [(0.171875, 0.015625, 0.484375, 0.984375)]),
("a hat", (109, 114), [(0.515625, 0.046875, 0.828125, 0.234375)]),
("scarf", (116, 121), [(0.515625, 0.234375, 0.890625, 0.578125)]),
("gloves", (127, 133), [(0.515625, 0.390625, 0.640625, 0.515625)]),
("a pot", (140, 145), [(0.078125, 0.609375, 0.265625, 0.859375)]),
("a cup", (157, 162), [(0.890625, 0.765625, 0.984375, 0.984375)]),
]
def test_snowman_image_captioning_batch(self):
url = "https://huggingface.co/ydshieh/temp-testing-kosmos-2-rename-002/resolve/main/snowman.png"
image = Image.open(requests.get(url, stream=True).raw)
image.save("new_image.jpg")
image = Image.open("new_image.jpg")
model = AutoModelForVision2Seq.from_pretrained("ydshieh/temp-testing-kosmos-2-rename-002").to(torch_device)
prompt = ["<grounding>An image of", "<grounding>Describe this image in detail:"]
# left padding
processor = AutoProcessor.from_pretrained("ydshieh/temp-testing-kosmos-2-rename-002", padding_side="left")
scores, generated_ids, generated_text, processed_text, final_text_with_entities = self.run_example(
prompt, [image] * len(prompt), model, processor
)
all_final_text = [x[0] for x in final_text_with_entities]
all_entities = [x[1] for x in final_text_with_entities]
# left padding gives identical results as non-padding
EXPECTED_PROCESSED_TEXT_0 = (
"<grounding> An image of<phrase> a snowman</phrase><object><patch_index_0044><patch_index_0863></object> "
"warming himself by<phrase> a fire</phrase><object><patch_index_0005><patch_index_0911></object>."
)
EXPECTED_PROCESSED_TEXT_1 = (
"<grounding> Describe this image in detail: The image features a snowman sitting by<phrase> a campfire"
"</phrase><object><patch_index_0005><patch_index_1007></object> in the snow. He is wearing<phrase> a hat"
"</phrase><object><patch_index_0048><patch_index_0250></object>,<phrase> scarf</phrase><object>"
"<patch_index_0240><patch_index_0604></object>, and<phrase> gloves</phrase><object><patch_index_0400>"
"<patch_index_0532></object>, with<phrase> a pot</phrase><object><patch_index_0610><patch_index_0872>"
"</object> nearby and<phrase> a cup</phrase><object><patch_index_0796><patch_index_1023></object> placed "
"nearby. The snowman appears to be enjoying the warmth of the fire, and it appears to have a warm and cozy "
"atmosphere."
)
assert processed_text == [EXPECTED_PROCESSED_TEXT_0, EXPECTED_PROCESSED_TEXT_1]
EXPECTED_FINAL_TEXT_0 = "An image of a snowman warming himself by a fire."
EXPECTED_FINAL_TEXT_1 = (
"Describe this image in detail: The image features a snowman sitting by a campfire in the snow. He is "
"wearing a hat, scarf, and gloves, with a pot nearby and a cup placed nearby. The snowman appears to be "
"enjoying the warmth of the fire, and it appears to have a warm and cozy atmosphere."
)
assert all_final_text == [EXPECTED_FINAL_TEXT_0, EXPECTED_FINAL_TEXT_1]
EXPECTED_ENTITIES_0 = [
("a snowman", (12, 21), [(0.390625, 0.046875, 0.984375, 0.828125)]),
("a fire", (41, 47), [(0.171875, 0.015625, 0.484375, 0.890625)]),
]
EXPECTED_ENTITIES_1 = [
("a campfire", (71, 81), [(0.171875, 0.015625, 0.484375, 0.984375)]),
("a hat", (109, 114), [(0.515625, 0.046875, 0.828125, 0.234375)]),
("scarf", (116, 121), [(0.515625, 0.234375, 0.890625, 0.578125)]),
("gloves", (127, 133), [(0.515625, 0.390625, 0.640625, 0.515625)]),
("a pot", (140, 145), [(0.078125, 0.609375, 0.265625, 0.859375)]),
("a cup", (157, 162), [(0.890625, 0.765625, 0.984375, 0.984375)]),
]
assert all_entities == [EXPECTED_ENTITIES_0, EXPECTED_ENTITIES_1]
# right padding
processor = AutoProcessor.from_pretrained("ydshieh/temp-testing-kosmos-2-rename-002")
scores, generated_ids, generated_text, processed_text, final_text_with_entities = self.run_example(
prompt, [image] * len(prompt), model, processor
)
all_final_text = [x[0] for x in final_text_with_entities]
all_entities = [x[1] for x in final_text_with_entities]
# For right padding, only the non-padded sequences will give the same results as non-padding
assert processed_text[1] == EXPECTED_PROCESSED_TEXT_1
assert all_final_text[1] == EXPECTED_FINAL_TEXT_1
assert all_entities[1] == EXPECTED_ENTITIES_1

View File

@ -0,0 +1,492 @@
# coding=utf-8
# Copyright 2023 Microsoft Research and The HuggingFace Inc. team. All rights reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
import os
import shutil
import tempfile
import unittest
import numpy as np
import pytest
import requests
from transformers.testing_utils import (
get_tests_dir,
require_sentencepiece,
require_tokenizers,
require_torch,
require_vision,
)
from transformers.utils import is_vision_available
if is_vision_available():
from PIL import Image
from transformers import (
AutoProcessor,
Kosmos2ImageProcessor,
Kosmos2Processor,
PreTrainedTokenizerFast,
XLMRobertaTokenizer,
XLMRobertaTokenizerFast,
)
SAMPLE_VOCAB = get_tests_dir("fixtures/test_sentencepiece.model")
@require_sentencepiece
@require_tokenizers
@require_vision
class Kosmos2ProcessorTest(unittest.TestCase):
def setUp(self):
self.tmpdirname = tempfile.mkdtemp()
image_processor = Kosmos2ImageProcessor()
# We have a SentencePiece fixture for testing
slow_tokenizer = XLMRobertaTokenizer(SAMPLE_VOCAB)
fast_tokenizer = XLMRobertaTokenizerFast(__slow_tokenizer=slow_tokenizer)
processor = Kosmos2Processor(image_processor, fast_tokenizer)
processor.save_pretrained(self.tmpdirname)
def get_tokenizer(self, **kwargs):
return AutoProcessor.from_pretrained(self.tmpdirname, **kwargs).tokenizer
def get_image_processor(self, **kwargs):
return AutoProcessor.from_pretrained(self.tmpdirname, **kwargs).image_processor
def tearDown(self):
shutil.rmtree(self.tmpdirname)
def prepare_image_inputs(self):
"""This function prepares a list of PIL images, or a list of numpy arrays if one specifies numpify=True,
or a list of PyTorch tensors if one specifies torchify=True.
"""
image_inputs = [np.random.randint(255, size=(3, 30, 400), dtype=np.uint8)]
image_inputs = [Image.fromarray(np.moveaxis(x, 0, -1)) for x in image_inputs]
return image_inputs
def test_save_load_pretrained_additional_features(self):
processor = Kosmos2Processor(tokenizer=self.get_tokenizer(), image_processor=self.get_image_processor())
processor.save_pretrained(self.tmpdirname)
tokenizer_add_kwargs = self.get_tokenizer(bos_token="(BOS)", eos_token="(EOS)")
image_processor_add_kwargs = self.get_image_processor(do_normalize=False, padding_value=1.0)
processor = Kosmos2Processor.from_pretrained(
self.tmpdirname, bos_token="(BOS)", eos_token="(EOS)", do_normalize=False, padding_value=1.0
)
self.assertEqual(processor.tokenizer.get_vocab(), tokenizer_add_kwargs.get_vocab())
self.assertIsInstance(processor.tokenizer, PreTrainedTokenizerFast)
self.assertEqual(processor.image_processor.to_json_string(), image_processor_add_kwargs.to_json_string())
self.assertIsInstance(processor.image_processor, Kosmos2ImageProcessor)
def test_image_processor(self):
image_processor = self.get_image_processor()
tokenizer = self.get_tokenizer()
processor = Kosmos2Processor(tokenizer=tokenizer, image_processor=image_processor)
image_input = self.prepare_image_inputs()
input_image_processor = image_processor(image_input, return_tensors="np")
input_processor = processor(images=image_input, return_tensors="np")
for key in input_image_processor.keys():
self.assertAlmostEqual(input_image_processor[key].sum(), input_processor[key].sum(), delta=1e-2)
def test_tokenizer(self):
image_processor = self.get_image_processor()
tokenizer = self.get_tokenizer()
processor = Kosmos2Processor(tokenizer=tokenizer, image_processor=image_processor)
input_str = "This is a test"
encoded_processor = processor(text=input_str)
encoded_tok = tokenizer(input_str, return_token_type_ids=False)
for key in encoded_tok.keys():
self.assertListEqual(encoded_tok[key], encoded_processor[key])
def test_processor(self):
image_processor = self.get_image_processor()
tokenizer = self.get_tokenizer()
processor = Kosmos2Processor(tokenizer=tokenizer, image_processor=image_processor)
input_str = "This is a test"
image_input = self.prepare_image_inputs()
inputs = processor(text=input_str, images=image_input)
self.assertListEqual(
list(inputs.keys()), ["pixel_values", "input_ids", "attention_mask", "image_embeds_position_mask"]
)
# test if it raises when no input is passed
with pytest.raises(ValueError):
processor()
def test_tokenizer_decode(self):
image_processor = self.get_image_processor()
tokenizer = self.get_tokenizer()
processor = Kosmos2Processor(tokenizer=tokenizer, image_processor=image_processor)
predicted_ids = [[1, 4, 5, 8, 1, 0, 8], [3, 4, 3, 1, 1, 8, 9]]
decoded_processor = processor.batch_decode(predicted_ids)
decoded_tok = tokenizer.batch_decode(predicted_ids)
self.assertListEqual(decoded_tok, decoded_processor)
def test_model_input_names(self):
image_processor = self.get_image_processor()
tokenizer = self.get_tokenizer()
processor = Kosmos2Processor(tokenizer=tokenizer, image_processor=image_processor)
input_str = "This is a test"
image_input = self.prepare_image_inputs()
# both image and text
inputs = processor(text=input_str, images=image_input)
self.assertListEqual(
list(inputs.keys()), ["pixel_values", "input_ids", "attention_mask", "image_embeds_position_mask"]
)
# only text
inputs = processor(text=input_str)
self.assertListEqual(list(inputs.keys()), ["input_ids", "attention_mask"])
# only image
inputs = processor(images=image_input)
self.assertListEqual(list(inputs.keys()), ["pixel_values"])
@require_torch
def test_full_processor(self):
url = "https://huggingface.co/ydshieh/temp-testing-kosmos-2-rename-002/resolve/main/two_dogs.jpg"
processor = Kosmos2Processor.from_pretrained("ydshieh/temp-testing-kosmos-2-rename-002")
# test with different input formats.
# fmt: off
texts = [
# no phrase
"<grounding> Two puppies sit in a field of grass.",
# 1 phrase
"<grounding> <phrase> Two puppies </phrase> sit in a field of grass.",
# 2 phrases
"<grounding> <phrase> Two puppies </phrase> sit in a field of <phrase> grass </phrase>.",
# 2 phrases: bboxes already specified for the 1st phrase
"<grounding> <phrase> Two puppies </phrase> <object> <patch_index_0079> <patch_index_1016> </delimiter_of_multi_objects/> <patch_index_0135> <patch_index_1008> </object> sit in a field of <phrase> grass </phrase>.",
]
# fmt: on
image = Image.open(requests.get(url, stream=True).raw)
# To match the official (microsoft) Kosmos-2 demo from which the expected values here are grabbed
image_path = os.path.join(self.tmpdirname, "image.jpg")
image.save(image_path)
image = Image.open(image_path)
# fmt: off
bboxes = [
[None, []],
[[None], [[]], [(79, 1016)], [[(79, 1016)]], [[(79, 1016), (135, 1008)]]],
[[[(79, 1016), (135, 1008)], None], [[(79, 1016), (135, 1008)], []], [[(79, 1016), (135, 1008)], (480, 1023)], [[(79, 1016), (135, 1008)], [(480, 1023)]]],
[[None, [(480, 1023)]]],
]
# fmt: on
batch_image = [image] * 4
batch_text = [texts[0], texts[1], texts[1], texts[2]]
batch_bboxes = [
None, # no phrase
[[]], # 1 phrase: no bbox
[(79, 1016)], # 1 phrase: 1 bbox
[[(79, 1016), (135, 1008)], (480, 1023)], # 2 phrase: 2 bboxes + 1 bbox
]
# fmt: off
expected_texts = [
# no phrase
"<grounding> Two puppies sit in a field of grass.",
# 1 phrase: without bbox
"<grounding><phrase> Two puppies</phrase> sit in a field of grass.",
# 1 phrase: with a single bbox
"<grounding><phrase> Two puppies</phrase><object><patch_index_0079><patch_index_1016></object> sit in a field of grass.", # noqa
# 1 phrase: with 2 bboxes
"<grounding><phrase> Two puppies</phrase><object><patch_index_0079><patch_index_1016></delimiter_of_multi_objects/><patch_index_0135><patch_index_1008></object> sit in a field of grass.", # noqa
# 2 phrases: one with 2 bboxes and another one without bbox
"<grounding><phrase> Two puppies</phrase><object><patch_index_0079><patch_index_1016></delimiter_of_multi_objects/><patch_index_0135><patch_index_1008></object> sit in a field of<phrase> grass</phrase> .", # noqa
# 2 phrases: one with 2 bboxes and another one with a single bbox
"<grounding><phrase> Two puppies</phrase><object><patch_index_0079><patch_index_1016></delimiter_of_multi_objects/><patch_index_0135><patch_index_1008></object> sit in a field of<phrase> grass</phrase><object><patch_index_0480><patch_index_1023></object> .", # noqa
]
# fmt: on
# fmt: off
expected_input_ids = [
[0, 64012, 1264, 17772, 1357, 12, 10, 770, 9, 4464, 4, 2],
[0, 64012, 64007, 1264, 17772, 64008, 1357, 12, 10, 770, 9, 4464, 4, 2],
[0, 64012, 64007, 1264, 17772, 64008, 64009, 64092, 65029, 64010, 1357, 12, 10, 770, 9, 4464, 4, 2],
[0, 64012, 64007, 1264, 17772, 64008, 64009, 64092, 65029, 64011, 64148, 65021, 64010, 1357, 12, 10, 770, 9, 4464, 4, 2],
[0, 64012, 64007, 1264, 17772, 64008, 64009, 64092, 65029, 64011, 64148, 65021, 64010, 1357, 12, 10, 770, 9, 64007, 4464, 64008, 106, 4, 2],
[0, 64012, 64007, 1264, 17772, 64008, 64009, 64092, 65029, 64011, 64148, 65021, 64010, 1357, 12, 10, 770, 9, 64007, 4464, 64008, 64009, 64493, 65036, 64010, 106, 4, 2],
]
# fmt: on
EXPECTED_PIXEL_VALUES_1 = np.array(
[
[
[-0.6535852551460266, -0.6389868259429932, -0.6243883967399597],
[-0.6535852551460266, -0.6389868259429932, -0.6243883967399597],
[-0.6243883967399597, -0.6243883967399597, -0.5951915383338928],
],
[
[-0.20629698038101196, -0.19128920137882233, -0.19128920137882233],
[-0.20629698038101196, -0.19128920137882233, -0.17628143727779388],
[-0.2213047444820404, -0.20629698038101196, -0.16127367317676544],
],
[
[-0.5843556523323059, -0.5701355338096619, -0.5701355338096619],
[-0.5843556523323059, -0.5701355338096619, -0.5559154152870178],
[-0.5843556523323059, -0.5559154152870178, -0.5416953563690186],
],
]
)
EXPECTED_PIXEL_VALUES_2 = np.array(
[
[
[-0.4346088469028473, -0.47840413451194763, -0.7849710583686829],
[-0.5221993923187256, -0.5076009631156921, -0.755774199962616],
[-0.5221993923187256, -0.5076009631156921, -0.7411757707595825],
],
[
[-0.2813358008861542, -0.2963435649871826, -0.431413471698761],
[-0.26632803678512573, -0.2963435649871826, -0.4764367938041687],
[-0.2213047444820404, -0.2813358008861542, -0.49144455790519714],
],
[
[-0.5701355338096619, -0.641235888004303, -0.7549964189529419],
[-0.5843556523323059, -0.641235888004303, -0.7834365367889404],
[-0.5559154152870178, -0.641235888004303, -0.7834365367889404],
],
]
)
def check(texts, bboxes, expected_texts, expected_input_ids):
processed_texts = processor.preprocess_text(images=None, texts=texts, bboxes=bboxes)
assert processed_texts == expected_texts
outputs = processor(images=None, text=texts, bboxes=bboxes)
assert outputs.input_ids == expected_input_ids
# no phrase
check(texts[0], bboxes[0][0], expected_texts[0], expected_input_ids[0])
# no phrase
check(texts[0], bboxes[0][1], expected_texts[0], expected_input_ids[0])
# 1 phrase: no bbox
check(texts[1], bboxes[1][0], expected_texts[1], expected_input_ids[1])
# 1 phrase: no bbox
check(texts[1], bboxes[1][1], expected_texts[1], expected_input_ids[1])
# 1 phrase: 1 bbox
check(texts[1], bboxes[1][2], expected_texts[2], expected_input_ids[2])
# 1 phrase: 1 bbox
check(texts[1], bboxes[1][3], expected_texts[2], expected_input_ids[2])
# 1 phrase: 2 bboxes
check(texts[1], bboxes[1][4], expected_texts[3], expected_input_ids[3])
# could not contain `[None]`
with pytest.raises(ValueError):
_ = processor.preprocess_text(images=None, texts=texts[1], bboxes=[[None]])
# 2 phrase: 2 bboxes + no bbox
check(texts[2], bboxes[2][0], expected_texts[4], expected_input_ids[4])
# 2 phrase: 2 bboxes + no bbox
check(texts[2], bboxes[2][1], expected_texts[4], expected_input_ids[4])
# 2 phrase: 2 bboxes + 1 bbox
check(texts[2], bboxes[2][2], expected_texts[5], expected_input_ids[5])
# 2 phrase: 2 bboxes + 1 bbox
check(texts[2], bboxes[2][3], expected_texts[5], expected_input_ids[5])
# 2 phrase: no box (as already specified in the text) + 1 bbox
check(texts[3], bboxes[3][0], expected_texts[5], expected_input_ids[5])
# could not contain `[None]`
with pytest.raises(ValueError):
_ = processor.preprocess_text(images=None, texts=texts[2], bboxes=[[(79, 1016), (135, 1008)], [None]])
# test batch
outputs = processor.preprocess_text(
images=None,
texts=batch_text,
bboxes=batch_bboxes,
)
assert outputs == [expected_texts[0], expected_texts[1], expected_texts[2], expected_texts[5]]
outputs = processor(
images=None,
text=batch_text,
bboxes=batch_bboxes,
)
assert outputs.input_ids == [
expected_input_ids[0],
expected_input_ids[1],
expected_input_ids[2],
expected_input_ids[5],
]
# test batch with padding (without `return_tensors`)
outputs = processor(
images=None,
text=batch_text,
bboxes=batch_bboxes,
padding=True,
)
# padding on the right
assert outputs.input_ids[0] == expected_input_ids[0] + [1] * (
len(expected_input_ids[5]) - len(expected_input_ids[0])
)
assert outputs.attention_mask[0] == [1] * len(expected_input_ids[0]) + [0] * (
len(expected_input_ids[5]) - len(expected_input_ids[0])
)
# no padding for the longest sequence
assert outputs.input_ids[-1] == expected_input_ids[5]
assert outputs.attention_mask[-1] == [1] * len(expected_input_ids[5])
# test batch with padding (with `return_tensors`)
outputs = processor(
images=None,
text=batch_text,
bboxes=batch_bboxes,
return_tensors="pt",
padding=True,
)
# padding on the right
assert outputs.input_ids.numpy().tolist()[0] == expected_input_ids[0] + [1] * (
len(expected_input_ids[5]) - len(expected_input_ids[0])
)
assert outputs.attention_mask.numpy().tolist()[0] == [1] * len(expected_input_ids[0]) + [0] * (
len(expected_input_ids[5]) - len(expected_input_ids[0])
)
# no padding for the longest sequence
assert outputs.input_ids.numpy().tolist()[-1] == expected_input_ids[5]
assert outputs.attention_mask.numpy().tolist()[-1] == [1] * len(expected_input_ids[5])
# test with image
num_image_tokens = 64
# (`image` type is not checked in `preprocess_text`. It works as long as it is not `None`.)
outputs = processor.preprocess_text(
images=image, texts=texts[0], bboxes=None, num_image_tokens=num_image_tokens
)
assert outputs == "".join(["<image>"] + ["<image>"] * num_image_tokens + ["</image>"] + [expected_texts[0]])
outputs = processor(images=image, text=texts[0], bboxes=None)
assert outputs.pixel_values[0].shape == (3, 224, 224)
assert (
outputs.input_ids
== [0, 64003] + list(range(4, 4 + num_image_tokens)) + [64004] + expected_input_ids[0][1:]
)
assert outputs.image_embeds_position_mask == [0] * 2 + [1] * num_image_tokens + [0] + [0] * (
len(expected_input_ids[0]) - 1
)
assert np.allclose(outputs.pixel_values[0][:3, :3, :3], EXPECTED_PIXEL_VALUES_1, atol=1e-9)
assert np.allclose(outputs.pixel_values[0][:3, -3:, -3:], EXPECTED_PIXEL_VALUES_2, atol=1e-9)
# test with image in batch (right padding)
outputs = processor(
images=batch_image,
text=batch_text,
bboxes=batch_bboxes,
return_tensors="pt",
padding=True,
)
assert outputs.pixel_values.shape == (4, 3, 224, 224)
assert np.allclose(
outputs.pixel_values[:, :3, :3, :3].numpy(), [EXPECTED_PIXEL_VALUES_1] * len(batch_image), atol=1e-9
)
assert np.allclose(
outputs.pixel_values[:, :3, -3:, -3:].numpy(), [EXPECTED_PIXEL_VALUES_2] * len(batch_image), atol=1e-9
)
# padding on the right: the `[1:]` below is because the part for `BOS` is already added in the beginning of each (dynamically computed) expected value # noqa
assert outputs.input_ids.numpy().tolist()[0] == [0, 64003] + list(range(4, 4 + num_image_tokens)) + [
64004
] + expected_input_ids[0][1:] + [1] * (len(expected_input_ids[5]) - len(expected_input_ids[0]))
assert outputs.attention_mask.numpy().tolist()[0] == [1, 1] + [1] * num_image_tokens + [1] + [1] * len(
expected_input_ids[0][1:]
) + [0] * (len(expected_input_ids[5]) - len(expected_input_ids[0]))
assert (
outputs.input_ids.numpy().tolist()[-1]
== [0, 64003] + list(range(4, 4 + num_image_tokens)) + [64004] + expected_input_ids[5][1:]
)
assert outputs.attention_mask.numpy().tolist()[-1] == [1] * (2 + num_image_tokens + len(expected_input_ids[5]))
assert outputs.image_embeds_position_mask.numpy().tolist() == [
[0, 0] + [1] * num_image_tokens + [0] + [0] * (len(expected_input_ids[5]) - 1)
] * len(batch_image)
processor = Kosmos2Processor.from_pretrained("ydshieh/temp-testing-kosmos-2-rename-002", padding_side="left")
# test with image in batch (left padding)
outputs = processor(
images=batch_image,
text=batch_text,
bboxes=batch_bboxes,
return_tensors="pt",
padding=True,
)
# padding on the left: the `[1:]` below is because the part for `BOS` is already added in the beginning of each (dynamically computed) expected value # noqa
assert (
outputs.input_ids.numpy().tolist()[0]
== [1] * (len(expected_input_ids[5]) - len(expected_input_ids[0]))
+ [0, 64003]
+ list(range(4, 4 + num_image_tokens))
+ [64004]
+ expected_input_ids[0][1:]
)
assert outputs.attention_mask.numpy().tolist()[0] == [0] * (
len(expected_input_ids[5]) - len(expected_input_ids[0])
) + [1, 1] + [1] * num_image_tokens + [1] + [1] * len(expected_input_ids[0][1:])
assert outputs.image_embeds_position_mask.numpy().tolist()[0] == [0] * (
len(expected_input_ids[5]) - len(expected_input_ids[0])
) + [0, 0] + [1] * num_image_tokens + [0] + [0] * len(expected_input_ids[0][1:])
# no padding for the longest sequence
assert (
outputs.input_ids.numpy().tolist()[-1]
== [0, 64003] + list(range(4, 4 + num_image_tokens)) + [64004] + expected_input_ids[5][1:]
)
assert outputs.attention_mask.numpy().tolist()[-1] == [1] * (2 + num_image_tokens + len(expected_input_ids[5]))
assert outputs.image_embeds_position_mask.numpy().tolist()[-1] == [0, 0] + [1] * num_image_tokens + [0] + [
0
] * (len(expected_input_ids[5]) - 1)

View File

@ -73,6 +73,10 @@ PRIVATE_MODELS = [
"MaskFormerSwinPreTrainedModel",
"BridgeTowerTextModel",
"BridgeTowerVisionModel",
"Kosmos2LMWrapper",
"Kosmos2TextModel",
"Kosmos2TextForCausalLM",
"Kosmos2VisionModel",
]
# Update this list for models that are not tested with a comment explaining the reason it should not be.

View File

@ -617,6 +617,7 @@ src/transformers/models/instructblip/processing_instructblip.py
src/transformers/models/jukebox/configuration_jukebox.py
src/transformers/models/jukebox/convert_jukebox.py
src/transformers/models/jukebox/modeling_jukebox.py
src/transformers/models/kosmos2/convert_kosmos2_original_pytorch_checkpoint_to_pytorch.py
src/transformers/models/led/configuration_led.py
src/transformers/models/led/modeling_led.py
src/transformers/models/led/modeling_tf_led.py

View File

@ -2,3 +2,4 @@ docs/source/en/generation_strategies.md
docs/source/en/model_doc/ctrl.md
docs/source/en/task_summary.md
src/transformers/models/ctrl/modeling_ctrl.py
src/transformers/models/kosmos2/modeling_kosmos2.py