push working qwen vl viz

fixup
biig update for ragged inputs why not
2025-10-22 02:08:58 +08:00 · 2025-10-21 18:04:08 +02:00 · 2025-10-21 16:33:01 +02:00 · 2025-10-21 16:17:45 +02:00 · 2025-10-21 14:19:35 +02:00 · 2025-08-11 19:20:28 +02:00
96 changed files with 6032 additions and 274 deletions
--- a/docs/source/ar/serialization.md
+++ b/docs/source/ar/serialization.md
@ -32,7 +32,7 @@
 لتصدير نموذج 🤗 Transformers إلى ONNX، قم أولاً بتثبيت اعتماد إضافي:

 ```bash
-pip install optimum-onnx
+pip install optimum[exporters]
 ```

 للاطلاع على جميع المعامﻻت المتاحة، يرجى الرجوع إلى [وثائق 🤗 Optimum](https://huggingface.co/docs/optimum/exporters/onnx/usage_guides/export_a_model#exporting-a-model-to-onnx-using-the-cli)، أو عرض المساعدة في سطر الأوامر:
@ -111,3 +111,60 @@ optimum-cli export onnx --model keras-io/transformers-qa distilbert_base_cased_s
 ### تصدير نموذج لهندسة غير مدعومة

 إذا كنت ترغب في المساهمة من خلال إضافة دعم لنموذج لا يُمكن تصديره حاليًا، فيجب عليك أولاً التحقق مما إذا كان مدعومًا في [`optimum.exporters.onnx`](https://huggingface.co/docs/optimum/exporters/onnx/overview)، وإذا لم يكن مدعومًا، [فيمكنك المساهمة في 🤗 Optimum](https://huggingface.co/docs/optimum/exporters/onnx/usage_guides/contribute) مُباشرةً.
+
+### تصدير نموذج باستخدام `transformers.onnx`
+
+<Tip warning={true}>
+
+لم يعد يتم دعم `transformers.onnx`  يُرجى تصدير النماذج باستخدام 🤗 Optimum كما هو موضح أعلاه. سيتم إزالة هذا القسم في الإصدارات القادمة.
+
+</Tip>
+
+لتصدير نموذج 🤗 Transformers إلى ONNX باستخدام `transformers.onnx`، ثبّت التبعيات الإضافية:
+
+```bash
+pip install transformers[onnx]
+```
+
+استخدم حزمة `transformers.onnx` كنموذج Python لتصدير نقطة حفظ باستخدام تكوين جاهز:
+
+```bash
+python -m transformers.onnx --model=distilbert/distilbert-base-uncased onnx/
+```
+
+يُصدّر هذا رسمًا بيانيًا ONNX لنقطة الحفظ المُحددة بواسطة وسيطة `--model`. مرر أي نقطة حفظ على 🤗 Hub أو نقطة حفظ مُخزنة محليًا.
+يُمكن بعد ذلك تشغيل ملف `model.onnx` الناتج على أحد المُسرعات العديدة التي تدعم معيار ONNX. على سبيل المثال، قم بتحميل وتشغيل النموذج باستخدام ONNX Runtime كما يلي:
+
+```python
+>>> from transformers import AutoTokenizer
+>>> from onnxruntime import InferenceSession
+
+>>> tokenizer = AutoTokenizer.from_pretrained("distilbert/distilbert-base-uncased")
+>>> session = InferenceSession("onnx/model.onnx")
+>>> # يتوقع ONNX Runtime مصفوفات NumPy كمدخلات
+>>> inputs = tokenizer("Using DistilBERT with ONNX Runtime!", return_tensors="np")
+>>> outputs = session.run(output_names=["last_hidden_state"], input_feed=dict(inputs))
+```
+
+يُمكن الحصول على أسماء المخرجات المطلوبة (مثل `["last_hidden_state"]`) من خلال إلقاء نظرة على تكوين ONNX لكل نموذج. على سبيل المثال، بالنسبة لـ DistilBERT، لدينا:
+
+```python
+>>> from transformers.models.distilbert import DistilBertConfig, DistilBertOnnxConfig
+
+>>> config = DistilBertConfig()
+>>> onnx_config = DistilBertOnnxConfig(config)
+>>> print(list(onnx_config.outputs.keys()))
+["last_hidden_state"]
+```
+
+العمليات مُتطابقة لنقاط الحفظ TensorFlow على Hub. على سبيل المثال، صدّر نقطة حفظ TensorFlow خالصة كما يلي:
+
+```bash
+python -m transformers.onnx --model=keras-io/transformers-qa onnx/
+```
+
+لتصدير نموذج مُخزن محليًا، احفظ أوزان النموذج ومجزىء اللغوى في نفس الدليل (على سبيل المثال `local-pt-checkpoint`)، ثم قم بتصديره إلى ONNX عن طريق توجيه وسيط `--model` لحزمة `transformers.onnx` إلى الدليل المطلوب:
+
+```bash
+python -m transformers.onnx --model=local-pt-checkpoint onnx/
+```
--- a/docs/source/en/serialization.md
+++ b/docs/source/en/serialization.md
@ -33,7 +33,7 @@ Export a Transformers model to ONNX with the Optimum CLI or the `optimum.onnxrun
 Run the command below to install Optimum and the [exporters](https://huggingface.co/docs/optimum/exporters/overview) module.

 ```bash
-pip install optimum-onnx
+pip install optimum[exporters]
 ```

 > [!TIP]
--- a/docs/source/ja/main_classes/onnx.md
+++ b/docs/source/ja/main_classes/onnx.md
@ -0,0 +1,50 @@
+<!--Copyright 2020 The HuggingFace Team. All rights reserved.
+
+Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with
+the License. You may obtain a copy of the License at
+
+http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on
+an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the
+specific language governing permissions and limitations under the License.
+
+⚠️ Note that this file is in Markdown but contain specific syntax for our doc-builder (similar to MDX) that may not be
+rendered properly in your Markdown viewer.
+
+-->
+
+# Exporting 🤗 Transformers models to ONNX
+
+🤗 Transformers は `transformers.onnx` パッケージを提供します。
+設定オブジェクトを利用することで、モデルのチェックポイントをONNXグラフに変換することができます。
+
+詳細は[ガイド](../serialization) を参照してください。
+を参照してください。
+
+## ONNX Configurations
+
+以下の3つの抽象クラスを提供しています。
+エクスポートしたいモデルアーキテクチャのタイプに応じて、継承すべき3つの抽象クラスを提供します：
+
+* エンコーダーベースのモデルは [`~onnx.config.OnnxConfig`] を継承します。
+* デコーダーベースのモデルは [`~onnx.config.OnnxConfigWithPast`] を継承します。
+* エンコーダー・デコーダーモデルは [`~onnx.config.OnnxSeq2SeqConfigWithPast`] を継承しています。
+
+
+### OnnxConfig
+
+[[autodoc]] onnx.config.OnnxConfig
+
+### OnnxConfigWithPast
+
+[[autodoc]] onnx.config.OnnxConfigWithPast
+
+### OnnxSeq2SeqConfigWithPast
+
+[[autodoc]] onnx.config.OnnxSeq2SeqConfigWithPast
+
+## ONNX Features
+
+各 ONNX 構成は、次のことを可能にする一連の _機能_ に関連付けられています。
+さまざまなタイプのトポロジまたはタスクのモデルをエクスポートします。
--- a/docs/source/ja/serialization.md
+++ b/docs/source/ja/serialization.md
@ -47,7 +47,7 @@ ONNX形式にエクスポートされたモデルは、以下のように使用
 🤗 TransformersモデルをONNXにエクスポートするには、まず追加の依存関係をインストールしてください：

 ```bash
-pip install optimum-onnx
+pip install optimum[exporters]
 ```

 すべての利用可能な引数を確認するには、[🤗 Optimumドキュメント](https://huggingface.co/docs/optimum/exporters/onnx/usage_guides/export_a_model#exporting-a-model-to-onnx-using-the-cli)を参照してください。または、コマンドラインでヘルプを表示することもできます：
@ -128,3 +128,64 @@ CLIの代わりに、🤗 TransformersモデルをONNXにプログラム的に
 ### Exporting a model for an unsupported architecture

 現在エクスポートできないモデルをサポートするために貢献したい場合、まず[`optimum.exporters.onnx`](https://huggingface.co/docs/optimum/exporters/onnx/overview)でサポートされているかどうかを確認し、サポートされていない場合は[🤗 Optimumに貢献](https://huggingface.co/docs/optimum/exporters/onnx/usage_guides/contribute)してください。
+
+### Exporting a model with `transformers.onnx`
+
+<Tip warning={true}>
+
+`transformers.onnx`はもはやメンテナンスされていないため、モデルを上記で説明したように🤗 Optimumでエクスポートしてください。このセクションは将来のバージョンで削除されます。
+
+</Tip>
+
+🤗 TransformersモデルをONNXにエクスポートするには、追加の依存関係をインストールしてください：
+
+
+```bash
+pip install transformers[onnx]
+```
+
+`transformers.onnx`パッケージをPythonモジュールとして使用して、事前に用意された設定を使用してチェックポイントをエクスポートする方法は以下の通りです：
+
+```bash
+python -m transformers.onnx --model=distilbert/distilbert-base-uncased onnx/
+```
+
+この方法は、`--model`引数で定義されたチェックポイントのONNXグラフをエクスポートします。🤗 Hubのいずれかのチェックポイントまたはローカルに保存されたチェックポイントを渡すことができます。エクスポートされた`model.onnx`ファイルは、ONNX標準をサポートする多くのアクセラレータで実行できます。例えば、ONNX Runtimeを使用してモデルを読み込んで実行する方法は以下の通りです：
+
+
+```python
+>>> from transformers import AutoTokenizer
+>>> from onnxruntime import InferenceSession
+
+>>> tokenizer = AutoTokenizer.from_pretrained("distilbert/distilbert-base-uncased")
+>>> session = InferenceSession("onnx/model.onnx")
+>>> # ONNX Runtime expects NumPy arrays as input
+>>> inputs = tokenizer("Using DistilBERT with ONNX Runtime!", return_tensors="np")
+>>> outputs = session.run(output_names=["last_hidden_state"], input_feed=dict(inputs))
+```
+
+必要な出力名（例: `["last_hidden_state"]`）は、各モデルのONNX構成を確認することで取得できます。例えば、DistilBERTの場合、次のようになります：
+
+
+```python
+>>> from transformers.models.distilbert import DistilBertConfig, DistilBertOnnxConfig
+
+>>> config = DistilBertConfig()
+>>> onnx_config = DistilBertOnnxConfig(config)
+>>> print(list(onnx_config.outputs.keys()))
+["last_hidden_state"]
+```
+
+ハブから純粋なTensorFlowのチェックポイントをプログラム的にエクスポートするプロセスは、以下のように同様です：
+
+```bash
+python -m transformers.onnx --model=keras-io/transformers-qa onnx/
+```
+
+ローカルに保存されたモデルをエクスポートする場合、モデルの重みとトークナイザのファイルを同じディレクトリに保存してください（例： `local-pt-checkpoint`）。その後、`transformers.onnx`パッケージの `--model`引数を希望するディレクトリに向けて設定して、ONNXにエクスポートします：
+
+
+```bash
+python -m transformers.onnx --model=local-pt-checkpoint onnx/
+```
+
--- a/docs/source/ko/main_classes/onnx.md
+++ b/docs/source/ko/main_classes/onnx.md
@ -0,0 +1,45 @@
+<!--Copyright 2020 The HuggingFace Team. All rights reserved.
+
+Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with
+the License. You may obtain a copy of the License at
+
+http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on
+an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the
+specific language governing permissions and limitations under the License.
+
+⚠️ Note that this file is in Markdown but contain specific syntax for our doc-builder (similar to MDX) that may not be
+rendered properly in your Markdown viewer.
+
+-->
+
+# 🤗 Transformers 모델을 ONNX로 내보내기[[exporting--transformers-models-to-onnx]]
+
+🤗 트랜스포머는 `transformers.onnx` 패키지를 제공하며, 이 패키지는 설정 객체를 활용하여 모델 체크포인트를 ONNX 그래프로 변환할 수 있게 합니다.
+
+🤗 Transformers에 대한 자세한 내용은 [이 가이드](../serialization)를 참조하세요.
+
+## ONNX 설정[[onnx-configurations]]
+
+내보내려는(export) 모델 아키텍처의 유형에 따라 상속받아야 할 세 가지 추상 클래스를 제공합니다:
+
+* 인코더 기반 모델은 [`~onnx.config.OnnxConfig`]을 상속받습니다.
+* 디코더 기반 모델은 [`~onnx.config.OnnxConfigWithPast`]을 상속받습니다.
+* 인코더-디코더 기반 모델은 [`~onnx.config.OnnxSeq2SeqConfigWithPast`]을 상속받습니다.
+
+### OnnxConfig[[transformers.onnx.OnnxConfig]]
+
+[[autodoc]] onnx.config.OnnxConfig
+
+### OnnxConfigWithPast[[transformers.onnx.OnnxConfigWithPast]]
+
+[[autodoc]] onnx.config.OnnxConfigWithPast
+
+### OnnxSeq2SeqConfigWithPast[[OnnxSeq2SeqConfigWithPast]]
+
+[[autodoc]] onnx.config.OnnxSeq2SeqConfigWithPast
+
+## ONNX 특징[[onnx-features]]
+
+각 ONNX 설정은 다양한 유형의 토폴로지나 작업에 대해 모델을 내보낼 수 있게(exporting) 해주는 _features_ 세트와 연관되어 있습니다.
--- a/docs/source/ko/serialization.md
+++ b/docs/source/ko/serialization.md
@ -47,7 +47,7 @@ ONNX 형식으로 내보낸 모델은 다음과 같이 사용할 수 있습니
 🤗 Transformers 모델을 ONNX로 내보내려면 먼저 추가 종속성을 설치하세요:

 ```bash
-pip install optimum-onnx
+pip install optimum[exporters]
 ```

 사용 가능한 모든 인수를 확인하려면 [🤗 Optimum 문서](https://huggingface.co/docs/optimum/exporters/onnx/usage_guides/export_a_model#exporting-a-model-to-onnx-using-the-cli)를 참조하거나 명령줄에서 도움말을 보세요.
@ -123,3 +123,59 @@ CLI 대신에 `optimum.onnxruntime`을 사용하여 프로그래밍 방식으로
 ### 지원되지 않는 아키텍처의 모델 내보내기 [[exporting-a-model-for-an-unsupported-architecture]]

 현재 내보낼 수 없는 모델을 지원하기 위해 기여하려면, 먼저 [`optimum.exporters.onnx`](https://huggingface.co/docs/optimum/exporters/onnx/overview)에서 지원되는지 확인한 후 지원되지 않는 경우에는 [🤗 Optimum에 기여](https://huggingface.co/docs/optimum/exporters/onnx/usage_guides/contribute)하세요.
+
+### `transformers.onnx`를 사용하여 모델 내보내기 [[exporting-a-model-with-transformersonnx]]
+
+<Tip warning={true}>
+
+`tranformers.onnx`는 더 이상 유지되지 않습니다. 위에서 설명한 대로 🤗 Optimum을 사용하여 모델을 내보내세요. 이 섹션은 향후 버전에서 제거될 예정입니다.
+
+</Tip>
+
+🤗 Transformers 모델을 ONNX로 내보내려면 추가 종속성을 설치하세요:
+
+```bash
+pip install transformers[onnx]
+```
+
+`transformers.onnx` 패키지를 Python 모듈로 사용하여 준비된 구성을 사용하여 체크포인트를 내보냅니다:
+
+```bash
+python -m transformers.onnx --model=distilbert/distilbert-base-uncased onnx/
+```
+
+이렇게 하면 `--model` 인수에 정의된 체크포인트의 ONNX 그래프가 내보내집니다. 🤗 Hub에서 제공하는 체크포인트나 로컬에 저장된 체크포인트를 전달할 수 있습니다. 결과로 생성된 `model.onnx` 파일은 ONNX 표준을 지원하는 많은 가속기 중 하나에서 실행할 수 있습니다. 예를 들어, 다음과 같이 ONNX Runtime을 사용하여 모델을 로드하고 실행할 수 있습니다:
+
+```python
+>>> from transformers import AutoTokenizer
+>>> from onnxruntime import InferenceSession
+
+>>> tokenizer = AutoTokenizer.from_pretrained("distilbert/distilbert-base-uncased")
+>>> session = InferenceSession("onnx/model.onnx")
+>>> # ONNX Runtime expects NumPy arrays as input
+>>> inputs = tokenizer("Using DistilBERT with ONNX Runtime!", return_tensors="np")
+>>> outputs = session.run(output_names=["last_hidden_state"], input_feed=dict(inputs))
+```
+
+필요한 출력 이름(예: `["last_hidden_state"]`)은 각 모델의 ONNX 구성을 확인하여 얻을 수 있습니다. 예를 들어, DistilBERT의 경우 다음과 같습니다:
+
+```python
+>>> from transformers.models.distilbert import DistilBertConfig, DistilBertOnnxConfig
+
+>>> config = DistilBertConfig()
+>>> onnx_config = DistilBertOnnxConfig(config)
+>>> print(list(onnx_config.outputs.keys()))
+["last_hidden_state"]
+```
+
+Hub의 TensorFlow 체크포인트에 대해서도 동일한 프로세스가 적용됩니다. 예를 들어, 다음과 같이 순수한 TensorFlow 체크포인트를 내보냅니다:
+
+```bash
+python -m transformers.onnx --model=keras-io/transformers-qa onnx/
+```
+
+로컬에 저장된 모델을 내보내려면 모델의 가중치 파일과 토크나이저 파일을 동일한 디렉토리에 저장한 다음, transformers.onnx 패키지의 --model 인수를 원하는 디렉토리로 지정하여 ONNX로 내보냅니다:
+
+```bash
+python -m transformers.onnx --model=local-pt-checkpoint onnx/
+```
--- a/docs/source/zh/main_classes/onnx.md
+++ b/docs/source/zh/main_classes/onnx.md
@ -0,0 +1,45 @@
+<!--Copyright 2020 The HuggingFace Team. All rights reserved.
+
+Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with
+the License. You may obtain a copy of the License at
+
+http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on
+an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the
+specific language governing permissions and limitations under the License.
+
+⚠️ Note that this file is in Markdown but contain specific syntax for our doc-builder (similar to MDX) that may not be
+rendered properly in your Markdown viewer.
+
+-->
+
+# 导出 🤗 Transformers 模型到 ONNX
+
+🤗 Transformers提供了一个`transformers.onnx`包，通过利用配置对象，您可以将模型checkpoints转换为ONNX图。
+
+有关更多详细信息，请参阅导出 🤗 Transformers 模型的[指南](../serialization)。
+
+## ONNX Configurations
+
+我们提供了三个抽象类，取决于您希望导出的模型架构类型：
+
+* 基于编码器的模型继承 [`~onnx.config.OnnxConfig`]
+* 基于解码器的模型继承 [`~onnx.config.OnnxConfigWithPast`]
+* 编码器-解码器模型继承 [`~onnx.config.OnnxSeq2SeqConfigWithPast`]
+
+### OnnxConfig
+
+[[autodoc]] onnx.config.OnnxConfig
+
+### OnnxConfigWithPast
+
+[[autodoc]] onnx.config.OnnxConfigWithPast
+
+### OnnxSeq2SeqConfigWithPast
+
+[[autodoc]] onnx.config.OnnxSeq2SeqConfigWithPast
+
+## ONNX Features
+
+每个ONNX配置与一组 _特性_ 相关联，使您能够为不同类型的拓扑结构或任务导出模型。
--- a/docs/source/zh/serialization.md
+++ b/docs/source/zh/serialization.md
@ -47,7 +47,7 @@ rendered properly in your Markdown viewer.
 要将 🤗 Transformers 模型导出为 ONNX，首先需要安装额外的依赖项：

 ```bash
-pip install optimum-onnx
+pip install optimum[exporters]
 ```

 请参阅 [🤗 Optimum 文档](https://huggingface.co/docs/optimum/exporters/onnx/usage_guides/export_a_model#exporting-a-model-to-onnx-using-the-cli) 以查看所有可用参数，或者在命令行中查看帮助：
@ -117,3 +117,53 @@ optimum-cli export onnx --model local_path --task question-answering distilbert_
 ### 导出尚未支持的架构的模型

 如果你想要为当前无法导出的模型添加支持，请先检查 [`optimum.exporters.onnx`](https://huggingface.co/docs/optimum/exporters/onnx/overview) 是否支持该模型，如果不支持，你可以 [直接为 🤗 Optimum 贡献代码](https://huggingface.co/docs/optimum/exporters/onnx/usage_guides/contribute)。
+
+### 使用 `transformers.onnx` 导出模型
+
+<Tip warning={true}>
+
+`transformers.onnx` 不再进行维护，请如上所述，使用 🤗 Optimum 导出模型。这部分内容将在未来版本中删除。
+
+</Tip>
+
+要使用 `transformers.onnx` 将 🤗 Transformers 模型导出为 ONNX，请安装额外的依赖项：
+
+```bash
+pip install transformers[onnx]
+```
+
+将 `transformers.onnx` 包作为 Python 模块使用，以使用现成的配置导出检查点：
+
+```bash
+python -m transformers.onnx --model=distilbert/distilbert-base-uncased onnx/
+```
+
+以上代码将导出由 `--model` 参数定义的检查点的 ONNX 图。传入任何 🤗 Hub 上或者存储与本地的检查点。生成的 `model.onnx` 文件可以在支持 ONNX 标准的众多加速引擎上运行。例如，使用 ONNX Runtime 加载并运行模型，如下所示：
+
+```python
+>>> from transformers import AutoTokenizer
+>>> from onnxruntime import InferenceSession
+
+>>> tokenizer = AutoTokenizer.from_pretrained("distilbert/distilbert-base-uncased")
+>>> session = InferenceSession("onnx/model.onnx")
+>>> # ONNX Runtime expects NumPy arrays as input
+>>> inputs = tokenizer("Using DistilBERT with ONNX Runtime!", return_tensors="np")
+>>> outputs = session.run(output_names=["last_hidden_state"], input_feed=dict(inputs))
+```
+
+可以通过查看每个模型的 ONNX 配置来获取所需的输出名（例如 `["last_hidden_state"]`）。例如，对于 DistilBERT，可以用以下代码获取输出名称：
+
+```python
+>>> from transformers.models.distilbert import DistilBertConfig, DistilBertOnnxConfig
+
+>>> config = DistilBertConfig()
+>>> onnx_config = DistilBertOnnxConfig(config)
+>>> print(list(onnx_config.outputs.keys()))
+["last_hidden_state"]
+```
+
+要导出本地存储的模型，请将模型的权重和分词器文件保存在同一目录中（例如 `local-pt-checkpoint`），然后通过将 `transformers.onnx` 包的 `--model` 参数指向该目录，将其导出为 ONNX：
+
+```bash
+python -m transformers.onnx --model=local-pt-checkpoint onnx/
+```
--- a/src/transformers/init.py
+++ b/src/transformers/init.py
@ -129,6 +129,8 @@ _import_structure = {
    ],
    "loss": [],
    "modelcard": ["ModelCard"],
+    # Models
+    "onnx": [],
    "pipelines": [
        "AudioClassificationPipeline",
        "AutomaticSpeechRecognitionPipeline",
--- a/src/transformers/image_transforms.py
+++ b/src/transformers/image_transforms.py
@ -442,6 +442,75 @@ def normalize(
    return image


+def unnormalize(
+    image: Union[np.ndarray, "torch.Tensor"],
+    mean: Union[float, Collection[float]],
+    std: Union[float, Collection[float]],
+    data_format: Optional[ChannelDimension] = None,
+    input_data_format: Optional[Union[str, ChannelDimension]] = None,
+) -> np.ndarray:
+    """
+    Inverse of `normalize`:
+
+        image = image * std + mean
+
+    Args:
+        image (`np.ndarray` or `torch.Tensor`):
+            The image to unnormalize.
+        mean (`float` or `Collection[float]`):
+            The mean to use for unnormalization.
+        std (`float` or `Collection[float]`):
+            The standard deviation to use for unnormalization.
+        data_format (`ChannelDimension`, *optional*):
+            The channel dimension format of the output image. If unset, will use the inferred format from the input.
+        input_data_format (`ChannelDimension`, *optional*):
+            The channel dimension format of the input image. If unset, will use the inferred format from the input.
+
+    Returns:
+        `np.ndarray`: The unnormalized image.
+    """
+    is_torch_input = isinstance(image, torch.Tensor)
+    if is_torch_input:
+        image = image.detach().cpu().numpy()
+    elif not isinstance(image, np.ndarray):
+        raise TypeError("image must be a numpy array or a torch tensor")
+
+    if input_data_format is None:
+        input_data_format = infer_channel_dimension_format(image)
+
+    if not np.issubdtype(image.dtype, np.floating):
+        image = image.astype(np.float32)
+
+    channel_axis = get_channel_dimension_axis(image, input_data_format=input_data_format)
+    num_channels = image.shape[channel_axis]
+
+    if isinstance(mean, Collection):
+        if len(mean) != num_channels:
+            raise ValueError(f"mean must have {num_channels} elements if it is an iterable, got {len(mean)}")
+    else:
+        mean = [mean] * num_channels
+    mean = np.array(mean, dtype=image.dtype)
+
+    if isinstance(std, Collection):
+        if len(std) != num_channels:
+            raise ValueError(f"std must have {num_channels} elements if it is an iterable, got {len(std)}")
+    else:
+        std = [std] * num_channels
+    std = np.array(std, dtype=image.dtype)
+
+    if input_data_format == ChannelDimension.LAST:
+        image = image * std + mean
+    else:
+        shape = [1] * image.ndim
+        shape[channel_axis] = num_channels
+        mean = mean.reshape(shape)
+        std = std.reshape(shape)
+        image = image * std + mean
+
+    image = to_channel_dimension_format(image, data_format, input_data_format) if data_format is not None else image
+    return image
+
+
 def center_crop(
    image: np.ndarray,
    size: tuple[int, int],
--- a/src/transformers/models/albert/configuration_albert.py
+++ b/src/transformers/models/albert/configuration_albert.py
@ -15,7 +15,11 @@
 # limitations under the License.
 """ALBERT model configuration"""

+from collections import OrderedDict
+from collections.abc import Mapping
+
 from ...configuration_utils import PreTrainedConfig
+from ...onnx import OnnxConfig


 class AlbertConfig(PreTrainedConfig):
@ -138,4 +142,21 @@ class AlbertConfig(PreTrainedConfig):
        self.classifier_dropout_prob = classifier_dropout_prob


-__all__ = ["AlbertConfig"]
+# Copied from transformers.models.bert.configuration_bert.BertOnnxConfig with Roberta->Albert
+class AlbertOnnxConfig(OnnxConfig):
+    @property
+    def inputs(self) -> Mapping[str, Mapping[int, str]]:
+        if self.task == "multiple-choice":
+            dynamic_axis = {0: "batch", 1: "choice", 2: "sequence"}
+        else:
+            dynamic_axis = {0: "batch", 1: "sequence"}
+        return OrderedDict(
+            [
+                ("input_ids", dynamic_axis),
+                ("attention_mask", dynamic_axis),
+                ("token_type_ids", dynamic_axis),
+            ]
+        )
+
+
+__all__ = ["AlbertConfig", "AlbertOnnxConfig"]
--- a/src/transformers/models/bart/configuration_bart.py
+++ b/src/transformers/models/bart/configuration_bart.py
@ -15,9 +15,15 @@
 """BART model configuration"""

 import warnings
+from collections import OrderedDict
+from collections.abc import Mapping
+from typing import Any

+from ... import PreTrainedTokenizer
 from ...configuration_utils import PreTrainedConfig
-from ...utils import logging
+from ...onnx import OnnxConfig, OnnxConfigWithPast, OnnxSeq2SeqConfigWithPast
+from ...onnx.utils import compute_effective_axis_dimension
+from ...utils import is_torch_available, logging


 logger = logging.get_logger(__name__)
@ -174,4 +180,223 @@ class BartConfig(PreTrainedConfig):
            )


-__all__ = ["BartConfig"]
+class BartOnnxConfig(OnnxSeq2SeqConfigWithPast):
+    @property
+    def inputs(self) -> Mapping[str, Mapping[int, str]]:
+        if self.task in ["default", "seq2seq-lm"]:
+            common_inputs = OrderedDict(
+                [
+                    ("input_ids", {0: "batch", 1: "encoder_sequence"}),
+                    ("attention_mask", {0: "batch", 1: "encoder_sequence"}),
+                ]
+            )
+
+            if self.use_past:
+                common_inputs["decoder_input_ids"] = {0: "batch"}
+                common_inputs["decoder_attention_mask"] = {0: "batch", 1: "past_decoder_sequence + sequence"}
+            else:
+                common_inputs["decoder_input_ids"] = {0: "batch", 1: "decoder_sequence"}
+                common_inputs["decoder_attention_mask"] = {0: "batch", 1: "decoder_sequence"}
+
+            if self.use_past:
+                self.fill_with_past_key_values_(common_inputs, direction="inputs")
+        elif self.task == "causal-lm":
+            # TODO: figure this case out.
+            common_inputs = OrderedDict(
+                [
+                    ("input_ids", {0: "batch", 1: "encoder_sequence"}),
+                    ("attention_mask", {0: "batch", 1: "encoder_sequence"}),
+                ]
+            )
+            if self.use_past:
+                num_encoder_layers, _ = self.num_layers
+                for i in range(num_encoder_layers):
+                    common_inputs[f"past_key_values.{i}.key"] = {0: "batch", 2: "past_sequence + sequence"}
+                    common_inputs[f"past_key_values.{i}.value"] = {0: "batch", 2: "past_sequence + sequence"}
+        else:
+            common_inputs = OrderedDict(
+                [
+                    ("input_ids", {0: "batch", 1: "encoder_sequence"}),
+                    ("attention_mask", {0: "batch", 1: "encoder_sequence"}),
+                    ("decoder_input_ids", {0: "batch", 1: "decoder_sequence"}),
+                    ("decoder_attention_mask", {0: "batch", 1: "decoder_sequence"}),
+                ]
+            )
+
+        return common_inputs
+
+    @property
+    def outputs(self) -> Mapping[str, Mapping[int, str]]:
+        if self.task in ["default", "seq2seq-lm"]:
+            common_outputs = super().outputs
+        else:
+            common_outputs = super(OnnxConfigWithPast, self).outputs
+            if self.use_past:
+                num_encoder_layers, _ = self.num_layers
+                for i in range(num_encoder_layers):
+                    common_outputs[f"present.{i}.key"] = {0: "batch", 2: "past_sequence + sequence"}
+                    common_outputs[f"present.{i}.value"] = {0: "batch", 2: "past_sequence + sequence"}
+        return common_outputs
+
+    def _generate_dummy_inputs_for_default_and_seq2seq_lm(
+        self,
+        tokenizer: PreTrainedTokenizer,
+        batch_size: int = -1,
+        seq_length: int = -1,
+        is_pair: bool = False,
+    ) -> Mapping[str, Any]:
+        encoder_inputs = self._generate_dummy_inputs_for_sequence_classification_and_question_answering(
+            tokenizer, batch_size, seq_length, is_pair
+        )
+
+        # Generate decoder inputs
+        decoder_seq_length = seq_length if not self.use_past else 1
+        decoder_inputs = self._generate_dummy_inputs_for_sequence_classification_and_question_answering(
+            tokenizer, batch_size, decoder_seq_length, is_pair
+        )
+        decoder_inputs = {f"decoder_{name}": tensor for name, tensor in decoder_inputs.items()}
+        common_inputs = dict(**encoder_inputs, **decoder_inputs)
+
+        if self.use_past:
+            if not is_torch_available():
+                raise ValueError("Cannot generate dummy past_keys inputs without PyTorch installed.")
+            else:
+                import torch
+            batch, encoder_seq_length = common_inputs["input_ids"].shape
+            decoder_seq_length = common_inputs["decoder_input_ids"].shape[1]
+            num_encoder_attention_heads, num_decoder_attention_heads = self.num_attention_heads
+            encoder_shape = (
+                batch,
+                num_encoder_attention_heads,
+                encoder_seq_length,
+                self._config.hidden_size // num_encoder_attention_heads,
+            )
+            decoder_past_length = decoder_seq_length + 3
+            decoder_shape = (
+                batch,
+                num_decoder_attention_heads,
+                decoder_past_length,
+                self._config.hidden_size // num_decoder_attention_heads,
+            )
+
+            common_inputs["decoder_attention_mask"] = torch.cat(
+                [common_inputs["decoder_attention_mask"], torch.ones(batch, decoder_past_length)], dim=1
+            )
+
+            common_inputs["past_key_values"] = []
+            # If the number of encoder and decoder layers are present in the model configuration, both are considered
+            num_encoder_layers, num_decoder_layers = self.num_layers
+            min_num_layers = min(num_encoder_layers, num_decoder_layers)
+            max_num_layers = max(num_encoder_layers, num_decoder_layers) - min_num_layers
+            remaining_side_name = "encoder" if num_encoder_layers > num_decoder_layers else "decoder"
+
+            for _ in range(min_num_layers):
+                common_inputs["past_key_values"].append(
+                    (
+                        torch.zeros(decoder_shape),
+                        torch.zeros(decoder_shape),
+                        torch.zeros(encoder_shape),
+                        torch.zeros(encoder_shape),
+                    )
+                )
+            # TODO: test this.
+            shape = encoder_shape if remaining_side_name == "encoder" else decoder_shape
+            for _ in range(min_num_layers, max_num_layers):
+                common_inputs["past_key_values"].append((torch.zeros(shape), torch.zeros(shape)))
+        return common_inputs
+
+    def _generate_dummy_inputs_for_causal_lm(
+        self,
+        tokenizer: PreTrainedTokenizer,
+        batch_size: int = -1,
+        seq_length: int = -1,
+        is_pair: bool = False,
+    ) -> Mapping[str, Any]:
+        common_inputs = self._generate_dummy_inputs_for_sequence_classification_and_question_answering(
+            tokenizer, batch_size, seq_length, is_pair
+        )
+
+        if self.use_past:
+            if not is_torch_available():
+                raise ValueError("Cannot generate dummy past_keys inputs without PyTorch installed.")
+            else:
+                import torch
+            batch, seqlen = common_inputs["input_ids"].shape
+            # Not using the same length for past_key_values
+            past_key_values_length = seqlen + 2
+            num_encoder_layers, _ = self.num_layers
+            num_encoder_attention_heads, _ = self.num_attention_heads
+            past_shape = (
+                batch,
+                num_encoder_attention_heads,
+                past_key_values_length,
+                self._config.hidden_size // num_encoder_attention_heads,
+            )
+
+            mask_dtype = common_inputs["attention_mask"].dtype
+            common_inputs["attention_mask"] = torch.cat(
+                [common_inputs["attention_mask"], torch.ones(batch, past_key_values_length, dtype=mask_dtype)], dim=1
+            )
+            common_inputs["past_key_values"] = [
+                (torch.zeros(past_shape), torch.zeros(past_shape)) for _ in range(num_encoder_layers)
+            ]
+        return common_inputs
+
+    def _generate_dummy_inputs_for_sequence_classification_and_question_answering(
+        self,
+        tokenizer: PreTrainedTokenizer,
+        batch_size: int = -1,
+        seq_length: int = -1,
+        is_pair: bool = False,
+    ) -> Mapping[str, Any]:
+        # Copied from OnnxConfig.generate_dummy_inputs
+        # Did not use super(OnnxConfigWithPast, self).generate_dummy_inputs for code clarity.
+        # If dynamic axis (-1) we forward with a fixed dimension of 2 samples to avoid optimizations made by ONNX
+        batch_size = compute_effective_axis_dimension(
+            batch_size, fixed_dimension=OnnxConfig.default_fixed_batch, num_token_to_add=0
+        )
+
+        # If dynamic axis (-1) we forward with a fixed dimension of 8 tokens to avoid optimizations made by ONNX
+        token_to_add = tokenizer.num_special_tokens_to_add(is_pair)
+        seq_length = compute_effective_axis_dimension(
+            seq_length, fixed_dimension=OnnxConfig.default_fixed_sequence, num_token_to_add=token_to_add
+        )
+
+        # Generate dummy inputs according to compute batch and sequence
+        dummy_input = [" ".join([tokenizer.unk_token]) * seq_length] * batch_size
+        common_inputs = dict(tokenizer(dummy_input, return_tensors="pt"))
+        return common_inputs
+
+    def generate_dummy_inputs(
+        self,
+        tokenizer: PreTrainedTokenizer,
+        batch_size: int = -1,
+        seq_length: int = -1,
+        is_pair: bool = False,
+    ) -> Mapping[str, Any]:
+        if self.task in ["default", "seq2seq-lm"]:
+            common_inputs = self._generate_dummy_inputs_for_default_and_seq2seq_lm(
+                tokenizer, batch_size=batch_size, seq_length=seq_length, is_pair=is_pair
+            )
+
+        elif self.task == "causal-lm":
+            common_inputs = self._generate_dummy_inputs_for_causal_lm(
+                tokenizer, batch_size=batch_size, seq_length=seq_length, is_pair=is_pair
+            )
+        else:
+            common_inputs = self._generate_dummy_inputs_for_sequence_classification_and_question_answering(
+                tokenizer, batch_size=batch_size, seq_length=seq_length, is_pair=is_pair
+            )
+
+        return common_inputs
+
+    def _flatten_past_key_values_(self, flattened_output, name, idx, t):
+        if self.task in ["default", "seq2seq-lm"]:
+            flattened_output = super()._flatten_past_key_values_(flattened_output, name, idx, t)
+        else:
+            flattened_output = super(OnnxSeq2SeqConfigWithPast, self)._flatten_past_key_values_(
+                flattened_output, name, idx, t
+            )
+
+
+__all__ = ["BartConfig", "BartOnnxConfig"]
--- a/src/transformers/models/beit/configuration_beit.py
+++ b/src/transformers/models/beit/configuration_beit.py
@ -15,8 +15,13 @@
 """BEiT model configuration"""

 import warnings
+from collections import OrderedDict
+from collections.abc import Mapping
+
+from packaging import version

 from ...configuration_utils import PreTrainedConfig
+from ...onnx import OnnxConfig
 from ...utils.backbone_utils import BackboneConfigMixin, get_aligned_output_features_output_indices


@ -204,4 +209,21 @@ class BeitConfig(BackboneConfigMixin, PreTrainedConfig):
        self.reshape_hidden_states = reshape_hidden_states


-__all__ = ["BeitConfig"]
+# Copied from transformers.models.vit.configuration_vit.ViTOnnxConfig
+class BeitOnnxConfig(OnnxConfig):
+    torch_onnx_minimum_version = version.parse("1.11")
+
+    @property
+    def inputs(self) -> Mapping[str, Mapping[int, str]]:
+        return OrderedDict(
+            [
+                ("pixel_values", {0: "batch", 1: "num_channels", 2: "height", 3: "width"}),
+            ]
+        )
+
+    @property
+    def atol_for_validation(self) -> float:
+        return 1e-4
+
+
+__all__ = ["BeitConfig", "BeitOnnxConfig"]
--- a/src/transformers/models/bert/configuration_bert.py
+++ b/src/transformers/models/bert/configuration_bert.py
@ -15,7 +15,11 @@
 # limitations under the License.
 """BERT model configuration"""

+from collections import OrderedDict
+from collections.abc import Mapping
+
 from ...configuration_utils import PreTrainedConfig
+from ...onnx import OnnxConfig
 from ...utils import logging


@ -123,4 +127,20 @@ class BertConfig(PreTrainedConfig):
        self.classifier_dropout = classifier_dropout


-__all__ = ["BertConfig"]
+class BertOnnxConfig(OnnxConfig):
+    @property
+    def inputs(self) -> Mapping[str, Mapping[int, str]]:
+        if self.task == "multiple-choice":
+            dynamic_axis = {0: "batch", 1: "choice", 2: "sequence"}
+        else:
+            dynamic_axis = {0: "batch", 1: "sequence"}
+        return OrderedDict(
+            [
+                ("input_ids", dynamic_axis),
+                ("attention_mask", dynamic_axis),
+                ("token_type_ids", dynamic_axis),
+            ]
+        )
+
+
+__all__ = ["BertConfig", "BertOnnxConfig"]
--- a/src/transformers/models/big_bird/configuration_big_bird.py
+++ b/src/transformers/models/big_bird/configuration_big_bird.py
@ -14,7 +14,11 @@
 # limitations under the License.
 """BigBird model configuration"""

+from collections import OrderedDict
+from collections.abc import Mapping
+
 from ...configuration_utils import PreTrainedConfig
+from ...onnx import OnnxConfig
 from ...utils import logging


@ -154,4 +158,19 @@ class BigBirdConfig(PreTrainedConfig):
        self.classifier_dropout = classifier_dropout


-__all__ = ["BigBirdConfig"]
+class BigBirdOnnxConfig(OnnxConfig):
+    @property
+    def inputs(self) -> Mapping[str, Mapping[int, str]]:
+        if self.task == "multiple-choice":
+            dynamic_axis = {0: "batch", 1: "choice", 2: "sequence"}
+        else:
+            dynamic_axis = {0: "batch", 1: "sequence"}
+        return OrderedDict(
+            [
+                ("input_ids", dynamic_axis),
+                ("attention_mask", dynamic_axis),
+            ]
+        )
+
+
+__all__ = ["BigBirdConfig", "BigBirdOnnxConfig"]
--- a/src/transformers/models/bigbird_pegasus/configuration_bigbird_pegasus.py
+++ b/src/transformers/models/bigbird_pegasus/configuration_bigbird_pegasus.py
@ -14,8 +14,15 @@
 # limitations under the License.
 """BigBirdPegasus model configuration"""

+from collections import OrderedDict
+from collections.abc import Mapping
+from typing import Any
+
+from ... import PreTrainedTokenizer
 from ...configuration_utils import PreTrainedConfig
-from ...utils import logging
+from ...onnx import OnnxConfig, OnnxConfigWithPast, OnnxSeq2SeqConfigWithPast
+from ...onnx.utils import compute_effective_axis_dimension
+from ...utils import is_torch_available, logging


 logger = logging.get_logger(__name__)
@ -179,4 +186,224 @@ class BigBirdPegasusConfig(PreTrainedConfig):
        )


-__all__ = ["BigBirdPegasusConfig"]
+# Copied from transformers.models.bart.configuration_bart.BartOnnxConfig with Bart->BigBirdPegasus
+class BigBirdPegasusOnnxConfig(OnnxSeq2SeqConfigWithPast):
+    @property
+    def inputs(self) -> Mapping[str, Mapping[int, str]]:
+        if self.task in ["default", "seq2seq-lm"]:
+            common_inputs = OrderedDict(
+                [
+                    ("input_ids", {0: "batch", 1: "encoder_sequence"}),
+                    ("attention_mask", {0: "batch", 1: "encoder_sequence"}),
+                ]
+            )
+
+            if self.use_past:
+                common_inputs["decoder_input_ids"] = {0: "batch"}
+                common_inputs["decoder_attention_mask"] = {0: "batch", 1: "past_decoder_sequence + sequence"}
+            else:
+                common_inputs["decoder_input_ids"] = {0: "batch", 1: "decoder_sequence"}
+                common_inputs["decoder_attention_mask"] = {0: "batch", 1: "decoder_sequence"}
+
+            if self.use_past:
+                self.fill_with_past_key_values_(common_inputs, direction="inputs")
+        elif self.task == "causal-lm":
+            # TODO: figure this case out.
+            common_inputs = OrderedDict(
+                [
+                    ("input_ids", {0: "batch", 1: "encoder_sequence"}),
+                    ("attention_mask", {0: "batch", 1: "encoder_sequence"}),
+                ]
+            )
+            if self.use_past:
+                num_encoder_layers, _ = self.num_layers
+                for i in range(num_encoder_layers):
+                    common_inputs[f"past_key_values.{i}.key"] = {0: "batch", 2: "past_sequence + sequence"}
+                    common_inputs[f"past_key_values.{i}.value"] = {0: "batch", 2: "past_sequence + sequence"}
+        else:
+            common_inputs = OrderedDict(
+                [
+                    ("input_ids", {0: "batch", 1: "encoder_sequence"}),
+                    ("attention_mask", {0: "batch", 1: "encoder_sequence"}),
+                    ("decoder_input_ids", {0: "batch", 1: "decoder_sequence"}),
+                    ("decoder_attention_mask", {0: "batch", 1: "decoder_sequence"}),
+                ]
+            )
+
+        return common_inputs
+
+    @property
+    def outputs(self) -> Mapping[str, Mapping[int, str]]:
+        if self.task in ["default", "seq2seq-lm"]:
+            common_outputs = super().outputs
+        else:
+            common_outputs = super(OnnxConfigWithPast, self).outputs
+            if self.use_past:
+                num_encoder_layers, _ = self.num_layers
+                for i in range(num_encoder_layers):
+                    common_outputs[f"present.{i}.key"] = {0: "batch", 2: "past_sequence + sequence"}
+                    common_outputs[f"present.{i}.value"] = {0: "batch", 2: "past_sequence + sequence"}
+        return common_outputs
+
+    def _generate_dummy_inputs_for_default_and_seq2seq_lm(
+        self,
+        tokenizer: PreTrainedTokenizer,
+        batch_size: int = -1,
+        seq_length: int = -1,
+        is_pair: bool = False,
+    ) -> Mapping[str, Any]:
+        encoder_inputs = self._generate_dummy_inputs_for_sequence_classification_and_question_answering(
+            tokenizer, batch_size, seq_length, is_pair
+        )
+
+        # Generate decoder inputs
+        decoder_seq_length = seq_length if not self.use_past else 1
+        decoder_inputs = self._generate_dummy_inputs_for_sequence_classification_and_question_answering(
+            tokenizer, batch_size, decoder_seq_length, is_pair
+        )
+        decoder_inputs = {f"decoder_{name}": tensor for name, tensor in decoder_inputs.items()}
+        common_inputs = dict(**encoder_inputs, **decoder_inputs)
+
+        if self.use_past:
+            if not is_torch_available():
+                raise ValueError("Cannot generate dummy past_keys inputs without PyTorch installed.")
+            else:
+                import torch
+            batch, encoder_seq_length = common_inputs["input_ids"].shape
+            decoder_seq_length = common_inputs["decoder_input_ids"].shape[1]
+            num_encoder_attention_heads, num_decoder_attention_heads = self.num_attention_heads
+            encoder_shape = (
+                batch,
+                num_encoder_attention_heads,
+                encoder_seq_length,
+                self._config.hidden_size // num_encoder_attention_heads,
+            )
+            decoder_past_length = decoder_seq_length + 3
+            decoder_shape = (
+                batch,
+                num_decoder_attention_heads,
+                decoder_past_length,
+                self._config.hidden_size // num_decoder_attention_heads,
+            )
+
+            common_inputs["decoder_attention_mask"] = torch.cat(
+                [common_inputs["decoder_attention_mask"], torch.ones(batch, decoder_past_length)], dim=1
+            )
+
+            common_inputs["past_key_values"] = []
+            # If the number of encoder and decoder layers are present in the model configuration, both are considered
+            num_encoder_layers, num_decoder_layers = self.num_layers
+            min_num_layers = min(num_encoder_layers, num_decoder_layers)
+            max_num_layers = max(num_encoder_layers, num_decoder_layers) - min_num_layers
+            remaining_side_name = "encoder" if num_encoder_layers > num_decoder_layers else "decoder"
+
+            for _ in range(min_num_layers):
+                common_inputs["past_key_values"].append(
+                    (
+                        torch.zeros(decoder_shape),
+                        torch.zeros(decoder_shape),
+                        torch.zeros(encoder_shape),
+                        torch.zeros(encoder_shape),
+                    )
+                )
+            # TODO: test this.
+            shape = encoder_shape if remaining_side_name == "encoder" else decoder_shape
+            for _ in range(min_num_layers, max_num_layers):
+                common_inputs["past_key_values"].append((torch.zeros(shape), torch.zeros(shape)))
+        return common_inputs
+
+    def _generate_dummy_inputs_for_causal_lm(
+        self,
+        tokenizer: PreTrainedTokenizer,
+        batch_size: int = -1,
+        seq_length: int = -1,
+        is_pair: bool = False,
+    ) -> Mapping[str, Any]:
+        common_inputs = self._generate_dummy_inputs_for_sequence_classification_and_question_answering(
+            tokenizer, batch_size, seq_length, is_pair
+        )
+
+        if self.use_past:
+            if not is_torch_available():
+                raise ValueError("Cannot generate dummy past_keys inputs without PyTorch installed.")
+            else:
+                import torch
+            batch, seqlen = common_inputs["input_ids"].shape
+            # Not using the same length for past_key_values
+            past_key_values_length = seqlen + 2
+            num_encoder_layers, _ = self.num_layers
+            num_encoder_attention_heads, _ = self.num_attention_heads
+            past_shape = (
+                batch,
+                num_encoder_attention_heads,
+                past_key_values_length,
+                self._config.hidden_size // num_encoder_attention_heads,
+            )
+
+            mask_dtype = common_inputs["attention_mask"].dtype
+            common_inputs["attention_mask"] = torch.cat(
+                [common_inputs["attention_mask"], torch.ones(batch, past_key_values_length, dtype=mask_dtype)], dim=1
+            )
+            common_inputs["past_key_values"] = [
+                (torch.zeros(past_shape), torch.zeros(past_shape)) for _ in range(num_encoder_layers)
+            ]
+        return common_inputs
+
+    def _generate_dummy_inputs_for_sequence_classification_and_question_answering(
+        self,
+        tokenizer: PreTrainedTokenizer,
+        batch_size: int = -1,
+        seq_length: int = -1,
+        is_pair: bool = False,
+    ) -> Mapping[str, Any]:
+        # Copied from OnnxConfig.generate_dummy_inputs
+        # Did not use super(OnnxConfigWithPast, self).generate_dummy_inputs for code clarity.
+        # If dynamic axis (-1) we forward with a fixed dimension of 2 samples to avoid optimizations made by ONNX
+        batch_size = compute_effective_axis_dimension(
+            batch_size, fixed_dimension=OnnxConfig.default_fixed_batch, num_token_to_add=0
+        )
+
+        # If dynamic axis (-1) we forward with a fixed dimension of 8 tokens to avoid optimizations made by ONNX
+        token_to_add = tokenizer.num_special_tokens_to_add(is_pair)
+        seq_length = compute_effective_axis_dimension(
+            seq_length, fixed_dimension=OnnxConfig.default_fixed_sequence, num_token_to_add=token_to_add
+        )
+
+        # Generate dummy inputs according to compute batch and sequence
+        dummy_input = [" ".join([tokenizer.unk_token]) * seq_length] * batch_size
+        common_inputs = dict(tokenizer(dummy_input, return_tensors="pt"))
+        return common_inputs
+
+    def generate_dummy_inputs(
+        self,
+        tokenizer: PreTrainedTokenizer,
+        batch_size: int = -1,
+        seq_length: int = -1,
+        is_pair: bool = False,
+    ) -> Mapping[str, Any]:
+        if self.task in ["default", "seq2seq-lm"]:
+            common_inputs = self._generate_dummy_inputs_for_default_and_seq2seq_lm(
+                tokenizer, batch_size=batch_size, seq_length=seq_length, is_pair=is_pair
+            )
+
+        elif self.task == "causal-lm":
+            common_inputs = self._generate_dummy_inputs_for_causal_lm(
+                tokenizer, batch_size=batch_size, seq_length=seq_length, is_pair=is_pair
+            )
+        else:
+            common_inputs = self._generate_dummy_inputs_for_sequence_classification_and_question_answering(
+                tokenizer, batch_size=batch_size, seq_length=seq_length, is_pair=is_pair
+            )
+
+        return common_inputs
+
+    def _flatten_past_key_values_(self, flattened_output, name, idx, t):
+        if self.task in ["default", "seq2seq-lm"]:
+            flattened_output = super()._flatten_past_key_values_(flattened_output, name, idx, t)
+        else:
+            flattened_output = super(OnnxSeq2SeqConfigWithPast, self)._flatten_past_key_values_(
+                flattened_output, name, idx, t
+            )
+
+
+__all__ = ["BigBirdPegasusConfig", "BigBirdPegasusOnnxConfig"]
--- a/src/transformers/models/blenderbot/configuration_blenderbot.py
+++ b/src/transformers/models/blenderbot/configuration_blenderbot.py
@ -14,7 +14,15 @@
 # limitations under the License.
 """Blenderbot model configuration"""

+from collections import OrderedDict
+from collections.abc import Mapping
+from typing import Any
+
+from ... import PreTrainedTokenizer
 from ...configuration_utils import PreTrainedConfig
+from ...file_utils import is_torch_available
+from ...onnx import OnnxConfig, OnnxConfigWithPast, OnnxSeq2SeqConfigWithPast
+from ...onnx.utils import compute_effective_axis_dimension
 from ...utils import logging


@ -158,4 +166,227 @@ class BlenderbotConfig(PreTrainedConfig):
        )


-__all__ = ["BlenderbotConfig"]
+class BlenderbotOnnxConfig(OnnxSeq2SeqConfigWithPast):
+    @property
+    def inputs(self) -> Mapping[str, Mapping[int, str]]:
+        if self.task in ["default", "seq2seq-lm"]:
+            common_inputs = OrderedDict(
+                [
+                    ("input_ids", {0: "batch", 1: "encoder_sequence"}),
+                    ("attention_mask", {0: "batch", 1: "encoder_sequence"}),
+                ]
+            )
+            if self.use_past:
+                common_inputs["decoder_input_ids"] = {0: "batch"}
+                common_inputs["decoder_attention_mask"] = {0: "batch", 1: "past_decoder_sequence + sequence"}
+            else:
+                common_inputs["decoder_input_ids"] = {0: "batch", 1: "decoder_sequence"}
+                common_inputs["decoder_attention_mask"] = {0: "batch", 1: "decoder_sequence"}
+            if self.use_past:
+                self.fill_with_past_key_values_(common_inputs, direction="inputs")
+        elif self.task == "causal-lm":
+            common_inputs = OrderedDict(
+                [
+                    ("input_ids", {0: "batch", 1: "encoder_sequence"}),
+                    ("attention_mask", {0: "batch", 1: "encoder_sequence"}),
+                ]
+            )
+            if self.use_past:
+                _, num_decoder_layers = self.num_layers
+                for i in range(num_decoder_layers):
+                    common_inputs[f"past_key_values.{i}.key"] = {0: "batch", 2: "past_sequence + sequence"}
+                    common_inputs[f"past_key_values.{i}.value"] = {0: "batch", 2: "past_sequence + sequence"}
+        else:
+            common_inputs = OrderedDict(
+                [
+                    ("input_ids", {0: "batch", 1: "encoder_sequence"}),
+                    ("attention_mask", {0: "batch", 1: "encoder_sequence"}),
+                    ("decoder_input_ids", {0: "batch", 1: "decoder_sequence"}),
+                    ("decoder_attention_mask", {0: "batch", 1: "decoder_sequence"}),
+                ]
+            )
+
+        return common_inputs
+
+    @property
+    # Copied from transformers.models.bart.configuration_bart.BartOnnxConfig.outputs
+    def outputs(self) -> Mapping[str, Mapping[int, str]]:
+        if self.task in ["default", "seq2seq-lm"]:
+            common_outputs = super().outputs
+        else:
+            common_outputs = super(OnnxConfigWithPast, self).outputs
+            if self.use_past:
+                num_encoder_layers, _ = self.num_layers
+                for i in range(num_encoder_layers):
+                    common_outputs[f"present.{i}.key"] = {0: "batch", 2: "past_sequence + sequence"}
+                    common_outputs[f"present.{i}.value"] = {0: "batch", 2: "past_sequence + sequence"}
+        return common_outputs
+
+    def _generate_dummy_inputs_for_default_and_seq2seq_lm(
+        self,
+        tokenizer: PreTrainedTokenizer,
+        batch_size: int = -1,
+        seq_length: int = -1,
+        is_pair: bool = False,
+    ) -> Mapping[str, Any]:
+        encoder_inputs = self._generate_dummy_inputs_for_sequence_classification_and_question_answering(
+            tokenizer, batch_size, seq_length, is_pair
+        )
+        # Generate decoder inputs
+        decoder_seq_length = seq_length if not self.use_past else 1
+        decoder_inputs = self._generate_dummy_inputs_for_sequence_classification_and_question_answering(
+            tokenizer, batch_size, decoder_seq_length, is_pair
+        )
+        decoder_inputs = {f"decoder_{name}": tensor for name, tensor in decoder_inputs.items()}
+        common_inputs = dict(**encoder_inputs, **decoder_inputs)
+
+        if self.use_past:
+            if not is_torch_available():
+                raise ValueError("Cannot generate dummy past_keys inputs without PyTorch installed.")
+            else:
+                import torch
+            batch, encoder_seq_length = common_inputs["input_ids"].shape
+            decoder_seq_length = common_inputs["decoder_input_ids"].shape[1]
+            num_encoder_attention_heads, num_decoder_attention_heads = self.num_attention_heads
+            encoder_shape = (
+                batch,
+                num_encoder_attention_heads,
+                encoder_seq_length,
+                self._config.hidden_size // num_encoder_attention_heads,
+            )
+            decoder_past_length = decoder_seq_length
+            decoder_shape = (
+                batch,
+                num_decoder_attention_heads,
+                decoder_past_length,
+                self._config.hidden_size // num_decoder_attention_heads,
+            )
+            common_inputs["decoder_attention_mask"] = torch.cat(
+                [common_inputs["decoder_attention_mask"], torch.ones(batch, decoder_past_length)], dim=1
+            )
+            common_inputs["past_key_values"] = []
+            _, num_decoder_layers = self.num_layers
+
+            for _ in range(num_decoder_layers):
+                common_inputs["past_key_values"].append(
+                    (
+                        torch.zeros(decoder_shape),
+                        torch.zeros(decoder_shape),
+                        torch.zeros(encoder_shape),
+                        torch.zeros(encoder_shape),
+                    )
+                )
+        return common_inputs
+
+    def _generate_dummy_inputs_for_causal_lm(
+        self,
+        tokenizer: PreTrainedTokenizer,
+        batch_size: int = -1,
+        seq_length: int = -1,
+        is_pair: bool = False,
+    ) -> Mapping[str, Any]:
+        common_inputs = self._generate_dummy_inputs_for_sequence_classification_and_question_answering(
+            tokenizer, batch_size, seq_length, is_pair
+        )
+
+        if self.use_past:
+            if not is_torch_available():
+                raise ValueError("Cannot generate dummy past_keys inputs without PyTorch installed.")
+            else:
+                import torch
+            batch, seqlen = common_inputs["input_ids"].shape
+            past_key_values_length = seqlen
+            _, num_decoder_layers = self.num_layers
+            num_encoder_attention_heads, _ = self.num_attention_heads
+            past_shape = (
+                batch,
+                num_encoder_attention_heads,
+                past_key_values_length,
+                self._config.hidden_size // num_encoder_attention_heads,
+            )
+            mask_dtype = common_inputs["attention_mask"].dtype
+            common_inputs["attention_mask"] = torch.cat(
+                [common_inputs["attention_mask"], torch.ones(batch, past_key_values_length, dtype=mask_dtype)], dim=1
+            )
+            common_inputs["past_key_values"] = [
+                (torch.zeros(past_shape), torch.zeros(past_shape)) for _ in range(num_decoder_layers)
+            ]
+        return common_inputs
+
+    # Copied from transformers.models.bart.configuration_bart.BartOnnxConfig._generate_dummy_inputs_for_sequence_classification_and_question_answering
+    def _generate_dummy_inputs_for_sequence_classification_and_question_answering(
+        self,
+        tokenizer: PreTrainedTokenizer,
+        batch_size: int = -1,
+        seq_length: int = -1,
+        is_pair: bool = False,
+    ) -> Mapping[str, Any]:
+        # Copied from OnnxConfig.generate_dummy_inputs
+        # Did not use super(OnnxConfigWithPast, self).generate_dummy_inputs for code clarity.
+        # If dynamic axis (-1) we forward with a fixed dimension of 2 samples to avoid optimizations made by ONNX
+        batch_size = compute_effective_axis_dimension(
+            batch_size, fixed_dimension=OnnxConfig.default_fixed_batch, num_token_to_add=0
+        )
+
+        # If dynamic axis (-1) we forward with a fixed dimension of 8 tokens to avoid optimizations made by ONNX
+        token_to_add = tokenizer.num_special_tokens_to_add(is_pair)
+        seq_length = compute_effective_axis_dimension(
+            seq_length, fixed_dimension=OnnxConfig.default_fixed_sequence, num_token_to_add=token_to_add
+        )
+
+        # Generate dummy inputs according to compute batch and sequence
+        dummy_input = [" ".join([tokenizer.unk_token]) * seq_length] * batch_size
+        common_inputs = dict(tokenizer(dummy_input, return_tensors="pt"))
+        return common_inputs
+
+    # Copied from transformers.models.bart.configuration_bart.BartOnnxConfig.generate_dummy_inputs
+    def generate_dummy_inputs(
+        self,
+        tokenizer: PreTrainedTokenizer,
+        batch_size: int = -1,
+        seq_length: int = -1,
+        is_pair: bool = False,
+    ) -> Mapping[str, Any]:
+        if self.task in ["default", "seq2seq-lm"]:
+            common_inputs = self._generate_dummy_inputs_for_default_and_seq2seq_lm(
+                tokenizer, batch_size=batch_size, seq_length=seq_length, is_pair=is_pair
+            )
+
+        elif self.task == "causal-lm":
+            common_inputs = self._generate_dummy_inputs_for_causal_lm(
+                tokenizer, batch_size=batch_size, seq_length=seq_length, is_pair=is_pair
+            )
+        else:
+            common_inputs = self._generate_dummy_inputs_for_sequence_classification_and_question_answering(
+                tokenizer, batch_size=batch_size, seq_length=seq_length, is_pair=is_pair
+            )
+
+        return common_inputs
+
+    # Copied from transformers.models.bart.configuration_bart.BartOnnxConfig._flatten_past_key_values_
+    def _flatten_past_key_values_(self, flattened_output, name, idx, t):
+        if self.task in ["default", "seq2seq-lm"]:
+            flattened_output = super()._flatten_past_key_values_(flattened_output, name, idx, t)
+        else:
+            flattened_output = super(OnnxSeq2SeqConfigWithPast, self)._flatten_past_key_values_(
+                flattened_output, name, idx, t
+            )
+
+    def fill_with_past_key_values_(self, inputs_or_outputs: Mapping[str, Mapping[int, str]], direction: str):
+        if direction not in ["inputs", "outputs"]:
+            raise ValueError(f'direction must either be "inputs" or "outputs", but {direction} was given')
+
+        name = "past_key_values" if direction == "inputs" else "present"
+        _, num_decoder_layers = self.num_layers
+
+        encoder_sequence = "past_encoder_sequence"
+        decoder_sequence = "past_decoder_sequence" if direction == "inputs" else "past_decoder_sequence + sequence"
+
+        for i in range(num_decoder_layers):
+            inputs_or_outputs[f"{name}.{i}.decoder.key"] = {0: "batch", 2: decoder_sequence}
+            inputs_or_outputs[f"{name}.{i}.decoder.value"] = {0: "batch", 2: decoder_sequence}
+            inputs_or_outputs[f"{name}.{i}.encoder.key"] = {0: "batch", 2: encoder_sequence}
+            inputs_or_outputs[f"{name}.{i}.encoder.value"] = {0: "batch", 2: encoder_sequence}
+
+
+__all__ = ["BlenderbotConfig", "BlenderbotOnnxConfig"]
--- a/src/transformers/models/blenderbot_small/configuration_blenderbot_small.py
+++ b/src/transformers/models/blenderbot_small/configuration_blenderbot_small.py
@ -14,7 +14,15 @@
 # limitations under the License.
 """BlenderbotSmall model configuration"""

+from collections import OrderedDict
+from collections.abc import Mapping
+from typing import Any
+
+from ... import PreTrainedTokenizer
 from ...configuration_utils import PreTrainedConfig
+from ...file_utils import is_torch_available
+from ...onnx import OnnxConfig, OnnxConfigWithPast, OnnxSeq2SeqConfigWithPast
+from ...onnx.utils import compute_effective_axis_dimension
 from ...utils import logging


@ -156,4 +164,224 @@ class BlenderbotSmallConfig(PreTrainedConfig):
        )


-__all__ = ["BlenderbotSmallConfig"]
+# Copied from transformers.models.bart.configuration_bart.BartOnnxConfig with Bart->BlenderbotSmall
+class BlenderbotSmallOnnxConfig(OnnxSeq2SeqConfigWithPast):
+    @property
+    def inputs(self) -> Mapping[str, Mapping[int, str]]:
+        if self.task in ["default", "seq2seq-lm"]:
+            common_inputs = OrderedDict(
+                [
+                    ("input_ids", {0: "batch", 1: "encoder_sequence"}),
+                    ("attention_mask", {0: "batch", 1: "encoder_sequence"}),
+                ]
+            )
+
+            if self.use_past:
+                common_inputs["decoder_input_ids"] = {0: "batch"}
+                common_inputs["decoder_attention_mask"] = {0: "batch", 1: "past_decoder_sequence + sequence"}
+            else:
+                common_inputs["decoder_input_ids"] = {0: "batch", 1: "decoder_sequence"}
+                common_inputs["decoder_attention_mask"] = {0: "batch", 1: "decoder_sequence"}
+
+            if self.use_past:
+                self.fill_with_past_key_values_(common_inputs, direction="inputs")
+        elif self.task == "causal-lm":
+            # TODO: figure this case out.
+            common_inputs = OrderedDict(
+                [
+                    ("input_ids", {0: "batch", 1: "encoder_sequence"}),
+                    ("attention_mask", {0: "batch", 1: "encoder_sequence"}),
+                ]
+            )
+            if self.use_past:
+                num_encoder_layers, _ = self.num_layers
+                for i in range(num_encoder_layers):
+                    common_inputs[f"past_key_values.{i}.key"] = {0: "batch", 2: "past_sequence + sequence"}
+                    common_inputs[f"past_key_values.{i}.value"] = {0: "batch", 2: "past_sequence + sequence"}
+        else:
+            common_inputs = OrderedDict(
+                [
+                    ("input_ids", {0: "batch", 1: "encoder_sequence"}),
+                    ("attention_mask", {0: "batch", 1: "encoder_sequence"}),
+                    ("decoder_input_ids", {0: "batch", 1: "decoder_sequence"}),
+                    ("decoder_attention_mask", {0: "batch", 1: "decoder_sequence"}),
+                ]
+            )
+
+        return common_inputs
+
+    @property
+    def outputs(self) -> Mapping[str, Mapping[int, str]]:
+        if self.task in ["default", "seq2seq-lm"]:
+            common_outputs = super().outputs
+        else:
+            common_outputs = super(OnnxConfigWithPast, self).outputs
+            if self.use_past:
+                num_encoder_layers, _ = self.num_layers
+                for i in range(num_encoder_layers):
+                    common_outputs[f"present.{i}.key"] = {0: "batch", 2: "past_sequence + sequence"}
+                    common_outputs[f"present.{i}.value"] = {0: "batch", 2: "past_sequence + sequence"}
+        return common_outputs
+
+    def _generate_dummy_inputs_for_default_and_seq2seq_lm(
+        self,
+        tokenizer: PreTrainedTokenizer,
+        batch_size: int = -1,
+        seq_length: int = -1,
+        is_pair: bool = False,
+    ) -> Mapping[str, Any]:
+        encoder_inputs = self._generate_dummy_inputs_for_sequence_classification_and_question_answering(
+            tokenizer, batch_size, seq_length, is_pair
+        )
+
+        # Generate decoder inputs
+        decoder_seq_length = seq_length if not self.use_past else 1
+        decoder_inputs = self._generate_dummy_inputs_for_sequence_classification_and_question_answering(
+            tokenizer, batch_size, decoder_seq_length, is_pair
+        )
+        decoder_inputs = {f"decoder_{name}": tensor for name, tensor in decoder_inputs.items()}
+        common_inputs = dict(**encoder_inputs, **decoder_inputs)
+
+        if self.use_past:
+            if not is_torch_available():
+                raise ValueError("Cannot generate dummy past_keys inputs without PyTorch installed.")
+            else:
+                import torch
+            batch, encoder_seq_length = common_inputs["input_ids"].shape
+            decoder_seq_length = common_inputs["decoder_input_ids"].shape[1]
+            num_encoder_attention_heads, num_decoder_attention_heads = self.num_attention_heads
+            encoder_shape = (
+                batch,
+                num_encoder_attention_heads,
+                encoder_seq_length,
+                self._config.hidden_size // num_encoder_attention_heads,
+            )
+            decoder_past_length = decoder_seq_length + 3
+            decoder_shape = (
+                batch,
+                num_decoder_attention_heads,
+                decoder_past_length,
+                self._config.hidden_size // num_decoder_attention_heads,
+            )
+
+            common_inputs["decoder_attention_mask"] = torch.cat(
+                [common_inputs["decoder_attention_mask"], torch.ones(batch, decoder_past_length)], dim=1
+            )
+
+            common_inputs["past_key_values"] = []
+            # If the number of encoder and decoder layers are present in the model configuration, both are considered
+            num_encoder_layers, num_decoder_layers = self.num_layers
+            min_num_layers = min(num_encoder_layers, num_decoder_layers)
+            max_num_layers = max(num_encoder_layers, num_decoder_layers) - min_num_layers
+            remaining_side_name = "encoder" if num_encoder_layers > num_decoder_layers else "decoder"
+
+            for _ in range(min_num_layers):
+                common_inputs["past_key_values"].append(
+                    (
+                        torch.zeros(decoder_shape),
+                        torch.zeros(decoder_shape),
+                        torch.zeros(encoder_shape),
+                        torch.zeros(encoder_shape),
+                    )
+                )
+            # TODO: test this.
+            shape = encoder_shape if remaining_side_name == "encoder" else decoder_shape
+            for _ in range(min_num_layers, max_num_layers):
+                common_inputs["past_key_values"].append((torch.zeros(shape), torch.zeros(shape)))
+        return common_inputs
+
+    def _generate_dummy_inputs_for_causal_lm(
+        self,
+        tokenizer: PreTrainedTokenizer,
+        batch_size: int = -1,
+        seq_length: int = -1,
+        is_pair: bool = False,
+    ) -> Mapping[str, Any]:
+        common_inputs = self._generate_dummy_inputs_for_sequence_classification_and_question_answering(
+            tokenizer, batch_size, seq_length, is_pair
+        )
+
+        if self.use_past:
+            if not is_torch_available():
+                raise ValueError("Cannot generate dummy past_keys inputs without PyTorch installed.")
+            else:
+                import torch
+            batch, seqlen = common_inputs["input_ids"].shape
+            # Not using the same length for past_key_values
+            past_key_values_length = seqlen + 2
+            num_encoder_layers, _ = self.num_layers
+            num_encoder_attention_heads, _ = self.num_attention_heads
+            past_shape = (
+                batch,
+                num_encoder_attention_heads,
+                past_key_values_length,
+                self._config.hidden_size // num_encoder_attention_heads,
+            )
+
+            mask_dtype = common_inputs["attention_mask"].dtype
+            common_inputs["attention_mask"] = torch.cat(
+                [common_inputs["attention_mask"], torch.ones(batch, past_key_values_length, dtype=mask_dtype)], dim=1
+            )
+            common_inputs["past_key_values"] = [
+                (torch.zeros(past_shape), torch.zeros(past_shape)) for _ in range(num_encoder_layers)
+            ]
+        return common_inputs
+
+    def _generate_dummy_inputs_for_sequence_classification_and_question_answering(
+        self,
+        tokenizer: PreTrainedTokenizer,
+        batch_size: int = -1,
+        seq_length: int = -1,
+        is_pair: bool = False,
+    ) -> Mapping[str, Any]:
+        # Copied from OnnxConfig.generate_dummy_inputs
+        # Did not use super(OnnxConfigWithPast, self).generate_dummy_inputs for code clarity.
+        # If dynamic axis (-1) we forward with a fixed dimension of 2 samples to avoid optimizations made by ONNX
+        batch_size = compute_effective_axis_dimension(
+            batch_size, fixed_dimension=OnnxConfig.default_fixed_batch, num_token_to_add=0
+        )
+
+        # If dynamic axis (-1) we forward with a fixed dimension of 8 tokens to avoid optimizations made by ONNX
+        token_to_add = tokenizer.num_special_tokens_to_add(is_pair)
+        seq_length = compute_effective_axis_dimension(
+            seq_length, fixed_dimension=OnnxConfig.default_fixed_sequence, num_token_to_add=token_to_add
+        )
+
+        # Generate dummy inputs according to compute batch and sequence
+        dummy_input = [" ".join([tokenizer.unk_token]) * seq_length] * batch_size
+        common_inputs = dict(tokenizer(dummy_input, return_tensors="pt"))
+        return common_inputs
+
+    def generate_dummy_inputs(
+        self,
+        tokenizer: PreTrainedTokenizer,
+        batch_size: int = -1,
+        seq_length: int = -1,
+        is_pair: bool = False,
+    ) -> Mapping[str, Any]:
+        if self.task in ["default", "seq2seq-lm"]:
+            common_inputs = self._generate_dummy_inputs_for_default_and_seq2seq_lm(
+                tokenizer, batch_size=batch_size, seq_length=seq_length, is_pair=is_pair
+            )
+
+        elif self.task == "causal-lm":
+            common_inputs = self._generate_dummy_inputs_for_causal_lm(
+                tokenizer, batch_size=batch_size, seq_length=seq_length, is_pair=is_pair
+            )
+        else:
+            common_inputs = self._generate_dummy_inputs_for_sequence_classification_and_question_answering(
+                tokenizer, batch_size=batch_size, seq_length=seq_length, is_pair=is_pair
+            )
+
+        return common_inputs
+
+    def _flatten_past_key_values_(self, flattened_output, name, idx, t):
+        if self.task in ["default", "seq2seq-lm"]:
+            flattened_output = super()._flatten_past_key_values_(flattened_output, name, idx, t)
+        else:
+            flattened_output = super(OnnxSeq2SeqConfigWithPast, self)._flatten_past_key_values_(
+                flattened_output, name, idx, t
+            )
+
+
+__all__ = ["BlenderbotSmallConfig", "BlenderbotSmallOnnxConfig"]
--- a/src/transformers/models/bloom/configuration_bloom.py
+++ b/src/transformers/models/bloom/configuration_bloom.py
@ -14,8 +14,19 @@
 # limitations under the License.
 """Bloom configuration"""

+from collections import OrderedDict
+from collections.abc import Mapping
+from typing import TYPE_CHECKING, Any, Optional
+
+from packaging import version
+
+
+if TYPE_CHECKING:
+    from ... import PreTrainedTokenizer
+
 from ...configuration_utils import PreTrainedConfig
-from ...utils import logging
+from ...onnx import OnnxConfigWithPast, PatchingSpec
+from ...utils import is_torch_available, logging


 logger = logging.get_logger(__name__)
@ -131,4 +142,99 @@ class BloomConfig(PreTrainedConfig):
        super().__init__(bos_token_id=bos_token_id, eos_token_id=eos_token_id, **kwargs)


-__all__ = ["BloomConfig"]
+class BloomOnnxConfig(OnnxConfigWithPast):
+    torch_onnx_minimum_version = version.parse("1.12")
+
+    def __init__(
+        self,
+        config: PreTrainedConfig,
+        task: str = "default",
+        patching_specs: Optional[list[PatchingSpec]] = None,
+        use_past: bool = False,
+    ):
+        super().__init__(config, task=task, patching_specs=patching_specs, use_past=use_past)
+        if not getattr(self._config, "pad_token_id", None):
+            # TODO: how to do that better?
+            self._config.pad_token_id = 0
+
+    @property
+    def inputs(self) -> Mapping[str, Mapping[int, str]]:
+        common_inputs = OrderedDict({"input_ids": {0: "batch", 1: "sequence"}})
+        if self.use_past:
+            # BLOOM stores values on dynamic axis 2. For more details see: https://github.com/huggingface/transformers/pull/18344
+            self.fill_with_past_key_values_(common_inputs, direction="inputs", inverted_values_shape=True)
+            common_inputs["attention_mask"] = {0: "batch", 1: "past_sequence + sequence"}
+        else:
+            common_inputs["attention_mask"] = {0: "batch", 1: "sequence"}
+
+        return common_inputs
+
+    @property
+    def num_layers(self) -> int:
+        return self._config.n_layer
+
+    @property
+    def num_attention_heads(self) -> int:
+        return self._config.n_head
+
+    @property
+    def atol_for_validation(self) -> float:
+        return 1e-3
+
+    def generate_dummy_inputs(
+        self,
+        tokenizer: "PreTrainedTokenizer",
+        batch_size: int = -1,
+        seq_length: int = -1,
+        is_pair: bool = False,
+    ) -> Mapping[str, Any]:
+        common_inputs = super(OnnxConfigWithPast, self).generate_dummy_inputs(
+            tokenizer,
+            batch_size=batch_size,
+            seq_length=seq_length,
+            is_pair=is_pair,
+        )
+
+        # We need to order the input in the way they appears in the forward()
+        ordered_inputs = OrderedDict({"input_ids": common_inputs["input_ids"]})
+
+        # Need to add the past_keys
+        if self.use_past:
+            if not is_torch_available():
+                raise ValueError("Cannot generate dummy past_keys inputs without PyTorch installed.")
+            else:
+                import torch
+
+                batch, seqlen = common_inputs["input_ids"].shape
+                # Not using the same length for past_key_values
+                past_key_values_length = seqlen + 2
+                head_dim = self._config.hidden_size // self.num_attention_heads
+                past_key_shape = (
+                    batch * self.num_attention_heads,
+                    head_dim,
+                    past_key_values_length,
+                )
+                past_value_shape = (
+                    batch * self.num_attention_heads,
+                    past_key_values_length,
+                    head_dim,
+                )
+                ordered_inputs["past_key_values"] = [
+                    (torch.zeros(past_key_shape), torch.zeros(past_value_shape)) for _ in range(self.num_layers)
+                ]
+
+        ordered_inputs["attention_mask"] = common_inputs["attention_mask"]
+        if self.use_past:
+            mask_dtype = ordered_inputs["attention_mask"].dtype
+            ordered_inputs["attention_mask"] = torch.cat(
+                [ordered_inputs["attention_mask"], torch.ones(batch, past_key_values_length, dtype=mask_dtype)], dim=1
+            )
+
+        return ordered_inputs
+
+    @property
+    def default_onnx_opset(self) -> int:
+        return 13
+
+
+__all__ = ["BloomConfig", "BloomOnnxConfig"]
--- a/src/transformers/models/camembert/configuration_camembert.py
+++ b/src/transformers/models/camembert/configuration_camembert.py
@ -15,7 +15,11 @@
 # limitations under the License.
 """CamemBERT configuration"""

+from collections import OrderedDict
+from collections.abc import Mapping
+
 from ...configuration_utils import PreTrainedConfig
+from ...onnx import OnnxConfig
 from ...utils import logging


@ -125,4 +129,19 @@ class CamembertConfig(PreTrainedConfig):
        self.classifier_dropout = classifier_dropout


-__all__ = ["CamembertConfig"]
+class CamembertOnnxConfig(OnnxConfig):
+    @property
+    def inputs(self) -> Mapping[str, Mapping[int, str]]:
+        if self.task == "multiple-choice":
+            dynamic_axis = {0: "batch", 1: "choice", 2: "sequence"}
+        else:
+            dynamic_axis = {0: "batch", 1: "sequence"}
+        return OrderedDict(
+            [
+                ("input_ids", dynamic_axis),
+                ("attention_mask", dynamic_axis),
+            ]
+        )
+
+
+__all__ = ["CamembertConfig", "CamembertOnnxConfig"]
--- a/src/transformers/models/chinese_clip/configuration_chinese_clip.py
+++ b/src/transformers/models/chinese_clip/configuration_chinese_clip.py
@ -14,7 +14,16 @@
 # limitations under the License.
 """Chinese-CLIP model configuration"""

+from collections import OrderedDict
+from collections.abc import Mapping
+from typing import TYPE_CHECKING, Any
+
+
+if TYPE_CHECKING:
+    from ...processing_utils import ProcessorMixin
+
 from ...configuration_utils import PreTrainedConfig
+from ...onnx import OnnxConfig
 from ...utils import logging


@ -359,4 +368,52 @@ class ChineseCLIPConfig(PreTrainedConfig):
        super().__init__(**kwargs)


-__all__ = ["ChineseCLIPConfig", "ChineseCLIPTextConfig", "ChineseCLIPVisionConfig"]
+class ChineseCLIPOnnxConfig(OnnxConfig):
+    @property
+    def inputs(self) -> Mapping[str, Mapping[int, str]]:
+        return OrderedDict(
+            [
+                ("input_ids", {0: "batch", 1: "sequence"}),
+                ("pixel_values", {0: "batch", 1: "num_channels", 2: "height", 3: "width"}),
+                ("attention_mask", {0: "batch", 1: "sequence"}),
+            ]
+        )
+
+    @property
+    def outputs(self) -> Mapping[str, Mapping[int, str]]:
+        return OrderedDict(
+            [
+                ("logits_per_image", {0: "batch"}),
+                ("logits_per_text", {0: "batch"}),
+                ("text_embeds", {0: "batch"}),
+                ("image_embeds", {0: "batch"}),
+            ]
+        )
+
+    @property
+    def atol_for_validation(self) -> float:
+        return 1e-4
+
+    def generate_dummy_inputs(
+        self,
+        processor: "ProcessorMixin",
+        batch_size: int = -1,
+        seq_length: int = -1,
+    ) -> Mapping[str, Any]:
+        text_input_dict = super().generate_dummy_inputs(
+            processor.tokenizer,
+            batch_size=batch_size,
+            seq_length=seq_length,
+        )
+        image_input_dict = super().generate_dummy_inputs(
+            processor.image_processor,
+            batch_size=batch_size,
+        )
+        return {**text_input_dict, **image_input_dict}
+
+    @property
+    def default_onnx_opset(self) -> int:
+        return 14
+
+
+__all__ = ["ChineseCLIPConfig", "ChineseCLIPOnnxConfig", "ChineseCLIPTextConfig", "ChineseCLIPVisionConfig"]
--- a/src/transformers/models/clip/configuration_clip.py
+++ b/src/transformers/models/clip/configuration_clip.py
@ -14,7 +14,16 @@
 # limitations under the License.
 """CLIP model configuration"""

+from collections import OrderedDict
+from collections.abc import Mapping
+from typing import TYPE_CHECKING, Any
+
+
+if TYPE_CHECKING:
+    from ...processing_utils import ProcessorMixin
+
 from ...configuration_utils import PreTrainedConfig
+from ...onnx import OnnxConfig
 from ...utils import logging


@ -355,4 +364,52 @@ class CLIPConfig(PreTrainedConfig):
        super().__init__(**kwargs)


-__all__ = ["CLIPConfig", "CLIPTextConfig", "CLIPVisionConfig"]
+class CLIPOnnxConfig(OnnxConfig):
+    @property
+    def inputs(self) -> Mapping[str, Mapping[int, str]]:
+        return OrderedDict(
+            [
+                ("input_ids", {0: "batch", 1: "sequence"}),
+                ("pixel_values", {0: "batch", 1: "num_channels", 2: "height", 3: "width"}),
+                ("attention_mask", {0: "batch", 1: "sequence"}),
+            ]
+        )
+
+    @property
+    def outputs(self) -> Mapping[str, Mapping[int, str]]:
+        return OrderedDict(
+            [
+                ("logits_per_image", {0: "batch"}),
+                ("logits_per_text", {0: "batch"}),
+                ("text_embeds", {0: "batch"}),
+                ("image_embeds", {0: "batch"}),
+            ]
+        )
+
+    @property
+    def atol_for_validation(self) -> float:
+        return 1e-4
+
+    def generate_dummy_inputs(
+        self,
+        processor: "ProcessorMixin",
+        batch_size: int = -1,
+        seq_length: int = -1,
+    ) -> Mapping[str, Any]:
+        text_input_dict = super().generate_dummy_inputs(
+            processor.tokenizer,
+            batch_size=batch_size,
+            seq_length=seq_length,
+        )
+        image_input_dict = super().generate_dummy_inputs(
+            processor.image_processor,
+            batch_size=batch_size,
+        )
+        return {**text_input_dict, **image_input_dict}
+
+    @property
+    def default_onnx_opset(self) -> int:
+        return 14
+
+
+__all__ = ["CLIPConfig", "CLIPOnnxConfig", "CLIPTextConfig", "CLIPVisionConfig"]
--- a/src/transformers/models/codegen/configuration_codegen.py
+++ b/src/transformers/models/codegen/configuration_codegen.py
@ -14,7 +14,13 @@
 # limitations under the License.
 """CodeGen model configuration"""

+from collections import OrderedDict
+from collections.abc import Mapping
+from typing import Any, Optional
+
+from ... import PreTrainedTokenizer, is_torch_available
 from ...configuration_utils import PreTrainedConfig
+from ...onnx import OnnxConfigWithPast, PatchingSpec
 from ...utils import logging


@ -140,4 +146,85 @@ class CodeGenConfig(PreTrainedConfig):
        )


-__all__ = ["CodeGenConfig"]
+# Copied from transformers.models.gpt2.configuration_gpt2.GPT2OnnxConfig with GPT2->CodeGen
+class CodeGenOnnxConfig(OnnxConfigWithPast):
+    def __init__(
+        self,
+        config: PreTrainedConfig,
+        task: str = "default",
+        patching_specs: Optional[list[PatchingSpec]] = None,
+        use_past: bool = False,
+    ):
+        super().__init__(config, task=task, patching_specs=patching_specs, use_past=use_past)
+        if not getattr(self._config, "pad_token_id", None):
+            # TODO: how to do that better?
+            self._config.pad_token_id = 0
+
+    @property
+    def inputs(self) -> Mapping[str, Mapping[int, str]]:
+        common_inputs = OrderedDict({"input_ids": {0: "batch", 1: "sequence"}})
+        if self.use_past:
+            self.fill_with_past_key_values_(common_inputs, direction="inputs")
+            common_inputs["attention_mask"] = {0: "batch", 1: "past_sequence + sequence"}
+        else:
+            common_inputs["attention_mask"] = {0: "batch", 1: "sequence"}
+
+        return common_inputs
+
+    @property
+    def num_layers(self) -> int:
+        return self._config.n_layer
+
+    @property
+    def num_attention_heads(self) -> int:
+        return self._config.n_head
+
+    def generate_dummy_inputs(
+        self,
+        tokenizer: PreTrainedTokenizer,
+        batch_size: int = -1,
+        seq_length: int = -1,
+        is_pair: bool = False,
+    ) -> Mapping[str, Any]:
+        common_inputs = super(OnnxConfigWithPast, self).generate_dummy_inputs(
+            tokenizer, batch_size=batch_size, seq_length=seq_length, is_pair=is_pair
+        )
+
+        # We need to order the input in the way they appears in the forward()
+        ordered_inputs = OrderedDict({"input_ids": common_inputs["input_ids"]})
+
+        # Need to add the past_keys
+        if self.use_past:
+            if not is_torch_available():
+                raise ValueError("Cannot generate dummy past_keys inputs without PyTorch installed.")
+            else:
+                import torch
+
+                batch, seqlen = common_inputs["input_ids"].shape
+                # Not using the same length for past_key_values
+                past_key_values_length = seqlen + 2
+                past_shape = (
+                    batch,
+                    self.num_attention_heads,
+                    past_key_values_length,
+                    self._config.hidden_size // self.num_attention_heads,
+                )
+                ordered_inputs["past_key_values"] = [
+                    (torch.zeros(past_shape), torch.zeros(past_shape)) for _ in range(self.num_layers)
+                ]
+
+        ordered_inputs["attention_mask"] = common_inputs["attention_mask"]
+        if self.use_past:
+            mask_dtype = ordered_inputs["attention_mask"].dtype
+            ordered_inputs["attention_mask"] = torch.cat(
+                [ordered_inputs["attention_mask"], torch.ones(batch, past_key_values_length, dtype=mask_dtype)], dim=1
+            )
+
+        return ordered_inputs
+
+    @property
+    def default_onnx_opset(self) -> int:
+        return 13
+
+
+__all__ = ["CodeGenConfig", "CodeGenOnnxConfig"]
--- a/src/transformers/models/conditional_detr/configuration_conditional_detr.py
+++ b/src/transformers/models/conditional_detr/configuration_conditional_detr.py
@ -14,7 +14,13 @@
 # limitations under the License.
 """Conditional DETR model configuration"""

+from collections import OrderedDict
+from collections.abc import Mapping
+
+from packaging import version
+
 from ...configuration_utils import PreTrainedConfig
+from ...onnx import OnnxConfig
 from ...utils import logging
 from ...utils.backbone_utils import verify_backbone_config_arguments
 from ..auto import CONFIG_MAPPING, AutoConfig
@ -241,4 +247,25 @@ class ConditionalDetrConfig(PreTrainedConfig):
        super().__init__(is_encoder_decoder=is_encoder_decoder, **kwargs)


-__all__ = ["ConditionalDetrConfig"]
+class ConditionalDetrOnnxConfig(OnnxConfig):
+    torch_onnx_minimum_version = version.parse("1.11")
+
+    @property
+    def inputs(self) -> Mapping[str, Mapping[int, str]]:
+        return OrderedDict(
+            [
+                ("pixel_values", {0: "batch", 1: "num_channels", 2: "height", 3: "width"}),
+                ("pixel_mask", {0: "batch"}),
+            ]
+        )
+
+    @property
+    def atol_for_validation(self) -> float:
+        return 1e-5
+
+    @property
+    def default_onnx_opset(self) -> int:
+        return 12
+
+
+__all__ = ["ConditionalDetrConfig", "ConditionalDetrOnnxConfig"]
--- a/src/transformers/models/convbert/configuration_convbert.py
+++ b/src/transformers/models/convbert/configuration_convbert.py
@ -14,7 +14,11 @@
 # limitations under the License.
 """ConvBERT model configuration"""

+from collections import OrderedDict
+from collections.abc import Mapping
+
 from ...configuration_utils import PreTrainedConfig
+from ...onnx import OnnxConfig
 from ...utils import logging


@ -136,4 +140,21 @@ class ConvBertConfig(PreTrainedConfig):
        self.classifier_dropout = classifier_dropout


-__all__ = ["ConvBertConfig"]
+# Copied from transformers.models.bert.configuration_bert.BertOnnxConfig
+class ConvBertOnnxConfig(OnnxConfig):
+    @property
+    def inputs(self) -> Mapping[str, Mapping[int, str]]:
+        if self.task == "multiple-choice":
+            dynamic_axis = {0: "batch", 1: "choice", 2: "sequence"}
+        else:
+            dynamic_axis = {0: "batch", 1: "sequence"}
+        return OrderedDict(
+            [
+                ("input_ids", dynamic_axis),
+                ("attention_mask", dynamic_axis),
+                ("token_type_ids", dynamic_axis),
+            ]
+        )
+
+
+__all__ = ["ConvBertConfig", "ConvBertOnnxConfig"]
--- a/src/transformers/models/convnext/configuration_convnext.py
+++ b/src/transformers/models/convnext/configuration_convnext.py
@ -14,7 +14,13 @@
 # limitations under the License.
 """ConvNeXT model configuration"""

+from collections import OrderedDict
+from collections.abc import Mapping
+
+from packaging import version
+
 from ...configuration_utils import PreTrainedConfig
+from ...onnx import OnnxConfig
 from ...utils import logging
 from ...utils.backbone_utils import BackboneConfigMixin, get_aligned_output_features_output_indices

@ -117,4 +123,20 @@ class ConvNextConfig(BackboneConfigMixin, PreTrainedConfig):
        )


-__all__ = ["ConvNextConfig"]
+class ConvNextOnnxConfig(OnnxConfig):
+    torch_onnx_minimum_version = version.parse("1.11")
+
+    @property
+    def inputs(self) -> Mapping[str, Mapping[int, str]]:
+        return OrderedDict(
+            [
+                ("pixel_values", {0: "batch", 1: "num_channels", 2: "height", 3: "width"}),
+            ]
+        )
+
+    @property
+    def atol_for_validation(self) -> float:
+        return 1e-5
+
+
+__all__ = ["ConvNextConfig", "ConvNextOnnxConfig"]
--- a/src/transformers/models/data2vec/configuration_data2vec_text.py
+++ b/src/transformers/models/data2vec/configuration_data2vec_text.py
@ -14,7 +14,11 @@
 # limitations under the License.
 """Data2VecText configuration"""

+from collections import OrderedDict
+from collections.abc import Mapping
+
 from ...configuration_utils import PreTrainedConfig
+from ...onnx import OnnxConfig
 from ...utils import logging


@ -124,4 +128,19 @@ class Data2VecTextConfig(PreTrainedConfig):
        self.classifier_dropout = classifier_dropout


-__all__ = ["Data2VecTextConfig"]
+class Data2VecTextOnnxConfig(OnnxConfig):
+    @property
+    def inputs(self) -> Mapping[str, Mapping[int, str]]:
+        if self.task == "multiple-choice":
+            dynamic_axis = {0: "batch", 1: "choice", 2: "sequence"}
+        else:
+            dynamic_axis = {0: "batch", 1: "sequence"}
+        return OrderedDict(
+            [
+                ("input_ids", dynamic_axis),
+                ("attention_mask", dynamic_axis),
+            ]
+        )
+
+
+__all__ = ["Data2VecTextConfig", "Data2VecTextOnnxConfig"]
--- a/src/transformers/models/data2vec/configuration_data2vec_vision.py
+++ b/src/transformers/models/data2vec/configuration_data2vec_vision.py
@ -14,7 +14,13 @@
 # limitations under the License.
 """Data2VecVision model configuration"""

+from collections import OrderedDict
+from collections.abc import Mapping
+
+from packaging import version
+
 from ...configuration_utils import PreTrainedConfig
+from ...onnx import OnnxConfig
 from ...utils import logging


@ -168,4 +174,21 @@ class Data2VecVisionConfig(PreTrainedConfig):
        self.semantic_loss_ignore_index = semantic_loss_ignore_index


-__all__ = ["Data2VecVisionConfig"]
+# Copied from transformers.models.vit.configuration_vit.ViTOnnxConfig
+class Data2VecVisionOnnxConfig(OnnxConfig):
+    torch_onnx_minimum_version = version.parse("1.11")
+
+    @property
+    def inputs(self) -> Mapping[str, Mapping[int, str]]:
+        return OrderedDict(
+            [
+                ("pixel_values", {0: "batch", 1: "num_channels", 2: "height", 3: "width"}),
+            ]
+        )
+
+    @property
+    def atol_for_validation(self) -> float:
+        return 1e-4
+
+
+__all__ = ["Data2VecVisionConfig", "Data2VecVisionOnnxConfig"]
--- a/src/transformers/models/deberta/configuration_deberta.py
+++ b/src/transformers/models/deberta/configuration_deberta.py
@ -14,10 +14,19 @@
 # limitations under the License.
 """DeBERTa model configuration"""

+from collections import OrderedDict
+from collections.abc import Mapping
+from typing import TYPE_CHECKING, Any, Union
+
 from ...configuration_utils import PreTrainedConfig
+from ...onnx import OnnxConfig
 from ...utils import logging


+if TYPE_CHECKING:
+    from ... import FeatureExtractionMixin, PreTrainedTokenizerBase
+
+
 logger = logging.get_logger(__name__)


@ -150,4 +159,41 @@ class DebertaConfig(PreTrainedConfig):
        self.legacy = legacy


-__all__ = ["DebertaConfig"]
+# Copied from transformers.models.deberta_v2.configuration_deberta_v2.DebertaV2OnnxConfig
+class DebertaOnnxConfig(OnnxConfig):
+    @property
+    def inputs(self) -> Mapping[str, Mapping[int, str]]:
+        if self.task == "multiple-choice":
+            dynamic_axis = {0: "batch", 1: "choice", 2: "sequence"}
+        else:
+            dynamic_axis = {0: "batch", 1: "sequence"}
+        if self._config.type_vocab_size > 0:
+            return OrderedDict(
+                [("input_ids", dynamic_axis), ("attention_mask", dynamic_axis), ("token_type_ids", dynamic_axis)]
+            )
+        else:
+            return OrderedDict([("input_ids", dynamic_axis), ("attention_mask", dynamic_axis)])
+
+    @property
+    def default_onnx_opset(self) -> int:
+        return 12
+
+    def generate_dummy_inputs(
+        self,
+        preprocessor: Union["PreTrainedTokenizerBase", "FeatureExtractionMixin"],
+        batch_size: int = -1,
+        seq_length: int = -1,
+        num_choices: int = -1,
+        is_pair: bool = False,
+        num_channels: int = 3,
+        image_width: int = 40,
+        image_height: int = 40,
+        tokenizer: "PreTrainedTokenizerBase" = None,
+    ) -> Mapping[str, Any]:
+        dummy_inputs = super().generate_dummy_inputs(preprocessor=preprocessor)
+        if self._config.type_vocab_size == 0 and "token_type_ids" in dummy_inputs:
+            del dummy_inputs["token_type_ids"]
+        return dummy_inputs
+
+
+__all__ = ["DebertaConfig", "DebertaOnnxConfig"]
--- a/src/transformers/models/deberta_v2/configuration_deberta_v2.py
+++ b/src/transformers/models/deberta_v2/configuration_deberta_v2.py
@ -14,10 +14,19 @@
 # limitations under the License.
 """DeBERTa-v2 model configuration"""

+from collections import OrderedDict
+from collections.abc import Mapping
+from typing import TYPE_CHECKING, Any, Union
+
 from ...configuration_utils import PreTrainedConfig
+from ...onnx import OnnxConfig
 from ...utils import logging


+if TYPE_CHECKING:
+    from ... import FeatureExtractionMixin, PreTrainedTokenizerBase
+
+
 logger = logging.get_logger(__name__)


@ -150,4 +159,40 @@ class DebertaV2Config(PreTrainedConfig):
        self.legacy = legacy


-__all__ = ["DebertaV2Config"]
+class DebertaV2OnnxConfig(OnnxConfig):
+    @property
+    def inputs(self) -> Mapping[str, Mapping[int, str]]:
+        if self.task == "multiple-choice":
+            dynamic_axis = {0: "batch", 1: "choice", 2: "sequence"}
+        else:
+            dynamic_axis = {0: "batch", 1: "sequence"}
+        if self._config.type_vocab_size > 0:
+            return OrderedDict(
+                [("input_ids", dynamic_axis), ("attention_mask", dynamic_axis), ("token_type_ids", dynamic_axis)]
+            )
+        else:
+            return OrderedDict([("input_ids", dynamic_axis), ("attention_mask", dynamic_axis)])
+
+    @property
+    def default_onnx_opset(self) -> int:
+        return 12
+
+    def generate_dummy_inputs(
+        self,
+        preprocessor: Union["PreTrainedTokenizerBase", "FeatureExtractionMixin"],
+        batch_size: int = -1,
+        seq_length: int = -1,
+        num_choices: int = -1,
+        is_pair: bool = False,
+        num_channels: int = 3,
+        image_width: int = 40,
+        image_height: int = 40,
+        tokenizer: "PreTrainedTokenizerBase" = None,
+    ) -> Mapping[str, Any]:
+        dummy_inputs = super().generate_dummy_inputs(preprocessor=preprocessor)
+        if self._config.type_vocab_size == 0 and "token_type_ids" in dummy_inputs:
+            del dummy_inputs["token_type_ids"]
+        return dummy_inputs
+
+
+__all__ = ["DebertaV2Config", "DebertaV2OnnxConfig"]
--- a/src/transformers/models/deit/configuration_deit.py
+++ b/src/transformers/models/deit/configuration_deit.py
@ -14,7 +14,13 @@
 # limitations under the License.
 """DeiT model configuration"""

+from collections import OrderedDict
+from collections.abc import Mapping
+
+from packaging import version
+
 from ...configuration_utils import PreTrainedConfig
+from ...onnx import OnnxConfig
 from ...utils import logging


@ -125,4 +131,20 @@ class DeiTConfig(PreTrainedConfig):
        self.pooler_act = pooler_act


-__all__ = ["DeiTConfig"]
+class DeiTOnnxConfig(OnnxConfig):
+    torch_onnx_minimum_version = version.parse("1.11")
+
+    @property
+    def inputs(self) -> Mapping[str, Mapping[int, str]]:
+        return OrderedDict(
+            [
+                ("pixel_values", {0: "batch", 1: "num_channels", 2: "height", 3: "width"}),
+            ]
+        )
+
+    @property
+    def atol_for_validation(self) -> float:
+        return 1e-4
+
+
+__all__ = ["DeiTConfig", "DeiTOnnxConfig"]
--- a/src/transformers/models/deprecated/mega/configuration_mega.py
+++ b/src/transformers/models/deprecated/mega/configuration_mega.py
@ -14,7 +14,11 @@
 # limitations under the License.
 """MEGA configuration"""

+from collections import OrderedDict
+from collections.abc import Mapping
+
 from ....configuration_utils import PreTrainedConfig
+from ....onnx import OnnxConfig
 from ....utils import logging


@ -221,4 +225,19 @@ class MegaConfig(PreTrainedConfig):
        self.num_attention_heads = 1  # not used but required by Hugging Face


-__all__ = ["MegaConfig"]
+class MegaOnnxConfig(OnnxConfig):
+    @property
+    def inputs(self) -> Mapping[str, Mapping[int, str]]:
+        if self.task == "multiple-choice":
+            dynamic_axis = {0: "batch", 1: "choice", 2: "sequence"}
+        else:
+            dynamic_axis = {0: "batch", 1: "sequence"}
+        return OrderedDict(
+            [
+                ("input_ids", dynamic_axis),
+                ("attention_mask", dynamic_axis),
+            ]
+        )
+
+
+__all__ = ["MegaConfig", "MegaOnnxConfig"]
--- a/src/transformers/models/detr/configuration_detr.py
+++ b/src/transformers/models/detr/configuration_detr.py
@ -14,7 +14,13 @@
 # limitations under the License.
 """DETR model configuration"""

+from collections import OrderedDict
+from collections.abc import Mapping
+
+from packaging import version
+
 from ...configuration_utils import PreTrainedConfig
+from ...onnx import OnnxConfig
 from ...utils import logging
 from ...utils.backbone_utils import verify_backbone_config_arguments
 from ..auto import CONFIG_MAPPING, AutoConfig
@ -240,4 +246,25 @@ class DetrConfig(PreTrainedConfig):
        super().__init__(is_encoder_decoder=is_encoder_decoder, **kwargs)


-__all__ = ["DetrConfig"]
+class DetrOnnxConfig(OnnxConfig):
+    torch_onnx_minimum_version = version.parse("1.11")
+
+    @property
+    def inputs(self) -> Mapping[str, Mapping[int, str]]:
+        return OrderedDict(
+            [
+                ("pixel_values", {0: "batch", 1: "num_channels", 2: "height", 3: "width"}),
+                ("pixel_mask", {0: "batch"}),
+            ]
+        )
+
+    @property
+    def atol_for_validation(self) -> float:
+        return 1e-5
+
+    @property
+    def default_onnx_opset(self) -> int:
+        return 12
+
+
+__all__ = ["DetrConfig", "DetrOnnxConfig"]
--- a/src/transformers/models/dinov2/configuration_dinov2.py
+++ b/src/transformers/models/dinov2/configuration_dinov2.py
@ -14,7 +14,13 @@
 # limitations under the License.
 """DINOv2 model configuration"""

+from collections import OrderedDict
+from collections.abc import Mapping
+
+from packaging import version
+
 from ...configuration_utils import PreTrainedConfig
+from ...onnx import OnnxConfig
 from ...utils import logging
 from ...utils.backbone_utils import BackboneConfigMixin, get_aligned_output_features_output_indices

@ -154,4 +160,20 @@ class Dinov2Config(BackboneConfigMixin, PreTrainedConfig):
        self.use_mask_token = use_mask_token


-__all__ = ["Dinov2Config"]
+class Dinov2OnnxConfig(OnnxConfig):
+    torch_onnx_minimum_version = version.parse("1.11")
+
+    @property
+    def inputs(self) -> Mapping[str, Mapping[int, str]]:
+        return OrderedDict(
+            [
+                ("pixel_values", {0: "batch", 1: "num_channels", 2: "height", 3: "width"}),
+            ]
+        )
+
+    @property
+    def atol_for_validation(self) -> float:
+        return 1e-4
+
+
+__all__ = ["Dinov2Config", "Dinov2OnnxConfig"]
--- a/src/transformers/models/distilbert/configuration_distilbert.py
+++ b/src/transformers/models/distilbert/configuration_distilbert.py
@ -14,7 +14,11 @@
 # limitations under the License.
 """DistilBERT model configuration"""

+from collections import OrderedDict
+from collections.abc import Mapping
+
 from ...configuration_utils import PreTrainedConfig
+from ...onnx import OnnxConfig
 from ...utils import logging


@ -119,4 +123,19 @@ class DistilBertConfig(PreTrainedConfig):
        super().__init__(**kwargs, pad_token_id=pad_token_id)


-__all__ = ["DistilBertConfig"]
+class DistilBertOnnxConfig(OnnxConfig):
+    @property
+    def inputs(self) -> Mapping[str, Mapping[int, str]]:
+        if self.task == "multiple-choice":
+            dynamic_axis = {0: "batch", 1: "choice", 2: "sequence"}
+        else:
+            dynamic_axis = {0: "batch", 1: "sequence"}
+        return OrderedDict(
+            [
+                ("input_ids", dynamic_axis),
+                ("attention_mask", dynamic_axis),
+            ]
+        )
+
+
+__all__ = ["DistilBertConfig", "DistilBertOnnxConfig"]
--- a/src/transformers/models/efficientnet/configuration_efficientnet.py
+++ b/src/transformers/models/efficientnet/configuration_efficientnet.py
@ -14,7 +14,13 @@
 # limitations under the License.
 """EfficientNet model configuration"""

+from collections import OrderedDict
+from collections.abc import Mapping
+
+from packaging import version
+
 from ...configuration_utils import PreTrainedConfig
+from ...onnx import OnnxConfig
 from ...utils import logging


@ -144,4 +150,20 @@ class EfficientNetConfig(PreTrainedConfig):
        self.num_hidden_layers = sum(num_block_repeats) * 4


-__all__ = ["EfficientNetConfig"]
+class EfficientNetOnnxConfig(OnnxConfig):
+    torch_onnx_minimum_version = version.parse("1.11")
+
+    @property
+    def inputs(self) -> Mapping[str, Mapping[int, str]]:
+        return OrderedDict(
+            [
+                ("pixel_values", {0: "batch", 1: "num_channels", 2: "height", 3: "width"}),
+            ]
+        )
+
+    @property
+    def atol_for_validation(self) -> float:
+        return 1e-5
+
+
+__all__ = ["EfficientNetConfig", "EfficientNetOnnxConfig"]
--- a/src/transformers/models/electra/configuration_electra.py
+++ b/src/transformers/models/electra/configuration_electra.py
@ -15,7 +15,11 @@
 # limitations under the License.
 """ELECTRA model configuration"""

+from collections import OrderedDict
+from collections.abc import Mapping
+
 from ...configuration_utils import PreTrainedConfig
+from ...onnx import OnnxConfig
 from ...utils import logging


@ -156,4 +160,20 @@ class ElectraConfig(PreTrainedConfig):
        self.classifier_dropout = classifier_dropout


-__all__ = ["ElectraConfig"]
+class ElectraOnnxConfig(OnnxConfig):
+    @property
+    def inputs(self) -> Mapping[str, Mapping[int, str]]:
+        if self.task == "multiple-choice":
+            dynamic_axis = {0: "batch", 1: "choice", 2: "sequence"}
+        else:
+            dynamic_axis = {0: "batch", 1: "sequence"}
+        return OrderedDict(
+            [
+                ("input_ids", dynamic_axis),
+                ("attention_mask", dynamic_axis),
+                ("token_type_ids", dynamic_axis),
+            ]
+        )
+
+
+__all__ = ["ElectraConfig", "ElectraOnnxConfig"]
--- a/src/transformers/models/ernie/configuration_ernie.py
+++ b/src/transformers/models/ernie/configuration_ernie.py
@ -15,7 +15,11 @@
 # limitations under the License.
 """ERNIE model configuration"""

+from collections import OrderedDict
+from collections.abc import Mapping
+
 from ...configuration_utils import PreTrainedConfig
+from ...onnx import OnnxConfig
 from ...utils import logging


@ -131,4 +135,21 @@ class ErnieConfig(PreTrainedConfig):
        self.classifier_dropout = classifier_dropout


-__all__ = ["ErnieConfig"]
+class ErnieOnnxConfig(OnnxConfig):
+    @property
+    def inputs(self) -> Mapping[str, Mapping[int, str]]:
+        if self.task == "multiple-choice":
+            dynamic_axis = {0: "batch", 1: "choice", 2: "sequence"}
+        else:
+            dynamic_axis = {0: "batch", 1: "sequence"}
+        return OrderedDict(
+            [
+                ("input_ids", dynamic_axis),
+                ("attention_mask", dynamic_axis),
+                ("token_type_ids", dynamic_axis),
+                ("task_type_ids", dynamic_axis),
+            ]
+        )
+
+
+__all__ = ["ErnieConfig", "ErnieOnnxConfig"]
--- a/src/transformers/models/flaubert/configuration_flaubert.py
+++ b/src/transformers/models/flaubert/configuration_flaubert.py
@ -14,7 +14,11 @@
 # limitations under the License.
 """Flaubert configuration"""

+from collections import OrderedDict
+from collections.abc import Mapping
+
 from ...configuration_utils import PreTrainedConfig
+from ...onnx import OnnxConfig
 from ...utils import logging


@ -213,4 +217,19 @@ class FlaubertConfig(PreTrainedConfig):
        super().__init__(pad_token_id=pad_token_id, bos_token_id=bos_token_id, **kwargs)


-__all__ = ["FlaubertConfig"]
+class FlaubertOnnxConfig(OnnxConfig):
+    @property
+    def inputs(self) -> Mapping[str, Mapping[int, str]]:
+        if self.task == "multiple-choice":
+            dynamic_axis = {0: "batch", 1: "choice", 2: "sequence"}
+        else:
+            dynamic_axis = {0: "batch", 1: "sequence"}
+        return OrderedDict(
+            [
+                ("input_ids", dynamic_axis),
+                ("attention_mask", dynamic_axis),
+            ]
+        )
+
+
+__all__ = ["FlaubertConfig", "FlaubertOnnxConfig"]
--- a/src/transformers/models/gpt2/configuration_gpt2.py
+++ b/src/transformers/models/gpt2/configuration_gpt2.py
@ -15,7 +15,13 @@
 # limitations under the License.
 """OpenAI GPT-2 configuration"""

+from collections import OrderedDict
+from collections.abc import Mapping
+from typing import Any, Optional
+
+from ... import PreTrainedTokenizer, is_torch_available
 from ...configuration_utils import PreTrainedConfig
+from ...onnx import OnnxConfigWithPast, PatchingSpec
 from ...utils import logging


@ -184,4 +190,84 @@ class GPT2Config(PreTrainedConfig):
        super().__init__(bos_token_id=bos_token_id, eos_token_id=eos_token_id, **kwargs)


-__all__ = ["GPT2Config"]
+class GPT2OnnxConfig(OnnxConfigWithPast):
+    def __init__(
+        self,
+        config: PreTrainedConfig,
+        task: str = "default",
+        patching_specs: Optional[list[PatchingSpec]] = None,
+        use_past: bool = False,
+    ):
+        super().__init__(config, task=task, patching_specs=patching_specs, use_past=use_past)
+        if not getattr(self._config, "pad_token_id", None):
+            # TODO: how to do that better?
+            self._config.pad_token_id = 0
+
+    @property
+    def inputs(self) -> Mapping[str, Mapping[int, str]]:
+        common_inputs = OrderedDict({"input_ids": {0: "batch", 1: "sequence"}})
+        if self.use_past:
+            self.fill_with_past_key_values_(common_inputs, direction="inputs")
+            common_inputs["attention_mask"] = {0: "batch", 1: "past_sequence + sequence"}
+        else:
+            common_inputs["attention_mask"] = {0: "batch", 1: "sequence"}
+
+        return common_inputs
+
+    @property
+    def num_layers(self) -> int:
+        return self._config.n_layer
+
+    @property
+    def num_attention_heads(self) -> int:
+        return self._config.n_head
+
+    def generate_dummy_inputs(
+        self,
+        tokenizer: PreTrainedTokenizer,
+        batch_size: int = -1,
+        seq_length: int = -1,
+        is_pair: bool = False,
+    ) -> Mapping[str, Any]:
+        common_inputs = super(OnnxConfigWithPast, self).generate_dummy_inputs(
+            tokenizer, batch_size=batch_size, seq_length=seq_length, is_pair=is_pair
+        )
+
+        # We need to order the input in the way they appears in the forward()
+        ordered_inputs = OrderedDict({"input_ids": common_inputs["input_ids"]})
+
+        # Need to add the past_keys
+        if self.use_past:
+            if not is_torch_available():
+                raise ValueError("Cannot generate dummy past_keys inputs without PyTorch installed.")
+            else:
+                import torch
+
+                batch, seqlen = common_inputs["input_ids"].shape
+                # Not using the same length for past_key_values
+                past_key_values_length = seqlen + 2
+                past_shape = (
+                    batch,
+                    self.num_attention_heads,
+                    past_key_values_length,
+                    self._config.hidden_size // self.num_attention_heads,
+                )
+                ordered_inputs["past_key_values"] = [
+                    (torch.zeros(past_shape), torch.zeros(past_shape)) for _ in range(self.num_layers)
+                ]
+
+        ordered_inputs["attention_mask"] = common_inputs["attention_mask"]
+        if self.use_past:
+            mask_dtype = ordered_inputs["attention_mask"].dtype
+            ordered_inputs["attention_mask"] = torch.cat(
+                [ordered_inputs["attention_mask"], torch.ones(batch, past_key_values_length, dtype=mask_dtype)], dim=1
+            )
+
+        return ordered_inputs
+
+    @property
+    def default_onnx_opset(self) -> int:
+        return 13
+
+
+__all__ = ["GPT2Config", "GPT2OnnxConfig"]
--- a/src/transformers/models/gpt_neo/configuration_gpt_neo.py
+++ b/src/transformers/models/gpt_neo/configuration_gpt_neo.py
@ -14,7 +14,13 @@
 # limitations under the License.
 """GPT Neo model configuration"""

+from collections import OrderedDict
+from collections.abc import Mapping
+from typing import Any
+
+from ... import PreTrainedTokenizer, is_torch_available
 from ...configuration_utils import PreTrainedConfig
+from ...onnx import OnnxConfigWithPast
 from ...utils import logging


@ -199,4 +205,71 @@ def custom_get_block_length_and_num_blocks(seq_length, window_size):
    return largest_divisor, torch.div(seq_length, largest_divisor, rounding_mode="floor")


-__all__ = ["GPTNeoConfig"]
+class GPTNeoOnnxConfig(OnnxConfigWithPast):
+    @property
+    def inputs(self) -> Mapping[str, Mapping[int, str]]:
+        common_inputs = OrderedDict({"input_ids": {0: "batch", 1: "sequence"}})
+        if self.use_past:
+            self.fill_with_past_key_values_(common_inputs, direction="inputs")
+            common_inputs["attention_mask"] = {0: "batch", 1: "past_sequence + sequence"}
+        else:
+            common_inputs["attention_mask"] = {0: "batch", 1: "sequence"}
+
+        return common_inputs
+
+    @property
+    def num_attention_heads(self) -> int:
+        return self._config.num_heads
+
+    def generate_dummy_inputs(
+        self,
+        tokenizer: PreTrainedTokenizer,
+        batch_size: int = -1,
+        seq_length: int = -1,
+        is_pair: bool = False,
+    ) -> Mapping[str, Any]:
+        common_inputs = super(OnnxConfigWithPast, self).generate_dummy_inputs(
+            tokenizer,
+            batch_size=batch_size,
+            seq_length=seq_length,
+            is_pair=is_pair,
+        )
+
+        # We need to order the input in the way they appears in the forward()
+        ordered_inputs = OrderedDict({"input_ids": common_inputs["input_ids"]})
+
+        # Need to add the past_keys
+        if self.use_past:
+            if not is_torch_available():
+                raise ValueError("Cannot generate dummy past_keys inputs without PyTorch installed.")
+            else:
+                import torch
+
+                batch, seqlen = common_inputs["input_ids"].shape
+                # Not using the same length for past_key_values
+                past_key_values_length = seqlen + 2
+                past_shape = (
+                    batch,
+                    self.num_attention_heads,
+                    past_key_values_length,
+                    self._config.hidden_size // self.num_attention_heads,
+                )
+                ordered_inputs["past_key_values"] = [
+                    (torch.zeros(past_shape), torch.zeros(past_shape)) for _ in range(self.num_layers)
+                ]
+
+        ordered_inputs["attention_mask"] = common_inputs["attention_mask"]
+        if self.use_past:
+            mask_dtype = ordered_inputs["attention_mask"].dtype
+            ordered_inputs["attention_mask"] = torch.cat(
+                [ordered_inputs["attention_mask"], torch.ones(batch, past_key_values_length, dtype=mask_dtype)], dim=1
+            )
+
+        return ordered_inputs
+
+    @property
+    def default_onnx_opset(self) -> int:
+        return 13
+
+
+__all__ = ["GPTNeoConfig", "GPTNeoOnnxConfig"]
--- a/src/transformers/models/gptj/configuration_gptj.py
+++ b/src/transformers/models/gptj/configuration_gptj.py
@ -14,7 +14,13 @@
 # limitations under the License.
 """GPT-J model configuration"""

+from collections import OrderedDict
+from collections.abc import Mapping
+from typing import Any, Optional
+
+from ... import PreTrainedTokenizer, is_torch_available
 from ...configuration_utils import PreTrainedConfig
+from ...onnx import OnnxConfigWithPast, PatchingSpec
 from ...utils import logging


@ -129,4 +135,85 @@ class GPTJConfig(PreTrainedConfig):
        )


-__all__ = ["GPTJConfig"]
+# Copied from transformers.models.gpt2.configuration_gpt2.GPT2OnnxConfig
+class GPTJOnnxConfig(OnnxConfigWithPast):
+    def __init__(
+        self,
+        config: PreTrainedConfig,
+        task: str = "default",
+        patching_specs: Optional[list[PatchingSpec]] = None,
+        use_past: bool = False,
+    ):
+        super().__init__(config, task=task, patching_specs=patching_specs, use_past=use_past)
+        if not getattr(self._config, "pad_token_id", None):
+            # TODO: how to do that better?
+            self._config.pad_token_id = 0
+
+    @property
+    def inputs(self) -> Mapping[str, Mapping[int, str]]:
+        common_inputs = OrderedDict({"input_ids": {0: "batch", 1: "sequence"}})
+        if self.use_past:
+            self.fill_with_past_key_values_(common_inputs, direction="inputs")
+            common_inputs["attention_mask"] = {0: "batch", 1: "past_sequence + sequence"}
+        else:
+            common_inputs["attention_mask"] = {0: "batch", 1: "sequence"}
+
+        return common_inputs
+
+    @property
+    def num_layers(self) -> int:
+        return self._config.n_layer
+
+    @property
+    def num_attention_heads(self) -> int:
+        return self._config.n_head
+
+    def generate_dummy_inputs(
+        self,
+        tokenizer: PreTrainedTokenizer,
+        batch_size: int = -1,
+        seq_length: int = -1,
+        is_pair: bool = False,
+    ) -> Mapping[str, Any]:
+        common_inputs = super(OnnxConfigWithPast, self).generate_dummy_inputs(
+            tokenizer, batch_size=batch_size, seq_length=seq_length, is_pair=is_pair
+        )
+
+        # We need to order the input in the way they appears in the forward()
+        ordered_inputs = OrderedDict({"input_ids": common_inputs["input_ids"]})
+
+        # Need to add the past_keys
+        if self.use_past:
+            if not is_torch_available():
+                raise ValueError("Cannot generate dummy past_keys inputs without PyTorch installed.")
+            else:
+                import torch
+
+                batch, seqlen = common_inputs["input_ids"].shape
+                # Not using the same length for past_key_values
+                past_key_values_length = seqlen + 2
+                past_shape = (
+                    batch,
+                    self.num_attention_heads,
+                    past_key_values_length,
+                    self._config.hidden_size // self.num_attention_heads,
+                )
+                ordered_inputs["past_key_values"] = [
+                    (torch.zeros(past_shape), torch.zeros(past_shape)) for _ in range(self.num_layers)
+                ]
+
+        ordered_inputs["attention_mask"] = common_inputs["attention_mask"]
+        if self.use_past:
+            mask_dtype = ordered_inputs["attention_mask"].dtype
+            ordered_inputs["attention_mask"] = torch.cat(
+                [ordered_inputs["attention_mask"], torch.ones(batch, past_key_values_length, dtype=mask_dtype)], dim=1
+            )
+
+        return ordered_inputs
+
+    @property
+    def default_onnx_opset(self) -> int:
+        return 13
+
+
+__all__ = ["GPTJConfig", "GPTJOnnxConfig"]
--- a/src/transformers/models/groupvit/configuration_groupvit.py
+++ b/src/transformers/models/groupvit/configuration_groupvit.py
@ -14,10 +14,19 @@
 # limitations under the License.
 """GroupViT model configuration"""

+from collections import OrderedDict
+from collections.abc import Mapping
+from typing import TYPE_CHECKING, Any
+
 from ...configuration_utils import PreTrainedConfig
+from ...onnx import OnnxConfig
 from ...utils import logging


+if TYPE_CHECKING:
+    from ...processing_utils import ProcessorMixin
+
+
 logger = logging.get_logger(__name__)


@ -351,4 +360,52 @@ class GroupViTConfig(PreTrainedConfig):
        super().__init__(**kwargs)


-__all__ = ["GroupViTConfig", "GroupViTTextConfig", "GroupViTVisionConfig"]
+class GroupViTOnnxConfig(OnnxConfig):
+    @property
+    def inputs(self) -> Mapping[str, Mapping[int, str]]:
+        return OrderedDict(
+            [
+                ("input_ids", {0: "batch", 1: "sequence"}),
+                ("pixel_values", {0: "batch", 1: "num_channels", 2: "height", 3: "width"}),
+                ("attention_mask", {0: "batch", 1: "sequence"}),
+            ]
+        )
+
+    @property
+    def outputs(self) -> Mapping[str, Mapping[int, str]]:
+        return OrderedDict(
+            [
+                ("logits_per_image", {0: "batch"}),
+                ("logits_per_text", {0: "batch"}),
+                ("text_embeds", {0: "batch"}),
+                ("image_embeds", {0: "batch"}),
+            ]
+        )
+
+    @property
+    def atol_for_validation(self) -> float:
+        return 1e-4
+
+    def generate_dummy_inputs(
+        self,
+        processor: "ProcessorMixin",
+        batch_size: int = -1,
+        seq_length: int = -1,
+    ) -> Mapping[str, Any]:
+        text_input_dict = super().generate_dummy_inputs(
+            processor.tokenizer,
+            batch_size=batch_size,
+            seq_length=seq_length,
+        )
+        image_input_dict = super().generate_dummy_inputs(
+            processor.image_processor,
+            batch_size=batch_size,
+        )
+        return {**text_input_dict, **image_input_dict}
+
+    @property
+    def default_onnx_opset(self) -> int:
+        return 14
+
+
+__all__ = ["GroupViTConfig", "GroupViTOnnxConfig", "GroupViTTextConfig", "GroupViTVisionConfig"]
--- a/src/transformers/models/ibert/configuration_ibert.py
+++ b/src/transformers/models/ibert/configuration_ibert.py
@ -16,7 +16,11 @@
 # limitations under the License.
 """I-BERT configuration"""

+from collections import OrderedDict
+from collections.abc import Mapping
+
 from ...configuration_utils import PreTrainedConfig
+from ...onnx import OnnxConfig
 from ...utils import logging


@ -112,4 +116,19 @@ class IBertConfig(PreTrainedConfig):
        self.force_dequant = force_dequant


-__all__ = ["IBertConfig"]
+class IBertOnnxConfig(OnnxConfig):
+    @property
+    def inputs(self) -> Mapping[str, Mapping[int, str]]:
+        if self.task == "multiple-choice":
+            dynamic_axis = {0: "batch", 1: "choice", 2: "sequence"}
+        else:
+            dynamic_axis = {0: "batch", 1: "sequence"}
+        return OrderedDict(
+            [
+                ("input_ids", dynamic_axis),
+                ("attention_mask", dynamic_axis),
+            ]
+        )
+
+
+__all__ = ["IBertConfig", "IBertOnnxConfig"]
--- a/src/transformers/models/imagegpt/configuration_imagegpt.py
+++ b/src/transformers/models/imagegpt/configuration_imagegpt.py
@ -14,10 +14,18 @@
 # limitations under the License.
 """OpenAI ImageGPT configuration"""

+from collections import OrderedDict
+from collections.abc import Mapping
+from typing import TYPE_CHECKING, Any
+
 from ...configuration_utils import PreTrainedConfig
+from ...onnx import OnnxConfig
 from ...utils import logging


+if TYPE_CHECKING:
+    from ... import FeatureExtractionMixin
+
 logger = logging.get_logger(__name__)


@ -136,4 +144,54 @@ class ImageGPTConfig(PreTrainedConfig):
        super().__init__(tie_word_embeddings=tie_word_embeddings, **kwargs)


-__all__ = ["ImageGPTConfig"]
+class ImageGPTOnnxConfig(OnnxConfig):
+    @property
+    def inputs(self) -> Mapping[str, Mapping[int, str]]:
+        return OrderedDict(
+            [
+                ("input_ids", {0: "batch", 1: "sequence"}),
+            ]
+        )
+
+    def generate_dummy_inputs(
+        self,
+        preprocessor: "FeatureExtractionMixin",
+        batch_size: int = 1,
+        seq_length: int = -1,
+        is_pair: bool = False,
+        num_channels: int = 3,
+        image_width: int = 32,
+        image_height: int = 32,
+    ) -> Mapping[str, Any]:
+        """
+        Generate inputs to provide to the ONNX exporter.
+
+        Args:
+            preprocessor ([`PreTrainedTokenizerBase`] or [`FeatureExtractionMixin`]):
+                The preprocessor associated with this model configuration.
+            batch_size (`int`, *optional*, defaults to -1):
+                The batch size to export the model for (-1 means dynamic axis).
+            num_choices (`int`, *optional*, defaults to -1):
+                The number of candidate answers provided for multiple choice task (-1 means dynamic axis).
+            seq_length (`int`, *optional*, defaults to -1):
+                The sequence length to export the model for (-1 means dynamic axis).
+            is_pair (`bool`, *optional*, defaults to `False`):
+                Indicate if the input is a pair (sentence 1, sentence 2)
+            num_channels (`int`, *optional*, defaults to 3):
+                The number of channels of the generated images.
+            image_width (`int`, *optional*, defaults to 40):
+                The width of the generated images.
+            image_height (`int`, *optional*, defaults to 40):
+                The height of the generated images.
+
+        Returns:
+            Mapping[str, Tensor] holding the kwargs to provide to the model's forward function
+        """
+
+        input_image = self._generate_dummy_images(batch_size, num_channels, image_height, image_width)
+        inputs = dict(preprocessor(images=input_image, return_tensors="pt"))
+
+        return inputs
+
+
+__all__ = ["ImageGPTConfig", "ImageGPTOnnxConfig"]
--- a/src/transformers/models/layoutlm/configuration_layoutlm.py
+++ b/src/transformers/models/layoutlm/configuration_layoutlm.py
@ -14,8 +14,13 @@
 # limitations under the License.
 """LayoutLM model configuration"""

-from ... import PreTrainedConfig
-from ...utils import logging
+from collections import OrderedDict
+from collections.abc import Mapping
+from typing import Any, Optional
+
+from ... import PreTrainedConfig, PreTrainedTokenizer
+from ...onnx import OnnxConfig, PatchingSpec
+from ...utils import is_torch_available, logging


 logger = logging.get_logger(__name__)
@ -122,4 +127,64 @@ class LayoutLMConfig(PreTrainedConfig):
        self.max_2d_position_embeddings = max_2d_position_embeddings


-__all__ = ["LayoutLMConfig"]
+class LayoutLMOnnxConfig(OnnxConfig):
+    def __init__(
+        self,
+        config: PreTrainedConfig,
+        task: str = "default",
+        patching_specs: Optional[list[PatchingSpec]] = None,
+    ):
+        super().__init__(config, task=task, patching_specs=patching_specs)
+        self.max_2d_positions = config.max_2d_position_embeddings - 1
+
+    @property
+    def inputs(self) -> Mapping[str, Mapping[int, str]]:
+        return OrderedDict(
+            [
+                ("input_ids", {0: "batch", 1: "sequence"}),
+                ("bbox", {0: "batch", 1: "sequence"}),
+                ("attention_mask", {0: "batch", 1: "sequence"}),
+                ("token_type_ids", {0: "batch", 1: "sequence"}),
+            ]
+        )
+
+    def generate_dummy_inputs(
+        self,
+        tokenizer: PreTrainedTokenizer,
+        batch_size: int = -1,
+        seq_length: int = -1,
+        is_pair: bool = False,
+    ) -> Mapping[str, Any]:
+        """
+        Generate inputs to provide to the ONNX exporter
+
+        Args:
+            tokenizer: The tokenizer associated with this model configuration
+            batch_size: The batch size (int) to export the model for (-1 means dynamic axis)
+            seq_length: The sequence length (int) to export the model for (-1 means dynamic axis)
+            is_pair: Indicate if the input is a pair (sentence 1, sentence 2)
+
+        Returns:
+            Mapping[str, Tensor] holding the kwargs to provide to the model's forward function
+        """
+
+        input_dict = super().generate_dummy_inputs(
+            tokenizer,
+            batch_size=batch_size,
+            seq_length=seq_length,
+            is_pair=is_pair,
+        )
+
+        # Generate a dummy bbox
+        box = [48, 84, 73, 128]
+
+        if not is_torch_available():
+            raise ValueError("Cannot generate dummy inputs without PyTorch installed.")
+        import torch
+
+        batch_size, seq_length = input_dict["input_ids"].shape
+        input_dict["bbox"] = torch.tensor([*[box] * seq_length]).tile(batch_size, 1, 1)
+        return input_dict
+
+
+__all__ = ["LayoutLMConfig", "LayoutLMOnnxConfig"]
--- a/src/transformers/models/layoutlmv3/configuration_layoutlmv3.py
+++ b/src/transformers/models/layoutlmv3/configuration_layoutlmv3.py
@ -14,10 +14,22 @@
 # limitations under the License.
 """LayoutLMv3 model configuration"""

+from collections import OrderedDict
+from collections.abc import Mapping
+from typing import TYPE_CHECKING, Any
+
+from packaging import version
+
 from ...configuration_utils import PreTrainedConfig
+from ...onnx import OnnxConfig
+from ...onnx.utils import compute_effective_axis_dimension
 from ...utils import logging


+if TYPE_CHECKING:
+    from ...processing_utils import ProcessorMixin
+
+
 logger = logging.get_logger(__name__)


@ -175,4 +187,104 @@ class LayoutLMv3Config(PreTrainedConfig):
        self.classifier_dropout = classifier_dropout


-__all__ = ["LayoutLMv3Config"]
+class LayoutLMv3OnnxConfig(OnnxConfig):
+    torch_onnx_minimum_version = version.parse("1.12")
+
+    @property
+    def inputs(self) -> Mapping[str, Mapping[int, str]]:
+        # The order of inputs is different for question answering and sequence classification
+        if self.task in ["question-answering", "sequence-classification"]:
+            return OrderedDict(
+                [
+                    ("input_ids", {0: "batch", 1: "sequence"}),
+                    ("attention_mask", {0: "batch", 1: "sequence"}),
+                    ("bbox", {0: "batch", 1: "sequence"}),
+                    ("pixel_values", {0: "batch", 1: "num_channels", 2: "height", 3: "width"}),
+                ]
+            )
+        else:
+            return OrderedDict(
+                [
+                    ("input_ids", {0: "batch", 1: "sequence"}),
+                    ("bbox", {0: "batch", 1: "sequence"}),
+                    ("attention_mask", {0: "batch", 1: "sequence"}),
+                    ("pixel_values", {0: "batch", 1: "num_channels"}),
+                ]
+            )
+
+    @property
+    def atol_for_validation(self) -> float:
+        return 1e-5
+
+    @property
+    def default_onnx_opset(self) -> int:
+        return 12
+
+    def generate_dummy_inputs(
+        self,
+        processor: "ProcessorMixin",
+        batch_size: int = -1,
+        seq_length: int = -1,
+        is_pair: bool = False,
+        num_channels: int = 3,
+        image_width: int = 40,
+        image_height: int = 40,
+    ) -> Mapping[str, Any]:
+        """
+        Generate inputs to provide to the ONNX exporter
+
+        Args:
+            processor ([`ProcessorMixin`]):
+                The processor associated with this model configuration.
+            batch_size (`int`, *optional*, defaults to -1):
+                The batch size to export the model for (-1 means dynamic axis).
+            seq_length (`int`, *optional*, defaults to -1):
+                The sequence length to export the model for (-1 means dynamic axis).
+            is_pair (`bool`, *optional*, defaults to `False`):
+                Indicate if the input is a pair (sentence 1, sentence 2).
+            num_channels (`int`, *optional*, defaults to 3):
+                The number of channels of the generated images.
+            image_width (`int`, *optional*, defaults to 40):
+                The width of the generated images.
+            image_height (`int`, *optional*, defaults to 40):
+                The height of the generated images.
+
+        Returns:
+            Mapping[str, Any]: holding the kwargs to provide to the model's forward function
+        """
+
+        # A dummy image is used so OCR should not be applied
+        setattr(processor.image_processor, "apply_ocr", False)
+
+        # If dynamic axis (-1) we forward with a fixed dimension of 2 samples to avoid optimizations made by ONNX
+        batch_size = compute_effective_axis_dimension(
+            batch_size, fixed_dimension=OnnxConfig.default_fixed_batch, num_token_to_add=0
+        )
+        # If dynamic axis (-1) we forward with a fixed dimension of 8 tokens to avoid optimizations made by ONNX
+        token_to_add = processor.tokenizer.num_special_tokens_to_add(is_pair)
+        seq_length = compute_effective_axis_dimension(
+            seq_length, fixed_dimension=OnnxConfig.default_fixed_sequence, num_token_to_add=token_to_add
+        )
+        # Generate dummy inputs according to compute batch and sequence
+        dummy_text = [[" ".join([processor.tokenizer.unk_token]) * seq_length]] * batch_size
+
+        # Generate dummy bounding boxes
+        dummy_bboxes = [[[48, 84, 73, 128]]] * batch_size
+
+        # If dynamic axis (-1) we forward with a fixed dimension of 2 samples to avoid optimizations made by ONNX
+        # batch_size = compute_effective_axis_dimension(batch_size, fixed_dimension=OnnxConfig.default_fixed_batch)
+        dummy_image = self._generate_dummy_images(batch_size, num_channels, image_height, image_width)
+
+        inputs = dict(
+            processor(
+                dummy_image,
+                text=dummy_text,
+                boxes=dummy_bboxes,
+                return_tensors="pt",
+            )
+        )
+
+        return inputs
+
+
+__all__ = ["LayoutLMv3Config", "LayoutLMv3OnnxConfig"]
--- a/src/transformers/models/levit/configuration_levit.py
+++ b/src/transformers/models/levit/configuration_levit.py
@ -14,7 +14,13 @@
 # limitations under the License.
 """LeViT model configuration"""

+from collections import OrderedDict
+from collections.abc import Mapping
+
+from packaging import version
+
 from ...configuration_utils import PreTrainedConfig
+from ...onnx import OnnxConfig
 from ...utils import logging


@ -118,4 +124,21 @@ class LevitConfig(PreTrainedConfig):
        ]


-__all__ = ["LevitConfig"]
+# Copied from transformers.models.vit.configuration_vit.ViTOnnxConfig
+class LevitOnnxConfig(OnnxConfig):
+    torch_onnx_minimum_version = version.parse("1.11")
+
+    @property
+    def inputs(self) -> Mapping[str, Mapping[int, str]]:
+        return OrderedDict(
+            [
+                ("pixel_values", {0: "batch", 1: "num_channels", 2: "height", 3: "width"}),
+            ]
+        )
+
+    @property
+    def atol_for_validation(self) -> float:
+        return 1e-4
+
+
+__all__ = ["LevitConfig", "LevitOnnxConfig"]
--- a/src/transformers/models/longformer/configuration_longformer.py
+++ b/src/transformers/models/longformer/configuration_longformer.py
@ -14,12 +14,20 @@
 # limitations under the License.
 """Longformer configuration"""

-from typing import Union
+from collections import OrderedDict
+from collections.abc import Mapping
+from typing import TYPE_CHECKING, Any, Optional, Union

 from ...configuration_utils import PreTrainedConfig
+from ...onnx import OnnxConfig
 from ...utils import logging


+if TYPE_CHECKING:
+    from ...onnx.config import PatchingSpec
+    from ...tokenization_utils_base import PreTrainedTokenizerBase
+
+
 logger = logging.get_logger(__name__)


@ -131,4 +139,71 @@ class LongformerConfig(PreTrainedConfig):
        self.onnx_export = onnx_export


-__all__ = ["LongformerConfig"]
+class LongformerOnnxConfig(OnnxConfig):
+    def __init__(
+        self, config: "PreTrainedConfig", task: str = "default", patching_specs: "Optional[list[PatchingSpec]]" = None
+    ):
+        super().__init__(config, task, patching_specs)
+        config.onnx_export = True
+
+    @property
+    def inputs(self) -> Mapping[str, Mapping[int, str]]:
+        if self.task == "multiple-choice":
+            dynamic_axis = {0: "batch", 1: "choice", 2: "sequence"}
+        else:
+            dynamic_axis = {0: "batch", 1: "sequence"}
+        return OrderedDict(
+            [
+                ("input_ids", dynamic_axis),
+                ("attention_mask", dynamic_axis),
+                ("global_attention_mask", dynamic_axis),
+            ]
+        )
+
+    @property
+    def outputs(self) -> Mapping[str, Mapping[int, str]]:
+        outputs = super().outputs
+        if self.task == "default":
+            outputs["pooler_output"] = {0: "batch"}
+        return outputs
+
+    @property
+    def atol_for_validation(self) -> float:
+        """
+        What absolute tolerance value to use during model conversion validation.
+
+        Returns:
+            Float absolute tolerance value.
+        """
+        return 1e-4
+
+    @property
+    def default_onnx_opset(self) -> int:
+        # needs to be >= 14 to support tril operator
+        return max(super().default_onnx_opset, 14)
+
+    def generate_dummy_inputs(
+        self,
+        tokenizer: "PreTrainedTokenizerBase",
+        batch_size: int = -1,
+        seq_length: int = -1,
+        is_pair: bool = False,
+    ) -> Mapping[str, Any]:
+        inputs = super().generate_dummy_inputs(
+            preprocessor=tokenizer,
+            batch_size=batch_size,
+            seq_length=seq_length,
+            is_pair=is_pair,
+        )
+        import torch
+
+        # for some reason, replacing this code by inputs["global_attention_mask"] = torch.randint(2, inputs["input_ids"].shape, dtype=torch.int64)
+        # makes the export fail randomly
+        inputs["global_attention_mask"] = torch.zeros_like(inputs["input_ids"])
+        # make every second token global
+        inputs["global_attention_mask"][:, ::2] = 1
+
+        return inputs
+
+
+__all__ = ["LongformerConfig", "LongformerOnnxConfig"]
--- a/src/transformers/models/longt5/configuration_longt5.py
+++ b/src/transformers/models/longt5/configuration_longt5.py
@ -14,7 +14,10 @@
 # limitations under the License.
 """LongT5 model configuration"""

+from collections.abc import Mapping
+
 from ...configuration_utils import PreTrainedConfig
+from ...onnx import OnnxSeq2SeqConfigWithPast
 from ...utils import logging


@ -149,4 +152,29 @@ class LongT5Config(PreTrainedConfig):
        )


-__all__ = ["LongT5Config"]
+class LongT5OnnxConfig(OnnxSeq2SeqConfigWithPast):
+    @property
+    def inputs(self) -> Mapping[str, Mapping[int, str]]:
+        common_inputs = {
+            "input_ids": {0: "batch", 1: "encoder_sequence"},
+            "attention_mask": {0: "batch", 1: "encoder_sequence"},
+        }
+        if self.use_past:
+            common_inputs["attention_mask"][1] = "past_encoder_sequence + sequence"
+            common_inputs["decoder_input_ids"] = {0: "batch"}
+            common_inputs["decoder_attention_mask"] = {0: "batch", 1: "past_decoder_sequence + sequence"}
+        else:
+            common_inputs["decoder_input_ids"] = {0: "batch", 1: "decoder_sequence"}
+            common_inputs["decoder_attention_mask"] = {0: "batch", 1: "decoder_sequence"}
+
+        if self.use_past:
+            self.fill_with_past_key_values_(common_inputs, direction="inputs")
+
+        return common_inputs
+
+    @property
+    def default_onnx_opset(self) -> int:
+        return 13
+
+
+__all__ = ["LongT5Config", "LongT5OnnxConfig"]
--- a/src/transformers/models/m2m_100/configuration_m2m_100.py
+++ b/src/transformers/models/m2m_100/configuration_m2m_100.py
@ -14,8 +14,15 @@
 # limitations under the License.
 """M2M100 model configuration"""

+from collections import OrderedDict
+from collections.abc import Mapping
+from typing import Any
+
+from ... import PreTrainedTokenizer
 from ...configuration_utils import PreTrainedConfig
-from ...utils import logging
+from ...onnx import OnnxConfig, OnnxSeq2SeqConfigWithPast
+from ...onnx.utils import compute_effective_axis_dimension
+from ...utils import is_torch_available, logging


 logger = logging.get_logger(__name__)
@ -151,4 +158,125 @@ class M2M100Config(PreTrainedConfig):
        )


-__all__ = ["M2M100Config"]
+class M2M100OnnxConfig(OnnxSeq2SeqConfigWithPast):
+    @property
+    def inputs(self) -> Mapping[str, Mapping[int, str]]:
+        common_inputs = OrderedDict(
+            [
+                ("input_ids", {0: "batch", 1: "encoder_sequence"}),
+                ("attention_mask", {0: "batch", 1: "encoder_sequence"}),
+            ]
+        )
+
+        if self.use_past:
+            common_inputs["decoder_input_ids"] = {0: "batch"}
+            common_inputs["decoder_attention_mask"] = {0: "batch", 1: "past_decoder_sequence + sequence"}
+        else:
+            common_inputs["decoder_input_ids"] = {0: "batch", 1: "decoder_sequence"}
+            common_inputs["decoder_attention_mask"] = {0: "batch", 1: "decoder_sequence"}
+
+        if self.use_past:
+            self.fill_with_past_key_values_(common_inputs, direction="inputs")
+        return common_inputs
+
+    # Copied from BartOnnxConfig._generate_dummy_inputs_for_sequence_classification_and_question_answering
+    # A better name would be _generate_dummy_inputs_for_encoder_and_decoder because sequence classification and question
+    # answering are not supported for M2M100, but this name is preserved to be able to check that the copy matches what
+    # was done for BART so that it can be updated if need be.
+    def _generate_dummy_inputs_for_sequence_classification_and_question_answering(
+        self,
+        tokenizer: PreTrainedTokenizer,
+        batch_size: int = -1,
+        seq_length: int = -1,
+        is_pair: bool = False,
+    ) -> Mapping[str, Any]:
+        # Copied from OnnxConfig.generate_dummy_inputs
+        # Did not use super(OnnxConfigWithPast, self).generate_dummy_inputs for code clarity.
+        # If dynamic axis (-1) we forward with a fixed dimension of 2 samples to avoid optimizations made by ONNX
+        batch_size = compute_effective_axis_dimension(
+            batch_size, fixed_dimension=OnnxConfig.default_fixed_batch, num_token_to_add=0
+        )
+
+        # If dynamic axis (-1) we forward with a fixed dimension of 8 tokens to avoid optimizations made by ONNX
+        token_to_add = tokenizer.num_special_tokens_to_add(is_pair)
+        seq_length = compute_effective_axis_dimension(
+            seq_length, fixed_dimension=OnnxConfig.default_fixed_sequence, num_token_to_add=token_to_add
+        )
+
+        # Generate dummy inputs according to compute batch and sequence
+        dummy_input = [" ".join([tokenizer.unk_token]) * seq_length] * batch_size
+        common_inputs = dict(tokenizer(dummy_input, return_tensors="pt"))
+        return common_inputs
+
+    # Copied from transformers.models.bart.configuration_bart.BartOnnxConfig._generate_dummy_inputs_for_default_and_seq2seq_lm
+    def _generate_dummy_inputs_for_default_and_seq2seq_lm(
+        self,
+        tokenizer: PreTrainedTokenizer,
+        batch_size: int = -1,
+        seq_length: int = -1,
+        is_pair: bool = False,
+    ) -> Mapping[str, Any]:
+        encoder_inputs = self._generate_dummy_inputs_for_sequence_classification_and_question_answering(
+            tokenizer, batch_size, seq_length, is_pair
+        )
+
+        # Generate decoder inputs
+        decoder_seq_length = seq_length if not self.use_past else 1
+        decoder_inputs = self._generate_dummy_inputs_for_sequence_classification_and_question_answering(
+            tokenizer, batch_size, decoder_seq_length, is_pair
+        )
+        decoder_inputs = {f"decoder_{name}": tensor for name, tensor in decoder_inputs.items()}
+        common_inputs = dict(**encoder_inputs, **decoder_inputs)
+
+        if self.use_past:
+            if not is_torch_available():
+                raise ValueError("Cannot generate dummy past_keys inputs without PyTorch installed.")
+            else:
+                import torch
+            batch, encoder_seq_length = common_inputs["input_ids"].shape
+            decoder_seq_length = common_inputs["decoder_input_ids"].shape[1]
+            num_encoder_attention_heads, num_decoder_attention_heads = self.num_attention_heads
+            encoder_shape = (
+                batch,
+                num_encoder_attention_heads,
+                encoder_seq_length,
+                self._config.hidden_size // num_encoder_attention_heads,
+            )
+            decoder_past_length = decoder_seq_length + 3
+            decoder_shape = (
+                batch,
+                num_decoder_attention_heads,
+                decoder_past_length,
+                self._config.hidden_size // num_decoder_attention_heads,
+            )
+
+            common_inputs["decoder_attention_mask"] = torch.cat(
+                [common_inputs["decoder_attention_mask"], torch.ones(batch, decoder_past_length)], dim=1
+            )
+
+            common_inputs["past_key_values"] = []
+            # If the number of encoder and decoder layers are present in the model configuration, both are considered
+            num_encoder_layers, num_decoder_layers = self.num_layers
+            min_num_layers = min(num_encoder_layers, num_decoder_layers)
+            max_num_layers = max(num_encoder_layers, num_decoder_layers) - min_num_layers
+            remaining_side_name = "encoder" if num_encoder_layers > num_decoder_layers else "decoder"
+
+            for _ in range(min_num_layers):
+                common_inputs["past_key_values"].append(
+                    (
+                        torch.zeros(decoder_shape),
+                        torch.zeros(decoder_shape),
+                        torch.zeros(encoder_shape),
+                        torch.zeros(encoder_shape),
+                    )
+                )
+            # TODO: test this.
+            shape = encoder_shape if remaining_side_name == "encoder" else decoder_shape
+            for _ in range(min_num_layers, max_num_layers):
+                common_inputs["past_key_values"].append((torch.zeros(shape), torch.zeros(shape)))
+        return common_inputs
+
+    generate_dummy_inputs = _generate_dummy_inputs_for_default_and_seq2seq_lm
+
+
+__all__ = ["M2M100Config", "M2M100OnnxConfig"]
--- a/src/transformers/models/marian/configuration_marian.py
+++ b/src/transformers/models/marian/configuration_marian.py
@ -14,8 +14,15 @@
 # limitations under the License.
 """Marian model configuration"""

+from collections import OrderedDict
+from collections.abc import Mapping
+from typing import Any
+
+from ... import PreTrainedTokenizer
 from ...configuration_utils import PreTrainedConfig
-from ...utils import logging
+from ...onnx import OnnxConfig, OnnxConfigWithPast, OnnxSeq2SeqConfigWithPast
+from ...onnx.utils import compute_effective_axis_dimension
+from ...utils import is_torch_available, logging


 logger = logging.get_logger(__name__)
@ -157,4 +164,243 @@ class MarianConfig(PreTrainedConfig):
        )


-__all__ = ["MarianConfig"]
+class MarianOnnxConfig(OnnxSeq2SeqConfigWithPast):
+    @property
+    # Copied from transformers.models.bart.configuration_bart.BartOnnxConfig.inputs
+    def inputs(self) -> Mapping[str, Mapping[int, str]]:
+        if self.task in ["default", "seq2seq-lm"]:
+            common_inputs = OrderedDict(
+                [
+                    ("input_ids", {0: "batch", 1: "encoder_sequence"}),
+                    ("attention_mask", {0: "batch", 1: "encoder_sequence"}),
+                ]
+            )
+
+            if self.use_past:
+                common_inputs["decoder_input_ids"] = {0: "batch"}
+                common_inputs["decoder_attention_mask"] = {0: "batch", 1: "past_decoder_sequence + sequence"}
+            else:
+                common_inputs["decoder_input_ids"] = {0: "batch", 1: "decoder_sequence"}
+                common_inputs["decoder_attention_mask"] = {0: "batch", 1: "decoder_sequence"}
+
+            if self.use_past:
+                self.fill_with_past_key_values_(common_inputs, direction="inputs")
+        elif self.task == "causal-lm":
+            # TODO: figure this case out.
+            common_inputs = OrderedDict(
+                [
+                    ("input_ids", {0: "batch", 1: "encoder_sequence"}),
+                    ("attention_mask", {0: "batch", 1: "encoder_sequence"}),
+                ]
+            )
+            if self.use_past:
+                num_encoder_layers, _ = self.num_layers
+                for i in range(num_encoder_layers):
+                    common_inputs[f"past_key_values.{i}.key"] = {0: "batch", 2: "past_sequence + sequence"}
+                    common_inputs[f"past_key_values.{i}.value"] = {0: "batch", 2: "past_sequence + sequence"}
+        else:
+            common_inputs = OrderedDict(
+                [
+                    ("input_ids", {0: "batch", 1: "encoder_sequence"}),
+                    ("attention_mask", {0: "batch", 1: "encoder_sequence"}),
+                    ("decoder_input_ids", {0: "batch", 1: "decoder_sequence"}),
+                    ("decoder_attention_mask", {0: "batch", 1: "decoder_sequence"}),
+                ]
+            )
+
+        return common_inputs
+
+    @property
+    # Copied from transformers.models.bart.configuration_bart.BartOnnxConfig.outputs
+    def outputs(self) -> Mapping[str, Mapping[int, str]]:
+        if self.task in ["default", "seq2seq-lm"]:
+            common_outputs = super().outputs
+        else:
+            common_outputs = super(OnnxConfigWithPast, self).outputs
+            if self.use_past:
+                num_encoder_layers, _ = self.num_layers
+                for i in range(num_encoder_layers):
+                    common_outputs[f"present.{i}.key"] = {0: "batch", 2: "past_sequence + sequence"}
+                    common_outputs[f"present.{i}.value"] = {0: "batch", 2: "past_sequence + sequence"}
+        return common_outputs
+
+    def _generate_dummy_inputs_for_default_and_seq2seq_lm(
+        self,
+        tokenizer: PreTrainedTokenizer,
+        batch_size: int = -1,
+        seq_length: int = -1,
+        is_pair: bool = False,
+    ) -> Mapping[str, Any]:
+        encoder_inputs = self._generate_dummy_inputs_for_encoder_and_decoder(
+            tokenizer,
+            batch_size,
+            seq_length,
+            is_pair,
+        )
+
+        # Generate decoder inputs
+        decoder_seq_length = seq_length if not self.use_past else 1
+        decoder_inputs = self._generate_dummy_inputs_for_encoder_and_decoder(
+            tokenizer,
+            batch_size,
+            decoder_seq_length,
+            is_pair,
+        )
+        decoder_inputs = {f"decoder_{name}": tensor for name, tensor in decoder_inputs.items()}
+        common_inputs = dict(**encoder_inputs, **decoder_inputs)
+
+        if self.use_past:
+            if not is_torch_available():
+                raise ValueError("Cannot generate dummy past_keys inputs without PyTorch installed.")
+            else:
+                import torch
+            batch, encoder_seq_length = common_inputs["input_ids"].shape
+            decoder_seq_length = common_inputs["decoder_input_ids"].shape[1]
+            num_encoder_attention_heads, num_decoder_attention_heads = self.num_attention_heads
+            encoder_shape = (
+                batch,
+                num_encoder_attention_heads,
+                encoder_seq_length,
+                self._config.hidden_size // num_encoder_attention_heads,
+            )
+            decoder_past_length = decoder_seq_length + 3
+            decoder_shape = (
+                batch,
+                num_decoder_attention_heads,
+                decoder_past_length,
+                self._config.hidden_size // num_decoder_attention_heads,
+            )
+
+            common_inputs["decoder_attention_mask"] = torch.cat(
+                [common_inputs["decoder_attention_mask"], torch.ones(batch, decoder_past_length)], dim=1
+            )
+
+            common_inputs["past_key_values"] = []
+            # If the number of encoder and decoder layers are present in the model configuration, both are considered
+            num_encoder_layers, num_decoder_layers = self.num_layers
+            min_num_layers = min(num_encoder_layers, num_decoder_layers)
+            max_num_layers = max(num_encoder_layers, num_decoder_layers) - min_num_layers
+            remaining_side_name = "encoder" if num_encoder_layers > num_decoder_layers else "decoder"
+
+            for _ in range(min_num_layers):
+                common_inputs["past_key_values"].append(
+                    (
+                        torch.zeros(decoder_shape),
+                        torch.zeros(decoder_shape),
+                        torch.zeros(encoder_shape),
+                        torch.zeros(encoder_shape),
+                    )
+                )
+            # TODO: test this.
+            shape = encoder_shape if remaining_side_name == "encoder" else decoder_shape
+            for _ in range(min_num_layers, max_num_layers):
+                common_inputs["past_key_values"].append((torch.zeros(shape), torch.zeros(shape)))
+        return common_inputs
+
+    def _generate_dummy_inputs_for_causal_lm(
+        self,
+        tokenizer: PreTrainedTokenizer,
+        batch_size: int = -1,
+        seq_length: int = -1,
+        is_pair: bool = False,
+    ) -> Mapping[str, Any]:
+        common_inputs = self._generate_dummy_inputs_for_encoder_and_decoder(
+            tokenizer,
+            batch_size,
+            seq_length,
+            is_pair,
+        )
+
+        if self.use_past:
+            if not is_torch_available():
+                raise ValueError("Cannot generate dummy past_keys inputs without PyTorch installed.")
+            else:
+                import torch
+            batch, seqlen = common_inputs["input_ids"].shape
+            # Not using the same length for past_key_values
+            past_key_values_length = seqlen + 2
+            num_encoder_layers, _ = self.num_layers
+            num_encoder_attention_heads, _ = self.num_attention_heads
+            past_shape = (
+                batch,
+                num_encoder_attention_heads,
+                past_key_values_length,
+                self._config.hidden_size // num_encoder_attention_heads,
+            )
+
+            mask_dtype = common_inputs["attention_mask"].dtype
+            common_inputs["attention_mask"] = torch.cat(
+                [common_inputs["attention_mask"], torch.ones(batch, past_key_values_length, dtype=mask_dtype)], dim=1
+            )
+            common_inputs["past_key_values"] = [
+                (torch.zeros(past_shape), torch.zeros(past_shape)) for _ in range(num_encoder_layers)
+            ]
+        return common_inputs
+
+    # Copied from BartOnnxConfig._generate_dummy_inputs_for_sequence_classification_and_question_answering
+    # We renamed this function because Marian models do not have a sequence classification or question answering head
+    def _generate_dummy_inputs_for_encoder_and_decoder(
+        self,
+        tokenizer: PreTrainedTokenizer,
+        batch_size: int = -1,
+        seq_length: int = -1,
+        is_pair: bool = False,
+    ) -> Mapping[str, Any]:
+        # Copied from OnnxConfig.generate_dummy_inputs
+        # Did not use super(OnnxConfigWithPast, self).generate_dummy_inputs for code clarity.
+        # If dynamic axis (-1) we forward with a fixed dimension of 2 samples to avoid optimizations made by ONNX
+        batch_size = compute_effective_axis_dimension(
+            batch_size, fixed_dimension=OnnxConfig.default_fixed_batch, num_token_to_add=0
+        )
+
+        # If dynamic axis (-1) we forward with a fixed dimension of 8 tokens to avoid optimizations made by ONNX
+        token_to_add = tokenizer.num_special_tokens_to_add(is_pair)
+        seq_length = compute_effective_axis_dimension(
+            seq_length, fixed_dimension=OnnxConfig.default_fixed_sequence, num_token_to_add=token_to_add
+        )
+
+        # Generate dummy inputs according to compute batch and sequence
+        dummy_input = [" ".join([tokenizer.unk_token]) * seq_length] * batch_size
+        common_inputs = dict(tokenizer(dummy_input, return_tensors="pt"))
+        return common_inputs
+
+    def generate_dummy_inputs(
+        self,
+        tokenizer: PreTrainedTokenizer,
+        batch_size: int = -1,
+        seq_length: int = -1,
+        is_pair: bool = False,
+    ) -> Mapping[str, Any]:
+        if self.task in ["default", "seq2seq-lm"]:
+            common_inputs = self._generate_dummy_inputs_for_default_and_seq2seq_lm(
+                tokenizer,
+                batch_size=batch_size,
+                seq_length=seq_length,
+                is_pair=is_pair,
+            )
+
+        else:
+            common_inputs = self._generate_dummy_inputs_for_causal_lm(
+                tokenizer,
+                batch_size=batch_size,
+                seq_length=seq_length,
+                is_pair=is_pair,
+            )
+
+        return common_inputs
+
+    # Copied from transformers.models.bart.configuration_bart.BartOnnxConfig._flatten_past_key_values_
+    def _flatten_past_key_values_(self, flattened_output, name, idx, t):
+        if self.task in ["default", "seq2seq-lm"]:
+            flattened_output = super()._flatten_past_key_values_(flattened_output, name, idx, t)
+        else:
+            flattened_output = super(OnnxSeq2SeqConfigWithPast, self)._flatten_past_key_values_(
+                flattened_output, name, idx, t
+            )
+
+    @property
+    def atol_for_validation(self) -> float:
+        return 1e-4
+
+
+__all__ = ["MarianConfig", "MarianOnnxConfig"]
--- a/src/transformers/models/mbart/configuration_mbart.py
+++ b/src/transformers/models/mbart/configuration_mbart.py
@ -14,8 +14,15 @@
 # limitations under the License.
 """MBART model configuration"""

+from collections import OrderedDict
+from collections.abc import Mapping
+from typing import Any
+
+from ... import PreTrainedTokenizer
 from ...configuration_utils import PreTrainedConfig
-from ...utils import logging
+from ...onnx import OnnxConfig, OnnxConfigWithPast, OnnxSeq2SeqConfigWithPast
+from ...onnx.utils import compute_effective_axis_dimension
+from ...utils import is_torch_available, logging


 logger = logging.get_logger(__name__)
@ -157,4 +164,224 @@ class MBartConfig(PreTrainedConfig):
        )


-__all__ = ["MBartConfig"]
+# Copied from transformers.models.bart.configuration_bart.BartOnnxConfig with Bart->MBart
+class MBartOnnxConfig(OnnxSeq2SeqConfigWithPast):
+    @property
+    def inputs(self) -> Mapping[str, Mapping[int, str]]:
+        if self.task in ["default", "seq2seq-lm"]:
+            common_inputs = OrderedDict(
+                [
+                    ("input_ids", {0: "batch", 1: "encoder_sequence"}),
+                    ("attention_mask", {0: "batch", 1: "encoder_sequence"}),
+                ]
+            )
+
+            if self.use_past:
+                common_inputs["decoder_input_ids"] = {0: "batch"}
+                common_inputs["decoder_attention_mask"] = {0: "batch", 1: "past_decoder_sequence + sequence"}
+            else:
+                common_inputs["decoder_input_ids"] = {0: "batch", 1: "decoder_sequence"}
+                common_inputs["decoder_attention_mask"] = {0: "batch", 1: "decoder_sequence"}
+
+            if self.use_past:
+                self.fill_with_past_key_values_(common_inputs, direction="inputs")
+        elif self.task == "causal-lm":
+            # TODO: figure this case out.
+            common_inputs = OrderedDict(
+                [
+                    ("input_ids", {0: "batch", 1: "encoder_sequence"}),
+                    ("attention_mask", {0: "batch", 1: "encoder_sequence"}),
+                ]
+            )
+            if self.use_past:
+                num_encoder_layers, _ = self.num_layers
+                for i in range(num_encoder_layers):
+                    common_inputs[f"past_key_values.{i}.key"] = {0: "batch", 2: "past_sequence + sequence"}
+                    common_inputs[f"past_key_values.{i}.value"] = {0: "batch", 2: "past_sequence + sequence"}
+        else:
+            common_inputs = OrderedDict(
+                [
+                    ("input_ids", {0: "batch", 1: "encoder_sequence"}),
+                    ("attention_mask", {0: "batch", 1: "encoder_sequence"}),
+                    ("decoder_input_ids", {0: "batch", 1: "decoder_sequence"}),
+                    ("decoder_attention_mask", {0: "batch", 1: "decoder_sequence"}),
+                ]
+            )
+
+        return common_inputs
+
+    @property
+    def outputs(self) -> Mapping[str, Mapping[int, str]]:
+        if self.task in ["default", "seq2seq-lm"]:
+            common_outputs = super().outputs
+        else:
+            common_outputs = super(OnnxConfigWithPast, self).outputs
+            if self.use_past:
+                num_encoder_layers, _ = self.num_layers
+                for i in range(num_encoder_layers):
+                    common_outputs[f"present.{i}.key"] = {0: "batch", 2: "past_sequence + sequence"}
+                    common_outputs[f"present.{i}.value"] = {0: "batch", 2: "past_sequence + sequence"}
+        return common_outputs
+
+    def _generate_dummy_inputs_for_default_and_seq2seq_lm(
+        self,
+        tokenizer: PreTrainedTokenizer,
+        batch_size: int = -1,
+        seq_length: int = -1,
+        is_pair: bool = False,
+    ) -> Mapping[str, Any]:
+        encoder_inputs = self._generate_dummy_inputs_for_sequence_classification_and_question_answering(
+            tokenizer, batch_size, seq_length, is_pair
+        )
+
+        # Generate decoder inputs
+        decoder_seq_length = seq_length if not self.use_past else 1
+        decoder_inputs = self._generate_dummy_inputs_for_sequence_classification_and_question_answering(
+            tokenizer, batch_size, decoder_seq_length, is_pair
+        )
+        decoder_inputs = {f"decoder_{name}": tensor for name, tensor in decoder_inputs.items()}
+        common_inputs = dict(**encoder_inputs, **decoder_inputs)
+
+        if self.use_past:
+            if not is_torch_available():
+                raise ValueError("Cannot generate dummy past_keys inputs without PyTorch installed.")
+            else:
+                import torch
+            batch, encoder_seq_length = common_inputs["input_ids"].shape
+            decoder_seq_length = common_inputs["decoder_input_ids"].shape[1]
+            num_encoder_attention_heads, num_decoder_attention_heads = self.num_attention_heads
+            encoder_shape = (
+                batch,
+                num_encoder_attention_heads,
+                encoder_seq_length,
+                self._config.hidden_size // num_encoder_attention_heads,
+            )
+            decoder_past_length = decoder_seq_length + 3
+            decoder_shape = (
+                batch,
+                num_decoder_attention_heads,
+                decoder_past_length,
+                self._config.hidden_size // num_decoder_attention_heads,
+            )
+
+            common_inputs["decoder_attention_mask"] = torch.cat(
+                [common_inputs["decoder_attention_mask"], torch.ones(batch, decoder_past_length)], dim=1
+            )
+
+            common_inputs["past_key_values"] = []
+            # If the number of encoder and decoder layers are present in the model configuration, both are considered
+            num_encoder_layers, num_decoder_layers = self.num_layers
+            min_num_layers = min(num_encoder_layers, num_decoder_layers)
+            max_num_layers = max(num_encoder_layers, num_decoder_layers) - min_num_layers
+            remaining_side_name = "encoder" if num_encoder_layers > num_decoder_layers else "decoder"
+
+            for _ in range(min_num_layers):
+                common_inputs["past_key_values"].append(
+                    (
+                        torch.zeros(decoder_shape),
+                        torch.zeros(decoder_shape),
+                        torch.zeros(encoder_shape),
+                        torch.zeros(encoder_shape),
+                    )
+                )
+            # TODO: test this.
+            shape = encoder_shape if remaining_side_name == "encoder" else decoder_shape
+            for _ in range(min_num_layers, max_num_layers):
+                common_inputs["past_key_values"].append((torch.zeros(shape), torch.zeros(shape)))
+        return common_inputs
+
+    def _generate_dummy_inputs_for_causal_lm(
+        self,
+        tokenizer: PreTrainedTokenizer,
+        batch_size: int = -1,
+        seq_length: int = -1,
+        is_pair: bool = False,
+    ) -> Mapping[str, Any]:
+        common_inputs = self._generate_dummy_inputs_for_sequence_classification_and_question_answering(
+            tokenizer, batch_size, seq_length, is_pair
+        )
+
+        if self.use_past:
+            if not is_torch_available():
+                raise ValueError("Cannot generate dummy past_keys inputs without PyTorch installed.")
+            else:
+                import torch
+            batch, seqlen = common_inputs["input_ids"].shape
+            # Not using the same length for past_key_values
+            past_key_values_length = seqlen + 2
+            num_encoder_layers, _ = self.num_layers
+            num_encoder_attention_heads, _ = self.num_attention_heads
+            past_shape = (
+                batch,
+                num_encoder_attention_heads,
+                past_key_values_length,
+                self._config.hidden_size // num_encoder_attention_heads,
+            )
+
+            mask_dtype = common_inputs["attention_mask"].dtype
+            common_inputs["attention_mask"] = torch.cat(
+                [common_inputs["attention_mask"], torch.ones(batch, past_key_values_length, dtype=mask_dtype)], dim=1
+            )
+            common_inputs["past_key_values"] = [
+                (torch.zeros(past_shape), torch.zeros(past_shape)) for _ in range(num_encoder_layers)
+            ]
+        return common_inputs
+
+    def _generate_dummy_inputs_for_sequence_classification_and_question_answering(
+        self,
+        tokenizer: PreTrainedTokenizer,
+        batch_size: int = -1,
+        seq_length: int = -1,
+        is_pair: bool = False,
+    ) -> Mapping[str, Any]:
+        # Copied from OnnxConfig.generate_dummy_inputs
+        # Did not use super(OnnxConfigWithPast, self).generate_dummy_inputs for code clarity.
+        # If dynamic axis (-1) we forward with a fixed dimension of 2 samples to avoid optimizations made by ONNX
+        batch_size = compute_effective_axis_dimension(
+            batch_size, fixed_dimension=OnnxConfig.default_fixed_batch, num_token_to_add=0
+        )
+
+        # If dynamic axis (-1) we forward with a fixed dimension of 8 tokens to avoid optimizations made by ONNX
+        token_to_add = tokenizer.num_special_tokens_to_add(is_pair)
+        seq_length = compute_effective_axis_dimension(
+            seq_length, fixed_dimension=OnnxConfig.default_fixed_sequence, num_token_to_add=token_to_add
+        )
+
+        # Generate dummy inputs according to compute batch and sequence
+        dummy_input = [" ".join([tokenizer.unk_token]) * seq_length] * batch_size
+        common_inputs = dict(tokenizer(dummy_input, return_tensors="pt"))
+        return common_inputs
+
+    def generate_dummy_inputs(
+        self,
+        tokenizer: PreTrainedTokenizer,
+        batch_size: int = -1,
+        seq_length: int = -1,
+        is_pair: bool = False,
+    ) -> Mapping[str, Any]:
+        if self.task in ["default", "seq2seq-lm"]:
+            common_inputs = self._generate_dummy_inputs_for_default_and_seq2seq_lm(
+                tokenizer, batch_size=batch_size, seq_length=seq_length, is_pair=is_pair
+            )
+
+        elif self.task == "causal-lm":
+            common_inputs = self._generate_dummy_inputs_for_causal_lm(
+                tokenizer, batch_size=batch_size, seq_length=seq_length, is_pair=is_pair
+            )
+        else:
+            common_inputs = self._generate_dummy_inputs_for_sequence_classification_and_question_answering(
+                tokenizer, batch_size=batch_size, seq_length=seq_length, is_pair=is_pair
+            )
+
+        return common_inputs
+
+    def _flatten_past_key_values_(self, flattened_output, name, idx, t):
+        if self.task in ["default", "seq2seq-lm"]:
+            flattened_output = super()._flatten_past_key_values_(flattened_output, name, idx, t)
+        else:
+            flattened_output = super(OnnxSeq2SeqConfigWithPast, self)._flatten_past_key_values_(
+                flattened_output, name, idx, t
+            )
+
+
+__all__ = ["MBartConfig", "MBartOnnxConfig"]
--- a/src/transformers/models/metaclip_2/configuration_metaclip_2.py
+++ b/src/transformers/models/metaclip_2/configuration_metaclip_2.py
@ -4,6 +4,7 @@
 #             the file from the modular. If any change should be done, please apply the change to the
 #                          modular_metaclip_2.py file directly. One of our CI enforces this.
 #                🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨
+
 from ...configuration_utils import PreTrainedConfig
 from ...utils import logging

--- a/src/transformers/models/mobilebert/configuration_mobilebert.py
+++ b/src/transformers/models/mobilebert/configuration_mobilebert.py
@ -14,7 +14,11 @@
 # limitations under the License.
 """MobileBERT model configuration"""

+from collections import OrderedDict
+from collections.abc import Mapping
+
 from ...configuration_utils import PreTrainedConfig
+from ...onnx import OnnxConfig
 from ...utils import logging


@ -160,4 +164,21 @@ class MobileBertConfig(PreTrainedConfig):
        self.classifier_dropout = classifier_dropout


-__all__ = ["MobileBertConfig"]
+# Copied from transformers.models.bert.configuration_bert.BertOnnxConfig with Bert->MobileBert
+class MobileBertOnnxConfig(OnnxConfig):
+    @property
+    def inputs(self) -> Mapping[str, Mapping[int, str]]:
+        if self.task == "multiple-choice":
+            dynamic_axis = {0: "batch", 1: "choice", 2: "sequence"}
+        else:
+            dynamic_axis = {0: "batch", 1: "sequence"}
+        return OrderedDict(
+            [
+                ("input_ids", dynamic_axis),
+                ("attention_mask", dynamic_axis),
+                ("token_type_ids", dynamic_axis),
+            ]
+        )
+
+
+__all__ = ["MobileBertConfig", "MobileBertOnnxConfig"]
--- a/src/transformers/models/mobilenet_v1/configuration_mobilenet_v1.py
+++ b/src/transformers/models/mobilenet_v1/configuration_mobilenet_v1.py
@ -14,7 +14,13 @@
 # limitations under the License.
 """MobileNetV1 model configuration"""

+from collections import OrderedDict
+from collections.abc import Mapping
+
+from packaging import version
+
 from ...configuration_utils import PreTrainedConfig
+from ...onnx import OnnxConfig
 from ...utils import logging


@ -98,4 +104,23 @@ class MobileNetV1Config(PreTrainedConfig):
        self.layer_norm_eps = layer_norm_eps


-__all__ = ["MobileNetV1Config"]
+class MobileNetV1OnnxConfig(OnnxConfig):
+    torch_onnx_minimum_version = version.parse("1.11")
+
+    @property
+    def inputs(self) -> Mapping[str, Mapping[int, str]]:
+        return OrderedDict([("pixel_values", {0: "batch"})])
+
+    @property
+    def outputs(self) -> Mapping[str, Mapping[int, str]]:
+        if self.task == "image-classification":
+            return OrderedDict([("logits", {0: "batch"})])
+        else:
+            return OrderedDict([("last_hidden_state", {0: "batch"}), ("pooler_output", {0: "batch"})])
+
+    @property
+    def atol_for_validation(self) -> float:
+        return 1e-4
+
+
+__all__ = ["MobileNetV1Config", "MobileNetV1OnnxConfig"]
--- a/src/transformers/models/mobilenet_v2/configuration_mobilenet_v2.py
+++ b/src/transformers/models/mobilenet_v2/configuration_mobilenet_v2.py
@ -14,7 +14,13 @@
 # limitations under the License.
 """MobileNetV2 model configuration"""

+from collections import OrderedDict
+from collections.abc import Mapping
+
+from packaging import version
+
 from ...configuration_utils import PreTrainedConfig
+from ...onnx import OnnxConfig
 from ...utils import logging


@ -126,4 +132,23 @@ class MobileNetV2Config(PreTrainedConfig):
        self.semantic_loss_ignore_index = semantic_loss_ignore_index


-__all__ = ["MobileNetV2Config"]
+class MobileNetV2OnnxConfig(OnnxConfig):
+    torch_onnx_minimum_version = version.parse("1.11")
+
+    @property
+    def inputs(self) -> Mapping[str, Mapping[int, str]]:
+        return OrderedDict([("pixel_values", {0: "batch"})])
+
+    @property
+    def outputs(self) -> Mapping[str, Mapping[int, str]]:
+        if self.task == "image-classification":
+            return OrderedDict([("logits", {0: "batch"})])
+        else:
+            return OrderedDict([("last_hidden_state", {0: "batch"}), ("pooler_output", {0: "batch"})])
+
+    @property
+    def atol_for_validation(self) -> float:
+        return 1e-4
+
+
+__all__ = ["MobileNetV2Config", "MobileNetV2OnnxConfig"]
--- a/src/transformers/models/mobilevit/configuration_mobilevit.py
+++ b/src/transformers/models/mobilevit/configuration_mobilevit.py
@ -14,7 +14,13 @@
 # limitations under the License.
 """MobileViT model configuration"""

+from collections import OrderedDict
+from collections.abc import Mapping
+
+from packaging import version
+
 from ...configuration_utils import PreTrainedConfig
+from ...onnx import OnnxConfig
 from ...utils import logging


@ -144,4 +150,23 @@ class MobileViTConfig(PreTrainedConfig):
        self.semantic_loss_ignore_index = semantic_loss_ignore_index


-__all__ = ["MobileViTConfig"]
+class MobileViTOnnxConfig(OnnxConfig):
+    torch_onnx_minimum_version = version.parse("1.11")
+
+    @property
+    def inputs(self) -> Mapping[str, Mapping[int, str]]:
+        return OrderedDict([("pixel_values", {0: "batch", 1: "num_channels", 2: "height", 3: "width"})])
+
+    @property
+    def outputs(self) -> Mapping[str, Mapping[int, str]]:
+        if self.task == "image-classification":
+            return OrderedDict([("logits", {0: "batch"})])
+        else:
+            return OrderedDict([("last_hidden_state", {0: "batch"}), ("pooler_output", {0: "batch"})])
+
+    @property
+    def atol_for_validation(self) -> float:
+        return 1e-4
+
+
+__all__ = ["MobileViTConfig", "MobileViTOnnxConfig"]
--- a/src/transformers/models/mobilevitv2/configuration_mobilevitv2.py
+++ b/src/transformers/models/mobilevitv2/configuration_mobilevitv2.py
@ -14,7 +14,13 @@
 # limitations under the License.
 """MobileViTV2 model configuration"""

+from collections import OrderedDict
+from collections.abc import Mapping
+
+from packaging import version
+
 from ...configuration_utils import PreTrainedConfig
+from ...onnx import OnnxConfig
 from ...utils import logging


@ -140,4 +146,23 @@ class MobileViTV2Config(PreTrainedConfig):
        self.semantic_loss_ignore_index = semantic_loss_ignore_index


-__all__ = ["MobileViTV2Config"]
+class MobileViTV2OnnxConfig(OnnxConfig):
+    torch_onnx_minimum_version = version.parse("1.11")
+
+    @property
+    def inputs(self) -> Mapping[str, Mapping[int, str]]:
+        return OrderedDict([("pixel_values", {0: "batch", 1: "num_channels", 2: "height", 3: "width"})])
+
+    @property
+    def outputs(self) -> Mapping[str, Mapping[int, str]]:
+        if self.task == "image-classification":
+            return OrderedDict([("logits", {0: "batch"})])
+        else:
+            return OrderedDict([("last_hidden_state", {0: "batch"}), ("pooler_output", {0: "batch"})])
+
+    @property
+    def atol_for_validation(self) -> float:
+        return 1e-4
+
+
+__all__ = ["MobileViTV2Config", "MobileViTV2OnnxConfig"]
--- a/src/transformers/models/mt5/configuration_mt5.py
+++ b/src/transformers/models/mt5/configuration_mt5.py
@ -14,7 +14,10 @@
 # limitations under the License.
 """mT5 model configuration"""

+from collections.abc import Mapping
+
 from ...configuration_utils import PreTrainedConfig
+from ...onnx import OnnxSeq2SeqConfigWithPast
 from ...utils import logging


@ -145,4 +148,35 @@ class MT5Config(PreTrainedConfig):
        )


-__all__ = ["MT5Config"]
+class MT5OnnxConfig(OnnxSeq2SeqConfigWithPast):
+    @property
+    # Copied from transformers.models.t5.configuration_t5.T5OnnxConfig.inputs
+    def inputs(self) -> Mapping[str, Mapping[int, str]]:
+        common_inputs = {
+            "input_ids": {0: "batch", 1: "encoder_sequence"},
+            "attention_mask": {0: "batch", 1: "encoder_sequence"},
+        }
+        if self.use_past:
+            common_inputs["attention_mask"][1] = "past_encoder_sequence + sequence"
+            common_inputs["decoder_input_ids"] = {0: "batch"}
+            common_inputs["decoder_attention_mask"] = {0: "batch", 1: "past_decoder_sequence + sequence"}
+        else:
+            common_inputs["decoder_input_ids"] = {0: "batch", 1: "decoder_sequence"}
+            common_inputs["decoder_attention_mask"] = {0: "batch", 1: "decoder_sequence"}
+
+        if self.use_past:
+            self.fill_with_past_key_values_(common_inputs, direction="inputs")
+
+        return common_inputs
+
+    @property
+    # Copied from transformers.models.t5.configuration_t5.T5OnnxConfig.default_onnx_opset
+    def default_onnx_opset(self) -> int:
+        return 13
+
+    @property
+    def atol_for_validation(self) -> float:
+        return 5e-4
+
+
+__all__ = ["MT5Config", "MT5OnnxConfig"]
--- a/src/transformers/models/owlvit/configuration_owlvit.py
+++ b/src/transformers/models/owlvit/configuration_owlvit.py
@ -14,7 +14,16 @@
 # limitations under the License.
 """OWL-ViT model configuration"""

+from collections import OrderedDict
+from collections.abc import Mapping
+from typing import TYPE_CHECKING, Any
+
+
+if TYPE_CHECKING:
+    from ...processing_utils import ProcessorMixin
+
 from ...configuration_utils import PreTrainedConfig
+from ...onnx import OnnxConfig
 from ...utils import logging


@ -265,4 +274,52 @@ class OwlViTConfig(PreTrainedConfig):
        super().__init__(**kwargs)


-__all__ = ["OwlViTConfig", "OwlViTTextConfig", "OwlViTVisionConfig"]
+class OwlViTOnnxConfig(OnnxConfig):
+    @property
+    def inputs(self) -> Mapping[str, Mapping[int, str]]:
+        return OrderedDict(
+            [
+                ("input_ids", {0: "batch", 1: "sequence"}),
+                ("pixel_values", {0: "batch", 1: "num_channels", 2: "height", 3: "width"}),
+                ("attention_mask", {0: "batch", 1: "sequence"}),
+            ]
+        )
+
+    @property
+    def outputs(self) -> Mapping[str, Mapping[int, str]]:
+        return OrderedDict(
+            [
+                ("logits_per_image", {0: "batch"}),
+                ("logits_per_text", {0: "batch"}),
+                ("text_embeds", {0: "batch"}),
+                ("image_embeds", {0: "batch"}),
+            ]
+        )
+
+    @property
+    def atol_for_validation(self) -> float:
+        return 1e-4
+
+    def generate_dummy_inputs(
+        self,
+        processor: "ProcessorMixin",
+        batch_size: int = -1,
+        seq_length: int = -1,
+    ) -> Mapping[str, Any]:
+        text_input_dict = super().generate_dummy_inputs(
+            processor.tokenizer,
+            batch_size=batch_size,
+            seq_length=seq_length,
+        )
+        image_input_dict = super().generate_dummy_inputs(
+            processor.image_processor,
+            batch_size=batch_size,
+        )
+        return {**text_input_dict, **image_input_dict}
+
+    @property
+    def default_onnx_opset(self) -> int:
+        return 14
+
+
+__all__ = ["OwlViTConfig", "OwlViTOnnxConfig", "OwlViTTextConfig", "OwlViTVisionConfig"]
--- a/src/transformers/models/perceiver/configuration_perceiver.py
+++ b/src/transformers/models/perceiver/configuration_perceiver.py
@ -14,7 +14,15 @@
 # limitations under the License.
 """Perceiver model configuration"""

+from collections import OrderedDict
+from collections.abc import Mapping
+from typing import Any, Union
+
 from ...configuration_utils import PreTrainedConfig
+from ...feature_extraction_utils import FeatureExtractionMixin
+from ...onnx import OnnxConfig
+from ...onnx.utils import compute_effective_axis_dimension
+from ...tokenization_utils_base import PreTrainedTokenizerBase
 from ...utils import logging


@ -174,4 +182,63 @@ class PerceiverConfig(PreTrainedConfig):
        self._label_trainable_num_channels = _label_trainable_num_channels


-__all__ = ["PerceiverConfig"]
+class PerceiverOnnxConfig(OnnxConfig):
+    @property
+    def inputs(self) -> Mapping[str, Mapping[int, str]]:
+        if self.task == "multiple-choice":
+            dynamic_axis = {0: "batch", 1: "choice", 2: "sequence"}
+        else:
+            dynamic_axis = {0: "batch", 1: "sequence"}
+        return OrderedDict(
+            [
+                ("inputs", dynamic_axis),
+                ("attention_mask", dynamic_axis),
+            ]
+        )
+
+    @property
+    def atol_for_validation(self) -> float:
+        return 1e-4
+
+    def generate_dummy_inputs(
+        self,
+        preprocessor: Union["PreTrainedTokenizerBase", "FeatureExtractionMixin"],
+        batch_size: int = -1,
+        seq_length: int = -1,
+        num_choices: int = -1,
+        is_pair: bool = False,
+        num_channels: int = 3,
+        image_width: int = 40,
+        image_height: int = 40,
+    ) -> Mapping[str, Any]:
+        # copied from `transformers.onnx.config.OnnxConfig` and slightly altered/simplified
+
+        if isinstance(preprocessor, PreTrainedTokenizerBase):
+            # If dynamic axis (-1) we forward with a fixed dimension of 2 samples to avoid optimizations made by ONNX
+            batch_size = compute_effective_axis_dimension(
+                batch_size, fixed_dimension=OnnxConfig.default_fixed_batch, num_token_to_add=0
+            )
+            # If dynamic axis (-1) we forward with a fixed dimension of 8 tokens to avoid optimizations made by ONNX
+            token_to_add = preprocessor.num_special_tokens_to_add(is_pair)
+            seq_length = compute_effective_axis_dimension(
+                seq_length, fixed_dimension=OnnxConfig.default_fixed_sequence, num_token_to_add=token_to_add
+            )
+            # Generate dummy inputs according to compute batch and sequence
+            dummy_input = [" ".join(["a"]) * seq_length] * batch_size
+            inputs = dict(preprocessor(dummy_input, return_tensors="pt"))
+            inputs["inputs"] = inputs.pop("input_ids")
+            return inputs
+        elif isinstance(preprocessor, FeatureExtractionMixin) and preprocessor.model_input_names[0] == "pixel_values":
+            # If dynamic axis (-1) we forward with a fixed dimension of 2 samples to avoid optimizations made by ONNX
+            batch_size = compute_effective_axis_dimension(batch_size, fixed_dimension=OnnxConfig.default_fixed_batch)
+            dummy_input = self._generate_dummy_images(batch_size, num_channels, image_height, image_width)
+            inputs = dict(preprocessor(images=dummy_input, return_tensors="pt"))
+            inputs["inputs"] = inputs.pop("pixel_values")
+            return inputs
+        else:
+            raise ValueError(
+                "Unable to generate dummy inputs for the model. Please provide a tokenizer or a preprocessor."
+            )
+
+
+__all__ = ["PerceiverConfig", "PerceiverOnnxConfig"]
--- a/src/transformers/models/plbart/configuration_plbart.py
+++ b/src/transformers/models/plbart/configuration_plbart.py
@ -14,7 +14,11 @@
 # limitations under the License.
 """PLBART model configuration"""

+from collections import OrderedDict
+from collections.abc import Mapping
+
 from ...configuration_utils import PreTrainedConfig
+from ...onnx import OnnxConfigWithPast
 from ...utils import logging


@ -161,4 +165,33 @@ class PLBartConfig(PreTrainedConfig):
        )


+class PLBartOnnxConfig(OnnxConfigWithPast):
+    @property
+    def inputs(self) -> Mapping[str, Mapping[int, str]]:
+        return OrderedDict(
+            [
+                ("input_ids", {0: "batch", 1: "sequence"}),
+                ("attention_mask", {0: "batch", 1: "sequence"}),
+            ]
+        )
+
+    @property
+    def outputs(self) -> Mapping[str, Mapping[int, str]]:
+        if self.use_past:
+            return OrderedDict(
+                [
+                    ("last_hidden_state", {0: "batch", 1: "sequence"}),
+                    ("past_keys", {0: "batch", 2: "sequence"}),
+                    ("encoder_last_hidden_state", {0: "batch", 1: "sequence"}),
+                ]
+            )
+        else:
+            return OrderedDict(
+                [
+                    ("last_hidden_state", {0: "batch", 1: "sequence"}),
+                    ("encoder_last_hidden_state", {0: "batch", 1: "sequence"}),
+                ]
+            )
+
+
 __all__ = ["PLBartConfig"]
--- a/src/transformers/models/poolformer/configuration_poolformer.py
+++ b/src/transformers/models/poolformer/configuration_poolformer.py
@ -14,7 +14,13 @@
 # limitations under the License.
 """PoolFormer model configuration"""

+from collections import OrderedDict
+from collections.abc import Mapping
+
+from packaging import version
+
 from ...configuration_utils import PreTrainedConfig
+from ...onnx import OnnxConfig
 from ...utils import logging


@ -123,4 +129,20 @@ class PoolFormerConfig(PreTrainedConfig):
        super().__init__(**kwargs)


-__all__ = ["PoolFormerConfig"]
+class PoolFormerOnnxConfig(OnnxConfig):
+    torch_onnx_minimum_version = version.parse("1.11")
+
+    @property
+    def inputs(self) -> Mapping[str, Mapping[int, str]]:
+        return OrderedDict(
+            [
+                ("pixel_values", {0: "batch", 1: "num_channels", 2: "height", 3: "width"}),
+            ]
+        )
+
+    @property
+    def atol_for_validation(self) -> float:
+        return 2e-3
+
+
+__all__ = ["PoolFormerConfig", "PoolFormerOnnxConfig"]
--- a/src/transformers/models/pvt/configuration_pvt.py
+++ b/src/transformers/models/pvt/configuration_pvt.py
@ -16,9 +16,13 @@
 # limitations under the License.
 """Pvt model configuration"""

+from collections import OrderedDict
 from collections.abc import Callable, Mapping

+from packaging import version
+
 from ...configuration_utils import PreTrainedConfig
+from ...onnx import OnnxConfig
 from ...utils import logging


@ -135,4 +139,24 @@ class PvtConfig(PreTrainedConfig):
        self.qkv_bias = qkv_bias


-__all__ = ["PvtConfig"]
+class PvtOnnxConfig(OnnxConfig):
+    torch_onnx_minimum_version = version.parse("1.11")
+
+    @property
+    def inputs(self) -> Mapping[str, Mapping[int, str]]:
+        return OrderedDict(
+            [
+                ("pixel_values", {0: "batch", 1: "num_channels", 2: "height", 3: "width"}),
+            ]
+        )
+
+    @property
+    def atol_for_validation(self) -> float:
+        return 1e-4
+
+    @property
+    def default_onnx_opset(self) -> int:
+        return 12
+
+
+__all__ = ["PvtConfig", "PvtOnnxConfig"]
--- a/src/transformers/models/rembert/configuration_rembert.py
+++ b/src/transformers/models/rembert/configuration_rembert.py
@ -14,7 +14,11 @@
 # limitations under the License.
 """RemBERT model configuration"""

+from collections import OrderedDict
+from collections.abc import Mapping
+
 from ...configuration_utils import PreTrainedConfig
+from ...onnx import OnnxConfig
 from ...utils import logging


@ -135,4 +139,24 @@ class RemBertConfig(PreTrainedConfig):
        self.tie_word_embeddings = False


-__all__ = ["RemBertConfig"]
+class RemBertOnnxConfig(OnnxConfig):
+    @property
+    def inputs(self) -> Mapping[str, Mapping[int, str]]:
+        if self.task == "multiple-choice":
+            dynamic_axis = {0: "batch", 1: "choice", 2: "sequence"}
+        else:
+            dynamic_axis = {0: "batch", 1: "sequence"}
+        return OrderedDict(
+            [
+                ("input_ids", dynamic_axis),
+                ("attention_mask", dynamic_axis),
+                ("token_type_ids", dynamic_axis),
+            ]
+        )
+
+    @property
+    def atol_for_validation(self) -> float:
+        return 1e-4
+
+
+__all__ = ["RemBertConfig", "RemBertOnnxConfig"]
--- a/src/transformers/models/resnet/configuration_resnet.py
+++ b/src/transformers/models/resnet/configuration_resnet.py
@ -14,7 +14,13 @@
 # limitations under the License.
 """ResNet model configuration"""

+from collections import OrderedDict
+from collections.abc import Mapping
+
+from packaging import version
+
 from ...configuration_utils import PreTrainedConfig
+from ...onnx import OnnxConfig
 from ...utils import logging
 from ...utils.backbone_utils import BackboneConfigMixin, get_aligned_output_features_output_indices

@ -111,4 +117,20 @@ class ResNetConfig(BackboneConfigMixin, PreTrainedConfig):
        )


-__all__ = ["ResNetConfig"]
+class ResNetOnnxConfig(OnnxConfig):
+    torch_onnx_minimum_version = version.parse("1.11")
+
+    @property
+    def inputs(self) -> Mapping[str, Mapping[int, str]]:
+        return OrderedDict(
+            [
+                ("pixel_values", {0: "batch", 1: "num_channels", 2: "height", 3: "width"}),
+            ]
+        )
+
+    @property
+    def atol_for_validation(self) -> float:
+        return 1e-3
+
+
+__all__ = ["ResNetConfig", "ResNetOnnxConfig"]
--- a/src/transformers/models/roberta/configuration_roberta.py
+++ b/src/transformers/models/roberta/configuration_roberta.py
@ -15,7 +15,11 @@
 # limitations under the License.
 """RoBERTa configuration"""

+from collections import OrderedDict
+from collections.abc import Mapping
+
 from ...configuration_utils import PreTrainedConfig
+from ...onnx import OnnxConfig
 from ...utils import logging


@ -125,4 +129,19 @@ class RobertaConfig(PreTrainedConfig):
        self.classifier_dropout = classifier_dropout


-__all__ = ["RobertaConfig"]
+class RobertaOnnxConfig(OnnxConfig):
+    @property
+    def inputs(self) -> Mapping[str, Mapping[int, str]]:
+        if self.task == "multiple-choice":
+            dynamic_axis = {0: "batch", 1: "choice", 2: "sequence"}
+        else:
+            dynamic_axis = {0: "batch", 1: "sequence"}
+        return OrderedDict(
+            [
+                ("input_ids", dynamic_axis),
+                ("attention_mask", dynamic_axis),
+            ]
+        )
+
+
+__all__ = ["RobertaConfig", "RobertaOnnxConfig"]
--- a/src/transformers/models/roberta_prelayernorm/configuration_roberta_prelayernorm.py
+++ b/src/transformers/models/roberta_prelayernorm/configuration_roberta_prelayernorm.py
@ -15,7 +15,11 @@
 # limitations under the License.
 """RoBERTa-PreLayerNorm configuration"""

+from collections import OrderedDict
+from collections.abc import Mapping
+
 from ...configuration_utils import PreTrainedConfig
+from ...onnx import OnnxConfig
 from ...utils import logging


@ -126,4 +130,20 @@ class RobertaPreLayerNormConfig(PreTrainedConfig):
        self.classifier_dropout = classifier_dropout


-__all__ = ["RobertaPreLayerNormConfig"]
+# Copied from transformers.models.roberta.configuration_roberta.RobertaOnnxConfig with Roberta->RobertaPreLayerNorm
+class RobertaPreLayerNormOnnxConfig(OnnxConfig):
+    @property
+    def inputs(self) -> Mapping[str, Mapping[int, str]]:
+        if self.task == "multiple-choice":
+            dynamic_axis = {0: "batch", 1: "choice", 2: "sequence"}
+        else:
+            dynamic_axis = {0: "batch", 1: "sequence"}
+        return OrderedDict(
+            [
+                ("input_ids", dynamic_axis),
+                ("attention_mask", dynamic_axis),
+            ]
+        )
+
+
+__all__ = ["RobertaPreLayerNormConfig", "RobertaPreLayerNormOnnxConfig"]
--- a/src/transformers/models/roformer/configuration_roformer.py
+++ b/src/transformers/models/roformer/configuration_roformer.py
@ -14,7 +14,11 @@
 # limitations under the License.
 """RoFormer model configuration"""

+from collections import OrderedDict
+from collections.abc import Mapping
+
 from ...configuration_utils import PreTrainedConfig
+from ...onnx import OnnxConfig
 from ...utils import logging


@ -126,4 +130,21 @@ class RoFormerConfig(PreTrainedConfig):
        self.use_cache = use_cache


-__all__ = ["RoFormerConfig"]
+class RoFormerOnnxConfig(OnnxConfig):
+    @property
+    def inputs(self) -> Mapping[str, Mapping[int, str]]:
+        if self.task == "multiple-choice":
+            dynamic_axis = {0: "batch", 1: "choice", 2: "sequence"}
+        else:
+            dynamic_axis = {0: "batch", 1: "sequence"}
+        dynamic_axis = {0: "batch", 1: "sequence"}
+        return OrderedDict(
+            [
+                ("input_ids", dynamic_axis),
+                ("attention_mask", dynamic_axis),
+                ("token_type_ids", dynamic_axis),
+            ]
+        )
+
+
+__all__ = ["RoFormerConfig", "RoFormerOnnxConfig"]
--- a/src/transformers/models/segformer/configuration_segformer.py
+++ b/src/transformers/models/segformer/configuration_segformer.py
@ -15,8 +15,13 @@
 """SegFormer model configuration"""

 import warnings
+from collections import OrderedDict
+from collections.abc import Mapping
+
+from packaging import version

 from ...configuration_utils import PreTrainedConfig
+from ...onnx import OnnxConfig
 from ...utils import logging


@ -143,4 +148,24 @@ class SegformerConfig(PreTrainedConfig):
        self.semantic_loss_ignore_index = semantic_loss_ignore_index


-__all__ = ["SegformerConfig"]
+class SegformerOnnxConfig(OnnxConfig):
+    torch_onnx_minimum_version = version.parse("1.11")
+
+    @property
+    def inputs(self) -> Mapping[str, Mapping[int, str]]:
+        return OrderedDict(
+            [
+                ("pixel_values", {0: "batch", 1: "num_channels", 2: "height", 3: "width"}),
+            ]
+        )
+
+    @property
+    def atol_for_validation(self) -> float:
+        return 1e-4
+
+    @property
+    def default_onnx_opset(self) -> int:
+        return 12
+
+
+__all__ = ["SegformerConfig", "SegformerOnnxConfig"]
--- a/src/transformers/models/squeezebert/configuration_squeezebert.py
+++ b/src/transformers/models/squeezebert/configuration_squeezebert.py
@ -14,7 +14,11 @@
 # limitations under the License.
 """SqueezeBERT model configuration"""

+from collections import OrderedDict
+from collections.abc import Mapping
+
 from ...configuration_utils import PreTrainedConfig
+from ...onnx import OnnxConfig
 from ...utils import logging


@ -143,4 +147,21 @@ class SqueezeBertConfig(PreTrainedConfig):
        self.output_groups = output_groups


-__all__ = ["SqueezeBertConfig"]
+# # Copied from transformers.models.bert.configuration_bert.BertOnxxConfig with Bert->SqueezeBert
+class SqueezeBertOnnxConfig(OnnxConfig):
+    @property
+    def inputs(self) -> Mapping[str, Mapping[int, str]]:
+        if self.task == "multiple-choice":
+            dynamic_axis = {0: "batch", 1: "choice", 2: "sequence"}
+        else:
+            dynamic_axis = {0: "batch", 1: "sequence"}
+        return OrderedDict(
+            [
+                ("input_ids", dynamic_axis),
+                ("attention_mask", dynamic_axis),
+                ("token_type_ids", dynamic_axis),
+            ]
+        )
+
+
+__all__ = ["SqueezeBertConfig", "SqueezeBertOnnxConfig"]
--- a/src/transformers/models/swiftformer/configuration_swiftformer.py
+++ b/src/transformers/models/swiftformer/configuration_swiftformer.py
@ -14,7 +14,13 @@
 # limitations under the License.
 """SwiftFormer model configuration"""

+from collections import OrderedDict
+from collections.abc import Mapping
+
+from packaging import version
+
 from ...configuration_utils import PreTrainedConfig
+from ...onnx import OnnxConfig
 from ...utils import logging


@ -123,4 +129,20 @@ class SwiftFormerConfig(PreTrainedConfig):
        self.batch_norm_eps = batch_norm_eps


-__all__ = ["SwiftFormerConfig"]
+class SwiftFormerOnnxConfig(OnnxConfig):
+    torch_onnx_minimum_version = version.parse("1.11")
+
+    @property
+    def inputs(self) -> Mapping[str, Mapping[int, str]]:
+        return OrderedDict(
+            [
+                ("pixel_values", {0: "batch", 1: "num_channels", 2: "height", 3: "width"}),
+            ]
+        )
+
+    @property
+    def atol_for_validation(self) -> float:
+        return 1e-4
+
+
+__all__ = ["SwiftFormerConfig", "SwiftFormerOnnxConfig"]
--- a/src/transformers/models/swin/configuration_swin.py
+++ b/src/transformers/models/swin/configuration_swin.py
@ -14,7 +14,13 @@
 # limitations under the License.
 """Swin Transformer model configuration"""

+from collections import OrderedDict
+from collections.abc import Mapping
+
+from packaging import version
+
 from ...configuration_utils import PreTrainedConfig
+from ...onnx import OnnxConfig
 from ...utils import logging
 from ...utils.backbone_utils import BackboneConfigMixin, get_aligned_output_features_output_indices

@ -154,4 +160,20 @@ class SwinConfig(BackboneConfigMixin, PreTrainedConfig):
        )


-__all__ = ["SwinConfig"]
+class SwinOnnxConfig(OnnxConfig):
+    torch_onnx_minimum_version = version.parse("1.11")
+
+    @property
+    def inputs(self) -> Mapping[str, Mapping[int, str]]:
+        return OrderedDict(
+            [
+                ("pixel_values", {0: "batch", 1: "num_channels", 2: "height", 3: "width"}),
+            ]
+        )
+
+    @property
+    def atol_for_validation(self) -> float:
+        return 1e-4
+
+
+__all__ = ["SwinConfig", "SwinOnnxConfig"]
--- a/src/transformers/models/t5/configuration_t5.py
+++ b/src/transformers/models/t5/configuration_t5.py
@ -14,7 +14,10 @@
 # limitations under the License.
 """T5 model configuration"""

+from collections.abc import Mapping
+
 from ...configuration_utils import PreTrainedConfig
+from ...onnx import OnnxSeq2SeqConfigWithPast
 from ...utils import logging


@ -140,4 +143,29 @@ class T5Config(PreTrainedConfig):
        )


-__all__ = ["T5Config"]
+class T5OnnxConfig(OnnxSeq2SeqConfigWithPast):
+    @property
+    def inputs(self) -> Mapping[str, Mapping[int, str]]:
+        common_inputs = {
+            "input_ids": {0: "batch", 1: "encoder_sequence"},
+            "attention_mask": {0: "batch", 1: "encoder_sequence"},
+        }
+        if self.use_past:
+            common_inputs["attention_mask"][1] = "past_encoder_sequence + sequence"
+            common_inputs["decoder_input_ids"] = {0: "batch"}
+            common_inputs["decoder_attention_mask"] = {0: "batch", 1: "past_decoder_sequence + sequence"}
+        else:
+            common_inputs["decoder_input_ids"] = {0: "batch", 1: "decoder_sequence"}
+            common_inputs["decoder_attention_mask"] = {0: "batch", 1: "decoder_sequence"}
+
+        if self.use_past:
+            self.fill_with_past_key_values_(common_inputs, direction="inputs")
+
+        return common_inputs
+
+    @property
+    def default_onnx_opset(self) -> int:
+        return 13
+
+
+__all__ = ["T5Config", "T5OnnxConfig"]
--- a/src/transformers/models/table_transformer/configuration_table_transformer.py
+++ b/src/transformers/models/table_transformer/configuration_table_transformer.py
@ -14,7 +14,13 @@
 # limitations under the License.
 """Table Transformer model configuration"""

+from collections import OrderedDict
+from collections.abc import Mapping
+
+from packaging import version
+
 from ...configuration_utils import PreTrainedConfig
+from ...onnx import OnnxConfig
 from ...utils import logging
 from ...utils.backbone_utils import verify_backbone_config_arguments
 from ..auto import CONFIG_MAPPING, AutoConfig
@ -241,4 +247,26 @@ class TableTransformerConfig(PreTrainedConfig):
        super().__init__(is_encoder_decoder=is_encoder_decoder, **kwargs)


-__all__ = ["TableTransformerConfig"]
+# Copied from transformers.models.detr.configuration_detr.DetrOnnxConfig
+class TableTransformerOnnxConfig(OnnxConfig):
+    torch_onnx_minimum_version = version.parse("1.11")
+
+    @property
+    def inputs(self) -> Mapping[str, Mapping[int, str]]:
+        return OrderedDict(
+            [
+                ("pixel_values", {0: "batch", 1: "num_channels", 2: "height", 3: "width"}),
+                ("pixel_mask", {0: "batch"}),
+            ]
+        )
+
+    @property
+    def atol_for_validation(self) -> float:
+        return 1e-5
+
+    @property
+    def default_onnx_opset(self) -> int:
+        return 12
+
+
+__all__ = ["TableTransformerConfig", "TableTransformerOnnxConfig"]
--- a/src/transformers/models/umt5/configuration_umt5.py
+++ b/src/transformers/models/umt5/configuration_umt5.py
@ -14,7 +14,10 @@
 # limitations under the License.
 """UMT5 model configuration"""

+from collections.abc import Mapping
+
 from ...configuration_utils import PreTrainedConfig
+from ...onnx import OnnxSeq2SeqConfigWithPast
 from ...utils import logging


@ -144,4 +147,35 @@ class UMT5Config(PreTrainedConfig):
        )


-__all__ = ["UMT5Config"]
+class UMT5OnnxConfig(OnnxSeq2SeqConfigWithPast):
+    @property
+    # Copied from transformers.models.t5.configuration_t5.T5OnnxConfig.inputs
+    def inputs(self) -> Mapping[str, Mapping[int, str]]:
+        common_inputs = {
+            "input_ids": {0: "batch", 1: "encoder_sequence"},
+            "attention_mask": {0: "batch", 1: "encoder_sequence"},
+        }
+        if self.use_past:
+            common_inputs["attention_mask"][1] = "past_encoder_sequence + sequence"
+            common_inputs["decoder_input_ids"] = {0: "batch"}
+            common_inputs["decoder_attention_mask"] = {0: "batch", 1: "past_decoder_sequence + sequence"}
+        else:
+            common_inputs["decoder_input_ids"] = {0: "batch", 1: "decoder_sequence"}
+            common_inputs["decoder_attention_mask"] = {0: "batch", 1: "decoder_sequence"}
+
+        if self.use_past:
+            self.fill_with_past_key_values_(common_inputs, direction="inputs")
+
+        return common_inputs
+
+    @property
+    # Copied from transformers.models.t5.configuration_t5.T5OnnxConfig.default_onnx_opset
+    def default_onnx_opset(self) -> int:
+        return 13
+
+    @property
+    def atol_for_validation(self) -> float:
+        return 5e-4
+
+
+__all__ = ["UMT5Config", "UMT5OnnxConfig"]
--- a/src/transformers/models/vision_encoder_decoder/configuration_vision_encoder_decoder.py
+++ b/src/transformers/models/vision_encoder_decoder/configuration_vision_encoder_decoder.py
@ -14,11 +14,21 @@
 # See the License for the specific language governing permissions and
 # limitations under the License.

+from collections import OrderedDict
+from collections.abc import Mapping
+from typing import TYPE_CHECKING, Any
+
+from packaging import version
+
 from ...configuration_utils import PreTrainedConfig
+from ...onnx import OnnxConfig
 from ...utils import logging
 from ..auto.configuration_auto import AutoConfig


+if TYPE_CHECKING:
+    from ... import PreTrainedTokenizerBase
+
 logger = logging.get_logger(__name__)


@ -108,4 +118,100 @@ class VisionEncoderDecoderConfig(PreTrainedConfig):
        return cls(encoder=encoder_config.to_dict(), decoder=decoder_config.to_dict(), **kwargs)


-__all__ = ["VisionEncoderDecoderConfig"]
+class VisionEncoderDecoderEncoderOnnxConfig(OnnxConfig):
+    torch_onnx_minimum_version = version.parse("1.11")
+
+    @property
+    def inputs(self) -> Mapping[str, Mapping[int, str]]:
+        return OrderedDict(
+            [
+                ("pixel_values", {0: "batch", 1: "num_channels", 2: "height", 3: "width"}),
+            ]
+        )
+
+    @property
+    def atol_for_validation(self) -> float:
+        return 1e-4
+
+    @property
+    def outputs(self) -> Mapping[str, Mapping[int, str]]:
+        return OrderedDict({"last_hidden_state": {0: "batch", 1: "encoder_sequence"}})
+
+
+class VisionEncoderDecoderDecoderOnnxConfig(OnnxConfig):
+    @property
+    def inputs(self) -> Mapping[str, Mapping[int, str]]:
+        common_inputs = OrderedDict()
+        common_inputs["input_ids"] = {0: "batch", 1: "past_decoder_sequence + sequence"}
+        common_inputs["attention_mask"] = {0: "batch", 1: "past_decoder_sequence + sequence"}
+        common_inputs["encoder_hidden_states"] = {0: "batch", 1: "encoder_sequence"}
+
+        return common_inputs
+
+    def generate_dummy_inputs(
+        self,
+        tokenizer: "PreTrainedTokenizerBase",
+        batch_size: int = -1,
+        seq_length: int = -1,
+        is_pair: bool = False,
+    ) -> Mapping[str, Any]:
+        import torch
+
+        common_inputs = OrderedDict()
+
+        dummy_input = super().generate_dummy_inputs(
+            tokenizer,
+            batch_size=batch_size,
+            seq_length=seq_length,
+            is_pair=is_pair,
+        )
+
+        batch, encoder_sequence = dummy_input["input_ids"].shape
+        encoder_hidden_states_shape = (batch, encoder_sequence, self._config.encoder_hidden_size)
+        common_inputs["input_ids"] = dummy_input.pop("input_ids")
+        common_inputs["attention_mask"] = dummy_input.pop("attention_mask")
+        common_inputs["encoder_hidden_states"] = torch.zeros(encoder_hidden_states_shape)
+
+        return common_inputs
+
+
+class VisionEncoderDecoderOnnxConfig(OnnxConfig):
+    @property
+    def inputs(self) -> None:
+        pass
+
+    def get_encoder_config(self, encoder_config: PreTrainedConfig) -> OnnxConfig:
+        r"""
+        Returns ONNX encoder config for `VisionEncoderDecoder` model.
+
+        Args:
+            encoder_config (`PreTrainedConfig`):
+                The encoder model's configuration to use when exporting to ONNX.
+
+        Returns:
+            [`VisionEncoderDecoderEncoderOnnxConfig`]: An instance of the ONNX configuration object
+        """
+        return VisionEncoderDecoderEncoderOnnxConfig(encoder_config)
+
+    def get_decoder_config(
+        self, encoder_config: PreTrainedConfig, decoder_config: PreTrainedConfig, feature: str = "default"
+    ) -> OnnxConfig:
+        r"""
+        Returns ONNX decoder config for `VisionEncoderDecoder` model.
+
+        Args:
+            encoder_config (`PreTrainedConfig`):
+                The encoder model's configuration to use when exporting to ONNX.
+            decoder_config (`PreTrainedConfig`):
+                The decoder model's configuration to use when exporting to ONNX
+            feature (`str`, *optional*):
+                The type of feature to export the model with.
+
+        Returns:
+            [`VisionEncoderDecoderDecoderOnnxConfig`]: An instance of the ONNX configuration object.
+        """
+        decoder_config.encoder_hidden_size = encoder_config.hidden_size
+        return VisionEncoderDecoderDecoderOnnxConfig(decoder_config, feature)
+
+
+__all__ = ["VisionEncoderDecoderConfig", "VisionEncoderDecoderOnnxConfig"]
--- a/src/transformers/models/vit/configuration_vit.py
+++ b/src/transformers/models/vit/configuration_vit.py
@ -14,7 +14,13 @@
 # limitations under the License.
 """ViT model configuration"""

+from collections import OrderedDict
+from collections.abc import Mapping
+
+from packaging import version
+
 from ...configuration_utils import PreTrainedConfig
+from ...onnx import OnnxConfig
 from ...utils import logging


@ -124,4 +130,20 @@ class ViTConfig(PreTrainedConfig):
        self.pooler_act = pooler_act


-__all__ = ["ViTConfig"]
+class ViTOnnxConfig(OnnxConfig):
+    torch_onnx_minimum_version = version.parse("1.11")
+
+    @property
+    def inputs(self) -> Mapping[str, Mapping[int, str]]:
+        return OrderedDict(
+            [
+                ("pixel_values", {0: "batch", 1: "num_channels", 2: "height", 3: "width"}),
+            ]
+        )
+
+    @property
+    def atol_for_validation(self) -> float:
+        return 1e-4
+
+
+__all__ = ["ViTConfig", "ViTOnnxConfig"]
--- a/src/transformers/models/whisper/configuration_whisper.py
+++ b/src/transformers/models/whisper/configuration_whisper.py
@ -14,10 +14,19 @@
 # limitations under the License.
 """Whisper model configuration"""

+from collections import OrderedDict
+from collections.abc import Mapping
+from typing import TYPE_CHECKING, Any, Union
+
 from ...configuration_utils import PreTrainedConfig
+from ...onnx import OnnxConfig, OnnxSeq2SeqConfigWithPast
 from ...utils import logging


+if TYPE_CHECKING:
+    from ...feature_extraction_utils import FeatureExtractionMixin
+    from ...tokenization_utils_base import PreTrainedTokenizerBase
+
 logger = logging.get_logger(__name__)


@ -276,4 +285,64 @@ class WhisperConfig(PreTrainedConfig):
        )


-__all__ = ["WhisperConfig"]
+class WhisperOnnxConfig(OnnxSeq2SeqConfigWithPast):
+    @property
+    def inputs(self) -> Mapping[str, Mapping[int, str]]:
+        common_inputs = OrderedDict(
+            [
+                ("input_features", {0: "batch", 1: "feature_size", 2: "encoder_sequence"}),
+            ]
+        )
+        if self.use_past:
+            common_inputs["decoder_input_ids"] = {0: "batch"}
+        else:
+            common_inputs["decoder_input_ids"] = {0: "batch", 1: "decoder_sequence"}
+
+        if self.use_past:
+            self.fill_with_past_key_values_(common_inputs, direction="inputs")
+
+        return common_inputs
+
+    def generate_dummy_inputs(
+        self,
+        preprocessor: Union["PreTrainedTokenizerBase", "FeatureExtractionMixin"],
+        batch_size: int = -1,
+        seq_length: int = -1,
+        is_pair: bool = False,
+        sampling_rate: int = 22050,
+        time_duration: float = 5.0,
+        frequency: int = 220,
+    ) -> Mapping[str, Any]:
+        dummy_inputs = OrderedDict()
+        encoder_inputs = OnnxConfig.generate_dummy_inputs(
+            self,
+            preprocessor=preprocessor.feature_extractor,
+            batch_size=batch_size,
+            sampling_rate=sampling_rate,
+            time_duration=time_duration,
+            frequency=frequency,
+        )
+        encoder_sequence_length = encoder_inputs["input_features"].shape[2]
+        seq_length = encoder_sequence_length // 2 if self.use_past else seq_length
+
+        decoder_inputs = super().generate_dummy_inputs(
+            preprocessor.tokenizer,
+            batch_size,
+            seq_length,
+            is_pair,
+        )
+
+        dummy_inputs["input_features"] = encoder_inputs.pop("input_features")
+        dummy_inputs["decoder_input_ids"] = decoder_inputs.pop("decoder_input_ids")
+
+        if "past_key_values" in decoder_inputs:
+            dummy_inputs["past_key_values"] = decoder_inputs.pop("past_key_values")
+
+        return dummy_inputs
+
+    @property
+    def atol_for_validation(self) -> float:
+        return 1e-3
+
+
+__all__ = ["WhisperConfig", "WhisperOnnxConfig"]
--- a/src/transformers/models/xlm/configuration_xlm.py
+++ b/src/transformers/models/xlm/configuration_xlm.py
@ -14,7 +14,11 @@
 # limitations under the License.
 """XLM configuration"""

+from collections import OrderedDict
+from collections.abc import Mapping
+
 from ...configuration_utils import PreTrainedConfig
+from ...onnx import OnnxConfig
 from ...utils import logging


@ -217,4 +221,21 @@ class XLMConfig(PreTrainedConfig):
        super().__init__(pad_token_id=pad_token_id, bos_token_id=bos_token_id, **kwargs)


-__all__ = ["XLMConfig"]
+# Copied from transformers.models.bert.configuration_bert.BertOnnxConfig
+class XLMOnnxConfig(OnnxConfig):
+    @property
+    def inputs(self) -> Mapping[str, Mapping[int, str]]:
+        if self.task == "multiple-choice":
+            dynamic_axis = {0: "batch", 1: "choice", 2: "sequence"}
+        else:
+            dynamic_axis = {0: "batch", 1: "sequence"}
+        return OrderedDict(
+            [
+                ("input_ids", dynamic_axis),
+                ("attention_mask", dynamic_axis),
+                ("token_type_ids", dynamic_axis),
+            ]
+        )
+
+
+__all__ = ["XLMConfig", "XLMOnnxConfig"]
--- a/src/transformers/models/xlm_roberta/configuration_xlm_roberta.py
+++ b/src/transformers/models/xlm_roberta/configuration_xlm_roberta.py
@ -15,7 +15,11 @@
 # limitations under the License.
 """XLM-RoBERTa configuration"""

+from collections import OrderedDict
+from collections.abc import Mapping
+
 from ...configuration_utils import PreTrainedConfig
+from ...onnx import OnnxConfig
 from ...utils import logging


@ -126,4 +130,20 @@ class XLMRobertaConfig(PreTrainedConfig):
        self.classifier_dropout = classifier_dropout


-__all__ = ["XLMRobertaConfig"]
+# Copied from transformers.models.roberta.configuration_roberta.RobertaOnnxConfig with Roberta->XLMRoberta
+class XLMRobertaOnnxConfig(OnnxConfig):
+    @property
+    def inputs(self) -> Mapping[str, Mapping[int, str]]:
+        if self.task == "multiple-choice":
+            dynamic_axis = {0: "batch", 1: "choice", 2: "sequence"}
+        else:
+            dynamic_axis = {0: "batch", 1: "sequence"}
+        return OrderedDict(
+            [
+                ("input_ids", dynamic_axis),
+                ("attention_mask", dynamic_axis),
+            ]
+        )
+
+
+__all__ = ["XLMRobertaConfig", "XLMRobertaOnnxConfig"]
--- a/src/transformers/models/xlm_roberta_xl/configuration_xlm_roberta_xl.py
+++ b/src/transformers/models/xlm_roberta_xl/configuration_xlm_roberta_xl.py
@ -14,7 +14,11 @@
 # limitations under the License.
 """XLM_ROBERTa_XL configuration"""

+from collections import OrderedDict
+from collections.abc import Mapping
+
 from ...configuration_utils import PreTrainedConfig
+from ...onnx import OnnxConfig
 from ...utils import logging


@ -122,4 +126,20 @@ class XLMRobertaXLConfig(PreTrainedConfig):
        self.classifier_dropout = classifier_dropout


-__all__ = ["XLMRobertaXLConfig"]
+# Copied from transformers.models.roberta.configuration_roberta.RobertaOnnxConfig with Roberta->XLMRobertaXL
+class XLMRobertaXLOnnxConfig(OnnxConfig):
+    @property
+    def inputs(self) -> Mapping[str, Mapping[int, str]]:
+        if self.task == "multiple-choice":
+            dynamic_axis = {0: "batch", 1: "choice", 2: "sequence"}
+        else:
+            dynamic_axis = {0: "batch", 1: "sequence"}
+        return OrderedDict(
+            [
+                ("input_ids", dynamic_axis),
+                ("attention_mask", dynamic_axis),
+            ]
+        )
+
+
+__all__ = ["XLMRobertaXLConfig", "XLMRobertaXLOnnxConfig"]
--- a/src/transformers/models/xmod/configuration_xmod.py
+++ b/src/transformers/models/xmod/configuration_xmod.py
@ -15,7 +15,11 @@
 # limitations under the License.
 """X-MOD configuration"""

+from collections import OrderedDict
+from collections.abc import Mapping
+
 from ...configuration_utils import PreTrainedConfig
+from ...onnx import OnnxConfig
 from ...utils import logging


@ -154,4 +158,20 @@ class XmodConfig(PreTrainedConfig):
        self.default_language = default_language


-__all__ = ["XmodConfig"]
+# Copied from transformers.models.roberta.configuration_roberta.RobertaOnnxConfig with Roberta->Xmod
+class XmodOnnxConfig(OnnxConfig):
+    @property
+    def inputs(self) -> Mapping[str, Mapping[int, str]]:
+        if self.task == "multiple-choice":
+            dynamic_axis = {0: "batch", 1: "choice", 2: "sequence"}
+        else:
+            dynamic_axis = {0: "batch", 1: "sequence"}
+        return OrderedDict(
+            [
+                ("input_ids", dynamic_axis),
+                ("attention_mask", dynamic_axis),
+            ]
+        )
+
+
+__all__ = ["XmodConfig", "XmodOnnxConfig"]
--- a/src/transformers/models/yolos/configuration_yolos.py
+++ b/src/transformers/models/yolos/configuration_yolos.py
@ -14,7 +14,13 @@
 # limitations under the License.
 """YOLOS model configuration"""

+from collections import OrderedDict
+from collections.abc import Mapping
+
+from packaging import version
+
 from ...configuration_utils import PreTrainedConfig
+from ...onnx import OnnxConfig
 from ...utils import logging


@ -149,4 +155,24 @@ class YolosConfig(PreTrainedConfig):
        self.eos_coefficient = eos_coefficient


-__all__ = ["YolosConfig"]
+class YolosOnnxConfig(OnnxConfig):
+    torch_onnx_minimum_version = version.parse("1.11")
+
+    @property
+    def inputs(self) -> Mapping[str, Mapping[int, str]]:
+        return OrderedDict(
+            [
+                ("pixel_values", {0: "batch", 1: "num_channels", 2: "height", 3: "width"}),
+            ]
+        )
+
+    @property
+    def atol_for_validation(self) -> float:
+        return 1e-4
+
+    @property
+    def default_onnx_opset(self) -> int:
+        return 12
+
+
+__all__ = ["YolosConfig", "YolosOnnxConfig"]
--- a/src/transformers/onnx/init.py
+++ b/src/transformers/onnx/init.py
@ -0,0 +1,45 @@
+# Copyright 2020 The HuggingFace Team. All rights reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+from typing import TYPE_CHECKING
+
+from ..utils import _LazyModule
+
+
+_import_structure = {
+    "config": [
+        "EXTERNAL_DATA_FORMAT_SIZE_LIMIT",
+        "OnnxConfig",
+        "OnnxConfigWithPast",
+        "OnnxSeq2SeqConfigWithPast",
+        "PatchingSpec",
+    ],
+    "utils": ["ParameterFormat", "compute_serialized_parameters_size"],
+}
+
+
+if TYPE_CHECKING:
+    from .config import (
+        EXTERNAL_DATA_FORMAT_SIZE_LIMIT,
+        OnnxConfig,
+        OnnxConfigWithPast,
+        OnnxSeq2SeqConfigWithPast,
+        PatchingSpec,
+    )
+    from .utils import ParameterFormat, compute_serialized_parameters_size
+
+else:
+    import sys
+
+    sys.modules[__name__] = _LazyModule(__name__, globals()["__file__"], _import_structure, module_spec=__spec__)
--- a/src/transformers/onnx/config.py
+++ b/src/transformers/onnx/config.py
@ -0,0 +1,748 @@
+# Copyright 2021 The HuggingFace Team. All rights reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+import copy
+import dataclasses
+import warnings
+from abc import ABC, abstractmethod
+from collections import OrderedDict
+from collections.abc import Callable, Iterable, Mapping
+from typing import TYPE_CHECKING, Any, Optional, Union
+
+import numpy as np
+from packaging import version
+
+from ..utils import is_torch_available, is_vision_available, logging
+from .utils import ParameterFormat, compute_effective_axis_dimension, compute_serialized_parameters_size
+
+
+if TYPE_CHECKING:
+    from ..configuration_utils import PreTrainedConfig
+    from ..feature_extraction_utils import FeatureExtractionMixin
+    from ..image_processing_utils import ImageProcessingMixin
+    from ..tokenization_utils_base import PreTrainedTokenizerBase
+
+
+if is_vision_available():
+    from PIL import Image
+
+logger = logging.get_logger(__name__)
+
+
+DEFAULT_ONNX_OPSET = 11
+
+# 2 Gb
+EXTERNAL_DATA_FORMAT_SIZE_LIMIT = 2 * 1024 * 1024 * 1024
+
+
+@dataclasses.dataclass
+class PatchingSpec:
+    """
+    Data class that holds patching specifications.
+
+    Args:
+        o: Module / object where the op to patch is located
+        name: Name of the op to monkey patch
+        custom_op: Custom op that patches the original op
+        orig_op: Original op that is being patched
+        op_wrapper: Wrapper (optional) that wraps both the original and custom ops.
+            It is useful for ops that are class or static methods for instance.
+    """
+
+    o: Any
+    name: str
+    custom_op: Callable
+    orig_op: Optional[Callable] = None
+    op_wrapper: Optional[Callable] = None
+
+
+class OnnxConfig(ABC):
+    """
+    Base class for ONNX exportable model describing metadata on how to export the model through the ONNX format.
+    """
+
+    default_fixed_batch = 2
+    default_fixed_sequence = 8
+    default_fixed_num_choices = 4
+    torch_onnx_minimum_version = version.parse("1.8")
+    _tasks_to_common_outputs = {
+        "causal-lm": OrderedDict({"logits": {0: "batch", 1: "sequence"}}),
+        "default": OrderedDict({"last_hidden_state": {0: "batch", 1: "sequence"}}),
+        "image-classification": OrderedDict({"logits": {0: "batch", 1: "sequence"}}),
+        "image-segmentation": OrderedDict(
+            {
+                "logits": {0: "batch", 1: "sequence"},
+                "pred_boxes": {0: "batch", 1: "sequence"},
+                "pred_masks": {0: "batch", 1: "sequence"},
+            }
+        ),
+        "masked-im": OrderedDict({"logits": {0: "batch", 1: "sequence"}}),
+        "masked-lm": OrderedDict({"logits": {0: "batch", 1: "sequence"}}),
+        "multiple-choice": OrderedDict({"logits": {0: "batch"}}),
+        "object-detection": OrderedDict(
+            {
+                "logits": {0: "batch", 1: "sequence"},
+                "pred_boxes": {0: "batch", 1: "sequence"},
+            }
+        ),
+        "question-answering": OrderedDict(
+            {
+                "start_logits": {0: "batch", 1: "sequence"},
+                "end_logits": {0: "batch", 1: "sequence"},
+            }
+        ),
+        "semantic-segmentation": OrderedDict({"logits": {0: "batch", 1: "num_labels", 2: "height", 3: "width"}}),
+        "seq2seq-lm": OrderedDict({"logits": {0: "batch", 1: "decoder_sequence"}}),
+        "sequence-classification": OrderedDict({"logits": {0: "batch"}}),
+        "token-classification": OrderedDict({"logits": {0: "batch", 1: "sequence"}}),
+        "vision2seq-lm": OrderedDict({"logits": {0: "batch", 1: "sequence"}}),
+        "speech2seq-lm": OrderedDict({"logits": {0: "batch", 1: "sequence"}}),
+    }
+
+    def __init__(
+        self, config: "PreTrainedConfig", task: str = "default", patching_specs: Optional[list[PatchingSpec]] = None
+    ):
+        self._config = config
+
+        if task not in self._tasks_to_common_outputs:
+            raise ValueError(
+                f"{task} is not a supported task, supported tasks: {self._tasks_to_common_outputs.keys()}"
+            )
+        self.task = task
+
+        self._patching_specs = []
+        for spec in patching_specs if patching_specs is not None else []:
+            final_spec = spec
+            if spec.orig_op is None:
+                final_spec = dataclasses.replace(spec, orig_op=getattr(spec.o, spec.name))
+            self._patching_specs.append(final_spec)
+
+    @classmethod
+    def from_model_config(cls, config: "PreTrainedConfig", task: str = "default") -> "OnnxConfig":
+        """
+        Instantiate a OnnxConfig for a specific model
+
+        Args:
+            config: The model's configuration to use when exporting to ONNX
+
+        Returns:
+            OnnxConfig for this model
+        """
+        return cls(config, task=task)
+
+    @property
+    @abstractmethod
+    def inputs(self) -> Mapping[str, Mapping[int, str]]:
+        """
+        Mapping containing the axis definition of the input tensors to provide to the model
+
+        Returns:
+            For each input: its name associated to the axes symbolic name and the axis position within the tensor
+        """
+        raise NotImplementedError()
+
+    @property
+    def outputs(self) -> Mapping[str, Mapping[int, str]]:
+        """
+        Mapping containing the axis definition of the output tensors to provide to the model
+
+        Returns:
+            For each output: its name associated to the axes symbolic name and the axis position within the tensor
+        """
+        common_outputs = self._tasks_to_common_outputs[self.task]
+        return copy.deepcopy(common_outputs)
+
+    @property
+    def values_override(self) -> Optional[Mapping[str, Any]]:
+        """
+        Dictionary of keys to override in the model's config before exporting
+
+        Returns:
+            Dictionary with the keys (and their corresponding values) to override
+        """
+        if hasattr(self._config, "use_cache"):
+            return {"use_cache": False}
+
+        return None
+
+    @property
+    def default_batch_size(self) -> int:
+        """
+        The default batch size to use if no other indication
+
+        Returns:
+            Integer > 0
+        """
+        # Using 2 avoid ONNX making assumption about single sample batch
+        return OnnxConfig.default_fixed_batch
+
+    @property
+    def default_sequence_length(self) -> int:
+        """
+        The default sequence length to use if no other indication
+
+        Returns:
+            Integer > 0
+        """
+        return OnnxConfig.default_fixed_sequence
+
+    @property
+    def default_num_choices(self) -> int:
+        """
+        The default number of choices to use if no other indication
+
+        Returns:
+            Integer > 0
+        """
+        return OnnxConfig.default_fixed_num_choices
+
+    @property
+    def default_onnx_opset(self) -> int:
+        """
+        Which onnx opset to use when exporting the model
+
+        Returns:
+            Integer ONNX Opset version
+        """
+        return DEFAULT_ONNX_OPSET
+
+    @property
+    def atol_for_validation(self) -> float:
+        """
+        What absolute tolerance value to use during model conversion validation.
+
+        Returns:
+            Float absolute tolerance value.
+        """
+        return 1e-5
+
+    @property
+    def is_torch_support_available(self) -> bool:
+        """
+        The minimum PyTorch version required to export the model.
+
+        Returns:
+            `bool`: Whether the installed version of PyTorch is compatible with the model.
+        """
+        if is_torch_available():
+            from transformers.utils import get_torch_version
+
+            return version.parse(get_torch_version()) >= self.torch_onnx_minimum_version
+        else:
+            return False
+
+    @staticmethod
+    def use_external_data_format(num_parameters: int) -> bool:
+        """
+        Flag indicating if the model requires using external data format
+
+        Args:
+            num_parameters: Number of parameter on the model
+
+        Returns:
+            True if model.num_parameters() * size_of(float32) >= 2Gb False otherwise
+        """
+
+        return (
+            compute_serialized_parameters_size(num_parameters, ParameterFormat.Float)
+            >= EXTERNAL_DATA_FORMAT_SIZE_LIMIT
+        )
+
+    def _generate_dummy_images(
+        self, batch_size: int = 2, num_channels: int = 3, image_height: int = 40, image_width: int = 40
+    ):
+        images = []
+        for _ in range(batch_size):
+            data = np.random.rand(image_height, image_width, num_channels) * 255
+            images.append(Image.fromarray(data.astype("uint8")).convert("RGB"))
+        return images
+
+    def _generate_dummy_audio(
+        self, batch_size: int = 2, sampling_rate: int = 22050, time_duration: float = 5.0, frequency: int = 220
+    ):
+        audio_data = []
+        for _ in range(batch_size):
+            # time variable
+            t = np.linspace(0, time_duration, int(time_duration * sampling_rate), endpoint=False)
+
+            # generate pure sine wave at `frequency` Hz
+            audio_data.append(0.5 * np.sin(2 * np.pi * frequency * t))
+
+        return audio_data
+
+    def generate_dummy_inputs(
+        self,
+        preprocessor: Union["PreTrainedTokenizerBase", "FeatureExtractionMixin", "ImageProcessingMixin"],
+        batch_size: int = -1,
+        seq_length: int = -1,
+        num_choices: int = -1,
+        is_pair: bool = False,
+        num_channels: int = 3,
+        image_width: int = 40,
+        image_height: int = 40,
+        sampling_rate: int = 22050,
+        time_duration: float = 5.0,
+        frequency: int = 220,
+        tokenizer: Optional["PreTrainedTokenizerBase"] = None,
+    ) -> Mapping[str, Any]:
+        """
+        Generate inputs to provide to the ONNX exporter
+
+        Args:
+            preprocessor: ([`PreTrainedTokenizerBase`], [`FeatureExtractionMixin`], or [`ImageProcessingMixin`]):
+                The preprocessor associated with this model configuration.
+            batch_size (`int`, *optional*, defaults to -1):
+                The batch size to export the model for (-1 means dynamic axis).
+            num_choices (`int`, *optional*, defaults to -1):
+                The number of candidate answers provided for multiple choice task (-1 means dynamic axis).
+            seq_length (`int`, *optional*, defaults to -1):
+                The sequence length to export the model for (-1 means dynamic axis).
+            is_pair (`bool`, *optional*, defaults to `False`):
+                Indicate if the input is a pair (sentence 1, sentence 2)
+            num_channels (`int`, *optional*, defaults to 3):
+                The number of channels of the generated images.
+            image_width (`int`, *optional*, defaults to 40):
+                The width of the generated images.
+            image_height (`int`, *optional*, defaults to 40):
+                The height of the generated images.
+            sampling_rate (`int`, *optional* defaults to 22050)
+                The sampling rate for audio data generation.
+            time_duration (`float`, *optional* defaults to 5.0)
+                Total seconds of sampling for audio data generation.
+            frequency (`int`, *optional* defaults to 220)
+                The desired natural frequency of generated audio.
+
+        Returns:
+            Mapping[str, Tensor] holding the kwargs to provide to the model's forward function
+        """
+        from ..feature_extraction_utils import FeatureExtractionMixin
+        from ..image_processing_utils import ImageProcessingMixin
+        from ..tokenization_utils_base import PreTrainedTokenizerBase
+
+        if isinstance(preprocessor, PreTrainedTokenizerBase) and tokenizer is not None:
+            raise ValueError("You cannot provide both a tokenizer and a preprocessor to generate dummy inputs.")
+        if tokenizer is not None:
+            warnings.warn(
+                "The `tokenizer` argument is deprecated and will be removed in version 5 of Transformers. Use"
+                " `preprocessor` instead.",
+                FutureWarning,
+            )
+            logger.warning("Overwriting the `preprocessor` argument with `tokenizer` to generate dummy inputs.")
+            preprocessor = tokenizer
+        if isinstance(preprocessor, PreTrainedTokenizerBase):
+            # If dynamic axis (-1) we forward with a fixed dimension of 2 samples to avoid optimizations made by ONNX
+            batch_size = compute_effective_axis_dimension(
+                batch_size, fixed_dimension=OnnxConfig.default_fixed_batch, num_token_to_add=0
+            )
+            # If dynamic axis (-1) we forward with a fixed dimension of 8 tokens to avoid optimizations made by ONNX
+            token_to_add = preprocessor.num_special_tokens_to_add(is_pair)
+            seq_length = compute_effective_axis_dimension(
+                seq_length, fixed_dimension=OnnxConfig.default_fixed_sequence, num_token_to_add=token_to_add
+            )
+            # Generate dummy inputs according to compute batch and sequence
+            input_token = (
+                preprocessor.unk_token
+                if (preprocessor.unk_token is not None and len(preprocessor.unk_token) > 0)
+                else "0"
+            )
+            dummy_input = [" ".join([input_token]) * seq_length] * batch_size
+            if self.task == "multiple-choice":
+                # If dynamic axis (-1) we forward with a fixed dimension of 4 candidate answers to avoid optimizations
+                # made by ONNX
+                num_choices = compute_effective_axis_dimension(
+                    num_choices, fixed_dimension=OnnxConfig.default_fixed_num_choices, num_token_to_add=0
+                )
+                dummy_input = dummy_input * num_choices
+                # The shape of the tokenized inputs values is [batch_size * num_choices, seq_length]
+                tokenized_input = preprocessor(dummy_input, text_pair=dummy_input)
+                # Unflatten the tokenized inputs values expanding it to the shape [batch_size, num_choices, seq_length]
+                for k, v in tokenized_input.items():
+                    tokenized_input[k] = [v[i : i + num_choices] for i in range(0, len(v), num_choices)]
+                return dict(tokenized_input.convert_to_tensors(tensor_type="pt"))
+            return dict(preprocessor(dummy_input, return_tensors="pt"))
+        elif isinstance(preprocessor, ImageProcessingMixin):
+            if preprocessor.model_input_names[0] != "pixel_values":
+                raise ValueError(
+                    f"The `preprocessor` is an image processor ({preprocessor.__class__.__name__}) and expects"
+                    f' `model_input_names[0]` to be "pixel_values", but got {preprocessor.model_input_names[0]}'
+                )
+            # If dynamic axis (-1) we forward with a fixed dimension of 2 samples to avoid optimizations made by ONNX
+            batch_size = compute_effective_axis_dimension(batch_size, fixed_dimension=OnnxConfig.default_fixed_batch)
+            dummy_input = self._generate_dummy_images(batch_size, num_channels, image_height, image_width)
+            return dict(preprocessor(images=dummy_input, return_tensors="pt"))
+        elif isinstance(preprocessor, FeatureExtractionMixin) and preprocessor.model_input_names[0] == "pixel_values":
+            # If dynamic axis (-1) we forward with a fixed dimension of 2 samples to avoid optimizations made by ONNX
+            batch_size = compute_effective_axis_dimension(batch_size, fixed_dimension=OnnxConfig.default_fixed_batch)
+            dummy_input = self._generate_dummy_images(batch_size, num_channels, image_height, image_width)
+            return dict(preprocessor(images=dummy_input, return_tensors="pt"))
+        elif (
+            isinstance(preprocessor, FeatureExtractionMixin) and preprocessor.model_input_names[0] == "input_features"
+        ):
+            # If dynamic axis (-1) we forward with a fixed dimension of 2 samples to avoid optimizations made by ONNX
+            batch_size = compute_effective_axis_dimension(batch_size, fixed_dimension=OnnxConfig.default_fixed_batch)
+            dummy_input = self._generate_dummy_audio(batch_size, sampling_rate, time_duration, frequency)
+            return dict(preprocessor(dummy_input, return_tensors="pt"))
+        else:
+            raise ValueError(
+                "Unable to generate dummy inputs for the model. Please provide a tokenizer or a preprocessor."
+            )
+
+    def generate_dummy_inputs_onnxruntime(self, reference_model_inputs: Mapping[str, Any]) -> Mapping[str, Any]:
+        """
+        Generate inputs for ONNX Runtime using the reference model inputs. Override this to run inference with seq2seq
+        models which have the encoder and decoder exported as separate ONNX files.
+
+        Args:
+            reference_model_inputs ([`Mapping[str, Tensor]`):
+                Reference inputs for the model.
+
+        Returns:
+            `Mapping[str, Tensor]`: The mapping holding the kwargs to provide to the model's forward function
+        """
+        return reference_model_inputs
+
+    def patch_ops(self):
+        for spec in self._patching_specs:
+            custom_op = spec.custom_op if spec.op_wrapper is None else spec.op_wrapper(spec.custom_op)
+            setattr(spec.o, spec.name, custom_op)
+
+    def restore_ops(self):
+        for spec in self._patching_specs:
+            orig_op = spec.orig_op if spec.op_wrapper is None else spec.op_wrapper(spec.orig_op)
+            setattr(spec.o, spec.name, orig_op)
+
+    @classmethod
+    def flatten_output_collection_property(cls, name: str, field: Iterable[Any]) -> dict[str, Any]:
+        """
+        Flatten any potential nested structure expanding the name of the field with the index of the element within the
+        structure.
+
+        Args:
+            name: The name of the nested structure
+            field: The structure to, potentially, be flattened
+
+        Returns:
+            (dict[str, Any]): Outputs with flattened structure and key mapping this new structure.
+
+        """
+        from itertools import chain
+
+        return {f"{name}.{idx}": item for idx, item in enumerate(chain.from_iterable(field))}
+
+
+class OnnxConfigWithPast(OnnxConfig, ABC):
+    def __init__(
+        self,
+        config: "PreTrainedConfig",
+        task: str = "default",
+        patching_specs: Optional[list[PatchingSpec]] = None,
+        use_past: bool = False,
+    ):
+        super().__init__(config, task=task, patching_specs=patching_specs)
+        self.use_past = use_past
+
+    @classmethod
+    def with_past(cls, config: "PreTrainedConfig", task: str = "default") -> "OnnxConfigWithPast":
+        """
+        Instantiate a OnnxConfig with `use_past` attribute set to True
+
+        Args:
+            config: The underlying model's config to use when exporting to ONNX
+
+        Returns:
+            OnnxConfig with `.use_past = True`
+        """
+        return cls(config, task=task, use_past=True)
+
+    @property
+    def outputs(self) -> Mapping[str, Mapping[int, str]]:
+        common_outputs = super().outputs
+        if self.use_past:
+            self.fill_with_past_key_values_(common_outputs, direction="outputs")
+
+        return common_outputs
+
+    @property
+    def values_override(self) -> Optional[Mapping[str, Any]]:
+        if hasattr(self._config, "use_cache"):
+            return {"use_cache": self.use_past}
+
+        return None
+
+    @property
+    def num_layers(self) -> int:
+        """
+        The number of layers attribute retrieved from the model config. Override this for model configs where the
+        number of layers attribute is not called `num_layers`.
+        """
+        if not hasattr(self._config, "num_layers"):
+            raise AttributeError(
+                "could not find the number of layers attribute in the model configuration, override the num_layers"
+                " property of the model OnnxConfig to solve this"
+            )
+        return self._config.num_layers
+
+    @property
+    def num_attention_heads(self) -> int:
+        """
+        The number of attention heads attribute retrieved from the model config. Override this for model configs where
+        the number of attention heads attribute is not called `num_attention_heads`.
+        """
+        if not hasattr(self._config, "num_attention_heads"):
+            raise AttributeError(
+                "could not find the number of attention heads attribute in the model configuration, override the"
+                " num_attention_heads property of the model OnnxConfig to solve this"
+            )
+        return self._config.num_attention_heads
+
+    def generate_dummy_inputs(
+        self,
+        tokenizer: "PreTrainedTokenizerBase",
+        batch_size: int = -1,
+        seq_length: int = -1,
+        is_pair: bool = False,
+    ) -> Mapping[str, Any]:
+        # TODO: should we set seq_length = 1 when self.use_past = True?
+        common_inputs = super().generate_dummy_inputs(
+            tokenizer,
+            batch_size=batch_size,
+            seq_length=seq_length,
+            is_pair=is_pair,
+        )
+
+        if self.use_past:
+            if not is_torch_available():
+                raise ValueError("Cannot generate dummy past_keys inputs without PyTorch installed.")
+            else:
+                import torch
+
+            batch, seqlen = common_inputs["input_ids"].shape
+            # Not using the same length for past_key_values
+            past_key_values_length = seqlen + 2
+            shape = (
+                batch,
+                self.num_attention_heads,
+                past_key_values_length,
+                self._config.hidden_size // self.num_attention_heads,
+            )
+
+            if "attention_mask" in common_inputs:
+                mask_dtype = common_inputs["attention_mask"].dtype
+                common_inputs["attention_mask"] = torch.cat(
+                    [common_inputs["attention_mask"], torch.ones(batch, past_key_values_length, dtype=mask_dtype)],
+                    dim=1,
+                )
+
+            common_inputs["past_key_values"] = []
+            for _ in range(self.num_layers):
+                common_inputs["past_key_values"].append((torch.zeros(shape), torch.zeros(shape)))
+
+        return common_inputs
+
+    def fill_with_past_key_values_(
+        self, inputs_or_outputs: Mapping[str, Mapping[int, str]], direction: str, inverted_values_shape: bool = False
+    ):
+        """
+        Fill the input_or_outputs mapping with past_key_values dynamic axes considering.
+
+        Args:
+            inputs_or_outputs: The mapping to fill.
+            direction: either "inputs" or "outputs", it specifies whether input_or_outputs is the input mapping or the
+                output mapping, this is important for axes naming.
+            inverted_values_shape:
+                If `True`, store values on dynamic axis 1, else on axis 2.
+
+        """
+        if direction not in ["inputs", "outputs"]:
+            raise ValueError(f'direction must either be "inputs" or "outputs", but {direction} was given')
+
+        name = "past_key_values" if direction == "inputs" else "present"
+        for i in range(self.num_layers):
+            inputs_or_outputs[f"{name}.{i}.key"] = {0: "batch", 2: "past_sequence + sequence"}
+            if inverted_values_shape:
+                inputs_or_outputs[f"{name}.{i}.value"] = {0: "batch", 1: "past_sequence + sequence"}
+            else:
+                inputs_or_outputs[f"{name}.{i}.value"] = {0: "batch", 2: "past_sequence + sequence"}
+
+    def _flatten_past_key_values_(self, flattened_output, name, idx, t):
+        flattened_output[f"{name}.{idx}.key"] = t[0]
+        flattened_output[f"{name}.{idx}.value"] = t[1]
+
+    def flatten_output_collection_property(self, name: str, field: Iterable[Any]) -> dict[str, Any]:
+        flattened_output = {}
+        if name in ["present", "past_key_values"]:
+            for idx, t in enumerate(field):
+                self._flatten_past_key_values_(flattened_output, name, idx, t)
+        else:
+            flattened_output = super().flatten_output_collection_property(name, field)
+
+        return flattened_output
+
+
+class OnnxSeq2SeqConfigWithPast(OnnxConfigWithPast):
+    @property
+    def outputs(self) -> Mapping[str, Mapping[int, str]]:
+        common_outputs = super(OnnxConfigWithPast, self).outputs
+        # Renaming the outputs axes properly.
+        for name, axes_names in common_outputs.items():
+            sequence_name = "encoder_sequence" if "encoder" in name else "decoder_sequence"
+            for axis_idx, name in axes_names.items():
+                if "sequence" in name:
+                    axes_names[axis_idx] = sequence_name
+                # We reset the value as the order in common_outputs (OrderedDict) is lost otherwise
+                else:
+                    axes_names[axis_idx] = name
+        if self.use_past:
+            self.fill_with_past_key_values_(common_outputs, direction="outputs")
+
+        return common_outputs
+
+    @property
+    def num_layers(self) -> tuple[int, ...]:
+        try:
+            num_layers = super().num_layers
+            num_layers = (num_layers, num_layers)
+        except AttributeError:
+            if hasattr(self._config, "encoder_layers") and hasattr(self._config, "decoder_layers"):
+                num_layers = (self._config.encoder_layers, self._config.decoder_layers)
+            else:
+                raise AttributeError(
+                    "could not find the number of encoder and decoder layers attributes in the model configuration,"
+                    " override the num_layers property of the model OnnxConfig to solve this"
+                )
+
+        return num_layers
+
+    @property
+    def num_attention_heads(self) -> tuple[int, ...]:
+        try:
+            num_attention_heads = super().num_attention_heads
+            num_attention_heads = (num_attention_heads, num_attention_heads)
+        except AttributeError:
+            if hasattr(self._config, "encoder_attention_heads") and hasattr(self._config, "decoder_attention_heads"):
+                num_attention_heads = (self._config.encoder_attention_heads, self._config.decoder_attention_heads)
+            else:
+                raise AttributeError(
+                    "could not find the number of attention heads for the encoder and the decoder attributes in the"
+                    " model configuration, override the num_attention_heads property of the model OnnxConfig to solve"
+                    " this"
+                )
+        return num_attention_heads
+
+    def generate_dummy_inputs(
+        self,
+        tokenizer: Optional["PreTrainedTokenizerBase"],
+        batch_size: int = -1,
+        seq_length: int = -1,
+        is_pair: bool = False,
+    ) -> Mapping[str, Any]:
+        encoder_inputs = super(OnnxConfigWithPast, self).generate_dummy_inputs(
+            tokenizer,
+            batch_size=batch_size,
+            seq_length=seq_length,
+            is_pair=is_pair,
+        )
+
+        # Generate decoder inputs
+        decoder_seq_length = seq_length if not self.use_past else 1
+        decoder_inputs = super(OnnxConfigWithPast, self).generate_dummy_inputs(
+            tokenizer,
+            batch_size=batch_size,
+            seq_length=decoder_seq_length,
+            is_pair=is_pair,
+        )
+        decoder_inputs = {f"decoder_{name}": tensor for name, tensor in decoder_inputs.items()}
+        common_inputs = dict(**encoder_inputs, **decoder_inputs)
+
+        if self.use_past:
+            if not is_torch_available():
+                raise ValueError("Cannot generate dummy past_keys inputs without PyTorch installed.")
+            else:
+                import torch
+            batch = common_inputs["input_ids"].shape[0]
+            encoder_seq_length = common_inputs["input_ids"].shape[1]
+            decoder_seq_length = common_inputs["decoder_input_ids"].shape[1]
+            num_encoder_attention_heads, num_decoder_attention_heads = self.num_attention_heads
+            encoder_shape = (
+                batch,
+                num_encoder_attention_heads,
+                encoder_seq_length,
+                self._config.hidden_size // num_encoder_attention_heads,
+            )
+            decoder_shape = (
+                batch,
+                num_decoder_attention_heads,
+                # Not using the same length for past_key_values
+                decoder_seq_length + 3,
+                self._config.hidden_size // num_decoder_attention_heads,
+            )
+
+            common_inputs["past_key_values"] = []
+            # If the number of encoder and decoder layers are present in the model configuration, both are considered
+            num_encoder_layers, num_decoder_layers = self.num_layers
+            min_num_layers = min(num_encoder_layers, num_decoder_layers)
+            max_num_layers = max(num_encoder_layers, num_decoder_layers) - min_num_layers
+            remaining_side_name = "encoder" if num_encoder_layers > num_decoder_layers else "decoder"
+
+            for _ in range(min_num_layers):
+                # For encoder-decoder models, past_key_values contains pre-computed values for both the encoder and the
+                # decoder layers, hence a tuple of 4 tensors instead of 2
+                common_inputs["past_key_values"].append(
+                    (
+                        torch.zeros(decoder_shape),
+                        torch.zeros(decoder_shape),
+                        torch.zeros(encoder_shape),
+                        torch.zeros(encoder_shape),
+                    )
+                )
+
+            # TODO: test this.
+            shape = encoder_shape if remaining_side_name == "encoder" else decoder_shape
+            for _ in range(min_num_layers, max_num_layers):
+                common_inputs["past_key_values"].append((torch.zeros(shape), torch.zeros(shape)))
+
+        return common_inputs
+
+    def fill_with_past_key_values_(self, inputs_or_outputs: Mapping[str, Mapping[int, str]], direction: str):
+        if direction not in ["inputs", "outputs"]:
+            raise ValueError(f'direction must either be "inputs" or "outputs", but {direction} was given')
+
+        name = "past_key_values" if direction == "inputs" else "present"
+
+        # If the number of encoder and decoder layers are present in the model configuration, both are considered
+        num_encoder_layers, num_decoder_layers = self.num_layers
+        min_num_layers = min(num_encoder_layers, num_decoder_layers)
+        max_num_layers = max(num_encoder_layers, num_decoder_layers) - min_num_layers
+        remaining_side_name = "encoder" if num_encoder_layers > num_decoder_layers else "decoder"
+
+        encoder_sequence = "past_encoder_sequence"
+        decoder_sequence = "past_decoder_sequence" if direction == "inputs" else "past_decoder_sequence + sequence"
+
+        for i in range(min_num_layers):
+            inputs_or_outputs[f"{name}.{i}.decoder.key"] = {0: "batch", 2: decoder_sequence}
+            inputs_or_outputs[f"{name}.{i}.decoder.value"] = {0: "batch", 2: decoder_sequence}
+            inputs_or_outputs[f"{name}.{i}.encoder.key"] = {0: "batch", 2: encoder_sequence}
+            inputs_or_outputs[f"{name}.{i}.encoder.value"] = {0: "batch", 2: encoder_sequence}
+
+        for i in range(min_num_layers, max_num_layers):
+            if remaining_side_name == "encoder":
+                axes_info = {0: "batch", 2: encoder_sequence}
+            else:
+                axes_info = {0: "batch", 2: decoder_sequence}
+            inputs_or_outputs[f"{name}.{i}.{remaining_side_name}.key"] = axes_info
+
+    def _flatten_past_key_values_(self, flattened_output, name, idx, t):
+        flattened_output[f"{name}.{idx}.decoder.key"] = t[0]
+        flattened_output[f"{name}.{idx}.decoder.value"] = t[1]
+        flattened_output[f"{name}.{idx}.encoder.key"] = t[2]
+        flattened_output[f"{name}.{idx}.encoder.value"] = t[3]
--- a/src/transformers/onnx/utils.py
+++ b/src/transformers/onnx/utils.py
@ -0,0 +1,109 @@
+# Copyright 2021 The HuggingFace Team. All rights reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+from ctypes import c_float, sizeof
+from enum import Enum
+from typing import TYPE_CHECKING, Optional, Union
+
+
+if TYPE_CHECKING:
+    from .. import AutoFeatureExtractor, AutoProcessor, AutoTokenizer  # tests_ignore
+
+
+class ParameterFormat(Enum):
+    Float = c_float
+
+    @property
+    def size(self) -> int:
+        """
+        Number of byte required for this data type
+
+        Returns:
+            Integer > 0
+        """
+        return sizeof(self.value)
+
+
+def compute_effective_axis_dimension(dimension: int, fixed_dimension: int, num_token_to_add: int = 0) -> int:
+    """
+
+    Args:
+        dimension:
+        fixed_dimension:
+        num_token_to_add:
+
+    Returns:
+
+    """
+    # < 0 is possible if using a dynamic axis
+    if dimension <= 0:
+        dimension = fixed_dimension
+
+    dimension -= num_token_to_add
+    return dimension
+
+
+def compute_serialized_parameters_size(num_parameters: int, dtype: ParameterFormat) -> int:
+    """
+    Compute the size taken by all the parameters in the given the storage format when serializing the model
+
+    Args:
+        num_parameters: Number of parameters to be saved
+        dtype: The data format each parameter will be saved
+
+    Returns:
+        Size (in byte) taken to save all the parameters
+    """
+    return num_parameters * dtype.size
+
+
+def get_preprocessor(model_name: str) -> Optional[Union["AutoTokenizer", "AutoFeatureExtractor", "AutoProcessor"]]:
+    """
+    Gets a preprocessor (tokenizer, feature extractor or processor) that is available for `model_name`.
+
+    Args:
+        model_name (`str`): Name of the model for which a preprocessor are loaded.
+
+    Returns:
+        `Optional[Union[AutoTokenizer, AutoFeatureExtractor, AutoProcessor]]`:
+            If a processor is found, it is returned. Otherwise, if a tokenizer or a feature extractor exists, it is
+            returned. If both a tokenizer and a feature extractor exist, an error is raised. The function returns
+            `None` if no preprocessor is found.
+    """
+    # Avoid circular imports by only importing this here.
+    from .. import AutoFeatureExtractor, AutoProcessor, AutoTokenizer  # tests_ignore
+
+    try:
+        return AutoProcessor.from_pretrained(model_name)
+    except (ValueError, OSError, KeyError):
+        tokenizer, feature_extractor = None, None
+        try:
+            tokenizer = AutoTokenizer.from_pretrained(model_name)
+        except (OSError, KeyError):
+            pass
+        try:
+            feature_extractor = AutoFeatureExtractor.from_pretrained(model_name)
+        except (OSError, KeyError):
+            pass
+
+        if tokenizer is not None and feature_extractor is not None:
+            raise ValueError(
+                f"Couldn't auto-detect preprocessor for {model_name}. Found both a tokenizer and a feature extractor."
+            )
+        elif tokenizer is None and feature_extractor is None:
+            return None
+        elif tokenizer is not None:
+            return tokenizer
+        else:
+            return feature_extractor
--- a/src/transformers/trainer.py
+++ b/src/transformers/trainer.py
@ -209,6 +209,7 @@ if is_peft_available():

 if is_accelerate_available():
    from accelerate import Accelerator, skip_first_batches
+    from accelerate import __version__ as accelerate_version
    from accelerate.state import AcceleratorState
    from accelerate.utils import (
        DataLoaderConfiguration,
@ -4966,12 +4967,7 @@ class Trainer:
        # this would have been updated above, no need for it anymore
        accelerator_config.pop("gradient_accumulation_kwargs")

-        args = {
-            "mixed_precision": self.args.mixed_precision,
-            "dataloader_config": dataloader_config,
-            "fsdp_plugin": self.args.fsdp_plugin,
-            "deepspeed_plugin": self.args.deepspeed_plugin,
-        }
+        args = {"deepspeed_plugin": self.args.deepspeed_plugin, "dataloader_config": dataloader_config}

        # We defer compatibility checks to accelerator
        if self.args.parallelism_config is not None:
@ -4985,7 +4981,7 @@ class Trainer:
        if getattr(self.model, "tp_size", None) is not None and self.model.tp_size > 1:
            self.is_tp_enabled = True
            if self.args.parallelism_config is not None:
-                if is_accelerate_available("1.10.1"):
+                if version.parse(accelerate_version) > version.parse("1.10.1"):
                    if self.args.parallelism_config is not None:
                        from accelerate import ParallelismConfig

@ -4993,15 +4989,6 @@ class Trainer:
                else:
                    raise ValueError("Requires accelerate>1.10.1 to use Tensor Parallelism.")

-        if is_accelerate_available("1.2.0"):
-            # it we don't have the correct version, we will rely on env var instead that were set in TrainingArguments
-            from accelerate.utils import TorchDynamoPlugin
-
-            dynamo_plugin = TorchDynamoPlugin(
-                backend=self.args.torch_compile_backend, mode=self.args.torch_compile_mode
-            )
-            args["dynamo_plugin"] = dynamo_plugin
-
        # create accelerator object
        self.accelerator = Accelerator(**args)
        # some Trainer classes need to use `gather` instead of `gather_for_metrics`, thus we store a flag
--- a/src/transformers/training_args.py
+++ b/src/transformers/training_args.py
@ -462,8 +462,6 @@ class TrainingArguments:
            fsdp json config file (e.g., `fsdp_config.json`) or an already loaded json file as `dict`.

            A List of config and its options:
-                - fsdp_version (`int`, *optional*, defaults to `1`):
-                    The version of FSDP to use. Defaults to 1.
                - min_num_params (`int`, *optional*, defaults to `0`):
                    FSDP's minimum number of parameters for Default Auto Wrapping. (useful only when `fsdp` field is
                    passed).
@ -1536,28 +1534,6 @@ class TrainingArguments:

        self.optim = OptimizerNames(self.optim)

-        # We need to get from the env as we are setting deepspeed mixed precison afterwards
-        self.mixed_precision = os.environ.get("ACCELERATE_MIXED_PRECISION", "no")
-        if self.fp16:
-            self.mixed_precision = "fp16"
-        elif self.bf16:
-            self.mixed_precision = "bf16"
-
-        if (self.torch_compile_mode is not None or self.torch_compile_backend is not None) and not self.torch_compile:
-            self.torch_compile = True
-        if self.torch_compile and self.torch_compile_backend is None:
-            if not self.use_cpu and is_torch_hpu_available():
-                self.torch_compile_backend = "hpu_backend"
-            else:
-                self.torch_compile_backend = "inductor"
-
-        if self.torch_compile:
-            # TODO: remove this once we've bumped the minimum accelerate version
-            if not is_accelerate_available("1.2.0"):
-                os.environ["ACCELERATE_DYNAMO_BACKEND"] = self.torch_compile_backend
-                if self.torch_compile_mode is not None:
-                    os.environ["ACCELERATE_DYNAMO_MODE"] = self.torch_compile_mode
-
        # We need to setup the accelerator config here *before* the first call to `self.device`
        if is_accelerate_available():
            if not isinstance(self.accelerator_config, AcceleratorConfig):
@ -1584,6 +1560,22 @@ class TrainingArguments:
        if is_torch_available():
            self.device

+        if (self.torch_compile_mode is not None or self.torch_compile_backend is not None) and not self.torch_compile:
+            self.torch_compile = True
+        if self.torch_compile and self.torch_compile_backend is None:
+            if not self.use_cpu and is_torch_hpu_available():
+                self.torch_compile_backend = "hpu_backend"
+            else:
+                self.torch_compile_backend = "inductor"
+
+        # accelerate integration for torch compile
+        if self.torch_compile:
+            # set env vars for accelerate
+            prefix = "ACCELERATE_DYNAMO_"
+            os.environ[prefix + "BACKEND"] = self.torch_compile_backend
+            if self.torch_compile_mode is not None:
+                os.environ[prefix + "MODE"] = self.torch_compile_mode
+
        if is_torch_available() and self.torch_compile:
            if is_torch_tf32_available():
                if self.tf32 is None and not self.fp16 or self.bf16:
@ -1620,6 +1612,14 @@ class TrainingArguments:
                        torch.backends.cudnn.allow_tf32 = False
                # no need to assert on else

+        # if training args is specified, it will override the one specified in the accelerate config
+        mixed_precision_dtype = os.environ.get("ACCELERATE_MIXED_PRECISION", "no")
+        if self.fp16:
+            mixed_precision_dtype = "fp16"
+        elif self.bf16:
+            mixed_precision_dtype = "bf16"
+        os.environ["ACCELERATE_MIXED_PRECISION"] = mixed_precision_dtype
+
        if self.report_to == "all" or self.report_to == ["all"]:
            # Import at runtime to avoid a circular import.
            from .integrations import get_available_reporting_integrations
@ -1648,19 +1648,130 @@ class TrainingArguments:
        if not isinstance(self.warmup_steps, int) or self.warmup_steps < 0:
            raise ValueError("warmup_steps must be of type int and must be 0 or a positive integer.")

+        if self.fsdp is None:
+            self.fsdp = []
+        elif self.fsdp is True:
+            self.fsdp = [FSDPOption.FULL_SHARD]
+        elif isinstance(self.fsdp, str):
+            self.fsdp = [FSDPOption(s) for s in self.fsdp.split()]
+
+        if self.fsdp == [FSDPOption.OFFLOAD]:
+            raise ValueError(
+                "`--fsdp offload` can't work on its own. It needs to be added to `--fsdp full_shard` or "
+                '`--fsdp shard_grad_op`. For example, `--fsdp "full_shard offload"`.'
+            )
+        elif FSDPOption.FULL_SHARD in self.fsdp and FSDPOption.SHARD_GRAD_OP in self.fsdp:
+            raise ValueError("`--fsdp full_shard` is not compatible with `--fsdp shard_grad_op`.")
+
+        if self.gradient_checkpointing and (
+            FSDPOption.FULL_SHARD in self.fsdp or FSDPOption.HYBRID_SHARD in self.fsdp
+        ):
+            logger.warning(
+                "When using FSDP full shard, instead of using `gradient_checkpointing` in TrainingArguments, please"
+                " use `activation_checkpointing` in `fsdp_config`. The former introduces a redundant AllGather"
+                " operation in backward pass. Reference: https://github.com/huggingface/transformers/issues/30404"
+            )
+
+        if self.fsdp_config is None:
+            self.fsdp_config = {}
+
+        if isinstance(self.fsdp_config, str):
+            if len(self.fsdp) == 0:
+                warnings.warn("`--fsdp_config` is useful only when `--fsdp` is specified.")
+            with open(self.fsdp_config, encoding="utf-8") as f:
+                self.fsdp_config = json.load(f)
+
+        if self.fsdp_config is not None and isinstance(self.fsdp_config, dict):
+            for k in list(self.fsdp_config.keys()):
+                if k.startswith("fsdp_"):
+                    v = self.fsdp_config.pop(k)
+                    self.fsdp_config[k[5:]] = v
+
+        self.fsdp_config["min_num_params"] = self.fsdp_config.get("min_num_params", 0)
+
+        # if fsdp_config["transformer_layer_cls_to_wrap"] is specified as a string, convert it to a list with a single object
+        if isinstance(self.fsdp_config.get("transformer_layer_cls_to_wrap", None), str):
+            self.fsdp_config["transformer_layer_cls_to_wrap"] = [self.fsdp_config["transformer_layer_cls_to_wrap"]]
+
+        if len(self.fsdp) == 0 and self.fsdp_config["min_num_params"] > 0:
+            warnings.warn("`min_num_params` is useful only when `--fsdp` is specified.")
+
+        if len(self.fsdp) == 0 and self.fsdp_config.get("transformer_layer_cls_to_wrap", None) is not None:
+            warnings.warn("`transformer_layer_cls_to_wrap` is useful only when `--fsdp` is specified.")
+
+        if (
+            len(self.fsdp) > 0
+            and self.fsdp_config["min_num_params"] > 0
+            and self.fsdp_config.get("transformer_layer_cls_to_wrap", None) is not None
+        ):
+            raise ValueError("`min_num_params` and `transformer_layer_cls_to_wrap` are mutually exclusive.")
+        self.fsdp_config["xla"] = self.fsdp_config.get("xla", False)
+        self.fsdp_config["xla_fsdp_v2"] = self.fsdp_config.get("xla_fsdp_v2", False)
+        self.fsdp_config["xla_fsdp_grad_ckpt"] = self.fsdp_config.get("xla_fsdp_grad_ckpt", False)
+        if self.fsdp_config["xla"]:
+            if len(self.fsdp) > 0:
+                # store XLA fsdp configuration parameters into a dictionary
+                # Copy the config to avoid modifying the original config (which may be used for JSON serialization)
+                self.xla_fsdp_config = self.fsdp_config.get("xla_fsdp_settings", {}).copy()
+                # apply appropriate string to torch.dtype conversions for parameters
+                if "compute_dtype" in self.xla_fsdp_config:
+                    self.xla_fsdp_config["compute_dtype"] = getattr(torch, self.xla_fsdp_config["compute_dtype"])
+                if "buffer_dtype" in self.xla_fsdp_config:
+                    self.xla_fsdp_config["buffer_dtype"] = getattr(torch, self.xla_fsdp_config["buffer_dtype"])
+            else:
+                warnings.warn("XLA FSDP can be used only when `--fsdp` is specified.")
+        else:
+            if self.fsdp_config["xla_fsdp_grad_ckpt"]:
+                warnings.warn("`--xla_fsdp_grad_ckpt` is useful only when `--xla` is set to true.")
+
+        # accelerate integration for FSDP
+        if len(self.fsdp) > 0 and not self.fsdp_config["xla"]:
+            os.environ["ACCELERATE_USE_FSDP"] = "true"
+            from accelerate.utils.constants import (
+                FSDP_AUTO_WRAP_POLICY,
+                FSDP_SHARDING_STRATEGY,
+            )
+
+            prefix = "FSDP_"
+            for fsdp_option in self.fsdp:
+                if fsdp_option.upper() in FSDP_SHARDING_STRATEGY:
+                    # set environment variable for FSDP sharding strategy
+                    os.environ[f"{prefix}SHARDING_STRATEGY"] = str(
+                        FSDP_SHARDING_STRATEGY.index(fsdp_option.upper()) + 1
+                    )
+                elif fsdp_option == FSDPOption.OFFLOAD:
+                    os.environ[f"{prefix}OFFLOAD_PARAMS"] = "true"
+                elif fsdp_option == FSDPOption.AUTO_WRAP:
+                    os.environ[f"{prefix}AUTO_WRAP_POLICY"] = FSDP_AUTO_WRAP_POLICY[0]
+                    if self.fsdp_config["min_num_params"] > 0:
+                        os.environ[f"{prefix}MIN_NUM_PARAMS"] = str(self.fsdp_config["min_num_params"])
+                        os.environ[f"{prefix}AUTO_WRAP_POLICY"] = FSDP_AUTO_WRAP_POLICY[1]
+                    elif self.fsdp_config.get("transformer_layer_cls_to_wrap", None) is not None:
+                        os.environ[f"{prefix}TRANSFORMER_CLS_TO_WRAP"] = ",".join(
+                            self.fsdp_config["transformer_layer_cls_to_wrap"]
+                        )
+            prefetch_policy = self.fsdp_config.get("backward_prefetch", "NO_PREFETCH")
+            os.environ[f"{prefix}BACKWARD_PREFETCH"] = prefetch_policy.upper()
+            os.environ[f"{prefix}FORWARD_PREFETCH"] = str(self.fsdp_config.get("forward_prefetch", "false")).lower()
+
+            sync_module_states = str(self.fsdp_config.get("sync_module_states", "true")).lower()
+            cpu_ram_efficient_loading = str(self.fsdp_config.get("cpu_ram_efficient_loading", "false")).lower()
+
+            if sync_module_states == "false" and cpu_ram_efficient_loading == "true":
+                # In this case, all the processes except the main process would have random weights leading
+                # to unexpected behaviour during training, thus throwing error here to prevent it.
+                raise ValueError('`sync_module_states` must be `"True"` if `cpu_ram_efficient_loading` is `"True"`')
+
+            os.environ[f"{prefix}SYNC_MODULE_STATES"] = sync_module_states
+            os.environ[f"{prefix}CPU_RAM_EFFICIENT_LOADING"] = cpu_ram_efficient_loading
+
+            os.environ[f"{prefix}USE_ORIG_PARAMS"] = str(self.fsdp_config.get("use_orig_params", "true")).lower()
+
        if isinstance(self.debug, str):
            self.debug = [DebugOption(s) for s in self.debug.split()]
        elif self.debug is None:
            self.debug = []

-        self.fsdp_plugin = None
-        fsdp_plugin_args = self._process_fsdp_args()
-        if fsdp_plugin_args is not None:
-            # Accelerate FSDP Plugin
-            from accelerate.utils import FullyShardedDataParallelPlugin
-
-            self.fsdp_plugin = FullyShardedDataParallelPlugin(**fsdp_plugin_args)
-
        self.deepspeed_plugin = None
        if self.deepspeed:
            # - must be run very last in arg parsing, since it will use a lot of these settings.
@ -1679,13 +1790,15 @@ class TrainingArguments:
            # Accelerate DeepSpeed Plugin
            from accelerate.utils import DeepSpeedPlugin

+            os.environ["ACCELERATE_USE_DEEPSPEED"] = "true"
            self.deepspeed_plugin = DeepSpeedPlugin(hf_ds_config=self.hf_deepspeed_config)
        elif strtobool(os.environ.get("ACCELERATE_USE_DEEPSPEED", "false")):
            # Accelerate DeepSpeed Plugin
            from accelerate.utils import DeepSpeedPlugin

            self.deepspeed_plugin = DeepSpeedPlugin()
-            self.deepspeed_plugin.set_mixed_precision(self.mixed_precision)
+            mixed_precision = os.environ.get("ACCELERATE_MIXED_PRECISION", "no")
+            self.deepspeed_plugin.set_mixed_precision(mixed_precision)
            self.deepspeed_plugin.set_deepspeed_weakref()

        if self.use_cpu:
@ -1766,6 +1879,8 @@ class TrainingArguments:
        else:
            AcceleratorState._reset_state(reset_partial_state=True)
            self.distributed_state = None
+        if "ACCELERATE_USE_IPEX" not in os.environ:
+            os.environ["ACCELERATE_USE_IPEX"] = "false"

        self._n_gpu = 1
        if self.use_cpu or strtobool(os.environ.get("ACCELERATE_USE_CPU", "False")):
@ -2647,127 +2762,6 @@ class TrainingArguments:
        self.data_seed = sampler_seed
        return self

-    def _process_fsdp_args(self):
-        if self.fsdp is None:
-            self.fsdp = []
-        elif self.fsdp is True:
-            self.fsdp = [FSDPOption.FULL_SHARD]
-        elif isinstance(self.fsdp, str):
-            self.fsdp = [FSDPOption(s) for s in self.fsdp.split()]
-
-        if self.fsdp == [FSDPOption.OFFLOAD]:
-            raise ValueError(
-                "`--fsdp offload` can't work on its own. It needs to be added to `--fsdp full_shard` or "
-                '`--fsdp shard_grad_op`. For example, `--fsdp "full_shard offload"`.'
-            )
-        elif FSDPOption.FULL_SHARD in self.fsdp and FSDPOption.SHARD_GRAD_OP in self.fsdp:
-            raise ValueError("`--fsdp full_shard` is not compatible with `--fsdp shard_grad_op`.")
-
-        if self.gradient_checkpointing and (
-            FSDPOption.FULL_SHARD in self.fsdp or FSDPOption.HYBRID_SHARD in self.fsdp
-        ):
-            logger.warning(
-                "When using FSDP full shard, instead of using `gradient_checkpointing` in TrainingArguments, please"
-                " use `activation_checkpointing` in `fsdp_config`. The former introduces a redundant AllGather"
-                " operation in backward pass. Reference: https://github.com/huggingface/transformers/issues/30404"
-            )
-
-        if self.fsdp_config is None:
-            self.fsdp_config = {}
-
-        if isinstance(self.fsdp_config, str):
-            if len(self.fsdp) == 0:
-                warnings.warn("`--fsdp_config` is useful only when `--fsdp` is specified.")
-            with open(self.fsdp_config, encoding="utf-8") as f:
-                self.fsdp_config = json.load(f)
-
-        if self.fsdp_config is not None and isinstance(self.fsdp_config, dict):
-            for k in list(self.fsdp_config.keys()):
-                if k.startswith("fsdp_"):
-                    v = self.fsdp_config.pop(k)
-                    self.fsdp_config[k[5:]] = v
-
-        self.fsdp_config["min_num_params"] = self.fsdp_config.get("min_num_params", 0)
-
-        # if fsdp_config["transformer_layer_cls_to_wrap"] is specified as a string, convert it to a list with a single object
-        if isinstance(self.fsdp_config.get("transformer_layer_cls_to_wrap", None), str):
-            self.fsdp_config["transformer_layer_cls_to_wrap"] = [self.fsdp_config["transformer_layer_cls_to_wrap"]]
-
-        if len(self.fsdp) == 0 and self.fsdp_config["min_num_params"] > 0:
-            warnings.warn("`min_num_params` is useful only when `--fsdp` is specified.")
-
-        if len(self.fsdp) == 0 and self.fsdp_config.get("transformer_layer_cls_to_wrap", None) is not None:
-            warnings.warn("`transformer_layer_cls_to_wrap` is useful only when `--fsdp` is specified.")
-
-        if (
-            len(self.fsdp) > 0
-            and self.fsdp_config["min_num_params"] > 0
-            and self.fsdp_config.get("transformer_layer_cls_to_wrap", None) is not None
-        ):
-            raise ValueError("`min_num_params` and `transformer_layer_cls_to_wrap` are mutually exclusive.")
-        self.fsdp_config["xla"] = self.fsdp_config.get("xla", False)
-        self.fsdp_config["xla_fsdp_v2"] = self.fsdp_config.get("xla_fsdp_v2", False)
-        self.fsdp_config["xla_fsdp_grad_ckpt"] = self.fsdp_config.get("xla_fsdp_grad_ckpt", False)
-        if self.fsdp_config["xla"]:
-            if len(self.fsdp) > 0:
-                # store XLA fsdp configuration parameters into a dictionary
-                # Copy the config to avoid modifying the original config (which may be used for JSON serialization)
-                self.xla_fsdp_config = self.fsdp_config.get("xla_fsdp_settings", {}).copy()
-                # apply appropriate string to torch.dtype conversions for parameters
-                if "compute_dtype" in self.xla_fsdp_config:
-                    self.xla_fsdp_config["compute_dtype"] = getattr(torch, self.xla_fsdp_config["compute_dtype"])
-                if "buffer_dtype" in self.xla_fsdp_config:
-                    self.xla_fsdp_config["buffer_dtype"] = getattr(torch, self.xla_fsdp_config["buffer_dtype"])
-            else:
-                warnings.warn("XLA FSDP can be used only when `--fsdp` is specified.")
-        else:
-            if self.fsdp_config["xla_fsdp_grad_ckpt"]:
-                warnings.warn("`--xla_fsdp_grad_ckpt` is useful only when `--xla` is set to true.")
-
-        # accelerate integration for FSDP
-        fsdp_plugin_args = None
-        if len(self.fsdp) > 0 and not self.fsdp_config["xla"]:
-            from accelerate.utils.constants import (
-                FSDP_AUTO_WRAP_POLICY,
-                FSDP_SHARDING_STRATEGY,
-            )
-
-            fsdp_plugin_args = {}
-            for fsdp_option in self.fsdp:
-                if fsdp_option.upper() in FSDP_SHARDING_STRATEGY:
-                    fsdp_plugin_args["sharding_strategy"] = fsdp_option
-                elif fsdp_option == FSDPOption.OFFLOAD:
-                    fsdp_plugin_args["cpu_offload"] = True
-                elif fsdp_option == FSDPOption.AUTO_WRAP:
-                    fsdp_plugin_args["auto_wrap_policy"] = FSDP_AUTO_WRAP_POLICY[0]
-                    if self.fsdp_config["min_num_params"] > 0:
-                        fsdp_plugin_args["min_num_params"] = self.fsdp_config["min_num_params"]
-                        fsdp_plugin_args["auto_wrap_policy"] = FSDP_AUTO_WRAP_POLICY[1]
-                    elif self.fsdp_config.get("transformer_layer_cls_to_wrap", None) is not None:
-                        fsdp_plugin_args["transformer_cls_names_to_wrap"] = ",".join(
-                            self.fsdp_config["transformer_layer_cls_to_wrap"]
-                        )
-            fsdp_plugin_args["fsdp_version"] = self.fsdp_config.get("fsdp_version", 1)
-            prefetch_policy = self.fsdp_config.get("backward_prefetch", "NO_PREFETCH")
-            fsdp_plugin_args["backward_prefetch"] = prefetch_policy.upper()
-            fsdp_plugin_args["forward_prefetch"] = str(self.fsdp_config.get("forward_prefetch", "false")).lower()
-
-            sync_module_states = str(self.fsdp_config.get("sync_module_states", "true")).lower()
-            cpu_ram_efficient_loading = str(self.fsdp_config.get("cpu_ram_efficient_loading", "false")).lower()
-            if sync_module_states == "false" and cpu_ram_efficient_loading == "true":
-                # In this case, all the processes except the main process would have random weights leading
-                # to unexpected behaviour during training, thus throwing error here to prevent it.
-                raise ValueError('`sync_module_states` must be `"True"` if `cpu_ram_efficient_loading` is `"True"`')
-
-            # we need to set the env here as otherwise we get a warning in accelerate + we need to set it for transformers
-            fsdp_plugin_args["cpu_ram_efficient_loading"] = cpu_ram_efficient_loading
-            os.environ["FSDP_CPU_RAM_EFFICIENT_LOADING"] = cpu_ram_efficient_loading
-
-            fsdp_plugin_args["sync_module_states"] = sync_module_states
-            fsdp_plugin_args["use_orig_params"] = str(self.fsdp_config.get("use_orig_params", "true")).lower()
-
-        return fsdp_plugin_args
-

 class ParallelMode(Enum):
    NOT_PARALLEL = "not_parallel"
--- a/src/transformers/utils/import_utils.py
+++ b/src/transformers/utils/import_utils.py
@ -2161,7 +2161,7 @@ def create_import_structure_from_path(module_path):
    {
        'albert': {
            frozenset(): {
-                'configuration_albert': {'AlbertConfig'}
+                'configuration_albert': {'AlbertConfig', 'AlbertOnnxConfig'}
            },
            frozenset({'tokenizers'}): {
                'tokenization_albert_fast': {'AlbertTokenizerFast'}
@ -2337,7 +2337,7 @@ def spread_import_structure(nested_import_structure):
    {
        'albert': {
            frozenset(): {
-                'configuration_albert': {'AlbertConfig'}
+                'configuration_albert': {'AlbertConfig', 'AlbertOnnxConfig'}
            },
            frozenset({'tokenizers'}): {
                'tokenization_albert_fast': {'AlbertTokenizerFast'}
@ -2364,7 +2364,7 @@ def spread_import_structure(nested_import_structure):
            'albert.tokenization_albert_fast': {'AlbertTokenizerFast'}
        },
        frozenset(): {
-            'albert.configuration_albert': {'AlbertConfig'},
+            'albert.configuration_albert': {'AlbertConfig', 'AlbertOnnxConfig'},
            'align.processing_align': {'AlignProcessor'},
            'align.configuration_align': {'AlignConfig', 'AlignTextConfig', 'AlignVisionConfig'},
            'altclip.configuration_altclip': {'AltCLIPConfig', 'AltCLIPTextConfig', 'AltCLIPVisionConfig'},
@ -2465,7 +2465,7 @@ def define_import_structure(module_path: str, prefix: str | None = None) -> IMPO
            'albert.tokenization_albert_fast': {'AlbertTokenizerFast'}
        },
        frozenset(): {
-            'albert.configuration_albert': {'AlbertConfig'},
+            'albert.configuration_albert': {'AlbertConfig', 'AlbertOnnxConfig'},
            'align.processing_align': {'AlignProcessor'},
            'align.configuration_align': {'AlignConfig', 'AlignTextConfig', 'AlignVisionConfig'},
            'altclip.configuration_altclip': {'AltCLIPConfig', 'AltCLIPTextConfig', 'AltCLIPVisionConfig'},
--- a/src/transformers/utils/processor_visualizer_utils.py
+++ b/src/transformers/utils/processor_visualizer_utils.py
@ -0,0 +1,488 @@
+import re
+from typing import Optional, Union
+
+import matplotlib.pyplot as plt
+import numpy as np
+import requests
+from PIL import Image
+
+from ..image_transforms import convert_to_rgb, to_pil_image, unnormalize
+from ..image_utils import ChannelDimension, infer_channel_dimension_format
+from ..models.auto import AutoConfig, AutoProcessor
+
+
+# archs failing that should raise immediately for this util:
+
+INCOMPATIBLE_MODELS = [
+    "bit",
+    "colpali",
+    "colqwen2",
+    "convnext",
+    "d_fine",
+    "data2vec",
+    "efficientloftr",
+    "efficientnet",
+    "fuyu",
+    "gemma3",
+    "glm4v",
+    "glpn",
+    "hgnet_v2",
+    "hiera",
+    "internvl",
+    "janus",
+    "layoutlmv3",
+    "levit",
+    "lightglue",
+    "llama4",
+    "mistral3",
+    "mllama",
+    "mobilevit",
+    "mobilevitv2",
+    "musicgen",
+    "musicgen_melody",
+    "oneformer",
+    "perceiver",
+    "perception_lm",
+    "phi4_multimodal",
+    "regnet",
+    "resnet",
+    "superglue",
+    "superpoint",
+    "swin2sr",
+    "timm_wrapper",
+    "tvp",
+    "udop",
+    "vitmatte",
+    "vitpose",
+    "vjepa2",
+    "whisper",
+]
+
+
+DEFAULT_IMAGE_URL = (
+    "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/transformers/hf-logo-224x224.png"
+)
+
+
+def _looks_like_global(tile: np.ndarray, base: Image.Image, *, mae_tol: float = 0.05) -> bool:
+    """
+    Heuristic to check if a tile is a downscaled version of the original image.
+    Uses mean absolute error with a strict threshold.
+    """
+    base_r = base.convert("RGB").resize(tile.shape[:2][::-1], Image.BILINEAR)
+    base_np = np.asarray(base_r).astype(np.float32) / 255.0
+
+    tile_f32 = tile.astype(np.float32)
+    if tile_f32.max() > 1.5:
+        tile_f32 /= 255.0
+
+    mae = np.abs(tile_f32 - base_np).mean()
+    return mae < mae_tol
+
+
+def _find_global_tile_index(tiles: np.ndarray, base: Image.Image) -> Optional[int]:
+    """
+    Find which tile (if any) is the global/downscaled image.
+    Checks first and last tiles only, as models place global images at these positions.
+
+    Returns:
+        Index of global tile (0 or len-1), or None if not found
+    """
+    if tiles.shape[0] <= 1:
+        return None
+
+    if _looks_like_global(tiles[0], base):
+        return 0
+
+    if _looks_like_global(tiles[-1], base):
+        return tiles.shape[0] - 1
+
+    return None
+
+
+class ImageVisualizer:
+    def __init__(self, repo_id: str):
+        self.processor = AutoProcessor.from_pretrained(repo_id, trust_remote_code=False)
+        self.config = AutoConfig.from_pretrained(repo_id, trust_remote_code=False)
+
+        if hasattr(self.processor, "image_processor"):
+            image_processor = self.processor.image_processor
+        elif hasattr(self.processor, "image_mean"):
+            image_processor = self.processor  # weak test, but works most of the time
+        else:
+            raise ValueError(f"No image processor found for {repo_id}.")
+
+        self.channel_means = getattr(image_processor, "image_mean", [0.485, 0.456, 0.406])
+        self.channel_stds = getattr(image_processor, "image_std", [0.229, 0.224, 0.225])
+        if hasattr(self.processor, "image_token"):
+            self.image_token_marker = self.processor.image_token
+        elif hasattr(self.processor, "image_token_id"):
+            self.image_token_marker = self.processor.decode(self.processor.image_token_id)
+        else:
+            self.image_token_marker = "<image>"
+
+        self.default_prompt = f"{self.image_token_marker} How does it look?"
+
+        self.vision_config = getattr(self.config, "vision_config", None)
+        self.patch_size = getattr(self.vision_config, "patch_size", getattr(image_processor, "patch_size", 14))
+        self.merge_size = getattr(image_processor, "merge_size", 1)
+
+    def _prepare_images_for_display(self, image_array: np.ndarray) -> np.ndarray:
+        """
+        Convert unnormalized images to NHWC format for display, flattening any extra batch dimensions.
+
+        Args:
+            image_array: Array of shape [..., C, H, W] or [..., H, W, C]
+
+        Returns:
+            Array of shape [N, H, W, C] suitable for plotting
+        """
+        input_format = infer_channel_dimension_format(image_array)
+
+        if input_format == ChannelDimension.FIRST:
+            if image_array.ndim == 3:
+                image_array = image_array[np.newaxis, ...]
+            elif image_array.ndim > 4:
+                batch_size = int(np.prod(image_array.shape[: image_array.ndim - 3]))
+                num_channels, height, width = image_array.shape[-3:]
+                image_array = image_array.reshape(batch_size, num_channels, height, width)
+
+            if image_array.ndim == 4:
+                image_array = np.transpose(image_array, (0, 2, 3, 1))
+        else:
+            if image_array.ndim == 3:
+                image_array = image_array[np.newaxis, ...]
+            elif image_array.ndim > 4:
+                batch_size = int(np.prod(image_array.shape[: image_array.ndim - 3]))
+                height, width, num_channels = image_array.shape[-3:]
+                image_array = image_array.reshape(batch_size, height, width, num_channels)
+
+        return image_array
+
+    def _display_single_image(
+        self,
+        image_array: np.ndarray,
+        show_patch_grid: bool,
+        figsize=(7, 7),
+        patch_grid_rows=None,
+        patch_grid_cols=None,
+    ):
+        plt.figure(figsize=figsize)
+        plt.imshow(image_array)
+        plt.xticks([])
+        plt.yticks([])
+
+        if show_patch_grid:
+            height, width = image_array.shape[:2]
+
+            if patch_grid_rows is not None and patch_grid_cols is not None:
+                step_h = height / patch_grid_rows
+                step_w = width / patch_grid_cols
+                for i in range(1, patch_grid_cols):
+                    plt.axvline(i * step_w, color="red", linewidth=0.5)
+                for i in range(1, patch_grid_rows):
+                    plt.axhline(i * step_h, color="red", linewidth=0.5)
+            else:
+                step = max(1, min(height, width) // self.patch_size)
+                for x_pos in range(0, width, step):
+                    plt.axvline(x_pos, color="red", linewidth=0.5)
+                for y_pos in range(0, height, step):
+                    plt.axhline(y_pos, color="red", linewidth=0.5)
+
+        caption = f"{width}×{height} | mean={', '.join(f'{m:.3f}' for m in self.channel_means)} | std={', '.join(f'{s:.3f}' for s in self.channel_stds)}"
+        plt.tight_layout()
+        plt.figtext(0.5, -0.02, caption, ha="center", va="top", fontsize=12)
+        plt.show()
+
+    def _display_tiled_images(
+        self,
+        tiles_array: np.ndarray,
+        source_image: Image.Image,
+        rows: Optional[int] = None,
+        cols: Optional[int] = None,
+        aspect_ratio: float = 1.0,
+        add_grid: bool = True,
+        figsize=(7, 7),
+        global_tile: Optional[np.ndarray] = None,
+    ):
+        """
+        Display a grid of image tiles with optional global image display.
+
+        Args:
+            tiles_array: Array of tiles to display in grid format
+            source_image: Original source image
+            rows: Number of rows in the grid
+            cols: Number of columns in the grid
+            aspect_ratio: Aspect ratio for grid layout calculation
+            add_grid: Whether to add patch grid overlay
+            figsize: Figure size for matplotlib
+            global_tile: Optional global/downscaled image to display separately
+        """
+        num_tiles = tiles_array.shape[0]
+
+        # Infer grid if not specified
+        grid_rows, grid_cols = rows, cols
+        if grid_rows is None or grid_cols is None:
+            if aspect_ratio >= 1:
+                guessed_cols = int(np.ceil(np.sqrt(num_tiles * aspect_ratio)))
+                guessed_rows = int(np.ceil(num_tiles / max(guessed_cols, 1)))
+            else:
+                guessed_rows = int(np.ceil(np.sqrt(num_tiles / max(aspect_ratio, 1e-8))))
+                guessed_cols = int(np.ceil(num_tiles / max(guessed_rows, 1)))
+            grid_rows = grid_rows if grid_rows is not None else guessed_rows
+            grid_cols = grid_cols if grid_cols is not None else guessed_cols
+
+        fig, axes = plt.subplots(grid_rows, grid_cols, figsize=figsize, squeeze=False)
+        tile_index = 0
+        for row_idx in range(grid_rows):
+            for col_idx in range(grid_cols):
+                ax = axes[row_idx, col_idx]
+                if tile_index < num_tiles:
+                    tile_image = tiles_array[tile_index]
+                    ax.imshow(tile_image)
+                    ax.set_xticks([])
+                    ax.set_yticks([])
+
+                    if add_grid:
+                        height, width = tile_image.shape[:2]
+                        step = max(1, min(height, width) // self.patch_size)
+                        for x_pos in range(0, width, step):
+                            ax.axvline(x_pos, color="red", linewidth=0.5)
+                        for y_pos in range(0, height, step):
+                            ax.axhline(y_pos, color="red", linewidth=0.5)
+                else:
+                    ax.axis("off")
+                tile_index += 1
+
+        unique = sorted({f"{t.shape[1]}×{t.shape[0]}" for t in tiles_array})
+        sizes = ", ".join(unique)
+        caption = f"{tiles_array.shape[0]} patches | {sizes} | mean={', '.join(f'{m:.3f}' for m in self.channel_means)} | std={', '.join(f'{s:.3f}' for s in self.channel_stds)}"
+        plt.tight_layout()
+        fig.text(0.5, 0.02, caption, ha="center", va="bottom", fontsize=12)
+        plt.show()
+
+        if global_tile is not None:
+            fig2, ax2 = plt.subplots(figsize=figsize)
+            ax2.imshow(global_tile)
+            ax2.set_xticks([])
+            ax2.set_yticks([])
+            ax2.set_aspect("equal", adjustable="box")
+            fig2.subplots_adjust(left=0, right=1, top=1, bottom=0)
+            h0, w0 = global_tile.shape[:2]
+            caption = f"Global: {w0}×{h0} | mean={', '.join(f'{m:.3f}' for m in self.channel_means)} | std={', '.join(f'{s:.3f}' for s in self.channel_stds)}"
+            fig2.text(0.5, 0.02, caption, ha="center", va="bottom", fontsize=12)
+            plt.show()
+
+    def default_message(self, full_output: bool = False) -> str:
+        """
+        Build a single formatted prompt string using the processor's chat template.
+        Contains one image (HF logo) and one user text message.
+        If available, adds the generation prompt as well.
+        Falls back to a minimal '<image>' string if no template is available.
+        """
+        # ensure this is a multimodal processor with image + tokenizer
+        if not (
+            hasattr(self.processor, "attributes")
+            and "image_processor" in self.processor.attributes
+            and "tokenizer" in self.processor.attributes
+        ):
+            raise RuntimeError(
+                "Processor does not expose both 'image_processor' and 'tokenizer'; cannot build multimodal example."
+            )
+
+        conversation = [
+            {
+                "role": "user",
+                "content": [
+                    {
+                        "type": "image",
+                        "url": "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/transformers/hf-logo-224x224.png",
+                    },
+                    {"type": "text", "text": "Please describe this image."},
+                ],
+            }
+        ]
+
+        try:
+            print("For a 224x224 RGB png image: \n")
+            decoded_message = self.processor.batch_decode(
+                self.processor.apply_chat_template(
+                    conversation,
+                    add_generation_prompt=True,
+                    tokenize=True,
+                    return_dict=False,
+                    truncation=False,
+                ),
+                skip_special_tokens=False,
+            )[0]
+
+            image_token_string = getattr(self.processor, "image_token", "<image>")
+            token_escaped = re.escape(image_token_string)
+            image_token_run_pattern = re.compile(rf"(?:{token_escaped})(?:\s*{token_escaped}){{2,}}")
+
+            def compress_image_token_run(match: re.Match) -> str:
+                n_tokens = match.group(0).count(image_token_string)
+                return f"{image_token_string}[...{n_tokens} tokens...]{image_token_string}"
+
+            if full_output:
+                return decoded_message
+            else:
+                return image_token_run_pattern.sub(compress_image_token_run, decoded_message)
+
+        except ValueError:
+            image_token_string = getattr(
+                self.processor,
+                "image_token",
+                getattr(getattr(self.processor, "tokenizer", None), "image_token", "<image>"),
+            )
+            return f"{image_token_string} {'Please describe this image.'}"
+
+    def visualize(
+        self,
+        images: Optional[Union[Image.Image, np.ndarray, str, list[Union[Image.Image, np.ndarray, str]]]] = None,
+        rows: Optional[int] = None,
+        cols: Optional[int] = None,
+        add_grid: bool = True,
+        figsize=(12, 12),
+    ):
+        """
+        Visualize the model-processed image(s). Only single images are supported.
+        If the processor returns multiple tiles, display them in a grid with optional patch grid overlay.
+        """
+        if images is None:
+            images = Image.open(requests.get(DEFAULT_IMAGE_URL, stream=True).raw)
+
+        if not isinstance(images, list):
+            images = [images]
+        else:
+            if len(images) > 1:
+                raise ValueError(
+                    "You passed a list of several images. Only single images are accepted by the visualizer."
+                )
+
+        pil_images = [convert_to_rgb(to_pil_image(x)) for x in images]
+        img_width, img_height = pil_images[0].size
+        aspect_ratio = img_width / max(img_height, 1)
+
+        processed_inputs = self.processor(images=pil_images, text=self.default_prompt, return_tensors="pt")
+        pixel_values = processed_inputs["pixel_values"]
+
+        grid_rows = None
+        grid_cols = None
+        patch_grid_rows = None
+        patch_grid_cols = None
+
+        if hasattr(self.processor, "image_processor") and hasattr(
+            self.processor.image_processor, "get_num_patches_from_image_size"
+        ):
+            num_patches, grid_rows, grid_cols = self.processor.image_processor.get_num_patches_from_image_size(
+                img_width, img_height
+            )
+
+        if pixel_values.ndim == 2 and "image_grid_thw" in processed_inputs:
+            num_patches, flattened_size = pixel_values.shape
+            grid_thw = processed_inputs["image_grid_thw"][0]
+            temporal_frames, patch_grid_h, patch_grid_w = grid_thw.tolist()
+
+            patch_size = getattr(self.processor.image_processor, "patch_size", 14)
+            temporal_patch_size = getattr(self.processor.image_processor, "temporal_patch_size", 1)
+            merge_size = getattr(self.processor.image_processor, "merge_size", 2)
+
+            expected_size = temporal_patch_size * 3 * patch_size * patch_size
+            if flattened_size == expected_size:
+                pixel_values = pixel_values.reshape(num_patches, temporal_patch_size, 3, patch_size, patch_size)
+                pixel_values = pixel_values[:, 0, :, :, :]
+
+                super_grid_h = patch_grid_h // merge_size
+                super_grid_w = patch_grid_w // merge_size
+
+                pixel_values = pixel_values.reshape(
+                    super_grid_h, super_grid_w, merge_size, merge_size, 3, patch_size, patch_size
+                )
+                pixel_values = pixel_values.permute(0, 2, 1, 3, 4, 5, 6).contiguous()
+                pixel_values = pixel_values.reshape(
+                    super_grid_h * merge_size, super_grid_w * merge_size, 3, patch_size, patch_size
+                )
+                pixel_values = pixel_values.permute(0, 3, 1, 4, 2).contiguous()
+                pixel_values = pixel_values.reshape(patch_grid_h * patch_size, patch_grid_w * patch_size, 3)
+                pixel_values = pixel_values.unsqueeze(0)
+
+                patch_grid_rows = patch_grid_h
+                patch_grid_cols = patch_grid_w
+            else:
+                raise ValueError(
+                    f"Cannot reshape pixel_values: expected flattened size {expected_size} "
+                    f"(temporal={temporal_patch_size} × channels=3 × patch={patch_size}×{patch_size}), "
+                    f"but got {flattened_size}"
+                )
+        elif pixel_values.ndim == 5:
+            batch_size, num_tiles, num_channels, height, width = pixel_values.shape
+            pixel_values = pixel_values.view(batch_size * num_tiles, num_channels, height, width)
+        elif pixel_values.ndim == 4:
+            pass
+        elif pixel_values.ndim == 3:
+            pixel_values = pixel_values.unsqueeze(0)
+        else:
+            raise ValueError(f"Unexpected pixel_values shape: {pixel_values.shape}")
+
+        unnormalized = unnormalize(pixel_values, mean=self.channel_means, std=self.channel_stds)
+        display_ready = self._prepare_images_for_display(unnormalized)
+
+        if display_ready.shape[0] == 1:
+            self._display_single_image(
+                display_ready[0],
+                show_patch_grid=add_grid,
+                figsize=figsize,
+                patch_grid_rows=patch_grid_rows,
+                patch_grid_cols=patch_grid_cols,
+            )
+            return
+
+        num_tiles = display_ready.shape[0]
+        global_tile = None
+
+        if grid_rows is not None and grid_cols is not None and grid_rows * grid_cols + 1 == num_tiles:
+            global_tile = display_ready[-1]
+            display_ready = display_ready[:-1]
+            num_tiles = display_ready.shape[0]
+            if rows is None:
+                rows = grid_rows
+            if cols is None:
+                cols = grid_cols
+        else:
+            global_idx = _find_global_tile_index(display_ready, pil_images[0])
+            if global_idx is not None:
+                global_tile = display_ready[global_idx]
+                if global_idx == 0:
+                    display_ready = display_ready[1:]
+                else:
+                    display_ready = display_ready[:-1]
+                num_tiles = display_ready.shape[0]
+
+        if rows is None or cols is None:
+            tile_h, tile_w = display_ready.shape[1:3]
+            tile_aspect = tile_w / tile_h if tile_h > 0 else 1.0
+            target_aspect = aspect_ratio / tile_aspect
+
+            best_rows, best_cols = 1, num_tiles
+            min_diff = float("inf")
+            for r in range(1, num_tiles + 1):
+                c = int(np.ceil(num_tiles / r))
+                diff = abs((c / r) - target_aspect)
+                if diff < min_diff:
+                    min_diff = diff
+                    best_rows, best_cols = r, c
+
+            rows = best_rows
+            cols = best_cols
+
+        self._display_tiled_images(
+            display_ready,
+            pil_images[0],
+            rows=rows,
+            cols=cols,
+            aspect_ratio=aspect_ratio,
+            add_grid=add_grid,
+            figsize=figsize,
+            global_tile=global_tile,
+        )
--- a/tests/fsdp/test_fsdp.py
+++ b/tests/fsdp/test_fsdp.py
@ -191,6 +191,7 @@ class TrainerIntegrationFSDP(TestCasePlus, TrainerIntegrationCommon):
            for k, v in trainer.args.fsdp_config.items():
                self.assertTrue(k in self.accelerate_fsdp_config)
                self.assertEqual(v, self.accelerate_fsdp_config[k])
+            self.assertEqual(os.environ.get("ACCELERATE_USE_FSDP", "false"), "true")

    @parameterized.expand(params, name_func=_parameterized_custom_name_func)
    def test_fsdp_config(self, sharding_strategy, dtype):
@ -211,9 +212,10 @@ class TrainerIntegrationFSDP(TestCasePlus, TrainerIntegrationCommon):
            self.assertEqual(trainer.args.fsdp[2], FSDPOption.AUTO_WRAP)
            for k, v in trainer.args.fsdp_config.items():
                self.assertEqual(v, self.fsdp_config[k])
+            self.assertEqual(os.environ.get("ACCELERATE_USE_FSDP", "false"), "true")

    @parameterized.expand(params, name_func=_parameterized_custom_name_func)
-    def test_fsdp_plugin(self, sharding_strategy, dtype):
+    def test_fsdp_config_transformers_auto_wrap(self, sharding_strategy, dtype):
        output_dir = self.get_auto_remove_tmp_dir()
        fsdp_config = deepcopy(self.fsdp_config)
        del fsdp_config["min_num_params"]
@ -227,25 +229,27 @@ class TrainerIntegrationFSDP(TestCasePlus, TrainerIntegrationCommon):
            "fsdp_config": fsdp_config,
        }
        kwargs[dtype] = True
+        prefix = "FSDP_"
        with mockenv_context(**self.dist_env_1_gpu):
            trainer = get_regression_trainer(**kwargs)
            self.assertEqual(trainer.args.fsdp[0], sharding_strategy)
            self.assertEqual(trainer.args.fsdp[1], FSDPOption.OFFLOAD)
            self.assertEqual(trainer.args.fsdp[2], FSDPOption.AUTO_WRAP)
+            fsdp_sharding_strategy = str(FSDP_SHARDING_STRATEGY.index(sharding_strategy.upper()) + 1)
+            self.assertEqual(os.environ[f"{prefix}SHARDING_STRATEGY"], fsdp_sharding_strategy)
+            self.assertEqual(os.environ[f"{prefix}OFFLOAD_PARAMS"], "true")
+            self.assertEqual(os.environ[f"{prefix}AUTO_WRAP_POLICY"], "TRANSFORMER_BASED_WRAP")
            self.assertEqual(
-                trainer.args.fsdp_plugin.sharding_strategy.value,
-                FSDP_SHARDING_STRATEGY.index(sharding_strategy.upper()) + 1,
+                os.environ[f"{prefix}TRANSFORMER_CLS_TO_WRAP"], ",".join(fsdp_config["transformer_layer_cls_to_wrap"])
            )
-            self.assertEqual(trainer.args.fsdp_plugin.cpu_offload.offload_params, True)
+            self.assertEqual(os.environ[f"{prefix}BACKWARD_PREFETCH"], fsdp_config["backward_prefetch"])
+            self.assertEqual(os.environ[f"{prefix}FORWARD_PREFETCH"], fsdp_config["forward_prefetch"])
+            self.assertEqual(os.environ[f"{prefix}USE_ORIG_PARAMS"], fsdp_config["use_orig_params"])
+            self.assertEqual(os.environ[f"{prefix}SYNC_MODULE_STATES"], fsdp_config["sync_module_states"])
            self.assertEqual(
-                trainer.args.fsdp_plugin.transformer_cls_names_to_wrap,
-                fsdp_config["transformer_layer_cls_to_wrap"],
-            )
-            self.assertEqual(trainer.args.fsdp_plugin.forward_prefetch, fsdp_config["forward_prefetch"])
-            self.assertEqual(trainer.args.fsdp_plugin.sync_module_states, fsdp_config["sync_module_states"])
-            self.assertEqual(
-                trainer.args.fsdp_plugin.cpu_ram_efficient_loading, fsdp_config["cpu_ram_efficient_loading"]
+                os.environ[f"{prefix}CPU_RAM_EFFICIENT_LOADING"], fsdp_config["cpu_ram_efficient_loading"]
            )
+            self.assertEqual(os.environ.get("ACCELERATE_USE_FSDP", "false"), "true")

    @parameterized.expand(params, name_func=_parameterized_custom_name_func)
    @require_torch_multi_accelerator
--- a/tests/utils/test_import_structure.py
+++ b/tests/utils/test_import_structure.py
@ -132,7 +132,7 @@ class TestImportStructures(unittest.TestCase):
            "models": {
                frozenset(): {"dummy_config": {"DummyConfig"}},
                "albert": {
-                    frozenset(): {"configuration_albert": {"AlbertConfig"}},
+                    frozenset(): {"configuration_albert": {"AlbertConfig", "AlbertOnnxConfig"}},
                    frozenset({"torch"}): {
                        "modeling_albert": {
                            "AlbertForMaskedLM",
@ -174,7 +174,7 @@ class TestImportStructures(unittest.TestCase):
            frozenset(): {
                "dummy_non_model": {"DummyObject"},
                "models.dummy_config": {"DummyConfig"},
-                "models.albert.configuration_albert": {"AlbertConfig"},
+                "models.albert.configuration_albert": {"AlbertConfig", "AlbertOnnxConfig"},
                "models.llama.configuration_llama": {"LlamaConfig"},
                "models.deprecated.transfo_xl.configuration_transfo_xl": {"TransfoXLConfig"},
                "models.deprecated.transfo_xl.tokenization_transfo_xl": {"TransfoXLCorpus", "TransfoXLTokenizer"},
--- a/utils/check_repo.py
+++ b/utils/check_repo.py
@ -1045,6 +1045,7 @@ def ignore_undocumented(name: str) -> bool:
        or name.endswith("Layer")
        or name.endswith("Embeddings")
        or name.endswith("Attention")
+        or name.endswith("OnnxConfig")
    ):
        return True
    # Submodules are not documented.
--- a/utils/not_doctested.txt
+++ b/utils/not_doctested.txt
@ -738,6 +738,7 @@ src/transformers/models/yoso/convert_yoso_pytorch_to_pytorch.py
 src/transformers/models/yoso/modeling_yoso.py
 src/transformers/models/zamba/configuration_zamba.py
 src/transformers/models/zamba/modeling_zamba.py
+src/transformers/onnx/config.py
 src/transformers/optimization.py
 src/transformers/pipelines/audio_classification.py
 src/transformers/pipelines/audio_utils.py
Author	SHA1	Message	Date
Pablo Montalvo	557ecce22e	push working qwen vl viz	2025-10-21 18:04:08 +02:00
Pablo Montalvo	f3b187027a	fixup	2025-10-21 16:33:01 +02:00
Pablo Montalvo	2767a59df9	biig update for ragged inputs why not	2025-10-21 16:17:45 +02:00
Pablo Montalvo	c9f1003c70	Merge branch 'main' into vision_visualizer	2025-10-21 14:19:35 +02:00
molbap	b356fce1da	solve unequal cropping	2025-08-11 19:20:28 +02:00
molbap	af7f75e682	use existing methods, add default image	2025-08-11 16:44:06 +02:00
molbap	34ba5909a2	add an unnormalize image method	2025-08-11 16:43:27 +02:00
Pablo Montalvo	fbec904fb0	Merge branch 'main' into vision_visualizer	2025-08-06 19:19:09 +02:00
molbap	a1263dfe7b	fixup	2025-08-06 19:17:38 +02:00
molbap	1878d6c4ff	add captions and better tiling detection	2025-08-06 19:16:14 +02:00
molbap	a6a18efe53	better namings	2025-08-05 17:30:05 +02:00
Pablo Montalvo	e581d2f2ce	fixup	2025-07-25 08:02:39 +00:00
Pablo Montalvo	1f6822d114	move processor visualizer	2025-07-25 07:58:35 +00:00
Pablo Montalvo	edb70ae15c	Merge branch 'main' into vision_visualizer	2025-07-24 12:50:27 +00:00
Pablo Montalvo	27bc371bea	Merge branch 'main' into vision_visualizer	2025-07-22 13:01:45 +02:00
Pablo Montalvo	58c619e809	draft the vision visualizer	2025-03-21 18:53:04 +01:00