Fix accelerate tests command (#528 )

Relese: v0.11.0
FSDP integration enhancements and fixes (#522 )
2025-11-19 00:54:29 +08:00 · 2022-07-18 14:47:34 +02:00 · 2022-07-18 08:27:58 -04:00 · 2022-07-18 17:45:58 +05:30 · 2022-07-15 18:16:00 +02:00 · 2022-07-15 18:15:45 +02:00
95 changed files with 5794 additions and 867 deletions
--- a/.github/ISSUE_TEMPLATE/bug-report.yml
+++ b/.github/ISSUE_TEMPLATE/bug-report.yml
@ -0,0 +1,59 @@
+name: "\U0001F41B Bug Report"
+description: Submit a bug report to help us improve Accelerate
+labels: [ "bug" ]
+body:
+  - type: textarea
+    id: system-info
+    attributes:
+      label: System Info
+      description: Please share your accelerate configuration with us. You can run the command `accelerate env` and copy-paste its outputs below
+      render: Shell
+      placeholder: accelerate version, OS, python version, numpy version, torch version, and accelerate's configuration
+    validations:
+      required: true
+  
+  - type: checkboxes
+    id: information-scripts-examples
+    attributes:
+      label: Information
+      description: 'The problem arises when using:'
+      options:
+        - label: "The official example scripts"
+        - label: "My own modified scripts"
+  
+  - type: checkboxes
+    id: information-tasks
+    attributes:
+      label: Tasks
+      description: "The tasks I am working on are:"
+      options:
+        - label: "One of the scripts in the examples/ folder of Accelerate or an officially supported `no_trainer` script in the `examples` folder of the `transformers` repo (such as `run_no_trainer_glue.py`)"
+        - label: "My own task or dataset (give details below)"
+  
+  - type: textarea
+    id: reproduction
+    validations:
+      required: true
+    attributes:
+      label: Reproduction
+      description: |
+        Please provide a code sample that reproduces the problem you ran into. It can be a Colab link or just a code snippet.
+        If you have code snippets, error messages, stack traces please provide them here as well.
+        Important! Use code tags to correctly format your code. See https://help.github.com/en/github/writing-on-github/creating-and-highlighting-code-blocks#syntax-highlighting
+        Do not use screenshots, as they are hard to read and (more importantly) don't allow others to copy-and-paste your code.
+
+      placeholder: |
+        Steps to reproduce the behavior:
+          
+          1.
+          2.
+          3.
+
+  - type: textarea
+    id: expected-behavior
+    validations:
+      required: true
+    attributes:
+      label: Expected behavior
+      description: "A clear and concise description of what you would expect to happen."
+      render: Shell
--- a/.github/workflows/build-docker-images.yml
+++ b/.github/workflows/build-docker-images.yml
@ -1,7 +1,8 @@
 name: Build Docker images (scheduled)

 on:
-  repository_dispatch:
+  workflow_dispatch:
+  workflow_call:
  schedule:
    - cron: "0 1 * * *"

--- a/.github/workflows/check_dependencies.yml
+++ b/.github/workflows/check_dependencies.yml
@ -0,0 +1,45 @@
+name: Trigger docker images and run slow tests
+
+on:
+  push:
+    branches:
+      - main
+  workflow_dispatch:
+
+env:
+  GITHUB_TOKEN: ${{ secrets.GITHUB_TOKEN }}
+
+jobs:
+  check-for-setup:
+    runs-on: ubuntu-latest
+    name: Check if setup was changed
+    outputs:
+      changed: ${{ steps.was_changed.outputs.changed }}
+    steps:
+      - uses: actions/checkout@v3
+        with: 
+          fetch-depth: "2"
+      
+      - name: Get changed files
+        id: changed-files
+        uses: tj-actions/changed-files@v22.2
+      
+      - name: Was setup changed 
+        id: was_changed
+        run: |
+          for file in ${{ steps.changed-files.outputs.all_changed_files }}; do
+            if [ `basename "${file}"` = "setup.py" ]; then
+              echo ::set-output name=changed::"1"
+            fi
+          done
+          
+  build-docker-containers:
+    needs: check-for-setup
+    if: (github.event_name == 'push') && (needs.check-for-setup.outputs.changed == '1')
+    uses: ./.github/workflows/build-docker-images.yml
+    secrets: inherit
+
+  run-tests:
+    needs: build-docker-containers
+    if: always()
+    uses: ./.github/workflows/on-merge.yml
--- a/.github/workflows/nightly.yml
+++ b/.github/workflows/nightly.yml
@ -0,0 +1,69 @@
+name: Self-hosted runner (scheduled)
+
+on:
+  workflow_dispatch:
+  schedule:
+    - cron: "0 2 * * *"
+
+env:
+  RUN_SLOW: "yes"
+
+jobs:
+  run_all_tests_single_gpu:
+    runs-on: [self-hosted, docker-gpu, multi-gpu]
+    env:
+      CUDA_VISIBLE_DEVICES: "0"
+    container:
+      image: huggingface/accelerate-gpu:latest
+      options: --gpus all --shm-size "16gb"
+    defaults:
+      run:
+        working-directory: accelerate/
+        shell: bash
+    steps:
+      - name: Update clone & pip install
+        run: |
+          source activate accelerate
+          git config --global --add safe.directory '*'
+          git fetch && git checkout ${{ github.sha }} 
+          pip install -e . --no-deps
+
+      - name: Run test on GPUs
+        run: |
+          source activate accelerate
+          make test
+      - name: Run examples on GPUs
+        run: |
+          source activate accelerate
+          pip uninstall comet_ml -y
+          make test_examples
+
+  run_all_tests_multi_gpu:
+    runs-on: [self-hosted, docker-gpu, multi-gpu]
+    env:
+      CUDA_VISIBLE_DEVICES: "0,1"
+    container:
+      image: huggingface/accelerate-gpu:latest
+      options: --gpus all --shm-size "16gb"
+    defaults:
+      run:
+        working-directory: accelerate/
+        shell: bash
+    steps:
+      - name: Update clone
+        run: |
+          source activate accelerate
+          git config --global --add safe.directory '*'
+          git fetch && git checkout ${{ github.sha }}
+          pip install -e . --no-deps
+
+      - name: Run test on GPUs
+        run: |
+          source activate accelerate
+          make test
+
+      - name: Run examples on GPUs
+        run: |
+          source activate accelerate
+          pip uninstall comet_ml -y
+          make test_examples
--- a/.github/workflows/on-merge.yml
+++ b/.github/workflows/on-merge.yml
@ -0,0 +1,66 @@
+name: Self-hosted runner tests (push to "main")
+
+on:
+  workflow_call:
+  workflow_dispatch:
+
+env:
+  TESTING_MOCKED_DATALOADERS: "1"
+
+jobs:
+  run_all_tests_single_gpu:
+    runs-on: [self-hosted, docker-gpu, multi-gpu]
+    env:
+      CUDA_VISIBLE_DEVICES: "0"
+    container:
+      image: huggingface/accelerate-gpu:latest
+      options: --gpus all --shm-size "16gb"
+    defaults:
+      run:
+        working-directory: accelerate/
+        shell: bash
+    steps:
+      - name: Update clone & pip install
+        run: |
+          source activate accelerate
+          git config --global --add safe.directory '*'
+          git fetch && git checkout ${{ github.sha }}
+          pip install -e .[test,test_trackers]
+
+      - name: Run test on GPUs
+        run: |
+          source activate accelerate
+          make test
+      - name: Run examples on GPUs
+        run: |
+          source activate accelerate
+          pip uninstall comet_ml -y
+          make test_examples
+
+  run_all_tests_multi_gpu:
+    runs-on: [self-hosted, docker-gpu, multi-gpu]
+    container:
+      image: huggingface/accelerate-gpu:latest
+      options: --gpus all --shm-size "16gb"
+    defaults:
+      run:
+        working-directory: accelerate/
+        shell: bash
+    steps:
+      - name: Update clone
+        run: |
+          source activate accelerate
+          git config --global --add safe.directory '*'
+          git fetch && git checkout ${{ github.sha }}
+          pip install -e .[test,test_trackers]
+
+      - name: Run test on GPUs
+        run: |
+          source activate accelerate
+          make test
+
+      - name: Run examples on GPUs
+        run: |
+          source activate accelerate
+          pip uninstall comet_ml -y
+          make test_examples
--- a/.github/workflows/quality.yml
+++ b/.github/workflows/quality.yml
@ -7,10 +7,10 @@ jobs:
    runs-on: ubuntu-latest
    steps:
    - uses: actions/checkout@v2
-    - name: Set up Python 3.6
-      uses: actions/setup-python@v2
+    - name: Set up Python 3.7
+      uses: actions/setup-python@v3
      with:
-        python-version: 3.6
+        python-version: 3.7
    - name: Install Python dependencies
      run: pip install -e .[quality]
    - name: Run Quality check
--- a/.github/workflows/test.yml
+++ b/.github/workflows/test.yml
@ -2,29 +2,44 @@ name: Run Tests

 on: [pull_request]

+env:
+  HF_HOME: ~/hf_cache
+  TESTING_MOCKED_DATALOADERS: "1"
+
 jobs:
-  test:
+  run-tests:
    runs-on: ubuntu-latest
+    strategy:
+      matrix:
+        test-kind: [
+          test,
+          test_deepspeed,
+          test_example_differences,
+          test_checkpoint_step,
+          test_checkpoint_epoch,
+          test_rest
+        ]
    steps:
-    - uses: actions/checkout@v2
-    - name: Set up Python 3.6
-      uses: actions/setup-python@v2
+    - uses: actions/checkout@v3
+    - name: Set up python 3.7
+      uses: actions/setup-python@v3
      with:
-        python-version: 3.6
-    - name: Install Python dependencies
-      run: pip install setuptools==59.5.0; pip install -e .[test,test_trackers]
-    - name: Run Tests
-      run: make test_cpu
-      
-  test_examples:
-    runs-on: ubuntu-latest
-    steps:
-    - uses: actions/checkout@v2
-    - name: Set up Python 3.6
-      uses: actions/setup-python@v2
+        python-version: 3.7
+    
+    - name: Activate python cache
+      uses: actions/cache@v3
      with:
-        python-version: 3.6
-    - name: Install Python dependencies
-      run: pip install setuptools==59.5.0; pip install -e .[test] tensorboard
+        path: |
+          ${{ env.pythonLocation }}
+          ${{ env.HF_HOME }}
+        key: ${{ env.pythonLocation }}-${{ matrix.test-kind }}-${{ hashFiles('setup.py') }}
+    
+    - name: Install the library
+      run: |
+        pip install --upgrade pip
+        pip install -e .[test,test_trackers]
+        if [ ${{ matrix.test-kind }} = test_rest ]; then pip uninstall comet_ml -y; fi
+    
    - name: Run Tests
-      run: make test_examples
+      run: |
+        make ${{ matrix.test-kind }}
--- a/22
+++ b/22
@ -1,6 +1,6 @@
 .PHONY: quality style test docs

-check_dirs := tests src examples
+check_dirs := tests src examples benchmarks

 # Check that source code meets quality standards

@ -24,12 +24,24 @@ style:
 	python utils/style_doc.py src/accelerate docs/source --max_len 119
 	
 # Run tests for the library
-test_cpu:
+test:
 	python -m pytest -s -v ./tests/ --ignore=./tests/test_examples.py

-test_cuda:
-	python -m pytest -s -v ./tests/ --ignore=./tests/test_examples.py --ignore=./tests/test_scheduler.py --ignore=./tests/test_cpu.py
-	python -m pytest -s -v ./tests/test_cpu.py ./tests/test_scheduler.py
+test_deepspeed:
+	python -m pytest -s -v ./tests/deepspeed

 test_examples:
 	python -m pytest -s -v ./tests/test_examples.py
+
+# Broken down example tests for the CI runners
+test_example_differences:
+	python -m pytest -s -v ./tests/test_examples.py::ExampleDifferenceTests
+
+test_checkpoint_epoch:
+	python -m pytest -s -v ./tests/test_examples.py::FeatureExamplesTests -k "by_epoch"
+
+test_checkpoint_step:
+	python -m pytest -s -v ./tests/test_examples.py::FeatureExamplesTests -k "by_step"
+
+test_rest:
+	python -m pytest -s -v ./tests/test_examples.py::FeatureExamplesTests -k "not by_step and not by_epoch"
--- a/README.md
+++ b/README.md
@ -241,4 +241,5 @@ pip install accelerate
 - multi-GPU on several nodes (machines)
 - TPU
 - FP16 with native AMP (apex on the roadmap)
- DeepSpeed support (experimental)
+- DeepSpeed support (Experimental)
+- PyTorch Fully Sharded Data Parallel (FSDP) support (Experimental)
--- a/benchmarks/README.md
+++ b/benchmarks/README.md
@ -0,0 +1,46 @@
+# Big model inference benchmarks
+
+Running inference with Accelerate on big models.
+
+## Setup
+
+These benchmarks use the `transformers` library:
+
+```bash
+pip install transformers
+```
+
+To reproduce or test a new setup, run
+
+```py
+python inference_acc.py model_name
+```
+
+This script supports `gpt-j-6b`, `gpt-neox`, `opt` (30B version) and `T0pp` out of the box, but you can specify any valid checkpoint for `model_name`.
+
+To force a different `torch_dtype` than the one in the config: `--torch_dtype xxx`.
+
+If you get an error linked to disk offload, you need to add the option `--disk-offload`
+
+## Results
+
+On a setup with two Titan RTXs (24GB of RAM) and 32GB of RAM, we get the following benchmarks (T0pp does not run in float16, which is why it's not included).
+
+| Model | Model load time | Generation time | dtype | GPU 0 use | GPU 1 use | CPU use | Disk offload |
+|:-----:|:---------------:|:---------------:|:-----:|:---------:|:---------:|:-------:|:------------:|
+| GPT-J-6B | 8.7s | 0.05s per token | float16 | 11.7GB | 0GB | 0GB | no |
+| GPT-J-6B | 12.4s | 0.06s per token | float32 | 21.9GB | 1.5GB | 0GB | no |
+| GPT-Neo-X-20B | 30.9s | 0.08s per token | float16 | 21.5GB | 18GB | 0GB | no |
+| GPT-Neo-X-20B | 78.2s | 10.72s per token | float32 | 20.3GB | 22.7 GB | 24.4GB | yes |
+| T0pp (11B) | 29.4s | 0.05s per token | float32 | 21.1GB | 21.3GB | 0GB | no |
+| OPT-30B | 34.5s | 2.37s per token | float16 | 20.7GB | 22.3GB | 14.1GB | no |
+| OPT-30B | 112.3s | 33.9s per token | float32 | 20.2GB | 21.2GB | 23.5GB | yes |
+
+Note on the results:
+- using two GPUs instead of one does not slow down generation
+- using CPU offload slows down a bit (see OPT-30b)
+- using disk offload slows down a lot (need to implement prefetching)
+
+You will also note that Accelerate does not use anymore GPU and CPU RAM than necessary:
+- peak GPU memory is exactly the size of the model put on a given GPU
+- peak CPU memory is either the size of the biggest checkpoint shard or the part of the model offloaded on CPU, whichever is bigger.
--- a/benchmarks/big_model_inference.py
+++ b/benchmarks/big_model_inference.py
@ -0,0 +1,143 @@
+# Copyright 2022 The HuggingFace Team. All rights reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import argparse
+import time
+
+import torch
+
+import transformers
+from accelerate.utils import compute_module_sizes
+from measures_util import end_measure, log_measures, start_measure
+from transformers import AutoConfig, AutoModelForCausalLM, AutoModelForSeq2SeqLM, AutoTokenizer
+
+
+DEFAULT_MODELS = {
+    "gpt-j-6b": {"is_causal": True, "model": "sgugger/sharded-gpt-j-6B", "tokenizer": "EleutherAI/gpt-j-6B"},
+    "gpt-neox": {"is_causal": True, "model": "EleutherAI/gpt-neox-20b"},
+    "opt": {"is_causal": True, "model": "facebook/opt-30b"},
+    "T0pp": {"is_causal": False, "model": "bigscience/T0pp", "model_revision": "sharded"},
+}
+
+PROMPTS = [
+    "Hello, my name is",
+    "Are unicorns real? Unicorns are",
+    "For the first time in several years,",
+    "My name is Julien and I am",
+    "The goal of life is",
+    "Whenever I'm sad, I like to",
+]
+
+
+def parse_args():
+    parser = argparse.ArgumentParser(description="Run and time generations on a big model using Accelerate.")
+    parser.add_argument("model_name", type=str, default=None, help="The name of the model to try.")
+    parser.add_argument(
+        "--tokenizer_name", type=str, default=None, help="The name of the tokenizer (if different from the model."
+    )
+    parser.add_argument("--is_causal", type=bool, default=None, help="Whether or not the model is causal.")
+    parser.add_argument(
+        "--model_revision", type=str, default=None, help="The revision to use for the model checkpoint."
+    )
+    parser.add_argument("--torch_dtype", type=str, default=None, help="The dtype for the model.")
+    parser.add_argument("--disk_offload", action="store_true")
+
+    args = parser.parse_args()
+
+    # Sanitize args
+    if args.model_name in DEFAULT_MODELS:
+        defaults = DEFAULT_MODELS[args.model_name]
+        args.model_name = defaults["model"]
+        if args.tokenizer_name is None:
+            args.tokenizer_name = defaults.get("tokenizer", args.model_name)
+        if args.is_causal is None:
+            args.is_causal = defaults["is_causal"]
+        if args.model_revision is None:
+            args.model_revision = defaults.get("model_revision", "main")
+
+    if args.is_causal is None:
+        raise ValueError("Could not infer the default for `--is_causal`, pass either True or False for it.")
+    if args.tokenizer_name is None:
+        args.tokenizer_name = args.model_name
+    if args.model_revision is None:
+        args.model_revision = "main"
+
+    return args
+
+
+def main():
+    transformers.utils.logging.set_verbosity_error()
+    args = parse_args()
+
+    if args.torch_dtype is None:
+        config = AutoConfig.from_pretrained(args.model_name)
+        torch_dtype = getattr(config, "torch_dtype", torch.float32)
+    else:
+        torch_dtype = getattr(torch, args.torch_dtype)
+    model_cls = AutoModelForCausalLM if args.is_causal else AutoModelForSeq2SeqLM
+    kwargs = {
+        "torch_dtype": torch_dtype,
+        "revision": args.model_revision,
+    }
+    if args.disk_offload:
+        kwargs["offload_folder"] = "tmp_offload"
+        kwargs["offload_state_dict"] = True
+
+    start_measures = start_measure()
+    model = model_cls.from_pretrained(args.model_name, device_map="auto", **kwargs)
+    end_measures = end_measure(start_measures)
+    log_measures(end_measures, "Model loading")
+
+    module_sizes = compute_module_sizes(model)
+    device_size = {v: 0 for v in model.hf_device_map.values()}
+    for module, device in model.hf_device_map.items():
+        device_size[device] += module_sizes[module]
+    message = "\n".join([f"- {device}: {size // 2**20}MiB" for device, size in device_size.items()])
+    print(f"\nTheoretical use:\n{message}")
+
+    tokenizer = AutoTokenizer.from_pretrained(args.tokenizer_name)
+
+    start_measures = start_measure()
+    generation_times = []
+    gen_tokens = []
+    texts_outs = []
+    for prompt in PROMPTS:
+        inputs = tokenizer(prompt, return_tensors="pt").to(0)
+        tokens = inputs["input_ids"][0].tolist()
+        before_generate = time.time()
+        outputs = model.generate(inputs["input_ids"])
+        after_generate = time.time()
+        outputs = outputs[0].tolist()
+        num_gen_tokens = len(outputs) if outputs[: len(tokens)] != tokens else len(outputs) - len(tokens)
+        generation_time = after_generate - before_generate
+
+        text_out = tokenizer.decode(outputs, skip_special_tokens=True)
+        texts_outs.append(text_out)
+        generation_times.append(generation_time)
+        gen_tokens.append(num_gen_tokens)
+        print(f"Prompt: {prompt}\nGeneration {text_out}\nIn {generation_time:.2f}s for {num_gen_tokens} tokens\n")
+
+    end_measures = end_measure(start_measures)
+    log_measures(end_measures, "Model generation")
+
+    generation_times_per_token = [gen / tok for gen, tok in zip(generation_times, gen_tokens)]
+    avg_gen = sum(generation_times_per_token) / len(generation_times)
+    print(f"Average time of generation per token: {avg_gen:.2f}s")
+    print(f"First generation (avg time per token): {generation_times_per_token[0]:.2f}s")
+    avg_gen = sum(generation_times_per_token[1:]) / (len(generation_times_per_token) - 1)
+    print(f"Average time of generation per token (excluding the first): {avg_gen:.2f}s")
+
+
+if __name__ == "__main__":
+    main()
--- a/benchmarks/measures_util.py
+++ b/benchmarks/measures_util.py
@ -0,0 +1,86 @@
+import gc
+import threading
+import time
+
+import torch
+
+import psutil
+
+
+class PeakCPUMemory:
+    def __init__(self):
+        self.process = psutil.Process()
+        self.peak_monitoring = False
+
+    def peak_monitor(self):
+        self.cpu_memory_peak = -1
+
+        while True:
+            self.cpu_memory_peak = max(self.process.memory_info().rss, self.cpu_memory_peak)
+
+            # can't sleep or will not catch the peak right (this comment is here on purpose)
+            if not self.peak_monitoring:
+                break
+
+    def start(self):
+        self.peak_monitoring = True
+        self.thread = threading.Thread(target=self.peak_monitor)
+        self.thread.daemon = True
+        self.thread.start()
+
+    def stop(self):
+        self.peak_monitoring = False
+        self.thread.join()
+        return self.cpu_memory_peak
+
+
+cpu_peak_tracker = PeakCPUMemory()
+
+
+def start_measure():
+    # Time
+    measures = {"time": time.time()}
+
+    gc.collect()
+    torch.cuda.empty_cache()
+
+    # CPU mem
+    measures["cpu"] = psutil.Process().memory_info().rss
+    cpu_peak_tracker.start()
+
+    # GPU mem
+    for i in range(torch.cuda.device_count()):
+        measures[str(i)] = torch.cuda.memory_allocated(i)
+    torch.cuda.reset_peak_memory_stats()
+
+    return measures
+
+
+def end_measure(start_measures):
+    # Time
+    measures = {"time": time.time() - start_measures["time"]}
+
+    gc.collect()
+    torch.cuda.empty_cache()
+
+    # CPU mem
+    measures["cpu"] = (psutil.Process().memory_info().rss - start_measures["cpu"]) / 2**20
+    measures["cpu-peak"] = (cpu_peak_tracker.stop() - start_measures["cpu"]) / 2**20
+
+    # GPU mem
+    for i in range(torch.cuda.device_count()):
+        measures[str(i)] = (torch.cuda.memory_allocated(i) - start_measures[str(i)]) / 2**20
+        measures[f"{i}-peak"] = (torch.cuda.max_memory_allocated(i) - start_measures[str(i)]) / 2**20
+
+    return measures
+
+
+def log_measures(measures, description):
+    print(f"{description}:")
+    print(f"- Time: {measures['time']:.2f}s")
+    for i in range(torch.cuda.device_count()):
+        print(f"- GPU {i} allocated: {measures[str(i)]:.2f}MiB")
+        peak = measures[f"{i}-peak"]
+        print(f"- GPU {i} peak: {peak:.2f}MiB")
+    print(f"- CPU RAM allocated: {measures['cpu']:.2f}MiB")
+    print(f"- CPU RAM peak: {measures['cpu-peak']:.2f}MiB")
--- a/docker/accelerate-cpu/Dockerfile
+++ b/docker/accelerate-cpu/Dockerfile
@ -1,7 +1,7 @@
 # Builds CPU-only Docker image of PyTorch
 # Uses multi-staged approach to reduce size
 # Stage 1
-FROM python:3.6-slim as compile-image
+FROM python:3.7-slim as compile-image

 ARG DEBIAN_FRONTEND=noninteractive

@ -21,17 +21,14 @@ WORKDIR /workspace
 RUN python3 -m pip install --upgrade --no-cache-dir pip
 RUN python3 -m pip install --no-cache-dir \
    jupyter \
-    torch --extra-index-url https://download.pytorch.org/whl/cpu \
-    git+https://github.com/huggingface/accelerate#egg=accelerate[dev]
+    git+https://github.com/huggingface/accelerate#egg=accelerate[test,test_trackers] \
+    --extra-index-url https://download.pytorch.org/whl/cpu
    
 # Stage 2
-FROM python:3.6-slim AS build-image
+FROM python:3.7-slim AS build-image
 COPY --from=compile-image /opt/venv /opt/venv
-# Install apt libs
-RUN apt-get update && \
-    apt-get install -y curl git wget && \
-    apt-get clean && \
-    rm -rf /var/lib/apt/lists*
+RUN useradd -ms /bin/bash user
+USER user

 # Make sure we use the virtualenv
 ENV PATH="/opt/venv/bin:$PATH"
--- a/docker/accelerate-gpu/Dockerfile
+++ b/docker/accelerate-gpu/Dockerfile
@ -4,7 +4,7 @@
 # Use base conda image to reduce time
 FROM continuumio/miniconda3:latest AS compile-image
 # Specify py version
-ENV PYTHON_VERSION=3.6 
+ENV PYTHON_VERSION=3.7.3
 # Install apt libs
 RUN apt-get update && \
    apt-get install -y curl git wget && \
@ -22,8 +22,8 @@ SHELL ["/bin/bash", "-c"]
 # Activate the conda env and install torch + accelerate
 RUN source activate accelerate && \
    python3 -m pip install --no-cache-dir \
-    torch --extra-index-url https://download.pytorch.org/whl/cu113 \
-    git+https://github.com/huggingface/accelerate#egg=accelerate[dev]
+    git+https://github.com/huggingface/accelerate#egg=accelerate[test,test_trackers] \
+    --extra-index-url https://download.pytorch.org/whl/cu113

 # Stage 2
 FROM nvidia/cuda:11.2.2-cudnn8-devel-ubuntu20.04 AS build-image
--- a/docs/source/_toctree.yml
+++ b/docs/source/_toctree.yml
@ -9,6 +9,8 @@
 - sections:
  - local: big_modeling
    title: Handling big models
+  - local: gradient_accumulation
+    title: Gradient accumulation
  - local: sagemaker
    title: Amazon SageMaker
  title: Guides
@ -19,14 +21,18 @@
    title: Notebook Launcher
  - local: kwargs
    title: Kwargs Handlers
-  - local: internal
-    title: Internals
  - local: checkpoint
    title: Checkpointing
+  - local: internal
+    title: Internals
  - local: tracking
    title: Experiment Tracking
  - local: fsdp
    title: Fully Sharded Data Parallel
  - local: memory
    title: Memory Utilities
+  - local: deepspeed
+    title: DeepSpeed
+  - local: utilities
+    title: General Utilities
  title: API Reference
--- a/docs/source/accelerator.mdx
+++ b/docs/source/accelerator.mdx
@ -13,7 +13,7 @@ specific language governing permissions and limitations under the License.
 # Accelerator

 The [`Accelerator`] is the main class provided by 🤗 Accelerate. It serves at the main entrypoint for
-the API. To quickly adapt your script to work on any kind of setup with 🤗 Accelerate juste:
+the API. To quickly adapt your script to work on any kind of setup with 🤗 Accelerate just:

 1. Initialize an [`Accelerator`] object (that we will call `accelerator` in the rest of this
   page) as early as possible in your script.
@ -21,10 +21,10 @@ the API. To quickly adapt your script to work on any kind of setup with 🤗 Acc
 3. (Optional but best practice) Remove all the `.cuda()` or `.to(device)` in your code and let the
   `accelerator` handle device placement for you.
 4. Replace the `loss.backward()` in your code by `accelerator.backward(loss)`.
-5. (Optional, when using distributed evaluation) Gather your predictions and labelsbefore storing them or using them
-   for metric computation using [`~Accelerator.gather`].
+5. (Optional, when using distributed evaluation) Gather your predictions and labels before storing them or using
+   them for metric computation using [`~Accelerator.gather`].

-This is all what is needed in most cases. For more advanced case or a nicer experience here are the functions you
+This is all that is needed in most cases. For more advanced cases or a nicer experience here are the functions you
 should search for and replace by the corresponding methods of your `accelerator`:

 - `print` statements should be replaced by [`~Accelerator.print`] to be only printed once per
@ -38,4 +38,27 @@ should search for and replace by the corresponding methods of your `accelerator`
 - Use [`~Accelerator.clip_grad_norm_`] instead of `torch.nn.utils.clip_grad_norm_` and
  [`~Accelerator.clip_grad_value_`] instead of `torch.nn.utils.clip_grad_value_`.

+To perform gradient accumulation use [`~Accelerator.accumulate`] and specify a `gradient_accumulation_steps`. 
+This will also automatically ensure the gradients are synced or unsynced when on multi-device training, check if the step should
+actually be performed, and auto-scale the loss:
+
+```python
+accelerator = Accelerator(gradient_accumulation_steps=2)
+
+for (input, label) in enumerate(training_dataloader):
+    with accelerator.accumulate(model):
+        predictions = model(input)
+        loss = loss_function(predictions, labels)
+        accelerator.backward(loss)
+        optimizer.step()
+        scheduler.step()
+        optimizer.zero_grad()
+```
+
+<Tip warning={true}>
+
+Using this with `dispatch_batches=True`  (which is the default for iterable datasets) is currently not supported.
+
+</Tip>
+
 [[autodoc]] Accelerator
--- a/docs/source/big_modeling.mdx
+++ b/docs/source/big_modeling.mdx
@ -27,7 +27,7 @@ In plain English, those steps are:
 2. Load the model weights (in a dictionary usually called a state dict) from the disk
 3. Load those weights inside the model

-While this works very well for regularly sized models, this workflow has some clear limitation when we deal with a huge model: in step 1, we load a full version of the model in RAM, and spend some time randomly initializing the weights (which will be discarded in step 3). In step 2, we load another full version of the model in RAM, with the pretrained weights. If you're loading a model with 6 billions parameters, this needs you will need 24GB of RAM for each copy of the model, so 48GB in total (half of it to load the model in FP16).
+While this works very well for regularly sized models, this workflow has some clear limitations when we deal with a huge model: in step 1, we load a full version of the model in RAM, and spend some time randomly initializing the weights (which will be discarded in step 3). In step 2, we load another full version of the model in RAM, with the pretrained weights. If you're loading a model with 6 billions parameters, this means you will need 24GB of RAM for each copy of the model, so 48GB in total (half of it to load the model in FP16).

 <Tip warning={true}>

@ -37,7 +37,7 @@ This API is quite new and still in its experimental stage. While we strive to pr

 ## Instantiating an empty model

-The first tool Accelerate introduces to help with big models is a context manager [`init_empty_weights`] that helps you initialize a model without using any RAM, so that step 1 can be done on models of any size. Here is how it works:
+The first tool 🤗 Accelerate introduces to help with big models is a context manager [`init_empty_weights`] that helps you initialize a model without using any RAM, so that step 1 can be done on models of any size. Here is how it works:

 ```py
 from accelerate import init_empty_weights
@ -65,7 +65,7 @@ You can't move a model initialized like this on CPU or another device directly,

 It's possible your model is so big that even a single copy won't fit in RAM. That doesn't mean it can't be loaded: if you have one or several GPUs, this is more memory available to store your model. In this case, it's better if your checkpoint is split in several smaller files that we call checkpoint shards.

-Accelerate will handle sharded checkpoints as long as you follow the following format: your checkpoint should be in a folder, with several files containing the partial state dicts, and there should be an index in the JSON format that contains a dictionary mapping parameter names to the file containing their weights. For instance we could have a folder containing:
+🤗 Accelerate will handle sharded checkpoints as long as you follow the following format: your checkpoint should be in a folder, with several files containing the partial state dicts, and there should be an index in the JSON format that contains a dictionary mapping parameter names to the file containing their weights. For instance we could have a folder containing:

 ```bash
 first_state_dict.bin
@ -88,7 +88,7 @@ and `first_state_dict.bin` containing the weights for `"linear1.weight"` and `"l

 ## Loading weights

-The second tool Accelerate introduces is a function [`load_checkpoint_and_dispatch`], that will allow you to load a checkpoint inside your empty model. This supports full checkpoints (a single file containing the whole state dict) as well as sharded checkpoints. It will also automatically dispatch those weights across the devices you have available (GPUs, CPU RAM), so if you are loading a sharded checkpoint, the maximum RAM usage will be the size of the biggest shard.
+The second tool 🤗 Accelerate introduces is a function [`load_checkpoint_and_dispatch`], that will allow you to load a checkpoint inside your empty model. This supports full checkpoints (a single file containing the whole state dict) as well as sharded checkpoints. It will also automatically dispatch those weights across the devices you have available (GPUs, CPU RAM), so if you are loading a sharded checkpoint, the maximum RAM usage will be the size of the biggest shard.

 Here is how we can use this to load the [GPT-J-6B](https://huggingface.co/EleutherAI/gpt-j-6B) model. You clone the sharded version of this model with:

@ -122,14 +122,14 @@ model = load_checkpoint_and_dispatch(
 )
 ```

-By passing `device_map="auto"`, we tell Accelerate to determine automatically where to put each layer of the model depending on the available resources:
+By passing `device_map="auto"`, we tell 🤗 Accelerate to determine automatically where to put each layer of the model depending on the available resources:
 - first we use the maximum space available on the GPU(s)
 - if we still need space, we store the remaining weights on the CPU
 - if there is not enough RAM, we store the remaining weights on the hard drive as memory-mapped tensors

 `no_split_module_classes=["GPTJBlock"]` indicates that the modules that are `GPTJBlock` should not be split on different devices. You should set here all blocks that include a residual connection of some kind.

-You can see the `device_map` that Accelerate picked by accessing the `hf_device_map` attribute of your model:
+You can see the `device_map` that 🤗 Accelerate picked by accessing the `hf_device_map` attribute of your model:

 ```py
 model.hf_device_map
@ -190,7 +190,7 @@ output = model.generate(inputs["input_ids"])
 tokenizer.decode(output[0].tolist())
 ```

-Behind the scenes, Accelerate added hooks to the model, so that:
+Behind the scenes, 🤗 Accelerate added hooks to the model, so that:
 - at each layer, the inputs are put on the right device (so even if your model is spread across several GPUs, it works)
 - for the weights offloaded on the CPU, they are put on a GPU just before the forward pass, and cleaned up just after
 - for the weights offloaded on the hard drive, they are loaded in RAM then put on a GPU just before the forward pass, and cleaned up just after
@ -207,7 +207,7 @@ This only supports inference of your model, not training. Most of the computatio

 We are aware of the current limitations in the API:

- While this could theoretically work just one CPU with potential disk offload, you need at least one GPU to run this API. This will be fixed in further development.
+- While this could theoretically work on just one CPU with potential disk offload, you need at least one GPU to run this API. This will be fixed in further development.
 - [`infer_auto_device_map`] (or `device_map="auto"` in [`load_checkpoint_and_dispatch`]) tries to maximize GPU and CPU RAM it sees available when you execute it. While PyTorch is very good at managing GPU RAM efficiently (and giving it back when not needed), it's not entirely true with Python and CPU RAM. Therefore, an automatically computed device map might be too intense on the CPU. Move a few modules to the disk device if you get crashes due to lack of RAM.
 - [`infer_auto_device_map`] (or `device_map="auto"` in [`load_checkpoint_and_dispatch`]) attributes devices sequentially (to avoid moving things back and forth) so if your first layer is bigger than the size of the GPU you have, it will end up with everything on the CPU/Disk.
 - [`load_checkpoint_and_dispatch`] and [`load_checkpoint_in_model`] do not perform any check on the correctness of your state dict compared to your model at the moment (this will be fixed in a future version), so you may get some weird errors if trying to load a checkpoint with mismatched or missing keys.
--- a/docs/source/checkpoint.mdx
+++ b/docs/source/checkpoint.mdx
@ -12,8 +12,8 @@ specific language governing permissions and limitations under the License.

 # Checkpointing

-When training a PyTorch model with Accelerate, you may often want to save and continue a state of training. Doing so requires
-saving and loading the model, optimizer, RNG generators, and the GradScaler. Inside Accelerate are two convience functions to achieve this quickly:
+When training a PyTorch model with 🤗 Accelerate, you may often want to save and continue a state of training. Doing so requires
+saving and loading the model, optimizer, RNG generators, and the GradScaler. Inside 🤗 Accelerate are two convience functions to achieve this quickly:
 - Use [`~Accelerator.save_state`] for saving everything mentioned above to a folder location
 - Use [`~Accelerator.load_state`] for loading everything stored from an earlier `save_state`

@ -57,4 +57,4 @@ for epoch in range(num_epochs):

 # Restore previous state
 accelerate.load_state("my/save/path")
-```
+```
--- a/docs/source/deepspeed.mdx
+++ b/docs/source/deepspeed.mdx
@ -0,0 +1,508 @@
+<!--Copyright 2022 The HuggingFace Team. All rights reserved.
+
+Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with
+the License. You may obtain a copy of the License at
+
+http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on
+an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the
+specific language governing permissions and limitations under the License.
+-->
+
+# DeepSpeed 
+
+[DeepSpeed](https://github.com/microsoft/DeepSpeed) implements everything described in the [ZeRO paper](https://arxiv.org/abs/1910.02054). Currently it provides full support for:
+
+1. Optimizer state partitioning (ZeRO stage 1)
+2. Gradient partitioning (ZeRO stage 2)
+3. Parameter partitioning (ZeRO stage 3)
+4. Custom mixed precision training handling
+5. A range of fast CUDA-extension-based optimizers
+6. ZeRO-Offload to CPU and Disk/NVMe
+
+ZeRO-Offload has its own dedicated paper: [ZeRO-Offload: Democratizing Billion-Scale Model Training](https://arxiv.org/abs/2101.06840). And NVMe-support is described in the paper [ZeRO-Infinity: Breaking the GPU
+Memory Wall for Extreme Scale Deep Learning](https://arxiv.org/abs/2104.07857).
+
+DeepSpeed ZeRO-2 is primarily used only for training, as its features are of no use to inference.
+
+DeepSpeed ZeRO-3 can be used for inference as well, since it allows huge models to be loaded on multiple GPUs, which
+won't be possible on a single GPU.
+
+🤗 Accelerate integrates [DeepSpeed](https://github.com/microsoft/DeepSpeed) via 2 options:
+
+1. Integration of the DeepSpeed features via `deepspeed config file` specification in `accelerate config` . You just supply your custom config file or use our template. Most of
+   this document is focused on this feature. This supports all the core features of DeepSpeed and gives user a lot of flexibility. 
+   User may have to change few lines of code depending on the config.
+2. Integration via `deepspeed_plugin`.This supports subset of the DeepSpeed features and uses default options for the rest of the configurations. 
+   User need not change any code and is good for those who are fine with most of the default settings of DeepSpeed.
+
+## What is integrated?
+
+Training:
+
+1. DeepSpeed ZeRO training supports the full ZeRO stages 1, 2 and 3 as well as CPU/Disk offload of optimizer states, gradients and parameters. 
+Below is a short description of Data Parallelism using ZeRO - Zero Redundancy Optimizer along with diagram from this [blog post](https://www.microsoft.com/en-us/research/blog/zero-deepspeed-new-system-optimizations-enable-training-models-with-over-100-billion-parameters/)
+![ZeRO Data Parallelism](https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/parallelism-zero.png)
+
+(Source: [link](https://www.microsoft.com/en-us/research/blog/zero-deepspeed-new-system-optimizations-enable-training-models-with-over-100-billion-parameters/))
+
+ a. **Stage 1** : Shards optimizer states across data parallel workers/GPUs
+
+ b. **Stage 2** : Shards optimizer states + gradients across data parallel workers/GPUs
+
+ c. **Stage 3**: Shards optimizer states + gradients + model parameters across data parallel workers/GPUs
+
+ d. **Optimizer Offload**: Offloads the gradients + optimizer states to CPU/Disk building on top of ZERO Stage 2
+
+ e. **Param Offload**: Offloads the model parameters to CPU/Disk building on top of ZERO Stage 3
+
+<u>Note</u>: With respect to Disk Offload, the disk should be an NVME for decent speed but it technically work on any Disk
+
+Inference:
+
+1. DeepSpeed ZeRO Inference supports ZeRO stage 3 with ZeRO-Infinity. It uses the same ZeRO protocol as training, but
+   it doesn't use an optimizer and a lr scheduler and only stage 3 is relevant. For more details see:
+   [deepspeed-zero-inference](#deepspeed-zero-inference).
+
+
+## How it works?
+
+**Pre-Requisites**: Install DeepSpeed version >=0.6.5. Please refer to the [DeepSpeed Insallation details](https://github.com/microsoft/DeepSpeed#installation)
+for more information.
+
+We will first look at easy to use integration via `accelerate config`. 
+Followed by more flexible and feature rich `deepspeed config file` integration. 
+
+### Accelerate DeepSpeed Plugin
+On your machine(s) just run:
+
+```bash
+accelerate config
+```
+
+and answer the questions asked. It will ask whether you want to use a config file for DeepSpeed to which you should answer no. Then answer the following questions to generate a basic DeepSpeed config.
+This will generate a config file that will be used automatically to properly set the
+default options when doing
+
+```bash
+accelerate launch my_script.py --args_to_my_script
+```
+
+For instance, here is how you would run the NLP example `examples/nlp_example.py` (from the root of the repo) with DeepSpeed Plugin:
+
+**ZeRO Stage-2 DeepSpeed Plugin Example**
+```bash
+compute_environment: LOCAL_MACHINE
+deepspeed_config:
+ gradient_accumulation_steps: 1
+ gradient_clipping: 1.0
+ offload_optimizer_device: none
+ offload_param_device: none
+ zero3_init_flag: true
+ zero_stage: 2
+distributed_type: DEEPSPEED
+fsdp_config: {}
+machine_rank: 0
+main_process_ip: null
+main_process_port: null
+main_training_function: main
+mixed_precision: fp16
+num_machines: 1
+num_processes: 2
+use_cpu: false
+```
+
+```bash
+accelerate launch examples/nlp_example.py --mixed_precision fp16
+```
+
+**ZeRO Stage-3 with CPU Offload DeepSpeed Plugin Example**
+```bash
+compute_environment: LOCAL_MACHINE
+deepspeed_config:
+  gradient_accumulation_steps: 1
+  gradient_clipping: 1.0
+  offload_optimizer_device: cpu
+  offload_param_device: cpu
+  zero3_init_flag: true
+  zero3_save_16bit_model: true
+  zero_stage: 3
+distributed_type: DEEPSPEED
+fsdp_config: {}
+machine_rank: 0
+main_process_ip: null
+main_process_port: null
+main_training_function: main
+mixed_precision: fp16
+num_machines: 1
+num_processes: 2
+use_cpu: false
+```
+
+```bash
+accelerate launch examples/nlp_example.py --mixed_precision fp16
+```
+
+Currently, `Accelerate` supports following config through the CLI:
+
+```bash
+`zero_stage`: [0] Disabled, [1] optimizer state partitioning, [2] optimizer+gradient state partitioning and [3] optimizer+gradient+parameter partitioning
+`gradient_accumulation_steps`: Number of training steps to accumulate gradients before averaging and applying them.
+`gradient_clipping`: Enable gradient clipping with value.
+`offload_optimizer_device`: [none] Disable optimizer offloading, [cpu] offload optimizer to CPU, [nvme] offload optimizer to NVMe SSD. Only applicable with ZeRO >= Stage-2.
+`offload_param_device`: [none] Disable parameter offloading, [cpu] offload parameters to CPU, [nvme] offload parameters to NVMe SSD. Only applicable with ZeRO Stage-3.
+`zero3_init_flag`: Decides whether to enable `deepspeed.zero.Init` for constructing massive models. Only applicable with ZeRO Stage-3.
+`zero3_save_16bit_model`: Decides whether to save 16-bit model weights when using ZeRO Stage-3.
+`mixed_precision`: `no` for FP32 training, `fp16` for FP16 mixed-precision training and `bf16` for BF16 mixed-precision training. 
+```
+To be able to tweak more options, you will need to use a DeepSpeed config file.
+
+### DeepSpeed Config File
+On your machine(s) just run:
+
+```bash
+accelerate config
+```
+
+and answer the questions asked. It will ask whether you want to use a config file for deepspeed to which you answer yes 
+and provide the path to the deepspeed config file. 
+This will generate a config file that will be used automatically to properly set the
+default options when doing
+
+```bash
+accelerate launch my_script.py --args_to_my_script
+```
+
+For instance, here is how you would run the NLP example `examples/by_feature/deepspeed_with_config_support.py` (from the root of the repo) with DeepSpeed Config File:
+
+**ZeRO Stage-2 DeepSpeed Config File Example**
+```bash
+compute_environment: LOCAL_MACHINE
+deepspeed_config:
+ deepspeed_config_file: /home/ubuntu/accelerate/examples/configs/deepspeed_config_templates/zero_stage2_config.json
+ zero3_init_flag: true
+distributed_type: DEEPSPEED
+fsdp_config: {}
+machine_rank: 0
+main_process_ip: null
+main_process_port: null
+main_training_function: main
+mixed_precision: fp16
+num_machines: 1
+num_processes: 2
+use_cpu: false
+```
+
+with the contents of `zero_stage2_config.json` being:
+```json
+{
+    "fp16": {
+        "enabled": true,
+        "loss_scale": 0,
+        "loss_scale_window": 1000,
+        "initial_scale_power": 16,
+        "hysteresis": 2,
+        "min_loss_scale": 1
+    },
+    "optimizer": {
+        "type": "AdamW",
+        "params": {
+            "lr": "auto",
+            "weight_decay": "auto",
+            "torch_adam": true,
+            "adam_w_mode": true
+        }
+    },
+    "scheduler": {
+        "type": "WarmupDecayLR",
+        "params": {
+            "warmup_min_lr": "auto",
+            "warmup_max_lr": "auto",
+            "warmup_num_steps": "auto",
+            "total_num_steps": "auto"
+        }
+    },
+    "zero_optimization": {
+        "stage": 2,
+        "allgather_partitions": true,
+        "allgather_bucket_size": 2e8,
+        "overlap_comm": true,
+        "reduce_scatter": true,
+        "reduce_bucket_size": "auto",
+        "contiguous_gradients": true
+    },
+    "gradient_accumulation_steps": 1,
+    "gradient_clipping": "auto",
+    "steps_per_print": 2000,
+    "train_batch_size": "auto",
+    "train_micro_batch_size_per_gpu": "auto",
+    "wall_clock_breakdown": false
+}
+```
+
+```bash
+accelerate launch examples/by_feature/deepspeed_with_config_support.py \
+--config_name "gpt2-large" \
+--tokenizer_name "gpt2-large" \
+--dataset_name "wikitext" \
+--dataset_config_name "wikitext-2-raw-v1" \
+--block_size 128 \
+--output_dir "./clm/clm_deepspeed_stage2_accelerate" \
+--learning_rate 5e-4 \
+--per_device_train_batch_size 24 \
+--per_device_eval_batch_size 24 \
+--num_train_epochs 3 \
+--with_tracking \
+--report_to "wandb"\
+```
+
+**ZeRO Stage-3 with CPU offload DeepSpeed Config File Example**
+```bash
+compute_environment: LOCAL_MACHINE
+deepspeed_config:
+ deepspeed_config_file: /home/ubuntu/accelerate/examples/configs/deepspeed_config_templates/zero_stage3_offload_config.json
+ zero3_init_flag: true
+distributed_type: DEEPSPEED
+fsdp_config: {}
+machine_rank: 0
+main_process_ip: null
+main_process_port: null
+main_training_function: main
+mixed_precision: fp16
+num_machines: 1
+num_processes: 2
+use_cpu: false
+```
+with the contents of `zero_stage3_offload_config.json` being:
+```json
+{
+    "fp16": {
+        "enabled": true,
+        "loss_scale": 0,
+        "loss_scale_window": 1000,
+        "initial_scale_power": 16,
+        "hysteresis": 2,
+        "min_loss_scale": 1
+    },
+    "optimizer": {
+        "type": "AdamW",
+        "params": {
+            "lr": "auto",
+            "weight_decay": "auto"
+        }
+    },
+    "scheduler": {
+        "type": "WarmupDecayLR",
+        "params": {
+            "warmup_min_lr": "auto",
+            "warmup_max_lr": "auto",
+            "warmup_num_steps": "auto",
+            "total_num_steps": "auto"
+        }
+    },
+    "zero_optimization": {
+        "stage": 3,
+        "offload_optimizer": {
+            "device": "cpu",
+            "pin_memory": true
+        },
+        "offload_param": {
+            "device": "cpu",
+            "pin_memory": true
+        },
+        "overlap_comm": true,
+        "contiguous_gradients": true,
+        "reduce_bucket_size": "auto",
+        "stage3_prefetch_bucket_size": "auto",
+        "stage3_param_persistence_threshold": "auto",
+        "sub_group_size": 1e9,
+        "stage3_max_live_parameters": 1e9,
+        "stage3_max_reuse_distance": 1e9,
+        "stage3_gather_16bit_weights_on_model_save": "auto"
+    },
+    "gradient_accumulation_steps": 1,
+    "gradient_clipping": "auto",
+    "steps_per_print": 2000,
+    "train_batch_size": "auto",
+    "train_micro_batch_size_per_gpu": "auto",
+    "wall_clock_breakdown": false
+}
+```
+
+```bash
+accelerate launch examples/by_feature/deepspeed_with_config_support.py \
+--config_name "gpt2-large" \
+--tokenizer_name "gpt2-large" \
+--dataset_name "wikitext" \
+--dataset_config_name "wikitext-2-raw-v1" \
+--block_size 128 \
+--output_dir "./clm/clm_deepspeed_stage3_offload_accelerate" \
+--learning_rate 5e-4 \
+--per_device_train_batch_size 32 \
+--per_device_eval_batch_size 32 \
+--num_train_epochs 3 \
+--with_tracking \
+--report_to "wandb"\
+```
+
+**Important code changes when using DeepSpeed Config File**
+
+1. DeepSpeed Optimizers and Schedulers. For more information on these, 
+see the [DeepSpeed Optimizers](https://deepspeed.readthedocs.io/en/latest/optimizers.html) and [DeepSpeed Schedulers](https://deepspeed.readthedocs.io/en/latest/schedulers.html) documentation.
+We will look at the changes needed in the code when using these.
+   
+   a. DS Optim + DS Scheduler: The case when both `optimizer` and `scheduler` keys present in the DeepSpeed config file.
+   In this situation, those will be used and user has to use `accelerate.utils.DummyOptim` and `accelerate.utils.DummyScheduler` to replace the PyTorch/Custom optimizers and schedulers in their code.
+   Below is the snippet from `examples/by_feature/deepspeed_with_config_support.py` showing this:
+   ```python
+    # Creates Dummy Optimizer if `optimizer` was spcified in the config file else creates Adam Optimizer
+    optimizer_cls = (
+        torch.optim.AdamW
+        if accelerator.state.deepspeed_plugin is None
+        or "optimizer" not in accelerator.state.deepspeed_plugin.deepspeed_config
+        else DummyOptim
+    )
+    optimizer = optimizer_cls(optimizer_grouped_parameters, lr=args.learning_rate)
+
+    # Creates Dummy Scheduler if `scheduler` was spcified in the config file else creates `args.lr_scheduler_type` Scheduler
+    if (
+        accelerator.state.deepspeed_plugin is None
+        or "scheduler" not in accelerator.state.deepspeed_plugin.deepspeed_config
+    ):
+        lr_scheduler = get_scheduler(
+            name=args.lr_scheduler_type,
+            optimizer=optimizer,
+            num_warmup_steps=args.num_warmup_steps,
+            num_training_steps=args.max_train_steps,
+        )
+    else:
+        lr_scheduler = DummyScheduler(
+            optimizer, total_num_steps=args.max_train_steps, warmup_num_steps=args.num_warmup_steps
+        )
+   ```
+   b. Custom Optim + Custom Scheduler: The case when both `optimizer` and `scheduler` keys are absent in the DeepSpeed config file.
+   In this situation, no code changes are needed from the user and this is the case when using integration via DeepSpeed Plugin.
+   In the above example we can see that the code reamins unchanged if the `optimizer` and `scheduler` keys are absent in the DeepSpeed config file.
+
+   c. Custom Optim + DS Scheduler: The case when only `scheduler` key is present in the DeepSpeed config file. 
+   In this situation, user has to use `accelerate.utils.DummyScheduler` to replace the PyTorch/Custom scheduler in their code. 
+
+   d. DS Optim + Custom Scheduler: The case when only `optimizer` key is present in the DeepSpeed config file. 
+   This will result in an error because one can only use DS Scheduler when using DS Optim.
+
+2. Notice the `auto` values in the above example DeepSpeed config files. These are automatically handled by `prepare` method 
+based on model, dataloaders, dummy optimizer and dummy schedulers provided to `prepare` method. 
+Only the `auto` fields specified in above examples are handled by `prepare` method and the rest have to be explicitly specified by the user.
+
+## Saving and loading
+
+1. Saving and loading of models is unchanged for ZeRO Stage-1 and Stage-2.
+
+2. under ZeRO Stage-3, `state_dict` contains just the placeholders since the model weights are partitioned across multiple GPUs.
+ZeRO Stage-3 has 2 options:
+
+   a. Saving the entire 16bit model weights to directly load later on using `model.load_state_dict(torch.load(pytorch_model.bin))`.
+   For this, either set `zero_optimization.stage3_gather_16bit_weights_on_model_save` to True in DeepSpeed Config file or set
+   `zero3_save_16bit_model` to True in DeepSpeed Plugin. 
+   **Note that this option requires consolidation of the weights on one GPU it can be slow and memory demanding, so only use this feature when needed.**
+   Below is the snippet from `examples/by_feature/deepspeed_with_config_support.py` showing this:
+   ```python
+   unwrapped_model = accelerator.unwrap_model(model)
+
+   # New Code #
+   # Saves the whole/unpartitioned fp16 model when in ZeRO Stage-3 to the output directory if
+   # `stage3_gather_16bit_weights_on_model_save` is True in DeepSpeed Config file or
+   # `zero3_save_16bit_model` is True in DeepSpeed Plugin.
+   # For Zero Stages 1 and 2, models are saved as usual in the output directory.
+   # The model name saved is `pytorch_model.bin`
+   unwrapped_model.save_pretrained(
+       args.output_dir,
+       is_main_process=accelerator.is_main_process,
+       save_function=accelerator.save,
+       state_dict=accelerator.get_state_dict(model),
+   )
+   ```
+
+   b. To get 32bit weights, first save the model using `model.save_checkpoint()`.
+   Below is the snippet from `examples/by_feature/deepspeed_with_config_support.py` showing this:
+   ```python
+   success = model.save_checkpoint(PATH, ckpt_id, checkpoint_state_dict)
+   status_msg = "checkpointing: PATH={}, ckpt_id={}".format(PATH, ckpt_id)
+   if success:
+       logging.info(f"Success {status_msg}")
+   else:
+       logging.warning(f"Failure {status_msg}")
+   ``` 
+   This will create ZeRO model and optimizer partitions along with `zero_to_fp32.py` script in checkpoint directory.
+   One can use this script to do offline consolidation.  
+   It requires no configuration files or GPUs. Here is an example of its usage:  
+   ```bash
+   $ cd /path/to/checkpoint_dir
+   $ ./zero_to_fp32.py . pytorch_model.bin
+   Processing zero checkpoint at global_step1
+   Detected checkpoint of type zero stage 3, world_size: 2
+   Saving fp32 state dict to pytorch_model.bin (total_numel=60506624)
+   ```
+   To get 32bit model for saving/inference, one can do the following:
+   ```python
+   from deepspeed.utils.zero_to_fp32 import load_state_dict_from_zero_checkpoint
+
+   unwrapped_model = accelerator.unwrap_model(model)
+   fp32_model = load_state_dict_from_zero_checkpoint(unwrapped_model, checkpoint_dir)
+   ```
+   If only interested in state_dict, one can do the following:
+   ```python
+   from deepspeed.utils.zero_to_fp32 import get_fp32_state_dict_from_zero_checkpoint
+
+   state_dict = get_fp32_state_dict_from_zero_checkpoint(checkpoint_dir)
+   ```
+   Note that all these functions require ~2x memory (general RAM) of the size of the final checkpoint.
+
+## ZeRO Inference
+DeepSpeed ZeRO Inference supports ZeRO stage 3 with ZeRO-Infinity. 
+It uses the same ZeRO protocol as training, but it doesn't use an optimizer and a lr scheduler and only stage 3 is relevant.
+With accelerate integration, one has to just prepare model and dataloader as shown below:
+
+```python
+model, eval_dataloader = accelerator.prepare(model, eval_dataloader)
+```
+
+## Few caveats to be aware of 
+
+1. Current integration doesn’t support Pipeline Parallelism of DeepSpeed.
+2. Current integration doesn’t support `mpu`, limiting the tensor parallelism which is supported in Megatron-LM. 
+3. Current integration doesn’t support multiple models for a given `accelerator` object. 
+
+
+## Internals
+
+[[autodoc]] utils.DeepSpeedPlugin
+
+[[autodoc]] utils.DummyOptim
+
+[[autodoc]] utils.DummyScheduler
+
+[[autodoc]] utils.DeepSpeedEngineWrapper
+
+[[autodoc]] utils.DeepSpeedOptimizerWrapper
+
+[[autodoc]] utils.DeepSpeedSchedulerWrapper
+
+
+## Main DeepSpeed Resources
+
+- [Project's github](https://github.com/microsoft/deepspeed)
+- [Usage docs](https://www.deepspeed.ai/getting-started/)
+- [API docs](https://deepspeed.readthedocs.io/en/latest/index.html)
+- [Blog posts](https://www.microsoft.com/en-us/research/search/?q=deepspeed)
+
+Papers:
+
+- [ZeRO: Memory Optimizations Toward Training Trillion Parameter Models](https://arxiv.org/abs/1910.02054)
+- [ZeRO-Offload: Democratizing Billion-Scale Model Training](https://arxiv.org/abs/2101.06840)
+- [ZeRO-Infinity: Breaking the GPU Memory Wall for Extreme Scale Deep Learning](https://arxiv.org/abs/2104.07857)
+
+Finally, please, remember that, 🤗 `Accelerate` only integrates DeepSpeed, therefore if you
+have any problems or questions with regards to DeepSpeed usage, please, file an issue with [DeepSpeed GitHub](https://github.com/microsoft/DeepSpeed/issues).
+
--- a/docs/source/fsdp.mdx
+++ b/docs/source/fsdp.mdx
@ -18,7 +18,7 @@ To read more about it and the benefits, check out the [Fully Sharded Data Parall
 We have integrated the latest PyTorch's Fully Sharded Data Parallel (FSDP) training feature.
 All you need to do is enable it through the config.

-## How it works out the box
+## How it works out of the box

 On your machine(s) just run:

@ -57,7 +57,7 @@ use_cpu: false
 accelerate launch examples/nlp_example.py
 ```

-Currently, `Accelerate` supports following config through the CLI:
+Currently, `Accelerate` supports the following config through the CLI:

 ```bash
 `Sharding Strategy`: [1] FULL_SHARD, [2] SHARD_GRAD_OP
@ -65,11 +65,11 @@ Currently, `Accelerate` supports following config through the CLI:
 `Offload Params`: Decides Whether to offload parameters and gradients to CPU.
 ```

-## Few caveats to be aware of
+## A few caveats to be aware of

 - PyTorch FSDP auto wraps sub-modules, flattens the parameters and shards the parameters in place.
  Due to this, any optimizer created before model wrapping gets broken and occupies more memory.
-  Hence, it is highly recommended and efficient to prepare model before creating optimizer.
+  Hence, it is highly recommended and efficient to prepare the model before creating the optimizer.
  `Accelerate` will automatically wrap the model and create an optimizer for you in case of single model with a warning message.
  > FSDP Warning: When using FSDP, it is efficient and recommended to call prepare for the model before creating the optimizer

@ -91,14 +91,14 @@ optimizer = torch.optim.AdamW(params=model.parameters(), lr=lr)

 ```

- In case of a single model, if you have created optimizer with multiple parameter groups and called prepare with them together,
+- In case of a single model, if you have created the optimizer with multiple parameter groups and called prepare with them together,
  then the parameter groups will be lost and the following warning is displayed:
  > FSDP Warning: When using FSDP, several parameter groups will be conflated into
  > a single one due to nested module wrapping and parameter flattening.
  
-  This is because parameter groups created before wrapping will have no meaning post wrapping due parameter flattening of nested FSDP modules into 1D arrays (which can consume many layers).
-  For instance, below are the named parameters of FSDP model on GPU 0 (When using 2 GPUs. Around 55M (110M/2) params in 1D arrays as this will have the 1st shard of the parameters). 
-  Here, if one has applied no weight decay for [bias, LayerNorm.weight] named parameters of unwrapped BERT model, 
+  This is because parameter groups created before wrapping will have no meaning post wrapping due to parameter flattening of nested FSDP modules into 1D arrays (which can consume many layers).
+  For instance, below are the named parameters of an FSDP model on GPU 0 (When using 2 GPUs. Around 55M (110M/2) params in 1D arrays as this will have the 1st shard of the parameters). 
+  Here, if one has applied no weight decay for [bias, LayerNorm.weight] the named parameters of an unwrapped BERT model, 
  it can't be applied to the below FSDP wrapped model as there are no named parameters with either of those strings and 
  the parameters of those layers are concatenated with parameters of various other layers.
  ```
@ -110,7 +110,7 @@ optimizer = torch.optim.AdamW(params=model.parameters(), lr=lr)
  ```


- In case of multiple models, it is necessary to prepare the models before creating optimizers else it will throw an error.
+- In case of multiple models, it is necessary to prepare the models before creating optimizers or else it will throw an error.
 - Mixed precision is currently not supported with FSDP.

 For more control, users can leverage the `FullyShardedDataParallelPlugin` wherein they can specify `auto_wrap_policy`, `backward_prefetch` and `ignored_modules`.
--- a/docs/source/gradient_accumulation.mdx
+++ b/docs/source/gradient_accumulation.mdx
@ -0,0 +1,126 @@
+<!--Copyright 2022 The HuggingFace Team. All rights reserved.
+
+Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with
+the License. You may obtain a copy of the License at
+
+http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on
+an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the
+specific language governing permissions and limitations under the License.
+-->
+
+# Performing gradient accumulation with 🤗 Accelerate
+
+Gradient accumulation is a technique where you can train on bigger batch sizes than 
+your machine would normally be able to fit into memory. This is done by accumulating gradients over
+several batches, and only stepping the optimizer after a certain number of batches have been performed.
+
+While technically standard gradient accumulation code would work fine in a distributed setup, it is not the most efficient
+method for doing so and you may experience considerable slowdowns!
+
+In this tutorial you will see how to quickly setup gradient accumulation and perform it with the utilities provided in 🤗 Accelerate,
+which can total to adding just one new line of code!
+
+This example will use a very simplistic PyTorch training loop that performs gradient accumulation every two batches:
+
+```python
+device = "cuda"
+model.to(device)
+
+gradient_accumulation_steps = 2
+
+for index, batch in enumerate(training_dataloader):
+    optimizer.zero_grad()
+    inputs, targets = batch
+    inputs = inputs.to(device)
+    targets = targets.to(device)
+    outputs = model(inputs)
+    loss = loss_function(outputs, targets)
+    loss = loss / gradient_accumulation_steps
+    loss.backward()
+    if (index + 1) % gradient_accumulation_steps == 0:
+        optimizer.step()
+        scheduler.step()
+```
+
+## Converting it to 🤗 Accelerate
+
+First the code shown earlier will be converted to utilize 🤗 Accelerate without the special gradient accumulation helper:
+
+```diff
+ from accelerate import Accelerator
+ accelerator = Accelerator()
+
+ model, optimizer, training_dataloader, scheduler = accelerator.prepare(
+     model, optimizer, training_dataloader, scheduler
+ )
+
+  for index, batch in enumerate(training_dataloader):
+      optimizer.zero_grad()
+      inputs, targets = batch
+-     inputs = inputs.to(device)
+-     targets = targets.to(device)
+      outputs = model(inputs)
+      loss = loss_function(outputs, targets)
+      loss = loss / gradient_accumulation_steps
+     accelerator.backward(loss)
+      if (index+1) % gradient_accumulation_steps == 0:
+          optimizer.step()
+          scheduler.step()
+```
+
+<Tip warning={true}>
+In its current state, this code is not going to perform gradient accumulation efficiently due to a process called gradient synchronization.
+</Tip>
+
+## Letting 🤗 Accelerate handle gradient accumulation
+
+All that is left now is to let 🤗 Accelerate handle the gradient accumulation for us. To do so you should pass in a `gradient_accumulation_steps` parameter to [`Accelerator`], dictating the number 
+of steps to perform before each call to `step()` and how to automatically adjust the loss during the call to [`Accelerator.backward`]:
+
+```diff
+  from accelerate import Accelerator
+- accelerator = Accelerator()
+ accelerator = Accelerator(gradient_accumulation_steps=2)
+```
+
+From here you can use the [`Accelerator.accumulate`] context manager from inside your training loop to automatically perform the gradient accumulation for you!
+You just wrap it around the entire training part of your code: 
+
+```diff
+- for index, batch in enumerate(training_dataloader):
+ for batch in training_dataloader:
+    with accelerator.accumulate(model):
+          optimizer.zero_grad()
+          inputs, targets = batch
+          outputs = model(inputs)
+```
+
+and you can remove all the special checks for the step number and the loss adjustment:
+
+```diff
+- loss = loss / gradient_accumulation_steps
+  accelerator.backward(loss)
+- if (index+1) % gradient_accumulation_steps == 0:
+  optimizer.step()
+  scheduler.step()
+```
+
+As you can see the [`Accelerator`] is able to keep track of the batch number you are on and it will automatically know whether to step through the prepared optimizer and how to adjust the loss. 
+
+## The finished code
+
+Below is the finished implementation for performing gradient accumulation with 🤗 Accelerate
+
+```python
+for batch in training_dataloader:
+    with accelerator.accumulate(model):
+        optimizer.zero_grad()
+        inputs, targets = batch
+        outputs = model(inputs)
+        loss = loss_function(outputs, targets)
+        accelerator.backward(loss)
+        optimizer.step()
+        scheduler.step()
+```
--- a/docs/source/index.mdx
+++ b/docs/source/index.mdx
@ -12,16 +12,16 @@ specific language governing permissions and limitations under the License.

 # Accelerate

-Run your *raw* PyTorch training script on any kind of device
+Run your *raw* PyTorch training script on any kind of device.

 ## Features

- 🤗 Accelerate provides an easy API to make your scripts run with mixed precision and on any kind of distributed
-  setting (multi-GPUs, TPUs etc.) while still letting you write your own training loop. The same code can then runs
+- 🤗 Accelerate provides an easy API to make your scripts run with mixed precision and in any kind of distributed
+  setting (multi-GPUs, TPUs etc.) while still letting you write your own training loop. The same code can then run
  seamlessly on your local machine for debugging or your training environment.

- 🤗 Accelerate also provides a CLI tool that allows you to quickly configure and test your training environment then
-  launch the scripts.
+- 🤗 Accelerate also provides a CLI tool that allows you to quickly configure and test your training environment and
+  then launch the scripts.


 ## Easy to integrate
--- a/docs/source/installation.mdx
+++ b/docs/source/installation.mdx
@ -57,7 +57,7 @@ pip install git+https://github.com/huggingface/accelerate

 Note that this will install not the latest released version, but the bleeding edge `main` version, which you may want to use in case a bug has been fixed since the last official release and a new release hasn't  been yet rolled out.

-While we strive to keep `main` operational at all times, if you notice some issues, they usually get fixed within a few hours or a day and and you're more than welcome to help us detect any problems by opening an [Issue](https://github.com/huggingface/accelerate/issues) and this way, things will get fixed even sooner.
+While we strive to keep `main` operational at all times, if you notice some issues, they usually get fixed within a few hours or a day and you're more than welcome to help us detect any problems by opening an [Issue](https://github.com/huggingface/accelerate/issues) and this way, things will get fixed even sooner.

 Again, you can run:

@ -85,7 +85,7 @@ now this editable install will reside where you clone the folder to, e.g. `~/acc

 Do note that you have to keep that `accelerate` folder around and not delete it to continue using the 🤗 Accelerate library.

-Now, let's get to the real benefit of this installation approach. Say, you saw some new feature has been just committed into `main`. If you have already performed all the steps above, to update your accelerate repo to include all the latest commits, all you need to do is to `cd` into that cloned repository folder and update the clone to the latest version:
+Now, let's get to the real benefit of this installation approach. Say, you saw some new feature just has been committed into `main`. If you have already performed all the steps above, to update your accelerate repo to include all the latest commits, all you need to do is to `cd` into that cloned repository folder and update the clone to the latest version:

 ```bash
 cd ~/accelerate/
--- a/docs/source/internal.mdx
+++ b/docs/source/internal.mdx
@ -12,6 +12,10 @@ specific language governing permissions and limitations under the License.

 # Internals

+## Gradient Accumulation states
+
+[[autodoc]] state.GradientState
+
 ## Optimizer

 [[autodoc]] optimizer.AcceleratedOptimizer
@ -22,7 +26,7 @@ The main work on your PyTorch `DataLoader` is done by the following function:

 [[autodoc]] data_loader.prepare_data_loader

-### BatchSamplerShard
+### DataLoaderShard

 [[autodoc]] data_loader.DataLoaderShard

@ -44,28 +48,6 @@ The main work on your PyTorch `DataLoader` is done by the following function:

 [[autodoc]] state.AcceleratorState

-### DistributedType
-
-[[autodoc]] state.DistributedType
-
 ## Tracking

 [[autodoc]] tracking.GeneralTracker
-
-## Utilities
-
-[[autodoc]] utils.extract_model_from_parallel
-
-[[autodoc]] utils.gather
-
-[[autodoc]] utils.send_to_device
-
-[[autodoc]] utils.set_seed
-
-[[autodoc]] utils.synchronize_rng_state
-
-[[autodoc]] utils.synchronize_rng_states
-
-[[autodoc]] utils.wait_for_everyone
-
-[[autodoc]] utils.write_basic_config
--- a/docs/source/quicktour.mdx
+++ b/docs/source/quicktour.mdx
@ -12,7 +12,7 @@ specific language governing permissions and limitations under the License.

 # Quick tour

-Let's have a look at a look at 🤗 Accelerate main features and traps to avoid.
+Let's have a look at the 🤗 Accelerate main features and traps to avoid.

 ## Main use

@ -54,7 +54,7 @@ model, optimizer, train_dataloader, lr_scheduler = accelerator.prepare(
 )
 ```

-In particular, your training dataloader will be sharded accross all GPUs/TPU cores available so that each one sees a
+In particular, your training dataloader will be sharded across all GPUs/TPU cores available so that each one sees a
 different portion of the training dataset. Also, the random states of all processes will be synchronized at the
 beginning of each iteration through your dataloader, to make sure the data is shuffled the same way (if you decided to
 use `shuffle=True` or any kind of random sampler).
@ -118,7 +118,7 @@ method:
 validation_dataloader = accelerator.prepare(validation_dataloader)
 ```

-Like for your training dataloader, it will mean that (should you run your script on multiple devices) each device will
+As for your training dataloader, it will mean that (should you run your script on multiple devices) each device will
 only see part of the evaluation data. This means you will need to group your predictions together. This is very easy to
 do with the [`~Accelerator.gather`] method.

@ -134,8 +134,8 @@ for inputs, targets in validation_dataloader:

 <Tip warning={true}>

-Like for the training dataloader, passing your validation dataloader through
-[`~Accelerator.prepare`] may change its: if you run on X GPUs, it will have its length divided by X
+As for the training dataloader, passing your validation dataloader through
+[`~Accelerator.prepare`] may change it: if you run on X GPUs, it will have its length divided by X
 (since your actual batch size will be multiplied by X), unless you set `split_batches=True`.

 Any instruction using your training dataloader length (for instance if you need the number of total training steps
@ -159,7 +159,7 @@ PyTorch), they are fully compatible with 🤗 Accelerate. The only caveat here i
 to determine all useful information, so `torch.distributed.launch` should be used with the flag `--use_env`.

 🤗 Accelerate also provides a CLI tool that unifies all launcher, so you only have to remember one command. To use it,
-just run
+just run:

 ```bash
 accelerate config
@ -175,7 +175,7 @@ on your machine and reply to the questions asked. This will save a *default_conf

 You can also specify with the flag `--config_file` the location of the file you want to save.

-Once this is done, you can test everything is going well on your setup by running
+Once this is done, you can test everything is going well on your setup by running:

 ```bash
 accelerate test
@ -235,14 +235,14 @@ step). This is why your first step of training will always be very long as build
 optimizations takes some time.

 The good news is that this compilation will be cached so the second step and all the following will be much faster. The
-bas news is that it only applies if all of your steps do exactly the same operations, which implies:
+bad news is that it only applies if all of your steps do exactly the same operations, which implies:

 - having all tensors of the same length in all your lengths
 - having static code (i.e., not a for loop of length that could change from step to step)

 Having any of the things above change between two steps will trigger a new compilation which will, once again, take a
 lot of time. In practice, that means you must take special care to have all your tensors in your inputs of the same
-shape (so no dynamic padding for instance if you are in an NLP problem) and should not use layer with for loops that
+shape (so no dynamic padding for instance if you are in an NLP problem) and should not use layers with for loops that
 have different lengths depending on the inputs (such as an LSTM) or the training will be excruciatingly slow.

 To introduce special behavior in your script for TPUs you can check the `distributed_type` of your
@ -257,10 +257,10 @@ else:
    # go crazy and be dynamic
 ```

-The [NLP example](https://github.com/huggingface/accelerate/blob/main/examples/nlp_example.py) shows an example in
+The [NLP example](https://github.com/huggingface/accelerate/blob/main/examples/nlp_example.py) shows an example in a 
 situation with dynamic padding.

-One last thing to pay close attnetion to: if your model has tied weights (such as language models which tie the weights
+One last thing to pay close attention to: if your model has tied weights (such as language models which tie the weights
 of the embedding matrix with the weights of the decoder), moving this model to the TPU (either yourself or after you
 passed your model to [`~Accelerator.prepare`]) will break the tying. You will need to retie the weights
 after. You can find an example of this in the [run_clm_no_trainer](https://github.com/huggingface/transformers/blob/master/examples/pytorch/language-modeling/run_clm.py) script in
@ -317,8 +317,8 @@ following line in your code:
 accelerator.wait_for_everyone()
 ```

-This instruction will block all the processes that arrive them first until all the other processes have reached that
-point (if you run your script on just one GPU or CPU, this wont' do anything).
+This instruction will block all the processes that arrive first until all the other processes have reached that
+point (if you run your script on just one GPU or CPU, this won't do anything).


 ### Saving/loading a model
@ -338,7 +338,7 @@ unwrapped_model = accelerator.unwrap_model(model)
 accelerator.save(unwrapped_model.state_dict(), filename)
 ```

-If your script contains a logic to load checkpoint, we also recommend you load your weights in the unwrapped model
+If your script contains logic to load a checkpoint, we also recommend you load your weights in the unwrapped model
 (this is only useful if you use the load function after making your model go through
 [`~Accelerator.prepare`]). Here is an example:

@ -368,7 +368,7 @@ and `accelerator.clip_grad_value_` respectively.

 ### Mixed Precision training

-If you are running your training in Mixed Precision with Accelerate, you will get the best result with your loss being
+If you are running your training in Mixed Precision with 🤗 Accelerate, you will get the best result with your loss being
 computed inside your model (like in Transformer models for instance). Every computation outside of the model will be
 executed in full precision (which is generally what you want for loss computation, expecially if it involves a
 softmax). However you might want to put your loss computation inside the *accelerator.autocast* context manager:
@ -438,14 +438,14 @@ The random number generator synchronization will by default synchronize:
 - the main random number generator in PyTorch <=1.5.1

 You can choose which random number generator(s) to synchronize with the `rng_types` argument of the main
-[`Accelerator`]. In PyTorch >= 1.6, it is recommended to rely on local `generator` to avoid
+[`Accelerator`]. In PyTorch >= 1.6, it is recommended to rely on a local `generator` to avoid
 setting the same seed in the main random number generator in all processes.

 <Tip warning={true}>

-Synchronization the main torch (or CUDA or XLA) random number generator will affect any other potential random
-artifacts you could have in your dataset (like random data augmentation) in the sense all processes will get the
-same random numbers from the torch random modules (so will apply the same random data augmentation if it's
+Synchronization of the main torch (or CUDA or XLA) random number generator will affect any other potential random
+artifacts you could have in your dataset (like random data augmentation) in the sense that all processes will get
+the same random numbers from the torch random modules (so will apply the same random data augmentation if it's
 controlled by torch).

 </Tip>
@ -457,4 +457,4 @@ The randomization part of your custom sampler, batch sampler or iterable dataset

 </Tip>

-See more details about the internal in the [Internals page](internal).
+For more details about the internals, see the [Internals page](internal).
--- a/docs/source/sagemaker.mdx
+++ b/docs/source/sagemaker.mdx
@ -23,7 +23,7 @@ make it easier than ever to train Hugging Face Transformer models in [Amazon Sag
 Before you can run your 🤗 Accelerate scripts on Amazon SageMaker you need to sign up for an AWS account. If you do not
 have an AWS account yet learn more [here](https://docs.aws.amazon.com/sagemaker/latest/dg/gs-set-up.html).

-After you have your AWS Account you need to install the `sagemaker` sdk for 🤗 Accelerate with.
+After you have your AWS Account you need to install the `sagemaker` sdk for 🤗 Accelerate with:

 ```bash
 pip install "accelerate[sagemaker]" --upgrade
@ -31,7 +31,7 @@ pip install "accelerate[sagemaker]" --upgrade

 🤗 Accelerate currently uses the 🤗 DLCs, with `transformers`, `datasets` and `tokenizers` pre-installed. 🤗
 Accelerate is not in the DLC yet (will soon be added!) so to use it within Amazon SageMaker you need to create a
-`requirements.txt` in the same directory where your training script is located and add it as dependency.
+`requirements.txt` in the same directory where your training script is located and add it as dependency:

 ```
 accelerate
@ -43,7 +43,7 @@ You should also add any other dependencies you have to this `requirements.txt`.
 ### Configure 🤗 Accelerate

 You can configure the launch configuration for Amazon SageMaker the same as you do for non SageMaker training jobs with
-the 🤗 Accelerate CLI.
+the 🤗 Accelerate CLI:

 ```bash
 accelerate config
@ -62,7 +62,7 @@ accelerate config

 The training script is very similar to a training script you might run outside of SageMaker, but to save your model
 after training you need to specify either `/opt/ml/model` or use `os.environ["SM_MODEL_DIR"]` as your save
-directory. After training, artifacts in this directory are uploaded to S3.
+directory. After training, artifacts in this directory are uploaded to S3:


 ```diff
@ -79,7 +79,7 @@ specify type as bool in your script and provide an explicit True or False value

 ### Launch Training

-You can launch your training with 🤗 Accelerate CLI with
+You can launch your training with 🤗 Accelerate CLI with:

 ```
 accelerate launch path_to_script.py --args_to_the_script
--- a/docs/source/tracking.mdx
+++ b/docs/source/tracking.mdx
@ -13,7 +13,7 @@ specific language governing permissions and limitations under the License.
 # Tracking

 There are a large number of experiment tracking API's available, however getting them all to work with in a multi-processing environment can oftentimes be complex.
-Accelerate provides a general tracking API that can be used to log useful items during your script through [`~Accelerator.log`]
+🤗 Accelerate provides a general tracking API that can be used to log useful items during your script through [`~Accelerator.log`]

 ## Integrated Trackers

--- a/docs/source/utilities.mdx
+++ b/docs/source/utilities.mdx
@ -0,0 +1,91 @@
+<!--Copyright 2021 The HuggingFace Team. All rights reserved.
+
+Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with
+the License. You may obtain a copy of the License at
+
+http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on
+an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the
+specific language governing permissions and limitations under the License.
+-->
+
+# Helpful Utilities
+
+Below are a variety of utility functions that 🤗 Accelerate provides, broken down by use-case. 
+
+## Data Classes
+
+These are basic dataclasses used throughout 🤗 Accelerate and they can be passed in as parameters.
+
+[[autodoc]] utils.DistributedType
+
+[[autodoc]] utils.LoggerType
+
+[[autodoc]] utils.PrecisionType
+
+## Data Manipulation and Operations
+
+These include data operations that mimic the same `torch` ops but can be used on distributed processes.
+
+[[autodoc]] utils.broadcast
+
+[[autodoc]] utils.concatenate
+
+[[autodoc]] utils.gather
+
+[[autodoc]] utils.pad_across_processes
+
+[[autodoc]] utils.reduce
+
+[[autodoc]] utils.send_to_device
+
+## Environment Checks
+
+These functionalities check the state of the current working environment including information about the operating system itself, what it can support, and if particular dependencies are installed. 
+
+[[autodoc]] utils.get_max_memory
+
+[[autodoc]] utils.is_bf16_available
+
+[[autodoc]] utils.is_torch_version
+
+[[autodoc]] utils.is_tpu_available
+
+## Environment Configuration
+
+[[autodoc]] utils.write_basic_config
+
+When setting up 🤗 Accelerate for the first time, rather than running `accelerate config` [~utils.write_basic_config] can be used as an alternative for quick configuration.
+
+## Modeling
+
+These utilities relate to interacting with PyTorch models
+
+[[autodoc]] utils.extract_model_from_parallel
+
+[[autodoc]] utils.get_max_layer_size
+
+[[autodoc]] utils.offload_state_dict
+
+
+## Parallel
+
+These include general utilities that should be used when working in parallel.
+
+[[autodoc]] utils.extract_model_from_parallel
+
+[[autodoc]] utils.save
+
+[[autodoc]] utils.wait_for_everyone
+
+
+## Random
+
+These utilities relate to setting and synchronizing of all the random states.
+
+[[autodoc]] utils.set_seed
+
+[[autodoc]] utils.synchronize_rng_state
+
+[[autodoc]] utils.synchronize_rng_states
--- a/examples/README.md
+++ b/examples/README.md
@ -23,7 +23,7 @@ The [nlp_example.py](./nlp_example.py) script is a simple example to train a Ber
 Prior to running it you should install 🤗 Dataset and 🤗 Transformers:

 ```bash
-pip install datasets transformers
+pip install datasets evaluate transformers
 ```

 The same script can be run in any of the following configurations:
--- a/examples/by_feature/README.md
+++ b/examples/by_feature/README.md
@ -42,6 +42,18 @@ These arguments should be added at the end of any method for starting the python
 accelerate launch ./checkpointing.py --checkpointing_steps epoch output_dir "checkpointing_tutorial" --resume_from_checkpoint "checkpointing_tutorial/epoch_0"
 ```

+### Cross Validation (`cross_validation.py`)
+
+- Shows how to use `Accelerator.free_memory` and run cross validation efficiently with `datasets`.
+- Arguments available:
+  - `num_folds`, the number of folds the training dataset should be split into.
+
+These arguments should be added at the end of any method for starting the python script (such as `python`, `accelerate launch`, `python -m torch.distributed.launch`), such as:
+
+```bash
+accelerate launch ./cross_validation.py --num_folds 2
+```
+
 ### Experiment Tracking (`tracking.py`)

 - Shows how to use `Accelerate.init_trackers` and `Accelerator.log`
@ -55,14 +67,14 @@ These arguments should be added at the end of any method for starting the python
 accelerate launch ./tracking.py --with_tracking
 ```

-### Cross Validation (`cross_validation.py`)
+### Gradient Accumulation (`gradient_accumulation.py`)

- Shows how to use `Accelerator.free_memory` and run cross validation efficiently with `datasets`.
+- Shows how to use `Accelerator.no_sync` to prevent gradient averaging in a distributed setup.
 - Arguments available:
-  - `num_folds`, the number of folds the training dataset should be split into.
+  - `gradient_accumulation_steps`, the number of steps to perform before the gradients are accumulated and the optimizer and scheduler are stepped + zero_grad

 These arguments should be added at the end of any method for starting the python script (such as `python`, `accelerate launch`, `python -m torch.distributed.launch`), such as:

 ```bash
-accelerate launch ./cross_validation.py --num_folds 2
-```
+accelerate launch ./gradient_accumulation.py --gradient_accumulation_steps 5
+```
--- a/examples/by_feature/checkpointing.py
+++ b/examples/by_feature/checkpointing.py
@ -16,17 +16,13 @@ import argparse
 import os

 import torch
+from torch.optim import AdamW
 from torch.utils.data import DataLoader

+import evaluate
 from accelerate import Accelerator, DistributedType
-from datasets import load_dataset, load_metric
-from transformers import (
-    AdamW,
-    AutoModelForSequenceClassification,
-    AutoTokenizer,
-    get_linear_schedule_with_warmup,
-    set_seed,
-)
+from datasets import load_dataset
+from transformers import AutoModelForSequenceClassification, AutoTokenizer, get_linear_schedule_with_warmup, set_seed


 ########################################################################
@ -116,7 +112,6 @@ def training_function(config, args):
    # Sample hyper-parameters for learning rate, batch size, seed and a few other HPs
    lr = config["lr"]
    num_epochs = int(config["num_epochs"])
-    correct_bias = config["correct_bias"]
    seed = int(config["seed"])
    batch_size = int(config["batch_size"])

@ -137,11 +132,11 @@ def training_function(config, args):
    set_seed(seed)

    train_dataloader, eval_dataloader = get_dataloaders(accelerator, batch_size)
-    metric = load_metric("glue", "mrpc")
+    metric = evaluate.load("glue", "mrpc")

    # If the batch size is too big we use gradient accumulation
    gradient_accumulation_steps = 1
-    if batch_size > MAX_GPU_BATCH_SIZE:
+    if batch_size > MAX_GPU_BATCH_SIZE and accelerator.distributed_type != DistributedType.TPU:
        gradient_accumulation_steps = batch_size // MAX_GPU_BATCH_SIZE
        batch_size = MAX_GPU_BATCH_SIZE

@ -154,7 +149,7 @@ def training_function(config, args):
    model = model.to(accelerator.device)

    # Instantiate optimizer
-    optimizer = AdamW(params=model.parameters(), lr=lr, correct_bias=correct_bias)
+    optimizer = AdamW(params=model.parameters(), lr=lr)

    # Instantiate scheduler
    lr_scheduler = get_linear_schedule_with_warmup(
@ -296,7 +291,7 @@ def main():
        help="If the training should continue from a checkpoint folder.",
    )
    args = parser.parse_args()
-    config = {"lr": 2e-5, "num_epochs": 3, "correct_bias": True, "seed": 42, "batch_size": 16}
+    config = {"lr": 2e-5, "num_epochs": 3, "seed": 42, "batch_size": 16}
    training_function(config, args)


--- a/examples/by_feature/cross_validation.py
+++ b/examples/by_feature/cross_validation.py
@ -17,21 +17,17 @@ from typing import List

 import numpy as np
 import torch
+from torch.optim import AdamW
 from torch.utils.data import DataLoader

+import evaluate
 from accelerate import Accelerator, DistributedType
-from datasets import DatasetDict, load_dataset, load_metric
+from datasets import DatasetDict, load_dataset

 # New Code #
 # We'll be using StratifiedKFold for this example
 from sklearn.model_selection import StratifiedKFold
-from transformers import (
-    AdamW,
-    AutoModelForSequenceClassification,
-    AutoTokenizer,
-    get_linear_schedule_with_warmup,
-    set_seed,
-)
+from transformers import AutoModelForSequenceClassification, AutoTokenizer, get_linear_schedule_with_warmup, set_seed


 ########################################################################
@ -129,7 +125,6 @@ def get_fold_dataloaders(

 def training_function(config, args):
    # New Code #
-    test_labels = None
    test_predictions = []
    # Download the dataset
    datasets = load_dataset("glue", "mrpc")
@ -140,15 +135,14 @@ def training_function(config, args):
    # Sample hyper-parameters for learning rate, batch size, seed and a few other HPs
    lr = config["lr"]
    num_epochs = int(config["num_epochs"])
-    correct_bias = config["correct_bias"]
    seed = int(config["seed"])
    batch_size = int(config["batch_size"])

-    metric = load_metric("glue", "mrpc")
+    metric = evaluate.load("glue", "mrpc")

    # If the batch size is too big we use gradient accumulation
    gradient_accumulation_steps = 1
-    if batch_size > MAX_GPU_BATCH_SIZE:
+    if batch_size > MAX_GPU_BATCH_SIZE and accelerator.distributed_type != DistributedType.TPU:
        gradient_accumulation_steps = batch_size // MAX_GPU_BATCH_SIZE
        batch_size = MAX_GPU_BATCH_SIZE

@ -157,17 +151,15 @@ def training_function(config, args):
    # New Code #
    # Create our folds:
    folds = kfold.split(np.zeros(datasets["train"].num_rows), datasets["train"]["label"])
-
+    test_references = []
    # Iterate over them
-    for train_idxs, valid_idxs in folds:
+    for i, (train_idxs, valid_idxs) in enumerate(folds):
        train_dataloader, eval_dataloader, test_dataloader = get_fold_dataloaders(
            accelerator,
            datasets,
            train_idxs,
            valid_idxs,
        )
-        if test_labels is None:
-            test_labels = datasets["validation"]["label"]
        # Instantiate the model (we build the model here so that the seed also control new weights initialization)
        model = AutoModelForSequenceClassification.from_pretrained("bert-base-cased", return_dict=True)

@ -177,7 +169,7 @@ def training_function(config, args):
        model = model.to(accelerator.device)

        # Instantiate optimizer
-        optimizer = AdamW(params=model.parameters(), lr=lr, correct_bias=correct_bias)
+        optimizer = AdamW(params=model.parameters(), lr=lr)

        # Instantiate scheduler
        lr_scheduler = get_linear_schedule_with_warmup(
@ -236,19 +228,18 @@ def training_function(config, args):
            predictions = outputs.logits
            predictions, references = accelerator.gather((predictions, batch["labels"]))
            fold_predictions.append(predictions.cpu())
-            metric.add_batch(
-                predictions=predictions.argmax(dim=-1),
-                references=references,
-            )
-        test_metric = metric.compute()
+            if i == 0:
+                # We need all of the test predictions
+                test_references.append(references.cpu())
        # Use accelerator.print to print only on the main process.
        test_predictions.append(torch.cat(fold_predictions, dim=0))
        # We now need to release all our memory and get rid of the current model, optimizer, etc
        accelerator.free_memory()
    # New Code #
    # Finally we check the accuracy of our folded results:
-    preds = torch.stack(test_predictions, dim=0).sum(dim=0).div(int(config["n_splits"])).argmax(dim=-1)
-    test_metric = metric.compute(predictions=preds, references=test_labels)
+    test_references = torch.cat(test_references, dim=0)
+    preds = torch.stack(test_predictions, dim=0).sum(dim=0).div(int(args.num_folds)).argmax(dim=-1)
+    test_metric = metric.compute(predictions=preds, references=test_references)
    accelerator.print("Average test metrics from all folds:", test_metric)


@ -267,7 +258,7 @@ def main():
    # New Code #
    parser.add_argument("--num_folds", type=int, default=3, help="The number of splits to perform across the dataset")
    args = parser.parse_args()
-    config = {"lr": 2e-5, "num_epochs": 3, "correct_bias": True, "seed": 42, "batch_size": 16}
+    config = {"lr": 2e-5, "num_epochs": 3, "seed": 42, "batch_size": 16}
    training_function(config, args)


--- a/examples/by_feature/deepspeed_with_config_support.py
+++ b/examples/by_feature/deepspeed_with_config_support.py
@ -0,0 +1,736 @@
+#!/usr/bin/env python
+# coding=utf-8
+# Copyright 2022 The HuggingFace Inc. team. All rights reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+"""
+Fine-tuning the library models for causal language modeling (GPT, GPT-2, CTRL, ...)
+on a text file or a dataset without using HuggingFace Trainer.
+
+Here is the full list of checkpoints on the hub that can be fine-tuned by this script:
+https://huggingface.co/models?filter=text-generation
+"""
+# You can also adapt this script on your own causal language modeling task. Pointers for this are left as comments.
+
+import argparse
+import json
+import logging
+import math
+import os
+import random
+from itertools import chain
+from pathlib import Path
+
+import torch
+from torch.utils.data import DataLoader
+
+import datasets
+import transformers
+from accelerate import Accelerator, DistributedType
+from accelerate.logging import get_logger
+from accelerate.utils import DummyOptim, DummyScheduler, set_seed
+from datasets import load_dataset
+from huggingface_hub import Repository
+from tqdm.auto import tqdm
+from transformers import (
+    CONFIG_MAPPING,
+    MODEL_MAPPING,
+    AutoConfig,
+    AutoModelForCausalLM,
+    AutoTokenizer,
+    SchedulerType,
+    default_data_collator,
+    get_scheduler,
+)
+from transformers.utils import get_full_repo_name
+from transformers.utils.versions import require_version
+
+
+logger = get_logger(__name__)
+
+require_version("datasets>=1.8.0", "To fix: pip install -r examples/pytorch/language-modeling/requirements.txt")
+
+MODEL_CONFIG_CLASSES = list(MODEL_MAPPING.keys())
+MODEL_TYPES = tuple(conf.model_type for conf in MODEL_CONFIG_CLASSES)
+
+
+def parse_args():
+    parser = argparse.ArgumentParser(description="Finetune a transformers model on a causal language modeling task")
+    parser.add_argument(
+        "--dataset_name",
+        type=str,
+        default=None,
+        help="The name of the dataset to use (via the datasets library).",
+    )
+    parser.add_argument(
+        "--dataset_config_name",
+        type=str,
+        default=None,
+        help="The configuration name of the dataset to use (via the datasets library).",
+    )
+    parser.add_argument(
+        "--train_file", type=str, default=None, help="A csv or a json file containing the training data."
+    )
+    parser.add_argument(
+        "--validation_file", type=str, default=None, help="A csv or a json file containing the validation data."
+    )
+    parser.add_argument(
+        "--validation_split_percentage",
+        default=5,
+        help="The percentage of the train set used as validation set in case there's no validation split",
+    )
+    parser.add_argument(
+        "--model_name_or_path",
+        type=str,
+        help="Path to pretrained model or model identifier from huggingface.co/models.",
+        required=False,
+    )
+    parser.add_argument(
+        "--config_name",
+        type=str,
+        default=None,
+        help="Pretrained config name or path if not the same as model_name",
+    )
+    parser.add_argument(
+        "--tokenizer_name",
+        type=str,
+        default=None,
+        help="Pretrained tokenizer name or path if not the same as model_name",
+    )
+    parser.add_argument(
+        "--use_slow_tokenizer",
+        action="store_true",
+        help="If passed, will use a slow tokenizer (not backed by the 🤗 Tokenizers library).",
+    )
+    parser.add_argument(
+        "--per_device_train_batch_size",
+        type=int,
+        default=8,
+        help="Batch size (per device) for the training dataloader.",
+    )
+    parser.add_argument(
+        "--per_device_eval_batch_size",
+        type=int,
+        default=8,
+        help="Batch size (per device) for the evaluation dataloader.",
+    )
+    parser.add_argument(
+        "--learning_rate",
+        type=float,
+        default=5e-5,
+        help="Initial learning rate (after the potential warmup period) to use.",
+    )
+    parser.add_argument("--weight_decay", type=float, default=0.0, help="Weight decay to use.")
+    parser.add_argument("--num_train_epochs", type=int, default=3, help="Total number of training epochs to perform.")
+    parser.add_argument(
+        "--max_train_steps",
+        type=int,
+        default=None,
+        help="Total number of training steps to perform. If provided, overrides num_train_epochs.",
+    )
+    parser.add_argument(
+        "--gradient_accumulation_steps",
+        type=int,
+        default=1,
+        help="Number of updates steps to accumulate before performing a backward/update pass.",
+    )
+    parser.add_argument(
+        "--lr_scheduler_type",
+        type=SchedulerType,
+        default="linear",
+        help="The scheduler type to use.",
+        choices=["linear", "cosine", "cosine_with_restarts", "polynomial", "constant", "constant_with_warmup"],
+    )
+    parser.add_argument(
+        "--num_warmup_steps", type=int, default=0, help="Number of steps for the warmup in the lr scheduler."
+    )
+    parser.add_argument("--output_dir", type=str, default=None, help="Where to store the final model.")
+    parser.add_argument("--seed", type=int, default=None, help="A seed for reproducible training.")
+    parser.add_argument(
+        "--model_type",
+        type=str,
+        default=None,
+        help="Model type to use if training from scratch.",
+        choices=MODEL_TYPES,
+    )
+    parser.add_argument(
+        "--block_size",
+        type=int,
+        default=None,
+        help=(
+            "Optional input sequence length after tokenization. The training dataset will be truncated in block of"
+            " this size for training. Default to the model max input length for single sentence inputs (take into"
+            " account special tokens)."
+        ),
+    )
+    parser.add_argument(
+        "--preprocessing_num_workers",
+        type=int,
+        default=None,
+        help="The number of processes to use for the preprocessing.",
+    )
+    parser.add_argument(
+        "--overwrite_cache", type=bool, default=False, help="Overwrite the cached training and evaluation sets"
+    )
+    parser.add_argument(
+        "--no_keep_linebreaks", action="store_true", help="Do not keep line breaks when using TXT files."
+    )
+    parser.add_argument("--push_to_hub", action="store_true", help="Whether or not to push the model to the Hub.")
+    parser.add_argument(
+        "--hub_model_id", type=str, help="The name of the repository to keep in sync with the local `output_dir`."
+    )
+    parser.add_argument("--hub_token", type=str, help="The token to use to push to the Model Hub.")
+    parser.add_argument(
+        "--checkpointing_steps",
+        type=str,
+        default=None,
+        help="Whether the various states should be saved at the end of every n steps, or 'epoch' for each epoch.",
+    )
+    parser.add_argument(
+        "--resume_from_checkpoint",
+        type=str,
+        default=None,
+        help="If the training should continue from a checkpoint folder.",
+    )
+    # New Code #
+    # Whether to load the best model at the end of training
+    parser.add_argument(
+        "--load_best_model",
+        action="store_true",
+        help="Whether to load the best model at the end of training",
+    )
+    parser.add_argument(
+        "--with_tracking",
+        action="store_true",
+        help="Whether to enable experiment trackers for logging.",
+    )
+    parser.add_argument(
+        "--report_to",
+        type=str,
+        default="all",
+        help=(
+            'The integration to report the results and logs to. Supported platforms are `"tensorboard"`,'
+            ' `"wandb"` and `"comet_ml"`. Use `"all"` (default) to report to all integrations.'
+            "Only applicable when `--with_tracking` is passed."
+        ),
+    )
+    args = parser.parse_args()
+
+    # Sanity checks
+    if args.dataset_name is None and args.train_file is None and args.validation_file is None:
+        raise ValueError("Need either a dataset name or a training/validation file.")
+    else:
+        if args.train_file is not None:
+            extension = args.train_file.split(".")[-1]
+            assert extension in ["csv", "json", "txt"], "`train_file` should be a csv, json or txt file."
+        if args.validation_file is not None:
+            extension = args.validation_file.split(".")[-1]
+            assert extension in ["csv", "json", "txt"], "`validation_file` should be a csv, json or txt file."
+
+    if args.push_to_hub:
+        assert args.output_dir is not None, "Need an `output_dir` to create a repo when `--push_to_hub` is passed."
+
+    return args
+
+
+# New Code #
+def checkpoint_model(checkpoint_folder, ckpt_id, model, epoch, last_global_step, **kwargs):
+    """Utility function for checkpointing model + optimizer dictionaries
+    The main purpose for this is to be able to resume training from that instant again
+    """
+    checkpoint_state_dict = {
+        "epoch": epoch,
+        "last_global_step": last_global_step,
+    }
+    # Add extra kwargs too
+    checkpoint_state_dict.update(kwargs)
+
+    success = model.save_checkpoint(checkpoint_folder, ckpt_id, checkpoint_state_dict)
+    status_msg = f"checkpointing: checkpoint_folder={checkpoint_folder}, ckpt_id={ckpt_id}"
+    if success:
+        logging.info(f"Success {status_msg}")
+    else:
+        logging.warning(f"Failure {status_msg}")
+    return
+
+
+# New Code #
+def load_training_checkpoint(model, load_dir, tag=None, **kwargs):
+    """Utility function for checkpointing model + optimizer dictionaries
+    The main purpose for this is to be able to resume training from that instant again
+    """
+    _, checkpoint_state_dict = model.load_checkpoint(load_dir, tag=tag, **kwargs)
+    epoch = checkpoint_state_dict["epoch"]
+    last_global_step = checkpoint_state_dict["last_global_step"]
+    del checkpoint_state_dict
+    return (epoch, last_global_step)
+
+
+# New Code #
+def evaluate(args, model, eval_dataloader, accelerator, eval_dataset):
+    model.eval()
+    losses = []
+    for step, batch in enumerate(eval_dataloader):
+        with torch.no_grad():
+            outputs = model(**batch)
+
+        loss = outputs.loss
+        losses.append(accelerator.gather(loss.repeat(args.per_device_eval_batch_size)))
+
+    losses = torch.cat(losses)
+    losses = losses[: len(eval_dataset)]
+    try:
+        eval_loss = torch.mean(losses)
+        perplexity = math.exp(eval_loss)
+    except OverflowError:
+        perplexity = float("inf")
+    return perplexity, eval_loss
+
+
+def main():
+    args = parse_args()
+
+    # Initialize the accelerator. We will let the accelerator handle device placement for us in this example.
+    # If we're using tracking, we also need to initialize it here and it will by default pick up all supported trackers
+    # in the environment
+    accelerator = (
+        Accelerator(log_with=args.report_to, logging_dir=args.output_dir) if args.with_tracking else Accelerator()
+    )
+    # Make one log on every process with the configuration for debugging.
+    logging.basicConfig(
+        format="%(asctime)s - %(levelname)s - %(name)s - %(message)s",
+        datefmt="%m/%d/%Y %H:%M:%S",
+        level=logging.INFO,
+    )
+    logger.info(accelerator.state, main_process_only=False)
+    if accelerator.is_local_main_process:
+        datasets.utils.logging.set_verbosity_warning()
+        transformers.utils.logging.set_verbosity_info()
+    else:
+        datasets.utils.logging.set_verbosity_error()
+        transformers.utils.logging.set_verbosity_error()
+
+    # If passed along, set the training seed now.
+    if args.seed is not None:
+        set_seed(args.seed)
+
+    # Handle the repository creation
+    if accelerator.is_main_process:
+        if args.push_to_hub:
+            if args.hub_model_id is None:
+                repo_name = get_full_repo_name(Path(args.output_dir).name, token=args.hub_token)
+            else:
+                repo_name = args.hub_model_id
+            repo = Repository(args.output_dir, clone_from=repo_name)
+
+            with open(os.path.join(args.output_dir, ".gitignore"), "w+") as gitignore:
+                if "step_*" not in gitignore:
+                    gitignore.write("step_*\n")
+                if "epoch_*" not in gitignore:
+                    gitignore.write("epoch_*\n")
+        elif args.output_dir is not None:
+            os.makedirs(args.output_dir, exist_ok=True)
+    accelerator.wait_for_everyone()
+
+    # Get the datasets: you can either provide your own CSV/JSON/TXT training and evaluation files (see below)
+    # or just provide the name of one of the public datasets available on the hub at https://huggingface.co/datasets/
+    # (the dataset will be downloaded automatically from the datasets Hub).
+    #
+    # For CSV/JSON files, this script will use the column called 'text' or the first column if no column called
+    # 'text' is found. You can easily tweak this behavior (see below).
+    #
+    # In distributed training, the load_dataset function guarantee that only one local process can concurrently
+    # download the dataset.
+    if args.dataset_name is not None:
+        # Downloading and loading a dataset from the hub.
+        raw_datasets = load_dataset(args.dataset_name, args.dataset_config_name)
+        if "validation" not in raw_datasets.keys():
+            raw_datasets["validation"] = load_dataset(
+                args.dataset_name,
+                args.dataset_config_name,
+                split=f"train[:{args.validation_split_percentage}%]",
+            )
+            raw_datasets["train"] = load_dataset(
+                args.dataset_name,
+                args.dataset_config_name,
+                split=f"train[{args.validation_split_percentage}%:]",
+            )
+    else:
+        data_files = {}
+        dataset_args = {}
+        if args.train_file is not None:
+            data_files["train"] = args.train_file
+        if args.validation_file is not None:
+            data_files["validation"] = args.validation_file
+        extension = args.train_file.split(".")[-1]
+        if extension == "txt":
+            extension = "text"
+            dataset_args["keep_linebreaks"] = not args.no_keep_linebreaks
+        raw_datasets = load_dataset(extension, data_files=data_files, **dataset_args)
+        # If no validation data is there, validation_split_percentage will be used to divide the dataset.
+        if "validation" not in raw_datasets.keys():
+            raw_datasets["validation"] = load_dataset(
+                extension,
+                data_files=data_files,
+                split=f"train[:{args.validation_split_percentage}%]",
+                **dataset_args,
+            )
+            raw_datasets["train"] = load_dataset(
+                extension,
+                data_files=data_files,
+                split=f"train[{args.validation_split_percentage}%:]",
+                **dataset_args,
+            )
+
+    # See more about loading any type of standard or custom dataset (from files, python dict, pandas DataFrame, etc) at
+    # https://huggingface.co/docs/datasets/loading_datasets.html.
+
+    # Load pretrained model and tokenizer
+    #
+    # In distributed training, the .from_pretrained methods guarantee that only one local process can concurrently
+    # download model & vocab.
+    if args.config_name:
+        config = AutoConfig.from_pretrained(args.config_name)
+    elif args.model_name_or_path:
+        config = AutoConfig.from_pretrained(args.model_name_or_path)
+    else:
+        config = CONFIG_MAPPING[args.model_type]()
+        logger.warning("You are instantiating a new config instance from scratch.")
+
+    if args.tokenizer_name:
+        tokenizer = AutoTokenizer.from_pretrained(args.tokenizer_name, use_fast=not args.use_slow_tokenizer)
+    elif args.model_name_or_path:
+        tokenizer = AutoTokenizer.from_pretrained(args.model_name_or_path, use_fast=not args.use_slow_tokenizer)
+    else:
+        raise ValueError(
+            "You are instantiating a new tokenizer from scratch. This is not supported by this script."
+            "You can do it from another script, save it, and load it from here, using --tokenizer_name."
+        )
+
+    if args.model_name_or_path:
+        model = AutoModelForCausalLM.from_pretrained(
+            args.model_name_or_path,
+            from_tf=bool(".ckpt" in args.model_name_or_path),
+            config=config,
+        )
+    else:
+        logger.info("Training new model from scratch")
+        model = AutoModelForCausalLM.from_config(config)
+
+    model.resize_token_embeddings(len(tokenizer))
+
+    # Preprocessing the datasets.
+    # First we tokenize all the texts.
+    column_names = raw_datasets["train"].column_names
+    text_column_name = "text" if "text" in column_names else column_names[0]
+
+    def tokenize_function(examples):
+        return tokenizer(examples[text_column_name])
+
+    with accelerator.main_process_first():
+        tokenized_datasets = raw_datasets.map(
+            tokenize_function,
+            batched=True,
+            num_proc=args.preprocessing_num_workers,
+            remove_columns=column_names,
+            load_from_cache_file=not args.overwrite_cache,
+            desc="Running tokenizer on dataset",
+        )
+
+    if args.block_size is None:
+        block_size = tokenizer.model_max_length
+        if block_size > 1024:
+            logger.warning(
+                f"The tokenizer picked seems to have a very large `model_max_length` ({tokenizer.model_max_length}). "
+                "Picking 1024 instead. You can change that default value by passing --block_size xxx."
+            )
+        block_size = 1024
+    else:
+        if args.block_size > tokenizer.model_max_length:
+            logger.warning(
+                f"The block_size passed ({args.block_size}) is larger than the maximum length for the model"
+                f"({tokenizer.model_max_length}). Using block_size={tokenizer.model_max_length}."
+            )
+        block_size = min(args.block_size, tokenizer.model_max_length)
+
+    # Main data processing function that will concatenate all texts from our dataset and generate chunks of block_size.
+    def group_texts(examples):
+        # Concatenate all texts.
+        concatenated_examples = {k: list(chain(*examples[k])) for k in examples.keys()}
+        total_length = len(concatenated_examples[list(examples.keys())[0]])
+        # We drop the small remainder, we could add padding if the model supported it instead of this drop, you can
+        # customize this part to your needs.
+        if total_length >= block_size:
+            total_length = (total_length // block_size) * block_size
+        # Split by chunks of max_len.
+        result = {
+            k: [t[i : i + block_size] for i in range(0, total_length, block_size)]
+            for k, t in concatenated_examples.items()
+        }
+        result["labels"] = result["input_ids"].copy()
+        return result
+
+    # Note that with `batched=True`, this map processes 1,000 texts together, so group_texts throws away a remainder
+    # for each of those groups of 1,000 texts. You can adjust that batch_size here but a higher value might be slower
+    # to preprocess.
+    #
+    # To speed up this part, we use multiprocessing. See the documentation of the map method for more information:
+    # https://huggingface.co/docs/datasets/package_reference/main_classes.html#datasets.Dataset.map
+
+    with accelerator.main_process_first():
+        lm_datasets = tokenized_datasets.map(
+            group_texts,
+            batched=True,
+            num_proc=args.preprocessing_num_workers,
+            load_from_cache_file=not args.overwrite_cache,
+            desc=f"Grouping texts in chunks of {block_size}",
+        )
+
+    train_dataset = lm_datasets["train"]
+    eval_dataset = lm_datasets["validation"]
+
+    # Log a few random samples from the training set:
+    for index in random.sample(range(len(train_dataset)), 3):
+        logger.info(f"Sample {index} of the training set: {train_dataset[index]}.")
+
+    # DataLoaders creation:
+    train_dataloader = DataLoader(
+        train_dataset, shuffle=True, collate_fn=default_data_collator, batch_size=args.per_device_train_batch_size
+    )
+    eval_dataloader = DataLoader(
+        eval_dataset, collate_fn=default_data_collator, batch_size=args.per_device_eval_batch_size
+    )
+
+    # Optimizer
+    # Split weights in two groups, one with weight decay and the other not.
+    no_decay = ["bias", "LayerNorm.weight"]
+    optimizer_grouped_parameters = [
+        {
+            "params": [p for n, p in model.named_parameters() if not any(nd in n for nd in no_decay)],
+            "weight_decay": args.weight_decay,
+        },
+        {
+            "params": [p for n, p in model.named_parameters() if any(nd in n for nd in no_decay)],
+            "weight_decay": 0.0,
+        },
+    ]
+    # New Code #
+    # Creates Dummy Optimizer if `optimizer` was spcified in the config file else creates Adam Optimizer
+    optimizer_cls = (
+        torch.optim.AdamW
+        if accelerator.state.deepspeed_plugin is None
+        or "optimizer" not in accelerator.state.deepspeed_plugin.deepspeed_config
+        else DummyOptim
+    )
+    optimizer = optimizer_cls(optimizer_grouped_parameters, lr=args.learning_rate)
+
+    # On TPU, the tie weights in our model have been disconnected, so we need to restore the ties.
+    if accelerator.distributed_type == DistributedType.TPU:
+        model.tie_weights()
+
+    # Scheduler and math around the number of training steps.
+
+    # New Code
+    # Get gradient accumulation steps from deepspeed config if available
+    if accelerator.state.deepspeed_plugin is not None:
+        args.gradient_accumulation_steps = accelerator.state.deepspeed_plugin.deepspeed_config[
+            "gradient_accumulation_steps"
+        ]
+
+    num_update_steps_per_epoch = math.ceil(len(train_dataloader) / args.gradient_accumulation_steps)
+    if args.max_train_steps is None:
+        args.max_train_steps = args.num_train_epochs * num_update_steps_per_epoch
+    else:
+        args.num_train_epochs = math.ceil(args.max_train_steps / num_update_steps_per_epoch)
+
+    # New Code #
+    # Creates Dummy Scheduler if `scheduler` was spcified in the config file else creates `args.lr_scheduler_type` Scheduler
+    if (
+        accelerator.state.deepspeed_plugin is None
+        or "scheduler" not in accelerator.state.deepspeed_plugin.deepspeed_config
+    ):
+        lr_scheduler = get_scheduler(
+            name=args.lr_scheduler_type,
+            optimizer=optimizer,
+            num_warmup_steps=args.num_warmup_steps,
+            num_training_steps=args.max_train_steps,
+        )
+    else:
+        lr_scheduler = DummyScheduler(
+            optimizer, total_num_steps=args.max_train_steps, warmup_num_steps=args.num_warmup_steps
+        )
+
+    # Prepare everything with our `accelerator`.
+    model, optimizer, train_dataloader, eval_dataloader, lr_scheduler = accelerator.prepare(
+        model, optimizer, train_dataloader, eval_dataloader, lr_scheduler
+    )
+
+    # We need to recalculate our total training steps as the size of the training dataloader may have changed.
+    num_update_steps_per_epoch = math.ceil(len(train_dataloader) / args.gradient_accumulation_steps)
+    args.max_train_steps = args.num_train_epochs * num_update_steps_per_epoch
+
+    # Figure out how many steps we should save the Accelerator states
+    if hasattr(args.checkpointing_steps, "isdigit"):
+        checkpointing_steps = args.checkpointing_steps
+        if args.checkpointing_steps.isdigit():
+            checkpointing_steps = int(args.checkpointing_steps)
+    else:
+        checkpointing_steps = None
+
+    # We need to initialize the trackers we use, and also store our configuration.
+    # We initialize the trackers only on main process because `accelerator.log`
+    # only logs on main process and we don't want empty logs/runs on other processes.
+    if args.with_tracking:
+        if accelerator.is_main_process:
+            experiment_config = vars(args)
+            # TensorBoard cannot log Enums, need the raw value
+            experiment_config["lr_scheduler_type"] = experiment_config["lr_scheduler_type"].value
+            accelerator.init_trackers("clm_no_trainer", experiment_config)
+
+    # Train!
+    total_batch_size = args.per_device_train_batch_size * accelerator.num_processes * args.gradient_accumulation_steps
+
+    logger.info("***** Running training *****")
+    logger.info(f"  Num examples = {len(train_dataset)}")
+    logger.info(f"  Num Epochs = {args.num_train_epochs}")
+    logger.info(f"  Instantaneous batch size per device = {args.per_device_train_batch_size}")
+    logger.info(f"  Total train batch size (w. parallel, distributed & accumulation) = {total_batch_size}")
+    logger.info(f"  Gradient Accumulation steps = {args.gradient_accumulation_steps}")
+    logger.info(f"  Total optimization steps = {args.max_train_steps}")
+    # Only show the progress bar once on each machine.
+    progress_bar = tqdm(range(args.max_train_steps), disable=not accelerator.is_local_main_process)
+    completed_steps = 0
+    starting_epoch = 0
+    best_metric = None
+    best_metric_checkpoint = None
+
+    # Potentially load in the weights and states from a previous save
+    if args.resume_from_checkpoint:
+        # New Code #
+        # Loads the DeepSpeed checkpoint from the specified path
+        _, last_global_step = load_training_checkpoint(
+            model,
+            args.resume_from_checkpoint,
+            **{"load_optimizer_states": True, "load_lr_scheduler_states": True},
+        )
+        accelerator.print(f"Resumed from checkpoint: {args.resume_from_checkpoint}")
+        resume_step = last_global_step
+        starting_epoch = resume_step // len(train_dataloader)
+        resume_step -= starting_epoch * len(train_dataloader)
+
+    for epoch in range(starting_epoch, args.num_train_epochs):
+        model.train()
+        if args.with_tracking:
+            total_loss = 0
+        for step, batch in enumerate(train_dataloader):
+            # We need to skip steps until we reach the resumed step
+            if args.resume_from_checkpoint and epoch == starting_epoch:
+                if resume_step is not None and step < resume_step:
+                    completed_steps += 1
+                    continue
+            outputs = model(**batch)
+            loss = outputs.loss
+            # We keep track of the loss at each epoch
+            if args.with_tracking:
+                total_loss += loss.detach().float()
+            loss = loss / args.gradient_accumulation_steps
+            accelerator.backward(loss)
+            if step % args.gradient_accumulation_steps == 0 or step == len(train_dataloader) - 1:
+                optimizer.step()
+                lr_scheduler.step()
+                optimizer.zero_grad()
+                progress_bar.update(1)
+                completed_steps += 1
+
+            if isinstance(checkpointing_steps, int):
+                if completed_steps % checkpointing_steps == 0:
+                    output_dir = f"step_{completed_steps }"
+                    if args.output_dir is not None:
+                        output_dir = os.path.join(args.output_dir, output_dir)
+                    accelerator.save_state(output_dir)
+            if completed_steps >= args.max_train_steps:
+                break
+
+        perplexity, eval_loss = evaluate(args, model, eval_dataloader, accelerator, eval_dataset)
+        logger.info(f"epoch {epoch}: perplexity: {perplexity} eval_loss: {eval_loss}")
+
+        if args.with_tracking:
+            accelerator.log(
+                {
+                    "perplexity": perplexity,
+                    "eval_loss": eval_loss,
+                    "train_loss": total_loss.item() / len(train_dataloader),
+                    "epoch": epoch,
+                    "step": completed_steps,
+                },
+                step=completed_steps,
+            )
+
+        # New Code #
+        # Save the DeepSpeed checkpoint to the specified path
+        checkpoint_model(args.output_dir, epoch, model, epoch, completed_steps)
+
+        # New Code #
+        # Tracks the best checkpoint and best metric
+        if best_metric is None or best_metric > perplexity:
+            best_metric = perplexity
+            best_metric_checkpoint = os.path.join(args.output_dir, str(epoch))
+            accelerator.print(f"New best metric: {best_metric} at epoch {epoch}")
+            accelerator.print(f"best_metric_checkpoint: {best_metric_checkpoint}")
+
+    # New Code #
+    # Loads the best checkpoint after the training is finished
+    if args.load_best_model:
+        _, last_global_step = load_training_checkpoint(
+            model,
+            "/".join(best_metric_checkpoint.split("/")[:-1]),
+            tag=best_metric_checkpoint.split("/")[-1],
+            **{"load_optimizer_states": True, "load_lr_scheduler_states": True},
+        )
+
+    # New Code #
+    # Evaluates using the best checkpoint
+    perplexity, eval_loss = evaluate(args, model, eval_dataloader, accelerator, eval_dataset)
+    logger.info(f"Best model metrics: perplexity: {perplexity} eval_loss: {eval_loss}")
+    if perplexity != best_metric:
+        raise AssertionError(
+            f"Best metric {best_metric} does not match the metric {perplexity} of the loaded best model."
+        )
+
+    if args.output_dir is not None:
+        accelerator.wait_for_everyone()
+        unwrapped_model = accelerator.unwrap_model(model)
+
+        # New Code #
+        # Saves the whole/unpartitioned fp16 model when in ZeRO Stage-3 to the output directory if
+        # `stage3_gather_16bit_weights_on_model_save` is True in DeepSpeed Config file or
+        # `zero3_save_16bit_model` is True in DeepSpeed Plugin.
+        # For Zero Stages 1 and 2, models are saved as usual in the output directory.
+        # The model name saved is `pytorch_model.bin`
+        unwrapped_model.save_pretrained(
+            args.output_dir,
+            is_main_process=accelerator.is_main_process,
+            save_function=accelerator.save,
+            state_dict=accelerator.get_state_dict(model),
+        )
+        if accelerator.is_main_process:
+            tokenizer.save_pretrained(args.output_dir)
+            if args.push_to_hub:
+                repo.push_to_hub(commit_message="End of training", auto_lfs_prune=True)
+
+        with open(os.path.join(args.output_dir, "all_results.json"), "w") as f:
+            json.dump({"perplexity": perplexity, "eval_loss": eval_loss.item()}, f)
+
+
+if __name__ == "__main__":
+    main()
--- a/examples/by_feature/fsdp_with_peak_mem_tracking.py
+++ b/examples/by_feature/fsdp_with_peak_mem_tracking.py
@ -19,8 +19,9 @@ import os
 import torch
 from torch.utils.data import DataLoader

+import evaluate
 from accelerate import Accelerator, DistributedType
-from datasets import load_dataset, load_metric
+from datasets import load_dataset
 from transformers import AutoModelForSequenceClassification, AutoTokenizer, get_linear_schedule_with_warmup, set_seed


@ -111,15 +112,12 @@ def training_function(config, args):
    # We need to initialize the trackers we use, and also store our configuration
    if args.with_tracking:
        if accelerator.is_main_process:
-            run = os.path.split(__file__)[-1].split(".")[0]
-            if args.logging_dir:
-                run = os.path.join(args.logging_dir, run)
-                accelerator.print(run)
-            accelerator.init_trackers(run, config)
+            experiment_config = vars(args)
+            accelerator.init_trackers("fsdp_glue_no_trainer", experiment_config)

    tokenizer = AutoTokenizer.from_pretrained(args.model_name_or_path)
    datasets = load_dataset("glue", "mrpc")
-    metric = load_metric("glue", "mrpc")
+    metric = evaluate.load("glue", "mrpc")

    def tokenize_function(examples):
        # max_length=None => use the model max length (it's actually the default)
@ -139,7 +137,7 @@ def training_function(config, args):

    # If the batch size is too big we use gradient accumulation
    gradient_accumulation_steps = 1
-    if batch_size > MAX_GPU_BATCH_SIZE:
+    if batch_size > MAX_GPU_BATCH_SIZE and accelerator.distributed_type != DistributedType.TPU:
        gradient_accumulation_steps = batch_size // MAX_GPU_BATCH_SIZE
        batch_size = MAX_GPU_BATCH_SIZE

@ -164,6 +162,7 @@ def training_function(config, args):
    # New Code #
    # For FSDP feature, it is highly recommended and efficient to prepare the model before creating optimizer
    model = accelerator.prepare(model)
+    accelerator.print(model)

    # Instantiate optimizer
    # New Code #
@ -282,7 +281,7 @@ def training_function(config, args):
                predictions, references = accelerator.gather(
                    (predictions, batch["labels"])
                )  # If we are in a multiprocess environment, the last batch has duplicates
-                if accelerator.num_processes > 1:
+                if accelerator.use_distributed:
                    if step == len(eval_dataloader) - 1:
                        predictions = predictions[: len(eval_dataloader.dataset) - samples_seen]
                        references = references[: len(eval_dataloader.dataset) - samples_seen]
--- a/examples/by_feature/gradient_accumulation.py
+++ b/examples/by_feature/gradient_accumulation.py
@ -0,0 +1,210 @@
+# coding=utf-8
+# Copyright 2021 The HuggingFace Inc. team. All rights reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+import argparse
+import os
+
+import torch
+from torch.optim import AdamW
+from torch.utils.data import DataLoader
+
+import evaluate
+from accelerate import Accelerator, DistributedType
+from datasets import load_dataset
+from transformers import AutoModelForSequenceClassification, AutoTokenizer, get_linear_schedule_with_warmup, set_seed
+
+
+########################################################################
+# This is a fully working simple example to use Accelerate
+# and perform gradient accumulation
+#
+# This example trains a Bert base model on GLUE MRPC
+# in any of the following settings (with the same script):
+#   - single CPU or single GPU
+#   - multi GPUS (using PyTorch distributed mode)
+#   - (multi) TPUs
+#   - fp16 (mixed-precision) or fp32 (normal precision)
+#
+# To run it in each of these various modes, follow the instructions
+# in the readme for examples:
+# https://github.com/huggingface/accelerate/tree/main/examples
+#
+########################################################################
+
+
+MAX_GPU_BATCH_SIZE = 16
+EVAL_BATCH_SIZE = 32
+
+
+def get_dataloaders(accelerator: Accelerator, batch_size: int = 16):
+    """
+    Creates a set of `DataLoader`s for the `glue` dataset,
+    using "bert-base-cased" as the tokenizer.
+
+    Args:
+        accelerator (`Accelerator`):
+            An `Accelerator` object
+        batch_size (`int`, *optional*):
+            The batch size for the train and validation DataLoaders.
+    """
+    tokenizer = AutoTokenizer.from_pretrained("bert-base-cased")
+    datasets = load_dataset("glue", "mrpc")
+
+    def tokenize_function(examples):
+        # max_length=None => use the model max length (it's actually the default)
+        outputs = tokenizer(examples["sentence1"], examples["sentence2"], truncation=True, max_length=None)
+        return outputs
+
+    # Apply the method we just defined to all the examples in all the splits of the dataset
+    tokenized_datasets = datasets.map(
+        tokenize_function,
+        batched=True,
+        remove_columns=["idx", "sentence1", "sentence2"],
+    )
+
+    # We also rename the 'label' column to 'labels' which is the expected name for labels by the models of the
+    # transformers library
+    tokenized_datasets = tokenized_datasets.rename_column("label", "labels")
+
+    def collate_fn(examples):
+        # On TPU it's best to pad everything to the same length or training will be very slow.
+        if accelerator.distributed_type == DistributedType.TPU:
+            return tokenizer.pad(examples, padding="max_length", max_length=128, return_tensors="pt")
+        return tokenizer.pad(examples, padding="longest", return_tensors="pt")
+
+    # Instantiate dataloaders.
+    train_dataloader = DataLoader(
+        tokenized_datasets["train"], shuffle=True, collate_fn=collate_fn, batch_size=batch_size
+    )
+    eval_dataloader = DataLoader(
+        tokenized_datasets["validation"], shuffle=False, collate_fn=collate_fn, batch_size=EVAL_BATCH_SIZE
+    )
+
+    return train_dataloader, eval_dataloader
+
+
+# For testing only
+if os.environ.get("TESTING_MOCKED_DATALOADERS", None) == "1":
+    from accelerate.test_utils.training import mocked_dataloaders
+
+    get_dataloaders = mocked_dataloaders  # noqa: F811
+
+
+def training_function(config, args):
+    # New Code #
+    gradient_accumulation_steps = int(args.gradient_accumulation_steps)
+    # Initialize accelerator
+    accelerator = Accelerator(
+        cpu=args.cpu, mixed_precision=args.mixed_precision, gradient_accumulation_steps=gradient_accumulation_steps
+    )
+    if accelerator.distributed_type == DistributedType.TPU and gradient_accumulation_steps > 1:
+        raise NotImplementedError(
+            "Gradient accumulation on TPUs is currently not supported. Pass `gradient_accumulation_steps=1`"
+        )
+    # Sample hyper-parameters for learning rate, batch size, seed and a few other HPs
+    lr = config["lr"]
+    num_epochs = int(config["num_epochs"])
+    seed = int(config["seed"])
+    batch_size = int(config["batch_size"])
+
+    metric = evaluate.load("glue", "mrpc")
+
+    set_seed(seed)
+    train_dataloader, eval_dataloader = get_dataloaders(accelerator, batch_size)
+    # Instantiate the model (we build the model here so that the seed also control new weights initialization)
+    model = AutoModelForSequenceClassification.from_pretrained("bert-base-cased", return_dict=True)
+
+    # We could avoid this line since the accelerator is set with `device_placement=True` (default value).
+    # Note that if you are placing tensors on devices manually, this line absolutely needs to be before the optimizer
+    # creation otherwise training will not work on TPU (`accelerate` will kindly throw an error to make us aware of that).
+    model = model.to(accelerator.device)
+
+    # Instantiate optimizer
+    optimizer = AdamW(params=model.parameters(), lr=lr)
+
+    # Instantiate scheduler
+    lr_scheduler = get_linear_schedule_with_warmup(
+        optimizer=optimizer,
+        num_warmup_steps=100,
+        num_training_steps=(len(train_dataloader) * num_epochs),
+    )
+
+    # Prepare everything
+    # There is no specific order to remember, we just need to unpack the objects in the same order we gave them to the
+    # prepare method.
+    model, optimizer, train_dataloader, eval_dataloader, lr_scheduler = accelerator.prepare(
+        model, optimizer, train_dataloader, eval_dataloader, lr_scheduler
+    )
+
+    # Now we train the model
+    for epoch in range(num_epochs):
+        model.train()
+        for step, batch in enumerate(train_dataloader):
+            # We could avoid this line since we set the accelerator with `device_placement=True`.
+            batch.to(accelerator.device)
+            # New code #
+            # We use the new `accumulate` context manager to perform gradient accumulation
+            # We also currently do not support TPUs nor advise it as bugs were found on the XLA side when running our tests.
+            with accelerator.accumulate(model):
+                output = model(**batch)
+                loss = output.loss
+                accelerator.backward(loss)
+                optimizer.step()
+                lr_scheduler.step()
+                optimizer.zero_grad()
+
+        model.eval()
+        for step, batch in enumerate(eval_dataloader):
+            # We could avoid this line since we set the accelerator with `device_placement=True`.
+            batch.to(accelerator.device)
+            with torch.no_grad():
+                outputs = model(**batch)
+            predictions = outputs.logits.argmax(dim=-1)
+            predictions, references = accelerator.gather((predictions, batch["labels"]))
+            metric.add_batch(
+                predictions=predictions,
+                references=references,
+            )
+
+        eval_metric = metric.compute()
+        # Use accelerator.print to print only on the main process.
+        accelerator.print(f"epoch {epoch}:", eval_metric)
+
+
+def main():
+    parser = argparse.ArgumentParser(description="Simple example of training script.")
+    parser.add_argument(
+        "--mixed_precision",
+        type=str,
+        default="no",
+        choices=["no", "fp16", "bf16"],
+        help="Whether to use mixed precision. Choose"
+        "between fp16 and bf16 (bfloat16). Bf16 requires PyTorch >= 1.10."
+        "and an Nvidia Ampere GPU.",
+    )
+    # New Code #
+    parser.add_argument(
+        "--gradient_accumulation_steps",
+        type=int,
+        default=1,
+        help="The number of minibatches to be ran before gradients are accumulated.",
+    )
+    parser.add_argument("--cpu", action="store_true", help="If passed, will train on the CPU.")
+    args = parser.parse_args()
+    config = {"lr": 2e-5, "num_epochs": 3, "seed": 42, "batch_size": 16}
+    training_function(config, args)
+
+
+if __name__ == "__main__":
+    main()
--- a/examples/by_feature/memory.py
+++ b/examples/by_feature/memory.py
@ -15,20 +15,15 @@ import argparse
 import os

 import torch
+from torch.optim import AdamW
 from torch.utils.data import DataLoader

-from accelerate import Accelerator, DistributedType
-
 # New Code #
+import evaluate
+from accelerate import Accelerator, DistributedType
 from accelerate.utils import find_executable_batch_size
-from datasets import load_dataset, load_metric
-from transformers import (
-    AdamW,
-    AutoModelForSequenceClassification,
-    AutoTokenizer,
-    get_linear_schedule_with_warmup,
-    set_seed,
-)
+from datasets import load_dataset
+from transformers import AutoModelForSequenceClassification, AutoTokenizer, get_linear_schedule_with_warmup, set_seed


 ########################################################################
@ -117,15 +112,14 @@ def training_function(config, args):
    # Sample hyper-parameters for learning rate, batch size, seed and a few other HPs
    lr = config["lr"]
    num_epochs = int(config["num_epochs"])
-    correct_bias = config["correct_bias"]
    seed = int(config["seed"])
    batch_size = int(config["batch_size"])

-    metric = load_metric("glue", "mrpc")
+    metric = evaluate.load("glue", "mrpc")

    # If the batch size is too big we use gradient accumulation
    gradient_accumulation_steps = 1
-    if batch_size > MAX_GPU_BATCH_SIZE:
+    if batch_size > MAX_GPU_BATCH_SIZE and accelerator.distributed_type != DistributedType.TPU:
        gradient_accumulation_steps = batch_size // MAX_GPU_BATCH_SIZE
        batch_size = MAX_GPU_BATCH_SIZE

@ -139,7 +133,7 @@ def training_function(config, args):
    model = model.to(accelerator.device)

    # Instantiate optimizer
-    optimizer = AdamW(params=model.parameters(), lr=lr, correct_bias=correct_bias)
+    optimizer = AdamW(params=model.parameters(), lr=lr)

    # New Code #
    # We now can define an inner training loop function. It should take a batch size as the only parameter,
@ -218,7 +212,7 @@ def main():
    )
    parser.add_argument("--cpu", action="store_true", help="If passed, will train on the CPU.")
    args = parser.parse_args()
-    config = {"lr": 2e-5, "num_epochs": 3, "correct_bias": True, "seed": 42, "batch_size": 16}
+    config = {"lr": 2e-5, "num_epochs": 3, "seed": 42, "batch_size": 16}
    training_function(config, args)


--- a/examples/by_feature/multi_process_metrics.py
+++ b/examples/by_feature/multi_process_metrics.py
@ -16,17 +16,13 @@ import argparse
 import os

 import torch
+from torch.optim import AdamW
 from torch.utils.data import DataLoader

+import evaluate
 from accelerate import Accelerator, DistributedType
-from datasets import load_dataset, load_metric
-from transformers import (
-    AdamW,
-    AutoModelForSequenceClassification,
-    AutoTokenizer,
-    get_linear_schedule_with_warmup,
-    set_seed,
-)
+from datasets import load_dataset
+from transformers import AutoModelForSequenceClassification, AutoTokenizer, get_linear_schedule_with_warmup, set_seed


 ########################################################################
@ -118,15 +114,14 @@ def training_function(config, args):
    # Sample hyper-parameters for learning rate, batch size, seed and a few other HPs
    lr = config["lr"]
    num_epochs = int(config["num_epochs"])
-    correct_bias = config["correct_bias"]
    seed = int(config["seed"])
    batch_size = int(config["batch_size"])

-    metric = load_metric("glue", "mrpc")
+    metric = evaluate.load("glue", "mrpc")

    # If the batch size is too big we use gradient accumulation
    gradient_accumulation_steps = 1
-    if batch_size > MAX_GPU_BATCH_SIZE:
+    if batch_size > MAX_GPU_BATCH_SIZE and accelerator.distributed_type != DistributedType.TPU:
        gradient_accumulation_steps = batch_size // MAX_GPU_BATCH_SIZE
        batch_size = MAX_GPU_BATCH_SIZE

@ -141,7 +136,7 @@ def training_function(config, args):
    model = model.to(accelerator.device)

    # Instantiate optimizer
-    optimizer = AdamW(params=model.parameters(), lr=lr, correct_bias=correct_bias)
+    optimizer = AdamW(params=model.parameters(), lr=lr)

    # Instantiate scheduler
    lr_scheduler = get_linear_schedule_with_warmup(
@ -183,7 +178,7 @@ def training_function(config, args):
            predictions, references = accelerator.gather((predictions, batch["labels"]))
            # New Code #
            # First we check if it's a distributed system
-            if accelerator.num_processes > 1:
+            if accelerator.use_distributed:
                # Then see if we're on the last batch of our eval dataloader
                if step == len(eval_dataloader) - 1:
                    # Last batch needs to be truncated on distributed systems as it contains additional samples
@ -215,7 +210,7 @@ def main():
    )
    parser.add_argument("--cpu", action="store_true", help="If passed, will train on the CPU.")
    args = parser.parse_args()
-    config = {"lr": 2e-5, "num_epochs": 3, "correct_bias": True, "seed": 42, "batch_size": 16}
+    config = {"lr": 2e-5, "num_epochs": 3, "seed": 42, "batch_size": 16}
    training_function(config, args)


--- a/examples/by_feature/tracking.py
+++ b/examples/by_feature/tracking.py
@ -16,17 +16,13 @@ import argparse
 import os

 import torch
+from torch.optim import AdamW
 from torch.utils.data import DataLoader

+import evaluate
 from accelerate import Accelerator, DistributedType
-from datasets import load_dataset, load_metric
-from transformers import (
-    AdamW,
-    AutoModelForSequenceClassification,
-    AutoTokenizer,
-    get_linear_schedule_with_warmup,
-    set_seed,
-)
+from datasets import load_dataset
+from transformers import AutoModelForSequenceClassification, AutoTokenizer, get_linear_schedule_with_warmup, set_seed


 ########################################################################
@ -126,17 +122,16 @@ def training_function(config, args):
    # Sample hyper-parameters for learning rate, batch size, seed and a few other HPs
    lr = config["lr"]
    num_epochs = int(config["num_epochs"])
-    correct_bias = config["correct_bias"]
    seed = int(config["seed"])
    batch_size = int(config["batch_size"])
    set_seed(seed)

    train_dataloader, eval_dataloader = get_dataloaders(accelerator, batch_size)
-    metric = load_metric("glue", "mrpc")
+    metric = evaluate.load("glue", "mrpc")

    # If the batch size is too big we use gradient accumulation
    gradient_accumulation_steps = 1
-    if batch_size > MAX_GPU_BATCH_SIZE:
+    if batch_size > MAX_GPU_BATCH_SIZE and accelerator.distributed_type != DistributedType.TPU:
        gradient_accumulation_steps = batch_size // MAX_GPU_BATCH_SIZE
        batch_size = MAX_GPU_BATCH_SIZE

@ -149,7 +144,7 @@ def training_function(config, args):
    model = model.to(accelerator.device)

    # Instantiate optimizer
-    optimizer = AdamW(params=model.parameters(), lr=lr, correct_bias=correct_bias)
+    optimizer = AdamW(params=model.parameters(), lr=lr)

    # Instantiate scheduler
    lr_scheduler = get_linear_schedule_with_warmup(
@ -170,8 +165,6 @@ def training_function(config, args):
    if args.with_tracking:
        if accelerator.is_main_process:
            run = os.path.split(__file__)[-1].split(".")[0]
-            if args.logging_dir:
-                run = os.path.join(args.logging_dir, run)
            accelerator.init_trackers(run, config)

    # Now we train the model
@ -259,7 +252,7 @@ def main():
        help="Location on where to store experiment tracking logs`",
    )
    args = parser.parse_args()
-    config = {"lr": 2e-5, "num_epochs": 3, "correct_bias": True, "seed": 42, "batch_size": 16}
+    config = {"lr": 2e-5, "num_epochs": 3, "seed": 42, "batch_size": 16}
    training_function(config, args)


--- a/examples/complete_cv_example.py
+++ b/examples/complete_cv_example.py
@ -242,7 +242,7 @@ def training_function(config, args):
                outputs = model(inputs)
            predictions = outputs.argmax(dim=-1)
            predictions, references = accelerator.gather((predictions, batch["label"]))
-            if accelerator.num_processes > 1:
+            if accelerator.use_distributed:
                if step == len(eval_dataloader) - 1:
                    predictions = predictions[: len(eval_dataloader) - samples_seen]
                    references = references[: len(eval_dataloader) - samples_seen]
--- a/examples/complete_nlp_example.py
+++ b/examples/complete_nlp_example.py
@ -16,17 +16,13 @@ import argparse
 import os

 import torch
+from torch.optim import AdamW
 from torch.utils.data import DataLoader

+import evaluate
 from accelerate import Accelerator, DistributedType
-from datasets import load_dataset, load_metric
-from transformers import (
-    AdamW,
-    AutoModelForSequenceClassification,
-    AutoTokenizer,
-    get_linear_schedule_with_warmup,
-    set_seed,
-)
+from datasets import load_dataset
+from transformers import AutoModelForSequenceClassification, AutoTokenizer, get_linear_schedule_with_warmup, set_seed


 ########################################################################
@ -75,7 +71,6 @@ def training_function(config, args):
    # Sample hyper-parameters for learning rate, batch size, seed and a few other HPs
    lr = config["lr"]
    num_epochs = int(config["num_epochs"])
-    correct_bias = config["correct_bias"]
    seed = int(config["seed"])
    batch_size = int(config["batch_size"])

@ -89,7 +84,7 @@ def training_function(config, args):

    tokenizer = AutoTokenizer.from_pretrained("bert-base-cased")
    datasets = load_dataset("glue", "mrpc")
-    metric = load_metric("glue", "mrpc")
+    metric = evaluate.load("glue", "mrpc")

    def tokenize_function(examples):
        # max_length=None => use the model max length (it's actually the default)
@ -109,7 +104,7 @@ def training_function(config, args):

    # If the batch size is too big we use gradient accumulation
    gradient_accumulation_steps = 1
-    if batch_size > MAX_GPU_BATCH_SIZE:
+    if batch_size > MAX_GPU_BATCH_SIZE and accelerator.distributed_type != DistributedType.TPU:
        gradient_accumulation_steps = batch_size // MAX_GPU_BATCH_SIZE
        batch_size = MAX_GPU_BATCH_SIZE

@ -138,7 +133,7 @@ def training_function(config, args):
    model = model.to(accelerator.device)

    # Instantiate optimizer
-    optimizer = AdamW(params=model.parameters(), lr=lr, correct_bias=correct_bias)
+    optimizer = AdamW(params=model.parameters(), lr=lr)

    # Instantiate scheduler
    lr_scheduler = get_linear_schedule_with_warmup(
@ -227,7 +222,7 @@ def training_function(config, args):
            predictions, references = accelerator.gather(
                (predictions, batch["labels"])
            )  # If we are in a multiprocess environment, the last batch has duplicates
-            if accelerator.num_processes > 1:
+            if accelerator.use_distributed:
                if step == len(eval_dataloader) - 1:
                    predictions = predictions[: len(eval_dataloader.dataset) - samples_seen]
                    references = references[: len(eval_dataloader.dataset) - samples_seen]
@ -304,7 +299,7 @@ def main():
        help="Location on where to store experiment tracking logs`",
    )
    args = parser.parse_args()
-    config = {"lr": 2e-5, "num_epochs": 3, "correct_bias": True, "seed": 42, "batch_size": 16}
+    config = {"lr": 2e-5, "num_epochs": 3, "seed": 42, "batch_size": 16}
    training_function(config, args)


--- a/examples/deepspeed_config_templates/zero_stage1_config.json
+++ b/examples/deepspeed_config_templates/zero_stage1_config.json
@ -0,0 +1,43 @@
+{
+    "fp16": {
+        "enabled": true,
+        "loss_scale": 0,
+        "loss_scale_window": 1000,
+        "initial_scale_power": 16,
+        "hysteresis": 2,
+        "min_loss_scale": 1
+    },
+    "optimizer": {
+        "type": "AdamW",
+        "params": {
+            "lr": "auto",
+            "weight_decay": "auto",
+            "torch_adam": true,
+            "adam_w_mode": true
+        }
+    },
+    "scheduler": {
+        "type": "WarmupDecayLR",
+        "params": {
+            "warmup_min_lr": "auto",
+            "warmup_max_lr": "auto",
+            "warmup_num_steps": "auto",
+            "total_num_steps": "auto"
+        }
+    },
+    "zero_optimization": {
+        "stage": 1,
+        "allgather_partitions": true,
+        "allgather_bucket_size": 2e8,
+        "overlap_comm": true,
+        "reduce_scatter": true,
+        "reduce_bucket_size": "auto",
+        "contiguous_gradients": true
+    },
+    "gradient_accumulation_steps": 1,
+    "gradient_clipping": "auto",
+    "steps_per_print": 2000,
+    "train_batch_size": "auto",
+    "train_micro_batch_size_per_gpu": "auto",
+    "wall_clock_breakdown": false
+}
--- a/examples/deepspeed_config_templates/zero_stage2_config.json
+++ b/examples/deepspeed_config_templates/zero_stage2_config.json
@ -0,0 +1,43 @@
+{
+    "fp16": {
+        "enabled": true,
+        "loss_scale": 0,
+        "loss_scale_window": 1000,
+        "initial_scale_power": 16,
+        "hysteresis": 2,
+        "min_loss_scale": 1
+    },
+    "optimizer": {
+        "type": "AdamW",
+        "params": {
+            "lr": "auto",
+            "weight_decay": "auto",
+            "torch_adam": true,
+            "adam_w_mode": true
+        }
+    },
+    "scheduler": {
+        "type": "WarmupDecayLR",
+        "params": {
+            "warmup_min_lr": "auto",
+            "warmup_max_lr": "auto",
+            "warmup_num_steps": "auto",
+            "total_num_steps": "auto"
+        }
+    },
+    "zero_optimization": {
+        "stage": 2,
+        "allgather_partitions": true,
+        "allgather_bucket_size": 2e8,
+        "overlap_comm": true,
+        "reduce_scatter": true,
+        "reduce_bucket_size": "auto",
+        "contiguous_gradients": true
+    },
+    "gradient_accumulation_steps": 1,
+    "gradient_clipping": "auto",
+    "steps_per_print": 2000,
+    "train_batch_size": "auto",
+    "train_micro_batch_size_per_gpu": "auto",
+    "wall_clock_breakdown": false
+}
--- a/examples/deepspeed_config_templates/zero_stage2_offload_config.json
+++ b/examples/deepspeed_config_templates/zero_stage2_offload_config.json
@ -0,0 +1,47 @@
+{
+    "fp16": {
+        "enabled": true,
+        "loss_scale": 0,
+        "loss_scale_window": 1000,
+        "initial_scale_power": 16,
+        "hysteresis": 2,
+        "min_loss_scale": 1
+    },
+    "optimizer": {
+        "type": "AdamW",
+        "params": {
+            "lr": "auto",
+            "weight_decay": "auto",
+            "torch_adam": true,
+            "adam_w_mode": true
+        }
+    },
+    "scheduler": {
+        "type": "WarmupDecayLR",
+        "params": {
+            "warmup_min_lr": "auto",
+            "warmup_max_lr": "auto",
+            "warmup_num_steps": "auto",
+            "total_num_steps": "auto"
+        }
+    },
+    "zero_optimization": {
+        "stage": 2,
+        "offload_optimizer": {
+            "device": "cpu",
+            "pin_memory": true
+        },
+        "allgather_partitions": true,
+        "allgather_bucket_size": 2e8,
+        "overlap_comm": true,
+        "reduce_scatter": true,
+        "reduce_bucket_size": "auto",
+        "contiguous_gradients": true
+    },
+    "gradient_accumulation_steps": 1,
+    "gradient_clipping": "auto",
+    "steps_per_print": 2000,
+    "train_batch_size": "auto",
+    "train_micro_batch_size_per_gpu": "auto",
+    "wall_clock_breakdown": false
+}
--- a/examples/deepspeed_config_templates/zero_stage3_config.json
+++ b/examples/deepspeed_config_templates/zero_stage3_config.json
@ -0,0 +1,44 @@
+{
+    "fp16": {
+        "enabled": true,
+        "loss_scale": 0,
+        "loss_scale_window": 1000,
+        "initial_scale_power": 16,
+        "hysteresis": 2,
+        "min_loss_scale": 1
+    },
+    "optimizer": {
+        "type": "AdamW",
+        "params": {
+            "lr": "auto",
+            "weight_decay": "auto"
+        }
+    },
+    "scheduler": {
+        "type": "WarmupDecayLR",
+        "params": {
+            "warmup_min_lr": "auto",
+            "warmup_max_lr": "auto",
+            "warmup_num_steps": "auto",
+            "total_num_steps": "auto"
+        }
+    },
+    "zero_optimization": {
+        "stage": 3,
+        "overlap_comm": true,
+        "contiguous_gradients": true,
+        "reduce_bucket_size": "auto",
+        "stage3_prefetch_bucket_size": "auto",
+        "stage3_param_persistence_threshold": "auto",
+        "sub_group_size": 1e9,
+        "stage3_max_live_parameters": 1e9,
+        "stage3_max_reuse_distance": 1e9,
+        "stage3_gather_16bit_weights_on_model_save": "auto"
+    },
+    "gradient_accumulation_steps": 1,
+    "gradient_clipping": "auto",
+    "steps_per_print": 2000,
+    "train_batch_size": "auto",
+    "train_micro_batch_size_per_gpu": "auto",
+    "wall_clock_breakdown": false
+}
--- a/examples/deepspeed_config_templates/zero_stage3_offload_config.json
+++ b/examples/deepspeed_config_templates/zero_stage3_offload_config.json
@ -0,0 +1,52 @@
+{
+    "fp16": {
+        "enabled": true,
+        "loss_scale": 0,
+        "loss_scale_window": 1000,
+        "initial_scale_power": 16,
+        "hysteresis": 2,
+        "min_loss_scale": 1
+    },
+    "optimizer": {
+        "type": "AdamW",
+        "params": {
+            "lr": "auto",
+            "weight_decay": "auto"
+        }
+    },
+    "scheduler": {
+        "type": "WarmupDecayLR",
+        "params": {
+            "warmup_min_lr": "auto",
+            "warmup_max_lr": "auto",
+            "warmup_num_steps": "auto",
+            "total_num_steps": "auto"
+        }
+    },
+    "zero_optimization": {
+        "stage": 3,
+        "offload_optimizer": {
+            "device": "cpu",
+            "pin_memory": true
+        },
+        "offload_param": {
+            "device": "cpu",
+            "pin_memory": true
+        },
+        "overlap_comm": true,
+        "contiguous_gradients": true,
+        "reduce_bucket_size": "auto",
+        "stage3_prefetch_bucket_size": "auto",
+        "stage3_param_persistence_threshold": "auto",
+        "sub_group_size": 1e9,
+        "stage3_max_live_parameters": 1e9,
+        "stage3_max_reuse_distance": 1e9,
+        "stage3_gather_16bit_weights_on_model_save": "auto"
+    },
+    "gradient_accumulation_steps": 1,
+    "gradient_clipping": "auto",
+    "steps_per_print": 2000,
+    "train_batch_size": "auto",
+    "train_micro_batch_size_per_gpu": "auto",
+    "wall_clock_breakdown": false
+}
--- a/examples/nlp_example.py
+++ b/examples/nlp_example.py
@ -15,17 +15,13 @@
 import argparse

 import torch
+from torch.optim import AdamW
 from torch.utils.data import DataLoader

+import evaluate
 from accelerate import Accelerator, DistributedType
-from datasets import load_dataset, load_metric
-from transformers import (
-    AdamW,
-    AutoModelForSequenceClassification,
-    AutoTokenizer,
-    get_linear_schedule_with_warmup,
-    set_seed,
-)
+from datasets import load_dataset
+from transformers import AutoModelForSequenceClassification, AutoTokenizer, get_linear_schedule_with_warmup, set_seed


 ########################################################################
@ -102,15 +98,14 @@ def training_function(config, args):
    # Sample hyper-parameters for learning rate, batch size, seed and a few other HPs
    lr = config["lr"]
    num_epochs = int(config["num_epochs"])
-    correct_bias = config["correct_bias"]
    seed = int(config["seed"])
    batch_size = int(config["batch_size"])

-    metric = load_metric("glue", "mrpc")
+    metric = evaluate.load("glue", "mrpc")

    # If the batch size is too big we use gradient accumulation
    gradient_accumulation_steps = 1
-    if batch_size > MAX_GPU_BATCH_SIZE:
+    if batch_size > MAX_GPU_BATCH_SIZE and accelerator.distributed_type != DistributedType.TPU:
        gradient_accumulation_steps = batch_size // MAX_GPU_BATCH_SIZE
        batch_size = MAX_GPU_BATCH_SIZE

@ -125,7 +120,7 @@ def training_function(config, args):
    model = model.to(accelerator.device)

    # Instantiate optimizer
-    optimizer = AdamW(params=model.parameters(), lr=lr, correct_bias=correct_bias)
+    optimizer = AdamW(params=model.parameters(), lr=lr)

    # Instantiate scheduler
    lr_scheduler = get_linear_schedule_with_warmup(
@ -187,7 +182,7 @@ def main():
    )
    parser.add_argument("--cpu", action="store_true", help="If passed, will train on the CPU.")
    args = parser.parse_args()
-    config = {"lr": 2e-5, "num_epochs": 3, "correct_bias": True, "seed": 42, "batch_size": 16}
+    config = {"lr": 2e-5, "num_epochs": 3, "seed": 42, "batch_size": 16}
    training_function(config, args)


--- a/examples/requirements.txt
+++ b/examples/requirements.txt
@ -1 +1,3 @@
-accelerate # used to be installed in Amazon SageMaker environment
+accelerate # used to be installed in Amazon SageMaker environment
+evaluate
+datasets==2.3.2
--- a/setup.py
+++ b/setup.py
@ -16,19 +16,22 @@ from setuptools import setup
 from setuptools import find_packages

 extras = {}
-extras["quality"] = ["black ~= 22.0", "isort >= 5.5.4", "flake8 >= 3.8.3"]
+extras["quality"] = ["black ~= 22.0", "isort >= 5.5.4", "flake8 >= 3.8.3", "hf-doc-builder >= 0.3.0"]
 extras["docs"] = []
 extras["test"] = [
-    "psutil", 
    "pytest",
    "pytest-xdist",
    "pytest-subtests",
-    "datasets",
+    "datasets<=2.2.2",
+    "evaluate",
    "transformers",
    "scipy",
-    "sklearn"
+    "sklearn",
+    "parameterized",
+    "deepspeed",
 ]
-extras["test_trackers"] = ["wandb", "comet-ml", "tensorflow>=2.6.2", "tensorboard"]
+
+extras["test_trackers"] = ["wandb", "comet-ml", "tensorboard"]
 extras["dev"] = extras["quality"] + extras["test"]

 extras["sagemaker"] = [
@ -37,7 +40,7 @@ extras["sagemaker"] = [

 setup(
    name="accelerate",
-    version="0.10.0.dev0",
+    version="0.11.0",
    description="Accelerate",
    long_description=open("README.md", "r", encoding="utf-8").read(),
    long_description_content_type="text/markdown",
@ -55,8 +58,8 @@ setup(
            "accelerate-launch=accelerate.commands.launch:main",
        ]
    },
-    python_requires=">=3.6.0",
-    install_requires=["numpy>=1.17", "packaging>=20.0", "pyyaml", "torch>=1.4.0"],
+    python_requires=">=3.7.0",
+    install_requires=["numpy>=1.17", "packaging>=20.0", "psutil", "pyyaml", "torch>=1.4.0"],
    extras_require=extras,
    classifiers=[
        "Development Status :: 5 - Production/Stable",
@ -66,7 +69,6 @@ setup(
        "License :: OSI Approved :: Apache Software License",
        "Operating System :: OS Independent",
        "Programming Language :: Python :: 3",
-        "Programming Language :: Python :: 3.6",
        "Programming Language :: Python :: 3.7",
        "Topic :: Scientific/Engineering :: Artificial Intelligence",
    ],
--- a/src/accelerate/init.py
+++ b/src/accelerate/init.py
@ -2,7 +2,7 @@
 # There's no way to ignore "F401 '...' imported but unused" warnings in this
 # module, but to preserve other warnings. So, don't check this module at all.

-__version__ = "0.10.0.dev0"
+__version__ = "0.11.0"

 from .accelerator import Accelerator
 from .big_modeling import cpu_offload, disk_offload, dispatch_model, init_empty_weights, load_checkpoint_and_dispatch
--- a/src/accelerate/accelerator.py
+++ b/src/accelerate/accelerator.py
@ -12,7 +12,9 @@
 # See the License for the specific language governing permissions and
 # limitations under the License.

+import contextlib
 import gc
+import math
 import os
 import sys
 import warnings
@ -26,7 +28,7 @@ from .data_loader import prepare_data_loader
 from .logging import get_logger
 from .optimizer import AcceleratedOptimizer
 from .scheduler import AcceleratedScheduler
-from .state import AcceleratorState
+from .state import AcceleratorState, GradientState
 from .tracking import LOGGER_TYPE_TO_CLASS, GeneralTracker, filter_trackers
 from .utils import (
    DeepSpeedPlugin,
@ -39,12 +41,15 @@ from .utils import (
    LoggerType,
    PrecisionType,
    RNGType,
+    compare_versions,
    convert_outputs_to_fp32,
    extract_model_from_parallel,
    gather,
    get_pretty_name,
+    is_bf16_available,
    is_deepspeed_available,
    is_torch_version,
+    is_tpu_available,
    pad_across_processes,
    reduce,
    save,
@ -55,7 +60,16 @@ from .utils import (
 if is_deepspeed_available():
    import deepspeed

-    from .utils import DeepSpeedEngineWrapper, DeepSpeedOptimizerWrapper
+    from .utils import (
+        DeepSpeedEngineWrapper,
+        DeepSpeedOptimizerWrapper,
+        DeepSpeedSchedulerWrapper,
+        DummyOptim,
+        DummyScheduler,
+    )
+
+if is_tpu_available(check_device=False):
+    import torch_xla.distributed.xla_multiprocessing as xmp

 logger = get_logger(__name__)

@ -78,6 +92,9 @@ class Accelerator:
            default to the value in the environment variable `MIXED_PRECISION`, which will use the default value in the
            accelerate config of the current system or the flag passed with the `accelerate.launch` command. 'fp16'
            requires pytorch 1.6 or higher. 'bf16' requires pytorch 1.10 or higher.
+        gradient_accumulation_steps (`int`, *optional*, default to 1):
+            The number of steps that should pass before gradients are accumulated. A number > 1 should be combined with
+            `Accelerator.accumulate`.
        cpu (`bool`, *optional*):
            Whether or not to force the script to execute on CPU. Will ignore GPU available if set to `True` and force
            the execution on one process only.
@ -132,6 +149,7 @@ class Accelerator:
        split_batches: bool = False,
        fp16: bool = None,
        mixed_precision: Union[PrecisionType, str] = None,
+        gradient_accumulation_steps: int = 1,
        cpu: bool = False,
        deepspeed_plugin: DeepSpeedPlugin = None,
        fsdp_plugin: FullyShardedDataParallelPlugin = None,
@ -143,7 +161,10 @@ class Accelerator:
        kwargs_handlers: Optional[List[KwargsHandler]] = None,
    ):
        self.logging_dir = logging_dir
-        self.log_with = filter_trackers(log_with, self.logging_dir)
+        trackers = filter_trackers(log_with, self.logging_dir)
+        if len(trackers) < 1 and log_with is not None:
+            warnings.warn(f"`log_with={log_with}` was passed but no supported trackers are currently installed.")
+        self.log_with = trackers

        if mixed_precision is not None:
            mixed_precision = str(mixed_precision)
@ -163,6 +184,19 @@ class Accelerator:
                deepspeed_plugin, DeepSpeedPlugin
            ), "`deepspeed_plugin` must be a DeepSpeedPlugin object."
            os.environ["USE_DEEPSPEED"] = "true"  # use DeepSpeed if plugin is provided
+        if deepspeed_plugin:
+            if not is_deepspeed_available():
+                raise ImportError("DeepSpeed is not installed => run `pip install deepspeed` or build it from source.")
+            if compare_versions("deepspeed", "<", "0.6.5"):
+                raise ImportError("DeepSpeed version must be >= 0.6.5. Please update DeepSpeed.")
+
+            mixed_precision = os.environ.get("MIXED_PRECISION", "no") if mixed_precision is None else mixed_precision
+            deepspeed_plugin.set_mixed_precision(mixed_precision)
+            deepspeed_plugin.set_deepspeed_weakref()
+
+        if os.environ.get("USE_FSDP", "false") == "true" or isinstance(fsdp_plugin, FullyShardedDataParallelPlugin):
+            if is_torch_version("<", "1.12.0"):
+                raise ValueError("FSDP requires PyTorch >= 1.12.0")

        if fsdp_plugin is None:  # init from env variables
            fsdp_plugin = FullyShardedDataParallelPlugin() if os.environ.get("USE_FSDP", "false") == "true" else None
@ -171,10 +205,6 @@ class Accelerator:
                raise TypeError("`fsdp_plugin` must be a FullyShardedDataParallelPlugin object.")
            os.environ["USE_FSDP"] = "true"  # use FSDP if plugin is provided

-        if os.environ.get("USE_FSDP", "false") == "true":
-            if is_torch_version("<", "1.12.0.dev20220418+cu113"):
-                raise ValueError("FSDP requires PyTorch >= 1.12.0.dev20220418+cu113")
-
        # Kwargs handlers
        self.ddp_handler = None
        self.scaler_handler = None
@ -208,6 +238,13 @@ class Accelerator:
            **kwargs,
        )

+        if gradient_accumulation_steps > 1:
+            if self.state.distributed_type == DistributedType.TPU:
+                raise NotImplementedError(
+                    "Gradient accumulation on TPU is not supported. Pass in `gradient_accumulation_steps=1`"
+                )
+
+        self.gradient_accumulation_steps = gradient_accumulation_steps
        self.device_placement = device_placement
        self.split_batches = split_batches
        self.dispatch_batches = dispatch_batches
@ -220,20 +257,33 @@ class Accelerator:
        # Mixed precision attributes
        self.scaler = None
        self.native_amp = False
+        err = "{mode} mixed precision requires {requirement}"
        if self.state.mixed_precision == "fp16":
            self.native_amp = is_torch_version(">=", "1.6")
            if not self.native_amp:
-                raise ValueError("fp16 mixed precision requires PyTorch >= 1.6")
-
+                raise ValueError(err.format(mode="fp16", requirement="PyTorch >= 1.6"))
+            if not torch.cuda.is_available():
+                raise ValueError(err.format(mode="fp16", requirement="a GPU"))
            kwargs = self.scaler_handler.to_kwargs() if self.scaler_handler is not None else {}
-            self.scaler = torch.cuda.amp.GradScaler(**kwargs)
-        elif self.state.mixed_precision == "bf16":
-            self.native_amp = is_torch_version(">=", "1.10")
-            if mixed_precision == "bf16" and not self.native_amp:
-                raise ValueError("bf16 mixed precision requires PyTorch >= 1.10")
+            if self.distributed_type == DistributedType.FSDP:
+                from torch.distributed.fsdp.sharded_grad_scaler import ShardedGradScaler

-            kwargs = self.scaler_handler.to_kwargs() if self.scaler_handler is not None else {}
-            self.scaler = torch.cuda.amp.GradScaler(**kwargs)
+                self.scaler = ShardedGradScaler(**kwargs)
+            else:
+                self.scaler = torch.cuda.amp.GradScaler(**kwargs)
+        elif self.state.mixed_precision == "bf16" and self.distributed_type != DistributedType.FSDP:
+            self.native_amp = is_bf16_available(True)
+            if mixed_precision == "bf16" and not self.native_amp and not is_tpu_available():
+                raise ValueError(err.format(mode="bf16", requirement="PyTorch >= 1.10 and a supported device."))
+
+            # Only on the GPU do we care about scaling the gradients
+            if torch.cuda.is_available():
+                kwargs = self.scaler_handler.to_kwargs() if self.scaler_handler is not None else {}
+                self.scaler = torch.cuda.amp.GradScaler(**kwargs)
+
+        # Start of internal step tracking
+        self.step = 0
+        self.gradient_state = GradientState()

        # Internal references to the training objects
        self._optimizers = []
@ -246,6 +296,10 @@ class Accelerator:
        if self.rng_types is None:
            self.rng_types = ["torch"] if is_torch_version("<=", "1.5.1") else ["generator"]

+    @property
+    def use_distributed(self):
+        return self.distributed_type != DistributedType.NO and self.num_processes > 1
+
    @property
    def distributed_type(self):
        return self.state.distributed_type
@ -321,6 +375,56 @@ class Accelerator:
        if is_main:
            self.wait_for_everyone()

+    @contextmanager
+    def no_sync(self, model):
+        """
+        A context manager to disable gradient synchronizations across DDP processes by calling
+        `torch.nn.parallel.DistributedDataParallel.no_sync`.
+
+        If `model` is not in DDP, this context manager does nothing
+
+        Args:
+            model (`torch.nn.Module`):
+                PyTorch Module that was prepared with `Accelerator.prepare`
+        """
+        context = contextlib.nullcontext
+        if self.use_distributed:
+            context = getattr(model, "no_sync", context)
+
+        with context():
+            yield
+
+    def _do_sync(self):
+        "Sets the right `sync_gradients` context and either resets or increases `self.step`"
+        if self.gradient_state.end_of_dataloader:
+            self.step = 0
+            self.gradient_state._set_sync_gradients(True)
+        else:
+            self.step += 1
+            self.gradient_state._set_sync_gradients((self.step % self.gradient_accumulation_steps) == 0)
+
+    @property
+    def sync_gradients(self):
+        return self.gradient_state.sync_gradients
+
+    @contextmanager
+    def accumulate(self, model):
+        """
+        A context manager that will lightly wrap around and perform gradient accumulation automatically
+
+        Args:
+            model (`torch.nn.Module`):
+                PyTorch Module that was prepared with `Accelerator.prepare`
+        """
+        self._do_sync()
+        if self.sync_gradients:
+            context = contextlib.nullcontext
+        else:
+            context = self.no_sync
+
+        with context(model):
+            yield
+
    def print(self, *args, **kwargs):
        """
        Use in replacement of `print()` to only print once per server.
@ -470,6 +574,7 @@ class Accelerator:
            # Check if the model is already a FSDP model due to `Manual Wrapping` and if so,
            # don't wrap it again
            if type(model) != FSDP:
+                self.state.fsdp_plugin.set_auto_wrap_policy(model)
                fsdp_plugin = self.state.fsdp_plugin
                model = FSDP(
                    model,
@ -477,6 +582,7 @@ class Accelerator:
                    cpu_offload=fsdp_plugin.cpu_offload,
                    auto_wrap_policy=fsdp_plugin.auto_wrap_policy,
                    backward_prefetch=fsdp_plugin.backward_prefetch,
+                    mixed_precision=fsdp_plugin.mixed_precision_policy,
                    ignored_modules=fsdp_plugin.ignored_modules,
                )
                if not fsdp_plugin.cpu_offload.offload_params:
@ -487,19 +593,28 @@ class Accelerator:
        if self.native_amp:
            if self.mixed_precision == "fp16" and is_torch_version(">=", "1.10"):
                model.forward = torch.cuda.amp.autocast(dtype=torch.float16)(model.forward)
-            elif self.mixed_precision == "bf16":
-                model.forward = torch.cuda.amp.autocast(dtype=torch.bfloat16)(model.forward)
+            elif self.mixed_precision == "bf16" and self.distributed_type != DistributedType.TPU:
+                device_type = "cuda" if torch.cuda.is_available() else "cpu"
+                model.forward = torch.autocast(device_type=device_type, dtype=torch.bfloat16)(model.forward)
            else:
                model.forward = torch.cuda.amp.autocast()(model.forward)
            model.forward = convert_outputs_to_fp32(model.forward)
+        if self.distributed_type == DistributedType.TPU and self.state.fork_launched:
+            model = xmp.MpModelWrapper(model).to(self.device)
        return model

    def _prepare_deepspeed(self, *args):

        deepspeed_plugin = self.state.deepspeed_plugin
-        self.deepspeed_config = deepspeed_plugin.deepspeed_config
+
+        result = [
+            self._prepare_one(obj, first_pass=True) if isinstance(obj, torch.utils.data.DataLoader) else obj
+            for obj in args
+        ]

        batch_sizes = [obj.batch_size for obj in args if hasattr(obj, "batch_size")]
+        if self.split_batches:
+            batch_sizes = [batch_size // self.num_processes for batch_size in batch_sizes]
        if len(batch_sizes) == 0:
            raise ValueError(
                "You must specify a training or evaluation dataloader in `accelerate.prepare()` when using DeepSpeed."
@ -508,73 +623,141 @@ class Accelerator:
        batch_size_per_device = min(batch_sizes) if deepspeed_plugin.is_train_batch_min else max(batch_sizes)
        if len(batch_sizes) > 1:
            logger.info(
-                f"Since you passed both train and evaluation dataloader, `is_train_batch_min` (here \
-                {deepspeed_plugin.is_train_batch_min} will decide the `train_batch_size` ({batch_size_per_device})."
+                "Since you passed both train and evaluation dataloader, `is_train_batch_min` (here "
+                f"{deepspeed_plugin.is_train_batch_min} will decide the `train_batch_size` ({batch_size_per_device})."
            )

-        self.deepspeed_config["train_batch_size"] = (
-            batch_size_per_device * deepspeed_plugin.gradient_accumulation_steps * self.num_processes
-        )
-
-        result = [
-            self._prepare_one(obj, first_pass=True) if isinstance(obj, torch.utils.data.DataLoader) else obj
-            for obj in args
-        ]
+        config_kwargs = {
+            "train_micro_batch_size_per_gpu": batch_size_per_device,
+            "train_batch_size": batch_size_per_device
+            * deepspeed_plugin.deepspeed_config["gradient_accumulation_steps"]
+            * self.num_processes,
+            "gradient_clipping": 1.0,
+            "zero_optimization.stage3_gather_16bit_weights_on_model_save": False,
+        }

        model = None
        optimizer = None
+        scheduler = None
        for obj in result:
            if isinstance(obj, torch.nn.Module):
                model = obj
-            elif isinstance(obj, (torch.optim.Optimizer, dict)):
+            elif isinstance(obj, (torch.optim.Optimizer, DummyOptim)):
                optimizer = obj
+            elif (isinstance(obj, (torch.optim.lr_scheduler._LRScheduler, DummyScheduler))) or (
+                type(obj).__name__ in deepspeed.runtime.lr_schedules.VALID_LR_SCHEDULES
+            ):
+                scheduler = obj

-        if deepspeed_plugin.auto_opt_mapping:
-            is_adam = isinstance(optimizer, torch.optim.Adam)
-            is_adamw = isinstance(optimizer, torch.optim.AdamW)
-            if (is_adam or is_adamw) and deepspeed_plugin.offload_optimizer_device == "cpu":
-                defaults = optimizer.defaults
-                params = []
-                for group in optimizer.param_groups:
-                    params.extend(group["params"])
-
-                optimizer = deepspeed.ops.adam.DeepSpeedCPUAdam(
-                    params,
-                    lr=defaults["lr"],
-                    bias_correction=True,
-                    betas=defaults["betas"],
-                    eps=defaults["eps"],
-                    weight_decay=defaults["weight_decay"],
-                    amsgrad=defaults["amsgrad"],
-                    adamw_mode=is_adamw,
+        if optimizer is not None:
+            if "optimizer" in deepspeed_plugin.deepspeed_config and not isinstance(optimizer, (DummyOptim)):
+                raise ValueError(
+                    "You cannot specify an optimizer in the config file and in the code at the same time. "
+                    "Please remove the optimizer from the config file or "
+                    "create `accelerate.utils.DummyOptim` in the code."
+                )
+            elif "optimizer" not in deepspeed_plugin.deepspeed_config and isinstance(optimizer, (DummyOptim)):
+                raise ValueError(
+                    "You cannot create a `DummyOptim` without specifying an optimizer in the config file."
+                )
+
+            if isinstance(optimizer, (torch.optim.Optimizer)):
+                deepspeed_plugin.deepspeed_config["zero_allow_untested_optimizer"] = True
+
+        if scheduler is not None:
+            if "scheduler" in deepspeed_plugin.deepspeed_config and not isinstance(scheduler, (DummyScheduler)):
+                raise ValueError(
+                    "You cannot specify a scheduler in the config file and in the code at the same time. "
+                    "Please remove the scheduler from the config file or "
+                    "create `accelerate.utils.DummyScheduler` in the code."
+                )
+            elif "scheduler" not in deepspeed_plugin.deepspeed_config and isinstance(scheduler, (DummyScheduler)):
+                raise ValueError(
+                    "You cannot create a `DummyScheduler` without specifying a scheduler in the config file."
+                )
+
+        if optimizer is not None and scheduler is not None:
+            if isinstance(optimizer, (DummyOptim)) and not isinstance(scheduler, (DummyScheduler)):
+                raise ValueError(
+                    "You can only specify `accelerate.utils.DummyScheduler` in the code when using "
+                    "`accelerate.utils.DummyOptim`."
                )

-        # useful when only eval_dataloader is given into `accelerator.prepare()`
        if model is not None:
-            engine = DeepSpeedEngineWrapper(
-                args=None,
-                model=model,
-                optimizer=optimizer,
-                config_params=self.deepspeed_config,
-                dist_init_required=False,
-            )
+            if hasattr(model, "config") and hasattr(model.config, "hidden_size"):
+                hidden_size = model.config.hidden_size
+                config_kwargs.update(
+                    {
+                        "zero_optimization.reduce_bucket_size": hidden_size * hidden_size,
+                        "zero_optimization.stage3_prefetch_bucket_size": 0.9 * hidden_size * hidden_size,
+                        "zero_optimization.stage3_param_persistence_threshold": 10 * hidden_size,
+                    }
+                )
+
+            if isinstance(optimizer, (DummyOptim)):
+                config_kwargs.update(
+                    {"optimizer.params.lr": optimizer.lr, "optimizer.params.weight_decay": optimizer.weight_decay}
+                )
+            if isinstance(scheduler, (DummyScheduler)):
+                config_kwargs.update(
+                    {
+                        "scheduler.params.warmup_min_lr": 0,
+                        "scheduler.params.warmup_max_lr": scheduler.optimizer.lr,
+                        "scheduler.params.warmup_num_steps": scheduler.warmup_num_steps,
+                    }
+                )
+                if scheduler.total_num_steps is not None:
+                    config_kwargs["scheduler.params.total_num_steps"] = (
+                        math.ceil(scheduler.total_num_steps / self.num_processes)
+                        if not self.split_batches
+                        else scheduler.total_num_steps
+                    )
+            deepspeed_plugin.deepspeed_config_process(must_match=False, **config_kwargs)
+            self.deepspeed_config = deepspeed_plugin.deepspeed_config
+            kwargs = dict(model=model, config_params=self.deepspeed_config)
+            if optimizer is not None:
+                if isinstance(optimizer, (DummyOptim)):
+                    kwargs["model_parameters"] = optimizer.params
+                else:
+                    kwargs["optimizer"] = optimizer
+                    if scheduler is not None:
+                        if type(scheduler).__name__ in deepspeed.runtime.lr_schedules.VALID_LR_SCHEDULES:
+                            kwargs["lr_scheduler"] = scheduler
+
+            engine, optimizer, _, lr_scheduler = deepspeed.initialize(**kwargs)
+            if optimizer is not None:
+                optimizer = DeepSpeedOptimizerWrapper(optimizer)
+            if scheduler is not None:
+                if lr_scheduler is None:
+                    scheduler = AcceleratedScheduler(
+                        scheduler,
+                        optimizer,
+                        step_with_optimizer=self.step_scheduler_with_optimizer,
+                        split_batches=self.split_batches,
+                    )
+                else:
+                    scheduler = DeepSpeedSchedulerWrapper(lr_scheduler, optimizer)
+
            for i in range(len(result)):
                if isinstance(result[i], torch.nn.Module):
                    result[i] = engine
-                elif isinstance(result[i], torch.optim.Optimizer):
-                    result[i] = DeepSpeedOptimizerWrapper(engine.optimizer, engine)
-            self.deepspeed_engine = engine  # pointing for deepspeed_engine.backward()
+                elif isinstance(result[i], (torch.optim.Optimizer, DummyOptim)):
+                    result[i] = optimizer
+                elif (isinstance(result[i], (torch.optim.lr_scheduler._LRScheduler, DummyScheduler))) or (
+                    type(result[i]).__name__ in deepspeed.runtime.lr_schedules.VALID_LR_SCHEDULES
+                ):
+                    result[i] = scheduler
+            # pointing for deepspeed_engine_wrapped.backward()
+            self.deepspeed_engine_wrapped = DeepSpeedEngineWrapper(engine)
            self._models.append(engine)
-            self._optimizers.append(engine.optimizer)
-            assert (
-                len(self._models) == 1
-            ), "You can't use same `Accelerator()` instance with 2 models when using DeepSpeed"
-
-        if self.distributed_type == DistributedType.DEEPSPEED:
-            assert hasattr(
-                self, "deepspeed_engine"
-            ), "You need to pass the model along the optimizer when using Deepspeed."
-
+            if optimizer is not None:
+                self._optimizers.append(optimizer)
+            if scheduler is not None:
+                self._schedulers.append(scheduler)
+            if len(self._models) > 1:
+                raise AssertionError(
+                    "You can't use same `Accelerator()` instance with multiple models when using DeepSpeed"
+                )
        return tuple(result)

    def prepare_data_loader(self, data_loader):
@ -584,7 +767,7 @@ class Accelerator:
            num_processes=self.num_processes,
            process_index=self.process_index,
            split_batches=self.split_batches,
-            put_on_device=self.device_placement,
+            put_on_device=self.device_placement if self.distributed_type != DistributedType.TPU else False,
            rng_types=self.rng_types.copy(),
            dispatch_batches=self.dispatch_batches,
        )
@ -611,8 +794,9 @@ class Accelerator:
        """
        Use `accelerator.backward(loss)` in lieu of `loss.backward()`.
        """
+        loss /= self.gradient_accumulation_steps
        if self.distributed_type == DistributedType.DEEPSPEED:
-            self.deepspeed_engine.backward(loss, **kwargs)
+            self.deepspeed_engine_wrapped.backward(loss, **kwargs)
        elif self.scaler is not None:
            self.scaler.scale(loss).backward(**kwargs)
        else:
@ -643,11 +827,15 @@ class Accelerator:
        Should be used in place of `torch.nn.utils.clip_grad_norm_`.
        """
        if self.distributed_type == DistributedType.FSDP:
+            self.unscale_gradients()
            parameters = [p for p in parameters]
            for model in self._models:
                if parameters == [p for p in model.parameters()]:
                    model.clip_grad_norm_(max_norm, norm_type)
                    return
+        elif self.distributed_type == DistributedType.DEEPSPEED:
+            # `accelerator.backward(loss)` is doing that automatically. Therefore, it's implementation is not needed
+            return
        self.unscale_gradients()
        torch.nn.utils.clip_grad_norm_(parameters, max_norm, norm_type=norm_type)

@ -655,6 +843,8 @@ class Accelerator:
        """
        Should be used in place of `torch.nn.utils.clip_grad_value_`.
        """
+        if self.distributed_type in [DistributedType.DEEPSPEED, DistributedType.FSDP]:
+            raise Exception("DeepSpeed and FSDP  do not support `clip_grad_value_`. Use `clip_grad_norm_` instead.")
        self.unscale_gradients()
        torch.nn.utils.clip_grad_value_(parameters, clip_value)

@ -676,17 +866,23 @@ class Accelerator:
        """
        return gather(tensor)

-    def reduce(self, tensor: torch.Tensor, reduction="sum"):
+    def reduce(self, tensor, reduction="sum"):
        """
        Reduce the values in *tensor* across all processes based on *reduction*.

+        Note:
+            All processes get the reduced value.
+
        Args:
-            tensor (`torch.Tensor`):
+            tensor (`torch.Tensor`, or a nested tuple/list/dictionary of `torch.Tensor`):
                The tensors to reduce across all processes.
            reduction (`str`, *optional*, defaults to "sum"):
                A reduction type, can be one of 'sum', 'mean', or 'none'. If 'none', will not perform any operation.
+
+        Returns:
+            `torch.Tensor`, or a nested tuple/list/dictionary of `torch.Tensor`: The reduced tensor(s).
        """
-        reduce(tensor, reduction)
+        return reduce(tensor, reduction)

    def pad_across_processes(self, tensor, dim=0, pad_index=0, pad_first=False):
        """
@ -794,7 +990,7 @@ class Accelerator:
        output_dir = os.path.expanduser(output_dir)
        os.makedirs(output_dir, exist_ok=True)
        logger.info(f"Saving current state to {output_dir}")
-        weights = [self.get_state_dict(m) for m in self._models]
+        weights = [self.get_state_dict(m, unwrap=False) for m in self._models]
        save_location = save_accelerator_state(
            output_dir, weights, self._optimizers, self._schedulers, self.state.process_index, self.scaler
        )
@ -837,7 +1033,7 @@ class Accelerator:
        self._schedulers = []
        self._optimizers = []
        self._models = []
-        self.deepspeed_engine = None
+        self.deepspeed_engine_wrapped = None
        gc.collect()
        torch.cuda.empty_cache()

@ -873,16 +1069,24 @@ class Accelerator:
                        break
        return (model_device, optimizer_device)

-    def get_state_dict(self, model):
+    def get_state_dict(self, model, unwrap=True):
        is_zero_3 = False
-        if is_deepspeed_available():
-            if isinstance(model, DeepSpeedEngineWrapper) and self.distributed_type == DistributedType.DEEPSPEED:
-                is_zero_3 = self.state.deepspeed_plugin.zero_stage == 3
+        if self.distributed_type == DistributedType.DEEPSPEED:
+            is_zero_3 = self.deepspeed_config["zero_optimization"]["stage"] == 3

        if is_zero_3:
-            state_dict = model._zero3_consolidated_16bit_state_dict()
+            if model.zero_gather_16bit_weights_on_model_save():
+                state_dict = model._zero3_consolidated_16bit_state_dict()
+            else:
+                raise ValueError(
+                    "Cannot get 16bit model weights because `stage3_gather_16bit_weights_on_model_save` in DeepSpeed config is False. "
+                    "To save the model weights in 16bit, set `stage3_gather_16bit_weights_on_model_save` to True in DeepSpeed config file or "
+                    "set `zero3_save_16bit_model` to True when using `accelerate config`. "
+                    "To save the full checkpoint, run `model.save_checkpoint(save_dir)` and use `zero_to_fp32.py` to recover weights."
+                )
        else:
-            model = self.unwrap_model(model)
+            if unwrap:
+                model = self.unwrap_model(model)
            state_dict = model.state_dict()

        if state_dict is not None:
@ -925,8 +1129,10 @@ class Accelerator:
        if self.native_amp:
            if self.mixed_precision == "fp16" and is_torch_version(">=", "1.10"):
                autocast_context = torch.cuda.amp.autocast(dtype=torch.float16)
-            elif self.mixed_precision == "bf16":
-                autocast_context = torch.cuda.amp.autocast(dtype=torch.bfloat16)
+            elif self.mixed_precision == "bf16" and is_bf16_available():
+                if self.distributed_type in [DistributedType.NO, DistributedType.MULTI_CPU, DistributedType.MULTI_GPU]:
+                    device_type = "cpu" if not torch.cuda.is_available() else "cuda"
+                    autocast_context = torch.autocast(dtype=torch.bfloat16, device_type=device_type)
            else:
                autocast_context = torch.cuda.amp.autocast()

--- a/src/accelerate/big_modeling.py
+++ b/src/accelerate/big_modeling.py
@ -65,7 +65,9 @@ def init_empty_weights(include_buffers: bool = False):
    def register_empty_parameter(module, name, param):
        old_register_parameter(module, name, param)
        if param is not None:
-            module._parameters[name] = nn.Parameter(module._parameters[name].to(torch.device("meta")))
+            param_cls = type(module._parameters[name])
+            kwargs = module._parameters[name].__dict__
+            module._parameters[name] = param_cls(module._parameters[name].to(torch.device("meta")), **kwargs)

    def register_empty_buffer(module, name, buffer):
        old_register_buffer(module, name, buffer)
@ -88,6 +90,7 @@ def cpu_offload(
    execution_device: Optional[torch.device] = None,
    offload_buffers: bool = False,
    state_dict: Optional[Dict[str, torch.Tensor]] = None,
+    preload_module_classes: Optional[List[str]] = None,
 ):
    """
    Activates full CPU offload for a model. As a result, all parameters of the model will be offloaded and only one
@ -104,13 +107,23 @@ def cpu_offload(
            Whether or not to offload the buffers with the model parameters.
        state_dict (`Dict[str, torch.Tensor]`, *optional*):
            The state dict of the model that will be kept on CPU.
+        preload_module_classes (`List[str]`, *optional*):
+            A list of classes whose instances should load all their weights (even in the submodules) at the beginning
+            of the forward. This should only be used for classes that have submodules which are registered but not
+            called directly during the forward, for instance if a `dense` linear layer is registered, but at forward,
+            `dense.weight` and `dense.bias` are used in some operations instead of calling `dense` directly.
    """
    if execution_device is None:
        execution_device = next(iter(model.parameters())).device
    if state_dict is None:
        state_dict = {n: p.to("cpu") for n, p in model.state_dict().items()}
    attach_align_device_hook(
-        model, execution_device=execution_device, offload=True, offload_buffers=offload_buffers, weights_map=state_dict
+        model,
+        execution_device=execution_device,
+        offload=True,
+        offload_buffers=offload_buffers,
+        weights_map=state_dict,
+        preload_module_classes=preload_module_classes,
    )
    add_hook_to_module(model, AlignDevicesHook(io_same_device=True))
    return model
@ -121,6 +134,7 @@ def disk_offload(
    offload_dir: Union[str, os.PathLike],
    execution_device: Optional[torch.device] = None,
    offload_buffers: bool = False,
+    preload_module_classes: Optional[List[str]] = None,
 ):
    """
    Activates full disk offload for a model. As a result, all parameters of the model will be offloaded as
@ -136,6 +150,11 @@ def disk_offload(
            model's first parameter device.
        offload_buffers (`bool`, *optional*, defaults to `False`):
            Whether or not to offload the buffers with the model parameters.
+        preload_module_classes (`List[str]`, *optional*):
+            A list of classes whose instances should load all their weights (even in the submodules) at the beginning
+            of the forward. This should only be used for classes that have submodules which are registered but not
+            called directly during the forward, for instance if a `dense` linear layer is registered, but at forward,
+            `dense.weight` and `dense.bias` are used in some operations instead of calling `dense` directly.
    """
    if not os.path.isdir(offload_dir) or not os.path.isfile(os.path.join(offload_dir, "index.json")):
        offload_state_dict(offload_dir, model.state_dict())
@ -148,6 +167,7 @@ def disk_offload(
        offload=True,
        offload_buffers=offload_buffers,
        weights_map=weights_map,
+        preload_module_classes=preload_module_classes,
    )
    add_hook_to_module(model, AlignDevicesHook(io_same_device=True))
    return model
@ -160,6 +180,7 @@ def dispatch_model(
    state_dict: Optional[Dict[str, torch.Tensor]] = None,
    offload_dir: Union[str, os.PathLike] = None,
    offload_buffers: bool = False,
+    preload_module_classes: Optional[List[str]] = None,
 ):
    """
    Dispatches a model according to a given device map. Layers of the model might be spread across GPUs, offloaded on
@ -180,6 +201,11 @@ def dispatch_model(
            The folder in which to offload the model weights (or where the model weights are already offloaded).
        offload_buffers (`bool`, *optional*, defaults to `False`):
            Whether or not to offload the buffers with the model parameters.
+        preload_module_classes (`List[str]`, *optional*):
+            A list of classes whose instances should load all their weights (even in the submodules) at the beginning
+            of the forward. This should only be used for classes that have submodules which are registered but not
+            called directly during the forward, for instance if a `dense` linear layer is registered, but at forward,
+            `dense.weight` and `dense.bias` are used in some operations instead of calling `dense` directly.
    """
    # Error early if the device map is incomplete.
    check_device_map(model, device_map)
@ -219,6 +245,7 @@ def dispatch_model(
        offload=offload,
        offload_buffers=offload_buffers,
        weights_map=weights_map,
+        preload_module_classes=preload_module_classes,
    )
    model.hf_device_map = device_map
    return model
@ -234,6 +261,7 @@ def load_checkpoint_and_dispatch(
    offload_buffers: bool = False,
    dtype: Optional[Union[str, torch.dtype]] = None,
    offload_state_dict: bool = False,
+    preload_module_classes: Optional[List[str]] = None,
 ):
    """
    Loads a (potentially sharded) checkpoint inside a model, potentially sending weights to a given device as they are
@ -267,6 +295,11 @@ def load_checkpoint_and_dispatch(
        offload_state_dict (`bool`, *optional*, defaults to `False`):
            If `True`, will temporarily offload the CPU state dict on the hard drive to avoig getting out of CPU RAM if
            the weight of the CPU state dict + the biggest shard does not fit.
+        preload_module_classes (`List[str]`, *optional*):
+            A list of classes whose instances should load all their weights (even in the submodules) at the beginning
+            of the forward. This should only be used for classes that have submodules which are registered but not
+            called directly during the forward, for instance if a `dense` linear layer is registered, but at forward,
+            `dense.weight` and `dense.bias` are used in some operations instead of calling `dense` directly.
    """
    if device_map == "auto":
        device_map = infer_auto_device_map(
@ -282,4 +315,10 @@ def load_checkpoint_and_dispatch(
    )
    if device_map is None:
        return model
-    return dispatch_model(model, device_map=device_map, offload_dir=offload_folder, offload_buffers=offload_buffers)
+    return dispatch_model(
+        model,
+        device_map=device_map,
+        offload_dir=offload_folder,
+        offload_buffers=offload_buffers,
+        preload_module_classes=preload_module_classes,
+    )
--- a/src/accelerate/checkpointing.py
+++ b/src/accelerate/checkpointing.py
@ -33,7 +33,7 @@ from .utils import (
 )


-if is_tpu_available():
+if is_tpu_available(check_device=False):
    import torch_xla.core.xla_model as xm

 from .logging import get_logger
@ -116,7 +116,7 @@ def load_accelerator_state(input_dir, models, optimizers, schedulers, process_in
    Args:
        input_dir (`str` or `os.PathLike`):
            The name of the folder to load all relevant weights and states.
-        model_stmodelsates (`List[torch.nn.Module]`):
+        models (`List[torch.nn.Module]`):
            A list of model instances
        optimizers (`List[torch.optim.Optimizer]`):
            A list of optimizer instances
--- a/src/accelerate/commands/config/cluster.py
+++ b/src/accelerate/commands/config/cluster.py
@ -14,7 +14,13 @@
 # See the License for the specific language governing permissions and
 # limitations under the License.

-from ...utils import ComputeEnvironment, DistributedType, is_deepspeed_available
+from ...utils import ComputeEnvironment, DistributedType, is_deepspeed_available, is_transformers_available
+from ...utils.constants import (
+    DEEPSPEED_MULTINODE_LAUNCHERS,
+    FSDP_AUTO_WRAP_POLICY,
+    FSDP_BACKWARD_PREFETCH,
+    FSDP_SHARDING_STRATEGY,
+)
 from .config_args import ClusterConfig
 from .config_utils import _ask_field, _convert_distributed_mode, _convert_yes_no_to_bool

@ -77,24 +83,117 @@ def get_cluster_input():
            ), "DeepSpeed is not installed => run `pip3 install deepspeed` or build it from source"

        if distributed_type == DistributedType.DEEPSPEED:
-            deepspeed_config["zero_stage"] = _ask_field(
-                "What should be your DeepSpeed's ZeRO optimization stage (0, 1, 2, 3)? [2]: ",
-                lambda x: int(x),
-                default=2,
+            use_deepspeed_config = _ask_field(
+                "Do you want to specify a json file to a DeepSpeed config? [yes/NO]: ",
+                _convert_yes_no_to_bool,
+                default=False,
+                error_message="Please enter yes or no.",
            )
-
-            if deepspeed_config["zero_stage"] >= 2:
-                deepspeed_config["offload_optimizer_device"] = _ask_field(
-                    "Where to offload optimizer states? [NONE/cpu/nvme]: ",
+            if use_deepspeed_config:
+                deepspeed_config["deepspeed_config_file"] = _ask_field(
+                    "Please enter the path to the json DeepSpeed config file: ",
                    lambda x: str(x),
                    default="none",
                )
+            else:
+                deepspeed_config["zero_stage"] = _ask_field(
+                    "What should be your DeepSpeed's ZeRO optimization stage (0, 1, 2, 3)? [2]: ",
+                    lambda x: int(x),
+                    default=2,
+                )

-            deepspeed_config["gradient_accumulation_steps"] = _ask_field(
-                "How many gradient accumulation steps you're passing in your script? [1]: ",
-                lambda x: int(x),
-                default=1,
+                if deepspeed_config["zero_stage"] >= 2:
+                    deepspeed_config["offload_optimizer_device"] = _ask_field(
+                        "Where to offload optimizer states? [none/cpu/nvme]: ",
+                        lambda x: str(x),
+                        default="none",
+                    )
+                    deepspeed_config["offload_param_device"] = _ask_field(
+                        "Where to offload parameters? [none/cpu/nvme]: ",
+                        lambda x: str(x),
+                        default="none",
+                    )
+                deepspeed_config["gradient_accumulation_steps"] = _ask_field(
+                    "How many gradient accumulation steps you're passing in your script? [1]: ",
+                    lambda x: int(x),
+                    default=1,
+                )
+                use_gradient_clipping = _ask_field(
+                    "Do you want to use gradient clipping? [yes/NO]: ",
+                    _convert_yes_no_to_bool,
+                    default=False,
+                    error_message="Please enter yes or no.",
+                )
+                if use_gradient_clipping:
+                    deepspeed_config["gradient_clipping"] = _ask_field(
+                        "What is the gradient clipping value? [1.0]: ",
+                        lambda x: float(x),
+                        default=1.0,
+                    )
+                if deepspeed_config["zero_stage"] == 3:
+                    deepspeed_config["zero3_save_16bit_model"] = _ask_field(
+                        "Do you want to save 16-bit model weights when using ZeRO Stage-3? [yes/NO]: ",
+                        _convert_yes_no_to_bool,
+                        default=False,
+                        error_message="Please enter yes or no.",
+                    )
+            deepspeed_config["zero3_init_flag"] = _ask_field(
+                "Do you want to enable `deepspeed.zero.Init` when using ZeRO Stage-3 for constructing massive models? [yes/NO]: ",
+                _convert_yes_no_to_bool,
+                default=False,
+                error_message="Please enter yes or no.",
            )
+            if deepspeed_config["zero3_init_flag"]:
+                if not is_transformers_available():
+                    raise Exception(
+                        "When `zero3_init_flag` is set, it requires Transformers to be installed. "
+                        "Please run `pip3 install transformers`."
+                    )
+
+            if num_machines > 1:
+                launcher_query = "Which Type of launcher do you want to use "
+                for i, launcher in enumerate(DEEPSPEED_MULTINODE_LAUNCHERS):
+                    launcher_query += f"[{i}] {launcher}, "
+                launcher_query = launcher_query[:-2] + ")? [0]: "
+                deepspeed_config["deepspeed_multinode_launcher"] = _ask_field(
+                    launcher_query,
+                    lambda x: DEEPSPEED_MULTINODE_LAUNCHERS[int(x)],
+                    default=DEEPSPEED_MULTINODE_LAUNCHERS[0],
+                )
+
+                if deepspeed_config["deepspeed_multinode_launcher"] != DEEPSPEED_MULTINODE_LAUNCHERS[1]:
+                    deepspeed_config["deepspeed_hostfile"] = _ask_field(
+                        "DeepSpeed configures multi-node compute resources with hostfile. "
+                        "Each row is of the format `hostname slots=[num_gpus]`, e.g., `localhost slots=2`; "
+                        "for more information please refer official [documentation]"
+                        "(https://www.deepspeed.ai/getting-started/#resource-configuration-multi-node). "
+                        "Please specify the location of hostfile: ",
+                        lambda x: str(x),
+                    )
+
+                    is_exclusion_filter = _ask_field(
+                        "Do you want to specify exclusion filter string? [yes/NO]: ",
+                        _convert_yes_no_to_bool,
+                        default=False,
+                        error_message="Please enter yes or no.",
+                    )
+                    if is_exclusion_filter:
+                        deepspeed_config["deepspeed_exclusion_filter"] = _ask_field(
+                            "DeepSpeed exclusion filter string: ",
+                            lambda x: str(x),
+                        )
+
+                    is_inclusion_filter = _ask_field(
+                        "Do you want to specify inclusion filter string? [yes/NO]: ",
+                        _convert_yes_no_to_bool,
+                        default=False,
+                        error_message="Please enter yes or no.",
+                    )
+                    if is_inclusion_filter:
+                        deepspeed_config["deepspeed_inclusion_filter"] = _ask_field(
+                            "DeepSpeed inclusion filter string: ",
+                            lambda x: str(x),
+                        )

    fsdp_config = {}
    if distributed_type in [DistributedType.MULTI_GPU]:
@ -107,8 +206,12 @@ def get_cluster_input():
        if use_fsdp:
            distributed_type = DistributedType.FSDP
        if distributed_type == DistributedType.FSDP:
+            sharding_strategy_query = "What should be your sharding strategy ("
+            for i, strategy in enumerate(FSDP_SHARDING_STRATEGY):
+                sharding_strategy_query += f"[{i+1}] {strategy}, "
+            sharding_strategy_query = sharding_strategy_query[:-2] + ")? [1]: "
            fsdp_config["sharding_strategy"] = _ask_field(
-                "What should be your sharding strategy ([1] FULL_SHARD, [2] SHARD_GRAD_OP)? [1]: ",
+                sharding_strategy_query,
                lambda x: int(x),
                default=1,
            )
@ -118,10 +221,34 @@ def get_cluster_input():
                default=False,
                error_message="Please enter yes or no.",
            )
-            fsdp_config["min_num_params"] = _ask_field(
-                "What should be your FSDP's minimum number of parameters for Default Auto Wrapping Policy? [1e8]: ",
-                lambda x: int(x),
-                default=1e8,
+            fsdp_wrap_query = "What should be your auto wrap policy ("
+            for i, wrap_policy in enumerate(FSDP_AUTO_WRAP_POLICY):
+                fsdp_wrap_query += f"[{i}] {wrap_policy}, "
+            fsdp_wrap_query = fsdp_wrap_query[:-2] + ")? [0]: "
+            fsdp_config["fsdp_auto_wrap_policy"] = _ask_field(
+                fsdp_wrap_query,
+                lambda x: FSDP_AUTO_WRAP_POLICY[int(x)],
+                default=FSDP_AUTO_WRAP_POLICY[0],
+            )
+            if fsdp_config["fsdp_auto_wrap_policy"] == FSDP_AUTO_WRAP_POLICY[0]:
+                fsdp_config["transformer_layer_cls_to_wrap"] = _ask_field(
+                    "What is the transformer layer class name (case-sensitive) to wrap ,e.g, `BertLayer`, `GPTJBlock`, `T5Block` ...? : ",
+                    lambda x: str(x),
+                )
+            elif fsdp_config["fsdp_auto_wrap_policy"] == FSDP_AUTO_WRAP_POLICY[1]:
+                fsdp_config["min_num_params"] = _ask_field(
+                    "What should be your FSDP's minimum number of parameters for Default Auto Wrapping Policy? [1e8]: ",
+                    lambda x: int(x),
+                    default=1e8,
+                )
+            fsdp_backward_prefetch_query = "What should be your FSDP's backward prefetch policy ("
+            for i, backward_prefetch_policy in enumerate(FSDP_BACKWARD_PREFETCH):
+                fsdp_backward_prefetch_query += f"[{i}] {backward_prefetch_policy}, "
+            fsdp_backward_prefetch_query = fsdp_backward_prefetch_query[:-2] + ")? [0]: "
+            fsdp_config["fsdp_backward_prefetch_policy"] = _ask_field(
+                fsdp_backward_prefetch_query,
+                lambda x: FSDP_BACKWARD_PREFETCH[int(x)],
+                default=FSDP_BACKWARD_PREFETCH[0],
            )

    if distributed_type == DistributedType.TPU:
@ -155,11 +282,14 @@ def get_cluster_input():
        num_processes = 1

    if distributed_type != DistributedType.TPU:
-        mixed_precision = _ask_field(
-            "Do you wish to use FP16 or BF16 (mixed precision)? [NO/fp16/bf16]: ",
-            lambda x: str(x).lower(),
-            default="no",
-        )
+        if distributed_type == DistributedType.DEEPSPEED and use_deepspeed_config:
+            mixed_precision = "no"
+        else:
+            mixed_precision = _ask_field(
+                "Do you wish to use FP16 or BF16 (mixed precision)? [NO/fp16/bf16]: ",
+                lambda x: str(x).lower(),
+                default="no",
+            )
    else:
        mixed_precision = "no"

--- a/src/accelerate/commands/config/config_args.py
+++ b/src/accelerate/commands/config/config_args.py
@ -23,6 +23,7 @@ from typing import Optional, Union
 import yaml

 from ...utils import ComputeEnvironment, DistributedType, SageMakerDistributedType
+from ...utils.constants import SAGEMAKER_PYTHON_VERSION, SAGEMAKER_PYTORCH_VERSION, SAGEMAKER_TRANSFORMERS_VERSION


 hf_cache_home = os.path.expanduser(
@ -123,7 +124,10 @@ class BaseConfig:
        if isinstance(self.compute_environment, str):
            self.compute_environment = ComputeEnvironment(self.compute_environment)
        if isinstance(self.distributed_type, str):
-            self.distributed_type = DistributedType(self.distributed_type)
+            if self.compute_environment == ComputeEnvironment.AMAZON_SAGEMAKER:
+                self.distributed_type = SageMakerDistributedType(self.distributed_type)
+            else:
+                self.distributed_type = DistributedType(self.distributed_type)


@dataclass
@ -152,9 +156,13 @@ class ClusterConfig(BaseConfig):
 class SageMakerConfig(BaseConfig):
    ec2_instance_type: str
    iam_role_name: str
+    image_uri: str
    profile: Optional[str] = None
    region: str = "us-east-1"
    num_machines: int = 1
    base_job_name: str = f"accelerate-sagemaker-{num_machines}"
-    pytorch_version: str = "1.6"
-    transformers_version: str = "4.4"
+    pytorch_version: str = SAGEMAKER_PYTORCH_VERSION
+    transformers_version: str = SAGEMAKER_TRANSFORMERS_VERSION
+    py_version: str = SAGEMAKER_PYTHON_VERSION
+    sagemaker_inputs_file: str = None
+    sagemaker_metrics_file: str = None
--- a/src/accelerate/commands/config/sagemaker.py
+++ b/src/accelerate/commands/config/sagemaker.py
@ -16,10 +16,11 @@
 import json
 import os

+from ...utils.constants import SAGEMAKER_PARALLEL_EC2_INSTANCES
 from ...utils.dataclasses import ComputeEnvironment, SageMakerDistributedType
 from ...utils.imports import is_boto3_available
 from .config_args import SageMakerConfig
-from .config_utils import _ask_field, _convert_sagemaker_distributed_mode
+from .config_utils import _ask_field, _convert_sagemaker_distributed_mode, _convert_yes_no_to_bool


 if is_boto3_available():
@ -119,24 +120,68 @@ def get_sagemaker_input():
        print(f'Accelerate will create an iam role "{iam_role_name}" using the provided credentials')
        _create_iam_role_for_sagemaker(iam_role_name)

+    is_custom_docker_image = _ask_field(
+        "Do you want to use custom Docker image? [yes/NO]: ",
+        _convert_yes_no_to_bool,
+        default=False,
+        error_message="Please enter yes or no.",
+    )
+    docker_image = None
+    if is_custom_docker_image:
+        docker_image = _ask_field("Enter your Docker image: ", lambda x: str(x).lower())
+
+    is_sagemaker_inputs_enabled = _ask_field(
+        "Do you want to provide SageMaker input channels with data locations? [yes/NO]: ",
+        _convert_yes_no_to_bool,
+        default=False,
+        error_message="Please enter yes or no.",
+    )
+    sagemaker_inputs_file = None
+    if is_sagemaker_inputs_enabled:
+        sagemaker_inputs_file = _ask_field(
+            "Enter the path to the SageMaker inputs TSV file with columns (channel_name, data_location): ",
+            lambda x: str(x).lower(),
+        )
+
+    is_sagemaker_metrics_enabled = _ask_field(
+        "Do you want to enable SageMaker metrics? [yes/NO]: ",
+        _convert_yes_no_to_bool,
+        default=False,
+        error_message="Please enter yes or no.",
+    )
+    sagemaker_metrics_file = None
+    if is_sagemaker_metrics_enabled:
+        sagemaker_metrics_file = _ask_field(
+            "Enter the path to the SageMaker metrics TSV file with columns (metric_name, metric_regex): ",
+            lambda x: str(x).lower(),
+        )
+
    distributed_type = _ask_field(
-        "Which type of machine are you using? ([0] No distributed training, [1] data parallelism, [2] model parallelism): ",
+        "Which type of machine are you using? ([0] No distributed training, [1] data parallelism): ",
        _convert_sagemaker_distributed_mode,
-        error_message="Please enter 0, 1 or 2",
+        error_message="Please enter 0 or 1",
    )

-    # using the best two instances for single-gpu training or multi-gpu -> can turn into question to make it more diverse
-    ec2_instance_type = "ml.p3.2xlarge" if distributed_type == SageMakerDistributedType.NO else "ml.p3dn.24xlarge"
+    ec2_instance_query = "Which EC2 instance type you want to use for your training "
+    if distributed_type != SageMakerDistributedType.NO:
+        ec2_instance_query += "("
+        for i, instance_type in enumerate(SAGEMAKER_PARALLEL_EC2_INSTANCES):
+            ec2_instance_query += f"[{i}] {instance_type}, "
+        ec2_instance_query = ec2_instance_query[:-2] + ")? [0]: "
+        ec2_instance_type = _ask_field(ec2_instance_query, lambda x: SAGEMAKER_PARALLEL_EC2_INSTANCES[int(x)])
+    else:
+        ec2_instance_query += "? [ml.p3.2xlarge]:"
+        ec2_instance_type = _ask_field(ec2_instance_query, lambda x: str(x).lower(), default="ml.p3.2xlarge")
+
    num_machines = 1
    if (
        distributed_type == SageMakerDistributedType.DATA_PARALLEL
        or distributed_type == SageMakerDistributedType.MODEL_PARALLEL
    ):
-        raise NotImplementedError("Model or Data Parallelism is not implemented yet. We are working on it")
        num_machines = _ask_field(
-            "How many machines do you want use? [2]: ",
+            "How many machines do you want use? [1]: ",
            lambda x: int(x),
-            default=2,
+            default=1,
        )

    mixed_precision = _ask_field(
@ -146,12 +191,16 @@ def get_sagemaker_input():
    )

    return SageMakerConfig(
+        image_uri=docker_image,
        compute_environment=ComputeEnvironment.AMAZON_SAGEMAKER,
        distributed_type=distributed_type,
+        use_cpu=False,
        ec2_instance_type=ec2_instance_type,
        profile=aws_profile,
        region=aws_region,
        iam_role_name=iam_role_name,
        mixed_precision=mixed_precision,
        num_machines=num_machines,
+        sagemaker_inputs_file=sagemaker_inputs_file,
+        sagemaker_metrics_file=sagemaker_metrics_file,
    )
--- a/src/accelerate/commands/launch.py
+++ b/src/accelerate/commands/launch.py
@ -31,9 +31,12 @@ from accelerate.utils import (
    DistributedType,
    PrecisionType,
    PrepareForLaunch,
+    get_launch_prefix,
+    is_deepspeed_available,
    is_sagemaker_available,
 )
-from accelerate.utils.versions import is_torch_version
+from accelerate.utils.constants import DEEPSPEED_MULTINODE_LAUNCHERS
+from accelerate.utils.dataclasses import SageMakerDistributedType


 def launch_command_parser(subparsers=None):
@ -57,6 +60,80 @@ def launch_command_parser(subparsers=None):
        action="store_true",
        help="Whether to use deepspeed.",
    )
+    parser.add_argument(
+        "--deepspeed_config_file",
+        default=None,
+        type=str,
+        help="DeepSpeed config file.",
+    )
+    parser.add_argument(
+        "--zero_stage",
+        default=None,
+        type=int,
+        help="DeepSpeed's ZeRO optimization stage (useful only when `use_deepspeed` flag is passed).",
+    )
+    parser.add_argument(
+        "--offload_optimizer_device",
+        default=None,
+        type=str,
+        help="Decides where (none|cpu|nvme) to offload optimizer states (useful only when `use_deepspeed` flag is passed).",
+    )
+    parser.add_argument(
+        "--offload_param_device",
+        default=None,
+        type=str,
+        help="Decides where (none|cpu|nvme) to offload parameters (useful only when `use_deepspeed` flag is passed).",
+    )
+    parser.add_argument(
+        "--gradient_accumulation_steps",
+        default=None,
+        type=int,
+        help="No of gradient_accumulation_steps used in your training script (useful only when `use_deepspeed` flag is passed).",
+    )
+    parser.add_argument(
+        "--gradient_clipping",
+        default=None,
+        type=float,
+        help="gradient clipping value used in your training script (useful only when `use_deepspeed` flag is passed).",
+    )
+    parser.add_argument(
+        "--zero3_init_flag",
+        default=None,
+        type=str,
+        help="Decides Whether (true|false) to enable `deepspeed.zero.Init` for constructing massive models. "
+        "Only applicable with DeepSpeed ZeRO Stage-3.",
+    )
+    parser.add_argument(
+        "--zero3_save_16bit_model",
+        default=None,
+        type=str,
+        help="Decides Whether (true|false) to save 16-bit model weights when using ZeRO Stage-3. "
+        "Only applicable with DeepSpeed ZeRO Stage-3.",
+    )
+    parser.add_argument(
+        "--deepspeed_hostfile",
+        default=None,
+        type=str,
+        help="DeepSpeed hostfile for configuring multi-node compute resources.",
+    )
+    parser.add_argument(
+        "--deepspeed_exclusion_filter",
+        default=None,
+        type=str,
+        help="DeepSpeed exclusion filter string when using mutli-node setup.",
+    )
+    parser.add_argument(
+        "--deepspeed_inclusion_filter",
+        default=None,
+        type=str,
+        help="DeepSpeed inclusion filter string when using mutli-node setup.",
+    )
+    parser.add_argument(
+        "--deepspeed_multinode_launcher",
+        default=None,
+        type=str,
+        help="DeepSpeed multi-node launcher to use.",
+    )
    parser.add_argument(
        "--use_fsdp",
        default=False,
@ -81,6 +158,25 @@ def launch_command_parser(subparsers=None):
        default=1,
        help="FSDP's Sharding Strategy. (useful only when `use_fsdp` flag is passed).",
    )
+    parser.add_argument(
+        "--fsdp_auto_wrap_policy",
+        type=str,
+        default=None,
+        help="FSDP's auto wrap policy. (useful only when `use_fsdp` flag is passed).",
+    )
+    parser.add_argument(
+        "--transformer_layer_cls_to_wrap",
+        default=None,
+        type=str,
+        help="Transformer layer class name (case-sensitive) to wrap ,e.g, `BertLayer`, `GPTJBlock`, `T5Block` .... "
+        "(useful only when `use_fsdp` flag is passed).",
+    )
+    parser.add_argument(
+        "--fsdp_backward_prefetch_policy",
+        default=None,
+        type=str,
+        help="FSDP's backward prefetch policy. (useful only when `use_fsdp` flag is passed).",
+    )
    parser.add_argument(
        "--tpu", default=False, action="store_true", help="Whether or not this should launch a TPU training."
    )
@ -158,24 +254,6 @@ def launch_command_parser(subparsers=None):
            "script."
        ),
    )
-    parser.add_argument(
-        "--zero_stage",
-        default=None,
-        type=int,
-        help="DeepSpeed's ZeRO optimization stage (useful only when `use_deepspeed` flag is passed).",
-    )
-    parser.add_argument(
-        "--offload_optimizer_device",
-        default=None,
-        type=str,
-        help="Decides where (none|cpu|nvme) to offload optimizer states (useful only when `use_deepspeed` flag is passed).",
-    )
-    parser.add_argument(
-        "--gradient_accumulation_steps",
-        default=None,
-        type=int,
-        help="No of gradient_accumulation_steps used in your training script (useful only when `use_deepspeed` flag is passed).",
-    )

    # Other arguments of the training scripts
    parser.add_argument("training_script_args", nargs=argparse.REMAINDER, help="Arguments of the training script.")
@ -218,12 +296,7 @@ def simple_launcher(args):


 def multi_gpu_launcher(args):
-    if is_torch_version(">=", "1.10.0"):
-        cmd = ["torchrun"]
-    elif is_torch_version(">=", "1.9.0"):
-        cmd = [sys.executable, "-m", "torch.distributed.run"]
-    else:
-        cmd = [sys.executable, "-m", "torch.distributed.launch", "--use_env"]
+    cmd = get_launch_prefix()
    if args.num_machines > 1:
        cmd.extend(
            [
@ -268,9 +341,12 @@ def multi_gpu_launcher(args):
    current_env["MIXED_PRECISION"] = str(mixed_precision)
    if args.use_fsdp:
        current_env["USE_FSDP"] = "true"
+        current_env["FSDP_AUTO_WRAP_POLICY"] = str(args.fsdp_auto_wrap_policy)
+        current_env["FSDP_TRANSFORMER_CLS_TO_WRAP"] = str(args.transformer_layer_cls_to_wrap)
        current_env["FSDP_OFFLOAD_PARAMS"] = str(args.offload_params).lower()
        current_env["FSDP_MIN_NUM_PARAMS"] = str(args.min_num_params)
        current_env["FSDP_SHARDING_STRATEGY"] = str(args.sharding_strategy)
+        current_env["FSDP_BACKWARD_PREFETCH"] = str(args.fsdp_backward_prefetch_policy)
    current_env["OMP_NUM_THREADS"] = str(args.num_cpu_threads_per_process)
    process = subprocess.Popen(cmd, env=current_env)
    process.wait()
@ -279,22 +355,46 @@ def multi_gpu_launcher(args):


 def deepspeed_launcher(args):
+    if not is_deepspeed_available():
+        raise ImportError("DeepSpeed is not installed => run `pip3 install deepspeed` or build it from source.")
    cmd = ["deepspeed", "--no_local_rank"]
    if args.num_machines > 1:
-        cmd.extend(
-            [
-                "--num_gpus",
-                str(args.num_processes // args.num_machines),
-                "--num_nodes",
-                str(args.num_machines),
-                "--node_rank",
-                str(args.machine_rank),
-                "--master_addr",
-                args.main_process_ip,
-                "--master_port",
-                str(args.main_process_port),
-            ]
-        )
+        if args.deepspeed_multinode_launcher == DEEPSPEED_MULTINODE_LAUNCHERS[1]:
+            cmd = get_launch_prefix()
+            cmd.extend(
+                [
+                    "--nproc_per_node",
+                    str(args.num_processes // args.num_machines),
+                    "--nnodes",
+                    str(args.num_machines),
+                    "--node_rank",
+                    str(args.machine_rank),
+                    "--master_addr",
+                    args.main_process_ip,
+                    "--master_port",
+                    str(args.main_process_port),
+                ]
+            )
+        else:
+            cmd.extend(
+                ["--hostfile", str(args.deepspeed_hostfile), "--launcher", str(args.deepspeed_multinode_launcher)]
+            )
+            if args.deepspeed_exclusion_filter is not None:
+                cmd.extend(
+                    [
+                        "--exclude",
+                        str(args.deepspeed_exclusion_filter),
+                    ]
+                )
+            elif args.deepspeed_inclusion_filter is not None:
+                cmd.extend(
+                    [
+                        "--include",
+                        str(args.deepspeed_inclusion_filter),
+                    ]
+                )
+            else:
+                cmd.extend(["--num_gpus", str(args.num_processes // args.num_machines)])
    else:
        cmd.extend(["--num_gpus", str(args.num_processes)])

@ -319,11 +419,24 @@ def deepspeed_launcher(args):
        warnings.warn('--fp16 flag is deprecated. Use "--mixed_precision fp16" instead.', DeprecationWarning)
        mixed_precision = "fp16"

+    current_env["PYTHONPATH"] = sys.executable
    current_env["MIXED_PRECISION"] = str(mixed_precision)
    current_env["USE_DEEPSPEED"] = "true"
    current_env["DEEPSPEED_ZERO_STAGE"] = str(args.zero_stage)
    current_env["GRADIENT_ACCUMULATION_STEPS"] = str(args.gradient_accumulation_steps)
+    current_env["GRADIENT_CLIPPING"] = str(args.gradient_clipping).lower()
    current_env["DEEPSPEED_OFFLOAD_OPTIMIZER_DEVICE"] = str(args.offload_optimizer_device).lower()
+    current_env["DEEPSPEED_OFFLOAD_PARAM_DEVICE"] = str(args.offload_param_device).lower()
+    current_env["DEEPSPEED_ZERO3_INIT"] = str(args.zero3_init_flag).lower()
+    current_env["DEEPSPEED_ZERO3_SAVE_16BIT_MODEL"] = str(args.zero3_save_16bit_model).lower()
+    current_env["DEEPSPEED_CONFIG_FILE"] = str(args.deepspeed_config_file).lower()
+
+    if args.num_machines > 1 and args.deepspeed_multinode_launcher != DEEPSPEED_MULTINODE_LAUNCHERS[1]:
+        with open(".deepspeed_env", "a") as f:
+            for key, value in current_env.items():
+                if ";" in value or " " in value:
+                    continue
+                f.write(f"{key}={value}\n")

    process = subprocess.Popen(cmd, env=current_env)
    process.wait()
@ -451,19 +564,56 @@ def sagemaker_launcher(sagemaker_config: SageMakerConfig, args):
        mixed_precision = "fp16"

    # Environment variables to be set for use during training job
-    environment = {"MIXED_PRECISION": str(mixed_precision)}
+    environment = {
+        "USE_SAGEMAKER": "true",
+        "MIXED_PRECISION": str(mixed_precision),
+        "SAGEMAKER_DISTRIBUTED_TYPE": sagemaker_config.distributed_type.value,
+    }
    # configure distribution set up
-    distribution = None  # TODO: not yet implemented
+    distribution = None
+    if sagemaker_config.distributed_type == SageMakerDistributedType.DATA_PARALLEL:
+        distribution = {"smdistributed": {"dataparallel": {"enabled": True}}}
+
+    # configure sagemaker inputs
+    sagemaker_inputs = None
+    if sagemaker_config.sagemaker_inputs_file is not None:
+        print(f"Loading SageMaker Inputs from {sagemaker_config.sagemaker_inputs_file} file")
+        sagemaker_inputs = {}
+        with open(sagemaker_config.sagemaker_inputs_file) as file:
+            for i, line in enumerate(file):
+                if i == 0:
+                    continue
+                l = line.split("\t")
+                sagemaker_inputs[l[0]] = l[1].strip()
+        print(f"Loaded SageMaker Inputs: {sagemaker_inputs}")
+
+    # configure sagemaker metrics
+    sagemaker_metrics = None
+    if sagemaker_config.sagemaker_metrics_file is not None:
+        print(f"Loading SageMaker Metrics from {sagemaker_config.sagemaker_metrics_file} file")
+        sagemaker_metrics = []
+        with open(sagemaker_config.sagemaker_metrics_file) as file:
+            for i, line in enumerate(file):
+                if i == 0:
+                    continue
+                l = line.split("\t")
+                metric_dict = {
+                    "Name": l[0],
+                    "Regex": l[1].strip(),
+                }
+                sagemaker_metrics.append(metric_dict)
+        print(f"Loaded SageMaker Metrics: {sagemaker_metrics}")

    # configure session
    print("Creating Estimator")
    huggingface_estimator = HuggingFace(
+        image_uri=sagemaker_config.image_uri,
        entry_point=entry_point,
        source_dir=source_dir,
        role=sagemaker_config.iam_role_name,
-        transformers_version="4.4",
-        pytorch_version="1.6",
-        py_version="py36",
+        transformers_version=sagemaker_config.transformers_version,
+        pytorch_version=sagemaker_config.pytorch_version,
+        py_version=sagemaker_config.py_version,
        base_job_name=sagemaker_config.base_job_name,
        instance_count=sagemaker_config.num_machines,
        instance_type=sagemaker_config.ec2_instance_type,
@ -471,9 +621,10 @@ def sagemaker_launcher(sagemaker_config: SageMakerConfig, args):
        distribution=distribution,
        hyperparameters=hyperparameters,
        environment=environment,
+        metric_definitions=sagemaker_metrics,
    )

-    huggingface_estimator.fit()
+    huggingface_estimator.fit(inputs=sagemaker_inputs)
    print(f"You can find your model data at: {huggingface_estimator.model_data}")


--- a/src/accelerate/commands/test.py
+++ b/src/accelerate/commands/test.py
@ -43,7 +43,7 @@ def test_command_parser(subparsers=None):


 def test_command(args):
-    script_name = os.path.sep.join(__file__.split(os.path.sep)[:-2] + ["test_utils", "test_script.py"])
+    script_name = os.path.sep.join(__file__.split(os.path.sep)[:-2] + ["test_utils", "scripts", "test_script.py"])

    test_args = f"""
        --config_file={args.config_file} {script_name}
--- a/src/accelerate/data_loader.py
+++ b/src/accelerate/data_loader.py
@ -18,7 +18,8 @@ from typing import List, Optional, Union
 import torch
 from torch.utils.data import BatchSampler, DataLoader, IterableDataset

-from .state import AcceleratorState, DistributedType, is_tpu_available
+from .logging import get_logger
+from .state import AcceleratorState, DistributedType, GradientState, is_tpu_available
 from .utils import (
    RNGType,
    broadcast,
@ -34,9 +35,26 @@ from .utils import (
 )


-if is_tpu_available():
-    import torch_xla.core.xla_model as xm
+if is_tpu_available(check_device=False):
+    import torch_xla.distributed.parallel_loader as xpl

+    class MpDeviceLoaderWrapper(xpl.MpDeviceLoader):
+        """
+        Wrapper for the xpl.MpDeviceLoader class that knows the total batch size.
+
+        **Available attributes:**
+
+        - **total_batch_size** (`int`) -- Total batch size of the dataloader across all processes.
+            Equal to the original batch size when `split_batches=True`; otherwise the original batch size * the total
+            number of processes
+        """
+
+        @property
+        def total_batch_size(self):
+            return self._loader.total_batch_size
+
+
+logger = get_logger(__name__)

 # kwargs of the DataLoader in min version 1.4.0.
 _PYTORCH_DATALOADER_KWARGS = {
@ -287,6 +305,12 @@ class DataLoaderShard(DataLoader):
            A random number generator to keep synchronized across processes.
        kwargs:
            All other keyword arguments to pass to the regular `DataLoader` initialization.
+
+    **Available attributes:**
+
+        - **total_batch_size** (`int`) -- Total batch size of the dataloader across all processes.
+            Equal to the original batch size when `split_batches=True`; otherwise the original batch size * the total
+            number of processes
    """

    def __init__(self, dataset, device=None, rng_types=None, generator=None, **kwargs):
@ -294,34 +318,58 @@ class DataLoaderShard(DataLoader):
        self.device = device
        self.rng_types = rng_types
        self.generator = generator
+        self.gradient_state = GradientState()

    def __iter__(self):
        if self.rng_types is not None:
            synchronize_rng_states(self.rng_types, self.generator)
-        state = AcceleratorState()
-        for batch in super().__iter__():
-            if state.distributed_type == DistributedType.TPU:
-                xm.mark_step()
-            yield batch if self.device is None else send_to_device(batch, self.device)
+        self.gradient_state._set_end_of_dataloader(False)
+        dataloader_iter = super().__iter__()
+        # We iterate one batch ahead to check when we are at the end
+        try:
+            current_batch = next(dataloader_iter)
+        except StopIteration:
+            yield
+        while True:
+            try:
+                # But we still move it to the device so it is done before `StopIteration` is reached
+                if self.device is not None:
+                    current_batch = send_to_device(current_batch, self.device)
+                next_batch = next(dataloader_iter)
+                yield current_batch
+                current_batch = next_batch
+            except StopIteration:
+                self.gradient_state._set_end_of_dataloader(True)
+                yield current_batch
+                break
+
+    @property
+    def total_batch_size(self):
+        return (
+            self.batch_sampler.batch_size
+            if self.batch_sampler.split_batches
+            else (self.batch_sampler.batch_size * self.batch_sampler.num_processes)
+        )


 class DataLoaderDispatcher(DataLoader):
    """
+    Args:
    Subclass of a PyTorch `DataLoader` that will iterate and preprocess on process 0 only, then dispatch on each
    process their part of the batch.
-
-    Args:
        split_batches (`bool`, *optional*, defaults to `False`):
            Whether the resulting `DataLoader` should split the batches of the original data loader across devices or
            yield full batches (in which case it will yield batches starting at the `process_index`-th and advancing of
-            `num_processes` batches at each iteration).
+            `num_processes` batches at each iteration). Another way to see this is that the observed batch size will be
+            the same as the initial `dataloader` if this option is set to `True`, the batch size of the initial
+            `dataloader` multiplied by `num_processes` otherwise. Setting this option to `True` requires that the batch
+            size of the `dataloader` is a round multiple of `batch_size`.

-            Another way to see this is that the observed batch size will be the same as the initial `dataloader` if
-            this option is set to `True`, the batch size of the initial `dataloader` multiplied by `num_processes`
-            otherwise.
+    **Available attributes:**

-            Setting this option to `True` requires that the batch size of the `dataloader` is a round multiple of
-            `batch_size`.
+        - **total_batch_size** (`int`) -- Total batch size of the dataloader across all processes.
+            Equal to the original batch size when `split_batches=True`; otherwise the original batch size * the total
+            number of processes
    """

    def __init__(self, dataset, split_batches: bool = False, **kwargs):
@ -341,85 +389,105 @@ class DataLoaderDispatcher(DataLoader):
        if shuffle:
            torch.utils.data.graph_settings.apply_shuffle_settings(dataset, shuffle=shuffle)

+        self.gradient_state = GradientState()
+        self.state = AcceleratorState()
+
+    def _fetch_batches(self, iterator):
+        batches, batch = None, None
+        # On process 0, we gather the batch to dispatch.
+        if self.state.process_index == 0:
+            try:
+                if self.split_batches:
+                    # One batch of the main iterator is dispatched and split.
+                    batch = next(iterator)
+                else:
+                    # num_processes batches of the main iterator are concatenated then dispatched and split.
+                    # We add the batches one by one so we have the remainder available when drop_last=False.
+                    batches = []
+                    for _ in range(self.state.num_processes):
+                        batches.append(next(iterator))
+                    batch = concatenate(batches, dim=0)
+                # In both cases, we need to get the structure of the batch that we will broadcast on other
+                # processes to initialize the tensors with the right shape.
+                # data_structure, stop_iteration
+                batch_info = [get_data_structure(batch), False]
+            except StopIteration:
+                batch_info = [None, True]
+        else:
+            batch_info = [None, self._stop_iteration]
+        # This is inplace, so after this instruction, every process has the same `batch_info` as process 0.
+        broadcast_object_list(batch_info)
+        self._stop_iteration = batch_info[1]
+        if self._stop_iteration:
+            # If drop_last is False and split_batches is False, we may have a remainder to take care of.
+            if not self.split_batches and not self.drop_last:
+                if self.state.process_index == 0 and len(batches) > 0:
+                    batch = concatenate(batches, dim=0)
+                    batch_info = [get_data_structure(batch), False]
+                else:
+                    batch_info = [None, True]
+                broadcast_object_list(batch_info)
+                if batch_info[1]:
+                    return batch, batch_info, True
+            else:
+                return batch, batch_info, True
+        return batch, batch_info, False
+
    def __iter__(self):
-        state = AcceleratorState()
-        if state.process_index == 0:
+        self.gradient_state._set_end_of_dataloader(False)
+        main_iterator = None
+        if self.state.process_index == 0:
            # We only iterate through the DataLoader on process 0.
            main_iterator = super().__iter__()
-        stop_iteration = False
+        self._stop_iteration = False
        first_batch = None
-        while not stop_iteration:
-            # On process 0, we gather the batch to dispatch.
-            if state.process_index == 0:
-                try:
-                    if self.split_batches:
-                        # One batch of the main iterator is dispatched and split.
-                        batch = next(main_iterator)
-                    else:
-                        # num_processes batches of the main iterator are concatenated then dispatched and split.
-                        # We add the batches one by one so we have the remainder available when drop_last=False.
-                        batches = []
-                        for _ in range(state.num_processes):
-                            batches.append(next(main_iterator))
-                        batch = concatenate(batches, dim=0)
-                    # In both cases, we need to get the structure of the batch that we will broadcast on other
-                    # processes to initialize the tensors with the right shape.
-                    # data_structure, stop_iteration
-                    batch_info = [get_data_structure(batch), False]
-                except StopIteration:
-                    batch_info = [None, True]
-            else:
-                batch_info = [None, stop_iteration]
-
-            # This is inplace, so after this instruction, every process has the same `batch_info` as process 0.
-            broadcast_object_list(batch_info)
-            stop_iteration = batch_info[1]
-            if stop_iteration:
-                # If drop_last is False and split_batches is False, we may have a remainder to take care of.
-                if not self.split_batches and not self.drop_last:
-                    if state.process_index == 0 and len(batches) > 0:
-                        batch = concatenate(batches, dim=0)
-                        batch_info = [get_data_structure(batch), False]
-                    else:
-                        batch_info = [None, True]
-                    broadcast_object_list(batch_info)
-                    if batch_info[1]:
-                        continue
-                else:
-                    continue
-
-            if state.process_index != 0:
+        batch, batch_info, skip = self._fetch_batches(main_iterator)
+        while True:
+            if skip:
+                continue
+            if self.state.process_index != 0:
                # Initialize tensors on other processes than process 0.
                batch = initialize_tensors(batch_info[0])
-            batch = send_to_device(batch, state.device)
+            batch = send_to_device(batch, self.state.device)
            # Broadcast the batch before splitting it.
            batch = broadcast(batch, from_process=0)

            if not self.drop_last and first_batch is None:
                # We keep at least num processes elements of the first batch to be able to complete the last batch
-                first_batch = slice_tensors(batch, slice(0, state.num_processes))
+                first_batch = slice_tensors(batch, slice(0, self.state.num_processes))

            observed_batch_size = find_batch_size(batch)
-            batch_size = observed_batch_size // state.num_processes
+            batch_size = observed_batch_size // self.state.num_processes

-            if not self.drop_last and stop_iteration and observed_batch_size % state.num_processes != 0:
+            if not self.drop_last and self._stop_iteration and observed_batch_size % self.state.num_processes != 0:
                # If the last batch is not complete, let's add the first batch to it.
                batch = concatenate([batch, first_batch], dim=0)
                batch_size += 1

-            data_slice = slice(state.process_index * batch_size, (state.process_index + 1) * batch_size)
-
-            if state.distributed_type == DistributedType.TPU:
-                xm.mark_step()
-            yield slice_tensors(batch, data_slice)
+            data_slice = slice(self.state.process_index * batch_size, (self.state.process_index + 1) * batch_size)
+            next_batch, next_batch_info, next_skip = self._fetch_batches(main_iterator)
+            if not self._stop_iteration:
+                yield slice_tensors(batch, data_slice)
+                batch, batch_info, skip = next_batch, next_batch_info, next_skip
+            else:
+                self.gradient_state._set_end_of_dataloader(True)
+                yield slice_tensors(batch, data_slice)
+                break

    def __len__(self):
-        state = AcceleratorState()
        whole_length = super().__len__()
-        if self.drop_last:
-            return whole_length // state.num_processes
+        if self.split_batches:
+            return whole_length
+        elif self.drop_last:
+            return whole_length // self.state.num_processes
        else:
-            return math.ceil(whole_length / state.num_processes)
+            return math.ceil(whole_length / self.state.num_processes)
+
+    @property
+    def total_batch_size(self):
+        return (
+            self.dataset.batch_size if self.split_batches else (self.dataset.batch_size * self.dataset.num_processes)
+        )


 def prepare_data_loader(
@ -565,15 +633,22 @@ def prepare_data_loader(
        kwargs["batch_size"] = dataloader.batch_size // num_processes if split_batches else dataloader.batch_size

    if dispatch_batches:
-        return DataLoaderDispatcher(
-            new_dataset, split_batches=split_batches, batch_sampler=new_batch_sampler, **kwargs
+        dataloader = DataLoaderDispatcher(
+            new_dataset,
+            split_batches=split_batches,
+            batch_sampler=new_batch_sampler,
+            **kwargs,
+        )
+    else:
+        dataloader = DataLoaderShard(
+            new_dataset,
+            device=device if put_on_device and state.distributed_type != DistributedType.TPU else None,
+            batch_sampler=new_batch_sampler,
+            rng_types=rng_types,
+            generator=generator,
+            **kwargs,
        )

-    return DataLoaderShard(
-        new_dataset,
-        device=device if put_on_device else None,
-        batch_sampler=new_batch_sampler,
-        rng_types=rng_types,
-        generator=generator,
-        **kwargs,
-    )
+    if state.distributed_type == DistributedType.TPU:
+        return MpDeviceLoaderWrapper(dataloader, device)
+    return dataloader
--- a/src/accelerate/hooks.py
+++ b/src/accelerate/hooks.py
@ -13,7 +13,7 @@
 # limitations under the License.

 import functools
-from typing import Dict, Mapping, Optional, Union
+from typing import Dict, List, Mapping, Optional, Union

 import torch
 import torch.nn as nn
@ -270,7 +270,11 @@ class AlignDevicesHook(ModelHook):
                    set_module_tensor_to_device(module, name, device, value=self.weights_map.get(name, None))


-def attach_execution_device_hook(module: torch.nn.Module, execution_device: Union[int, str, torch.device]):
+def attach_execution_device_hook(
+    module: torch.nn.Module,
+    execution_device: Union[int, str, torch.device],
+    preload_module_classes: Optional[List[str]] = None,
+):
    """
    Recursively attaches `AlignDevicesHook` to all submodules of a given model to make sure they have the right
    execution device
@ -280,10 +284,19 @@ def attach_execution_device_hook(module: torch.nn.Module, execution_device: Unio
            The module where we want to attach the hooks.
        execution_device (`int`, `str` or `torch.device`):
            The device on which inputs and model weights should be placed before the forward pass.
+        preload_module_classes (`List[str]`, *optional*):
+            A list of classes whose instances should load all their weights (even in the submodules) at the beginning
+            of the forward. This should only be used for classes that have submodules which are registered but not
+            called directly during the forward, for instance if a `dense` linear layer is registered, but at forward,
+            `dense.weight` and `dense.bias` are used in some operations instead of calling `dense` directly.
    """
    if not hasattr(module, "_hf_hook") and len(module.state_dict()) > 0:
        add_hook_to_module(module, AlignDevicesHook(execution_device))

+    # Break the recursion if we get to a preload module.
+    if preload_module_classes is not None and module.__class__.__name__ in preload_module_classes:
+        return
+
    for child in module.children():
        attach_execution_device_hook(child, execution_device)

@ -295,6 +308,7 @@ def attach_align_device_hook(
    weights_map: Optional[Mapping] = None,
    offload_buffers: bool = False,
    module_name: str = "",
+    preload_module_classes: Optional[List[str]] = None,
 ):
    """
    Recursively attaches `AlignDevicesHook` to all submodules of a given model that have direct parameters and/or
@ -313,10 +327,19 @@ def attach_align_device_hook(
            Whether or not to include the associated module's buffers when offloading.
        module_name (`str`, *optional*, defaults to `""`):
            The name of the module.
+        preload_module_classes (`List[str]`, *optional*):
+            A list of classes whose instances should load all their weights (even in the submodules) at the beginning
+            of the forward. This should only be used for classes that have submodules which are registered but not
+            called directly during the forward, for instance if a `dense` linear layer is registered, but at forward,
+            `dense.weight` and `dense.bias` are used in some operations instead of calling `dense` directly.
    """
    # Attach the hook on this module if it has any direct tensor.
    directs = named_module_tensors(module)
-    if len(list(directs)) > 0:
+    full_offload = (
+        offload and preload_module_classes is not None and module.__class__.__name__ in preload_module_classes
+    )
+
+    if len(list(directs)) > 0 or full_offload:
        if weights_map is not None:
            prefix = f"{module_name}." if len(module_name) > 0 else ""
            prefixed_weights_map = PrefixedDataset(weights_map, prefix)
@ -327,9 +350,14 @@ def attach_align_device_hook(
            offload=offload,
            weights_map=prefixed_weights_map,
            offload_buffers=offload_buffers,
+            place_submodules=full_offload,
        )
        add_hook_to_module(module, hook)

+    # We stop the recursion in case we hit the full offload.
+    if full_offload:
+        return
+
    # Recurse on all children of the module.
    for child_name, child in module.named_children():
        child_name = f"{module_name}.{child_name}" if len(module_name) > 0 else child_name
@ -340,6 +368,7 @@ def attach_align_device_hook(
            weights_map=weights_map,
            offload_buffers=offload_buffers,
            module_name=child_name,
+            preload_module_classes=preload_module_classes,
        )


@ -362,6 +391,7 @@ def attach_align_device_hook_on_blocks(
    weights_map: Mapping = None,
    offload_buffers: bool = False,
    module_name: str = "",
+    preload_module_classes: Optional[List[str]] = None,
 ):
    """
    Attaches `AlignDevicesHook` to all blocks of a given model as needed.
@ -381,6 +411,11 @@ def attach_align_device_hook_on_blocks(
            Whether or not to include the associated module's buffers when offloading.
        module_name (`str`, *optional*, defaults to `""`):
            The name of the module.
+        preload_module_classes (`List[str]`, *optional*):
+            A list of classes whose instances should load all their weights (even in the submodules) at the beginning
+            of the forward. This should only be used for classes that have submodules which are registered but not
+            called directly during the forward, for instance if a `dense` linear layer is registered, but at forward,
+            `dense.weight` and `dense.bias` are used in some operations instead of calling `dense` directly.
    """
    # If one device and one offload, we've got one hook.
    if not isinstance(execution_device, Mapping) and not isinstance(offload, dict):
@ -399,7 +434,7 @@ def attach_align_device_hook_on_blocks(
        return

    if not isinstance(execution_device, Mapping):
-        execution_device = {key: offload for key in offload.keys()}
+        execution_device = {key: execution_device for key in offload.keys()}
    if not isinstance(offload, Mapping):
        offload = {key: offload for key in execution_device.keys()}

@ -413,21 +448,21 @@ def attach_align_device_hook_on_blocks(
        add_hook_to_module(module, hook)
        attach_execution_device_hook(module, execution_device[module_name])
    elif module_name in execution_device:
-        if weights_map is not None:
-            prefix = f"{module_name}." if len(module_name) > 0 else ""
-            prefixed_weights_map = PrefixedDataset(weights_map, prefix)
-        else:
-            prefixed_weights_map = None
-        hook = AlignDevicesHook(
+        attach_align_device_hook(
+            module,
            execution_device=execution_device[module_name],
            offload=True,
-            weights_map=prefixed_weights_map,
+            weights_map=weights_map,
            offload_buffers=offload_buffers,
-            io_same_device=(module_name == ""),
-            place_submodules=True,
+            module_name=module_name,
+            preload_module_classes=preload_module_classes,
+        )
+        if not hasattr(module, "_hf_hook"):
+            hook = AlignDevicesHook(execution_device=execution_device[module_name], io_same_device=(module_name == ""))
+            add_hook_to_module(module, hook)
+        attach_execution_device_hook(
+            module, execution_device[module_name], preload_module_classes=preload_module_classes
        )
-        attach_execution_device_hook(module, execution_device[module_name])
-        add_hook_to_module(module, hook)
    elif module_name == "":
        hook = AlignDevicesHook(io_same_device=True)
        add_hook_to_module(module, hook)
@ -441,4 +476,5 @@ def attach_align_device_hook_on_blocks(
            weights_map=weights_map,
            offload_buffers=offload_buffers,
            module_name=child_name,
+            preload_module_classes=preload_module_classes,
        )
--- a/src/accelerate/launchers.py
+++ b/src/accelerate/launchers.py
@ -50,6 +50,13 @@ def notebook_launcher(function, args=(), num_processes=None, use_fp16=False, mix
    else:
        in_colab_or_kaggle = False

+    try:
+        mixed_precision = PrecisionType(mixed_precision.lower())
+    except ValueError:
+        raise ValueError(
+            f"Unknown mixed_precision mode: {args.mixed_precision.lower()}. Choose between {PrecisionType.list()}."
+        )
+
    if in_colab_or_kaggle:
        if os.environ.get("TPU_NAME", None) is not None:
            # TPU launch
@ -72,7 +79,7 @@ def notebook_launcher(function, args=(), num_processes=None, use_fp16=False, mix
            if torch.cuda.is_available():
                print("Launching training on one GPU.")
            else:
-                print("Launching training on CPU.")
+                print("Launching training on one CPU.")
            function(*args)

    else:
@ -105,13 +112,6 @@ def notebook_launcher(function, args=(), num_processes=None, use_fp16=False, mix
                    "function."
                )

-            try:
-                mixed_precision = PrecisionType(mixed_precision.lower())
-            except ValueError:
-                raise ValueError(
-                    f"Unknown mixed_precision mode: {args.mixed_precision.lower()}. Choose between {PrecisionType.list()}."
-                )
-
            if use_fp16:
                warnings.warn('use_fp16=True is deprecated. Use mixed_precision="fp16" instead.', DeprecationWarning)
                mixed_precision = "fp16"
--- a/src/accelerate/optimizer.py
+++ b/src/accelerate/optimizer.py
@ -17,11 +17,11 @@ import warnings

 import torch

-from .state import AcceleratorState
+from .state import AcceleratorState, GradientState
 from .utils import DistributedType, honor_type, is_torch_version, is_tpu_available


-if is_tpu_available():
+if is_tpu_available(check_device=False):
    import torch_xla.core.xla_model as xm


@ -39,6 +39,9 @@ class AcceleratedOptimizer(torch.optim.Optimizer):
    """
    Internal wrapper around a torch optimizer.

+    Conditionally will perform `step` and `zero_grad` if gradients should be synchronized when performing gradient
+    accumulation.
+
    Args:
        optimizer (`torch.optim.optimizer.Optimizer`):
            The optimizer to wrap.
@ -53,6 +56,7 @@ class AcceleratedOptimizer(torch.optim.Optimizer):
        self.optimizer = optimizer
        self.scaler = scaler
        self.accelerator_state = AcceleratorState()
+        self.gradient_state = GradientState()
        self.device_placement = device_placement
        self._is_overflow = False

@ -101,37 +105,39 @@ class AcceleratedOptimizer(torch.optim.Optimizer):
        return self.optimizer.state_dict()

    def zero_grad(self, set_to_none=None):
-        if is_torch_version("<", "1.7.0"):
-            if set_to_none is not None:
-                raise ValueError(
-                    "`set_to_none` for Optimizer.zero_grad` was introduced in PyTorch 1.7.0 and can't be used for "
-                    f"earlier versions (found version {torch.__version__})."
-                )
-            self.optimizer.zero_grad()
-        else:
-            accept_arg = "set_to_none" in inspect.signature(self.optimizer.zero_grad).parameters
-            if accept_arg:
-                if set_to_none is None:
-                    set_to_none = False
-                self.optimizer.zero_grad(set_to_none=set_to_none)
-            else:
+        if self.gradient_state.sync_gradients:
+            if is_torch_version("<", "1.7.0"):
                if set_to_none is not None:
-                    raise ValueError("`set_to_none` for Optimizer.zero_grad` is not supported by this optimizer.")
+                    raise ValueError(
+                        "`set_to_none` for Optimizer.zero_grad` was introduced in PyTorch 1.7.0 and can't be used for "
+                        f"earlier versions (found version {torch.__version__})."
+                    )
                self.optimizer.zero_grad()
+            else:
+                accept_arg = "set_to_none" in inspect.signature(self.optimizer.zero_grad).parameters
+                if accept_arg:
+                    if set_to_none is None:
+                        set_to_none = False
+                    self.optimizer.zero_grad(set_to_none=set_to_none)
+                else:
+                    if set_to_none is not None:
+                        raise ValueError("`set_to_none` for Optimizer.zero_grad` is not supported by this optimizer.")
+                    self.optimizer.zero_grad()

    def step(self, closure=None):
-        if self.accelerator_state.distributed_type == DistributedType.TPU:
-            optimizer_args = {"closure": closure} if closure is not None else {}
-            xm.optimizer_step(self.optimizer, optimizer_args=optimizer_args)
-        elif self.scaler is not None:
-            scale_before = self.scaler.get_scale()
-            self.scaler.step(self.optimizer, closure)
-            self.scaler.update()
-            scale_after = self.scaler.get_scale()
-            # If we reduced the loss scale, it means the optimizer step was skipped because of gradient overflow.
-            self._is_overflow = scale_after < scale_before
-        else:
-            self.optimizer.step(closure)
+        if self.gradient_state.sync_gradients:
+            if self.accelerator_state.distributed_type == DistributedType.TPU:
+                optimizer_args = {"closure": closure} if closure is not None else {}
+                xm.optimizer_step(self.optimizer, optimizer_args=optimizer_args)
+            elif self.scaler is not None:
+                scale_before = self.scaler.get_scale()
+                self.scaler.step(self.optimizer, closure)
+                self.scaler.update()
+                scale_after = self.scaler.get_scale()
+                # If we reduced the loss scale, it means the optimizer step was skipped because of gradient overflow.
+                self._is_overflow = scale_after < scale_before
+            else:
+                self.optimizer.step(closure)

    def _switch_parameters(self, parameters_map):
        for param_group in self.optimizer.param_groups:
--- a/src/accelerate/scheduler.py
+++ b/src/accelerate/scheduler.py
@ -12,16 +12,24 @@
 # See the License for the specific language governing permissions and
 # limitations under the License.

+# We ignore warnings about stepping the scheduler since we step it ourselves during gradient accumulation
+
+import warnings
+
 from .state import AcceleratorState


+warnings.filterwarnings("ignore", category=UserWarning, module="torch.optim.lr_scheduler")
+
+
 class AcceleratedScheduler:
    """
    A wrapper around a learning rate scheduler that will only step when the optimizer(s) have a training step. Useful
-    to avoid making a scheduler step too fast when:
+    to avoid making a scheduler step too fast when gradients went overflow and there was no training step (in mixed
+    precision training)

-    - gradients went overflow and there was no training step (in mixed precision training)
-    - step was skipped because of gradient accumulation
+    When performing gradient accumulation scheduler lengths should not be changed accordingly, accelerate will always
+    step the scheduler to account for it.

    Args:
        scheduler (`torch.optim.lr_scheduler._LRScheduler`):
@ -52,7 +60,6 @@ class AcceleratedScheduler:
        for opt in self.optimizers:
            if opt.step_was_skipped:
                return
-
        if self.split_batches:
            # Split batches -> the training dataloader batch size is not changed so one step per training step
            self.scheduler.step(*args, **kwargs)
--- a/src/accelerate/state.py
+++ b/src/accelerate/state.py
@ -18,9 +18,10 @@ from distutils.util import strtobool
 import torch

 from .utils import DistributedType, is_ccl_available, is_deepspeed_available, is_tpu_available
+from .utils.dataclasses import SageMakerDistributedType


-if is_tpu_available():
+if is_tpu_available(check_device=False):
    import torch_xla.core.xla_model as xm


@ -52,6 +53,7 @@ class AcceleratorState:
    Attributes:

        - **device** (`torch.device`) -- The device to use.
+        - **sync_gradients** (`bool`) -- Whether to sync the gradients or not
        - **distributed_type** (`~accelerate.state.DistributedType`) -- The type of distributed environment currently
          in use.
        - **num_processes** (`int`) -- The number of processes currently launched in parallel.
@ -75,22 +77,46 @@ class AcceleratorState:
        self.__dict__ = self._shared_state
        if parse_flag_from_env("USE_CPU"):
            cpu = True
+        self._check_initialized(mixed_precision, cpu)
+        self.fork_launched = parse_flag_from_env("FORK_LAUNCHED", 0)
        if not getattr(self, "initialized", False):
            self.backend = None
            self.deepspeed_plugin = None
-            mixed_precision = mixed_precision.lower() if mixed_precision else None
+            mixed_precision = (
+                parse_choice_from_env("MIXED_PRECISION", "no") if mixed_precision is None else mixed_precision.lower()
+            )
            if not _from_accelerator:
                raise ValueError(
                    "Please make sure to properly initialize your accelerator via `accelerator = Accelerator()` "
                    "before using any functionality from the `accelerate` library."
                )
+            if (
+                os.environ.get("USE_SAGEMAKER", "false") == "true"
+                and os.environ.get("SAGEMAKER_DISTRIBUTED_TYPE") != SageMakerDistributedType.NO
+                and not cpu
+            ):
+                if os.environ.get("SAGEMAKER_DISTRIBUTED_TYPE") == SageMakerDistributedType.DATA_PARALLEL:
+                    self.distributed_type = DistributedType.MULTI_GPU
+                    import smdistributed.dataparallel.torch.torch_smddp  # noqa
+
+                    if not torch.distributed.is_initialized():
+                        torch.distributed.init_process_group(backend="smddp")
+                    self.backend = "smddp"
+                    self.num_processes = torch.distributed.get_world_size()
+                    self.process_index = torch.distributed.get_rank()
+                    self.local_process_index = int(os.environ.get("LOCAL_RANK", -1))
+                    self.device = torch.device("cuda", self.local_process_index)
+                    torch.cuda.set_device(self.device)
+                    self.mixed_precision = mixed_precision
            elif is_tpu_available() and not cpu:
                self.distributed_type = DistributedType.TPU
                self.num_processes = xm.xrt_world_size()
                self.process_index = xm.get_ordinal()
                self.local_process_index = xm.get_local_ordinal()
                self.device = xm.xla_device()
-                self.mixed_precision = "no"
+                if mixed_precision == "bf16":
+                    os.environ["XLA_USE_BF16"] = str(1)
+                self.mixed_precision = mixed_precision
            elif os.environ.get("USE_DEEPSPEED", "false") == "true" and not cpu:
                assert (
                    is_deepspeed_available()
@ -105,13 +131,6 @@ class AcceleratorState:
                self.device = torch.device("cuda", self.local_process_index)
                torch.cuda.set_device(self.device)
                self.mixed_precision = "no"  # deepspeed handles mixed_precision using deepspeed_config
-                mixed_precision = (
-                    parse_choice_from_env("MIXED_PRECISION", "no") if mixed_precision is None else mixed_precision
-                )
-                if mixed_precision == "fp16":
-                    deepspeed_plugin.deepspeed_config.update({"fp16": {"enabled": True}})
-                elif mixed_precision == "bf16":
-                    deepspeed_plugin.deepspeed_config.update({"bfloat16": {"enabled": True}})
                self.deepspeed_plugin = deepspeed_plugin
            elif int(os.environ.get("LOCAL_RANK", -1)) != -1 and not cpu:
                self.distributed_type = DistributedType.MULTI_GPU
@ -123,15 +142,11 @@ class AcceleratorState:
                self.local_process_index = int(os.environ.get("LOCAL_RANK", -1))
                self.device = torch.device("cuda", self.local_process_index)
                torch.cuda.set_device(self.device)
-                self.mixed_precision = (
-                    parse_choice_from_env("MIXED_PRECISION", "no") if mixed_precision is None else mixed_precision
-                )
+                self.mixed_precision = mixed_precision
                if os.environ.get("USE_FSDP", "false") == "true":
                    self.distributed_type = DistributedType.FSDP
                    if self.mixed_precision != "no":
-                        raise ValueError(
-                            "Mixed precision is currently not supported for FSDP. Please set `mixed_precision` to `no`."
-                        )
+                        fsdp_plugin.set_mixed_precision(self.mixed_precision)
                    self.fsdp_plugin = fsdp_plugin
            elif get_int_from_env(["PMI_SIZE", "OMPI_COMM_WORLD_SIZE", "MV2_COMM_WORLD_SIZE", "WORLD_SIZE"], 1) > 1:
                self.distributed_type = DistributedType.MULTI_CPU
@ -169,15 +184,13 @@ class AcceleratorState:
                self.process_index = torch.distributed.get_rank()
                self.local_process_index = local_rank
                self.device = torch.device("cpu")
-                self.mixed_precision = "no"
+                self.mixed_precision = mixed_precision
            else:
                self.distributed_type = DistributedType.NO
                self.num_processes = 1
                self.process_index = self.local_process_index = 0
                self.device = torch.device("cuda" if torch.cuda.is_available() and not cpu else "cpu")
-                self.mixed_precision = (
-                    parse_choice_from_env("MIXED_PRECISION", "no") if mixed_precision is None else mixed_precision
-                )
+                self.mixed_precision = mixed_precision
            self.initialized = True

    def __repr__(self):
@ -189,13 +202,61 @@ class AcceleratorState:
            f"Process index: {self.process_index}\n"
            f"Local process index: {self.local_process_index}\n"
            f"Device: {self.device}\n"
-            f"Mixed precision type: {mixed_precision}\n"
        )
        if self.distributed_type == DistributedType.DEEPSPEED:
            repr += f"ds_config: {self.deepspeed_plugin.deepspeed_config}\n"
+        else:
+            f"Mixed precision type: {mixed_precision}\n"
        return repr

    # For backward compatibility
    @property
    def use_fp16(self):
        return self.mixed_precision != "no"
+
+    @staticmethod
+    def _reset_state():
+        "Resets `_shared_state`, is used internally and should not be called"
+        AcceleratorState._shared_state = {}
+
+    def _check_initialized(self, mixed_precision=None, cpu=None):
+        "Checks if a modification is trying to be made and the `AcceleratorState` has already been initialized"
+        if getattr(self, "initialized", False):
+            err = "AcceleratorState has already been initialized and cannot be changed, restart your runtime completely and pass `{flag}` to `Accelerate()`."
+            if cpu and self.device.type != "cpu":
+                raise ValueError(err.format(flag="cpu=True"))
+            if mixed_precision is not None and mixed_precision != self.mixed_precision:
+                raise ValueError(err.format(flag=f"mixed_precision='{mixed_precision}'"))
+
+
+class GradientState:
+    """
+    This is a variation of a [singleton class](https://en.wikipedia.org/wiki/Singleton_pattern) in the sense that all
+    instance of `GradientState` share the same state, which is initialized on the first instantiation.
+
+    This specific state revolves around whether gradients should be synced and if we have reached the end of a prepared
+    dataloader Attributes:
+
+        - **sync_gradients** (`bool`) -- Whether the gradients should be synced
+        - **end_of_dataloader** (`bool`) -- Whether we have reached the end the current dataloader
+    """
+
+    _shared_state = {}
+
+    def __init__(self):
+        self.__dict__ = self._shared_state
+        if not getattr(self, "initialized", False):
+            self.sync_gradients = True
+            self.end_of_dataloader = False
+        self.initialized = True
+
+    def __repr__(self):
+        return f"Sync Gradients: {self.sync_gradients}\n" f"At end of current dataloader: {self.end_of_dataloader}\n"
+
+    def _set_sync_gradients(self, sync_gradients):
+        "Private function that sets whether gradients should be synchronized. Users should not have to call this."
+        self.sync_gradients = sync_gradients
+
+    def _set_end_of_dataloader(self, end_of_dataloader):
+        "Private function that sets whether the end of the current dataloader has been reached. Users should not have to call this."
+        self.end_of_dataloader = end_of_dataloader
--- a/src/accelerate/test_utils/init.py
+++ b/src/accelerate/test_utils/init.py
@ -2,5 +2,17 @@
 # There's no way to ignore "F401 '...' imported but unused" warnings in this
 # module, but to preserve other warnings. So, don't check this module at all.

-from .testing import are_the_same_tensors, execute_subprocess_async, require_cuda, require_multi_gpu, require_tpu, slow
+from .testing import (
+    are_the_same_tensors,
+    execute_subprocess_async,
+    require_cpu,
+    require_cuda,
+    require_multi_gpu,
+    require_single_gpu,
+    require_tpu,
+    slow,
+)
 from .training import RegressionDataset, RegressionModel
+
+
+from .scripts import test_script, test_sync  # isort:skip
--- a/src/accelerate/test_utils/scripts/init.py
+++ b/src/accelerate/test_utils/scripts/init.py
--- a/src/accelerate/test_utils/scripts/test_script.py
+++ b/src/accelerate/test_utils/scripts/test_script.py
@ -21,7 +21,14 @@ from accelerate import Accelerator
 from accelerate.data_loader import prepare_data_loader
 from accelerate.state import AcceleratorState
 from accelerate.test_utils import RegressionDataset, RegressionModel, are_the_same_tensors
-from accelerate.utils import DistributedType, gather, is_torch_version, set_seed, synchronize_rng_states
+from accelerate.utils import (
+    DistributedType,
+    gather,
+    is_bf16_available,
+    is_torch_version,
+    set_seed,
+    synchronize_rng_states,
+)


 def init_state_check():
@ -245,71 +252,77 @@ def training_check():

    accelerator.print("Training yielded the same results on one CPU or distributes setup with batch split.")

-    # Mostly a test that FP16 doesn't crash as the operation inside the model is not converted to FP16
-    print("FP16 training check.")
-    accelerator = Accelerator(mixed_precision="fp16")
-    train_dl = DataLoader(train_set, batch_size=batch_size, shuffle=True, generator=generator)
-    model = RegressionModel()
-    optimizer = torch.optim.SGD(model.parameters(), lr=0.1)
+    if torch.cuda.is_available():
+        # Mostly a test that FP16 doesn't crash as the operation inside the model is not converted to FP16
+        print("FP16 training check.")
+        AcceleratorState._reset_state()
+        accelerator = Accelerator(mixed_precision="fp16")
+        train_dl = DataLoader(train_set, batch_size=batch_size, shuffle=True, generator=generator)
+        model = RegressionModel()
+        optimizer = torch.optim.SGD(model.parameters(), lr=0.1)

-    train_dl, model, optimizer = accelerator.prepare(train_dl, model, optimizer)
-    set_seed(42)
-    generator.manual_seed(42)
-    for _ in range(3):
-        for batch in train_dl:
-            model.zero_grad()
-            output = model(batch["x"])
-            loss = torch.nn.functional.mse_loss(output, batch["y"])
-            accelerator.backward(loss)
-            optimizer.step()
+        train_dl, model, optimizer = accelerator.prepare(train_dl, model, optimizer)
+        set_seed(42)
+        generator.manual_seed(42)
+        for _ in range(3):
+            for batch in train_dl:
+                model.zero_grad()
+                output = model(batch["x"])
+                loss = torch.nn.functional.mse_loss(output, batch["y"])
+                accelerator.backward(loss)
+                optimizer.step()

-    model = accelerator.unwrap_model(model).cpu()
-    assert torch.allclose(old_model.a, model.a), "Did not obtain the same model on CPU or distributed training."
-    assert torch.allclose(old_model.b, model.b), "Did not obtain the same model on CPU or distributed training."
+        model = accelerator.unwrap_model(model).cpu()
+        assert torch.allclose(old_model.a, model.a), "Did not obtain the same model on CPU or distributed training."
+        assert torch.allclose(old_model.b, model.b), "Did not obtain the same model on CPU or distributed training."

-    # TEST that previous fp16 flag still works
-    print("Legacy FP16 training check.")
-    accelerator = Accelerator(fp16=True)
-    train_dl = DataLoader(train_set, batch_size=batch_size, shuffle=True, generator=generator)
-    model = RegressionModel()
-    optimizer = torch.optim.SGD(model.parameters(), lr=0.1)
+        # TEST that previous fp16 flag still works
+        print("Legacy FP16 training check.")
+        AcceleratorState._reset_state()
+        accelerator = Accelerator(fp16=True)
+        train_dl = DataLoader(train_set, batch_size=batch_size, shuffle=True, generator=generator)
+        model = RegressionModel()
+        optimizer = torch.optim.SGD(model.parameters(), lr=0.1)

-    train_dl, model, optimizer = accelerator.prepare(train_dl, model, optimizer)
-    set_seed(42)
-    generator.manual_seed(42)
-    for _ in range(3):
-        for batch in train_dl:
-            model.zero_grad()
-            output = model(batch["x"])
-            loss = torch.nn.functional.mse_loss(output, batch["y"])
-            accelerator.backward(loss)
-            optimizer.step()
+        train_dl, model, optimizer = accelerator.prepare(train_dl, model, optimizer)
+        set_seed(42)
+        generator.manual_seed(42)
+        for _ in range(3):
+            for batch in train_dl:
+                model.zero_grad()
+                output = model(batch["x"])
+                loss = torch.nn.functional.mse_loss(output, batch["y"])
+                accelerator.backward(loss)
+                optimizer.step()

-    model = accelerator.unwrap_model(model).cpu()
-    assert torch.allclose(old_model.a, model.a), "Did not obtain the same model on CPU or distributed training."
-    assert torch.allclose(old_model.b, model.b), "Did not obtain the same model on CPU or distributed training."
+        model = accelerator.unwrap_model(model).cpu()
+        assert torch.allclose(old_model.a, model.a), "Did not obtain the same model on CPU or distributed training."
+        assert torch.allclose(old_model.b, model.b), "Did not obtain the same model on CPU or distributed training."

-    # Mostly a test that BF16 doesn't crash as the operation inside the model is not converted to BF16
-    print("BF16 training check.")
-    accelerator = Accelerator(mixed_precision="bf16")
-    train_dl = DataLoader(train_set, batch_size=batch_size, shuffle=True, generator=generator)
-    model = RegressionModel()
-    optimizer = torch.optim.SGD(model.parameters(), lr=0.1)
+    # BF16 support is only for CPU + TPU, and some GPU
+    if is_bf16_available():
+        # Mostly a test that BF16 doesn't crash as the operation inside the model is not converted to BF16
+        print("BF16 training check.")
+        AcceleratorState._reset_state()
+        accelerator = Accelerator(mixed_precision="bf16")
+        train_dl = DataLoader(train_set, batch_size=batch_size, shuffle=True, generator=generator)
+        model = RegressionModel()
+        optimizer = torch.optim.SGD(model.parameters(), lr=0.1)

-    train_dl, model, optimizer = accelerator.prepare(train_dl, model, optimizer)
-    set_seed(42)
-    generator.manual_seed(42)
-    for _ in range(3):
-        for batch in train_dl:
-            model.zero_grad()
-            output = model(batch["x"])
-            loss = torch.nn.functional.mse_loss(output, batch["y"])
-            accelerator.backward(loss)
-            optimizer.step()
+        train_dl, model, optimizer = accelerator.prepare(train_dl, model, optimizer)
+        set_seed(42)
+        generator.manual_seed(42)
+        for _ in range(3):
+            for batch in train_dl:
+                model.zero_grad()
+                output = model(batch["x"])
+                loss = torch.nn.functional.mse_loss(output, batch["y"])
+                accelerator.backward(loss)
+                optimizer.step()

-    model = accelerator.unwrap_model(model).cpu()
-    assert torch.allclose(old_model.a, model.a), "Did not obtain the same model on CPU or distributed training."
-    assert torch.allclose(old_model.b, model.b), "Did not obtain the same model on CPU or distributed training."
+        model = accelerator.unwrap_model(model).cpu()
+        assert torch.allclose(old_model.a, model.a), "Did not obtain the same model on CPU or distributed training."
+        assert torch.allclose(old_model.b, model.b), "Did not obtain the same model on CPU or distributed training."


 def main():
@ -326,7 +339,8 @@ def main():
    if state.local_process_index == 0:
        print("\n**DataLoader integration test**")
    dl_preparation_check()
-    central_dl_preparation_check()
+    if state.distributed_type != DistributedType.TPU:
+        central_dl_preparation_check()

    # Trainings are not exactly the same in DeepSpeed and CPU mode
    if state.distributed_type == DistributedType.DEEPSPEED:
@ -337,5 +351,10 @@ def main():
    training_check()


+def _mp_fn(index):
+    # For xla_spawn (TPUs)
+    main()
+
+
 if __name__ == "__main__":
    main()
--- a/src/accelerate/test_utils/scripts/test_sync.py
+++ b/src/accelerate/test_utils/scripts/test_sync.py
@ -0,0 +1,274 @@
+# Copyright 2022 The HuggingFace Team. All rights reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+from copy import deepcopy
+
+import torch
+import torch.nn.functional as F
+from torch.optim import AdamW
+from torch.optim.lr_scheduler import LambdaLR
+from torch.utils.data import DataLoader
+
+from accelerate import Accelerator
+from accelerate.test_utils import RegressionDataset, RegressionModel
+from accelerate.utils import DistributedType, set_seed
+
+
+def check_model_parameters(model_a, model_b, did_step, iteration):
+    for param, grad_param in zip(model_a.parameters(), model_b.parameters()):
+        if not param.requires_grad:
+            continue
+        if not did_step:
+            # Grads should not be in sync
+            assert (
+                torch.allclose(param.grad, grad_param.grad) is False
+            ), f"Gradients in sync when they should not be at iteration {iteration}:\nmodel_a grad ({param.grad}) == model_b grad ({grad_param.grad})"
+        else:
+            # Grads should be in sync
+            assert (
+                torch.allclose(param.grad, grad_param.grad) is True
+            ), f"Gradients not in sync when they should be at iteration {iteration}:\nmodel_a grad ({param.grad}) != model_b grad ({grad_param.grad})"
+
+
+def step_model(model, input, target, accelerator, do_backward=True):
+    model.train()
+    output = model(input)
+    loss = F.mse_loss(output, target.to(output.device))
+    if not do_backward:
+        loss /= accelerator.gradient_accumulation_steps
+        loss.backward()
+    else:
+        accelerator.backward(loss)
+
+
+def get_training_setup(accelerator, sched=False):
+    "Returns everything needed to perform basic training"
+    set_seed(42)
+    model = RegressionModel()
+    ddp_model = deepcopy(model)
+    dset = RegressionDataset(length=80)
+    dataloader = DataLoader(dset, batch_size=16)
+    model.to(accelerator.device)
+    if sched:
+        opt = AdamW(params=model.parameters(), lr=1e-3)
+        ddp_opt = AdamW(params=ddp_model.parameters(), lr=1e-3)
+        sched = LambdaLR(opt, lr_lambda=lambda epoch: epoch**0.65)
+        ddp_sched = LambdaLR(ddp_opt, lr_lambda=lambda epoch: epoch**0.65)
+    # Make a copy of `model`
+    if sched:
+        ddp_model, ddp_opt, ddp_sched, dataloader = accelerator.prepare(ddp_model, ddp_opt, ddp_sched, dataloader)
+    else:
+        ddp_model, dataloader = accelerator.prepare(ddp_model, dataloader)
+    if sched:
+        return (model, opt, sched, dataloader, ddp_model, ddp_opt, ddp_sched)
+    return model, ddp_model, dataloader
+
+
+def test_noop_sync(accelerator):
+    # Test when on a single CPU or GPU that the context manager does nothing
+    model, ddp_model, dataloader = get_training_setup(accelerator)
+    # Use a single batch
+    ddp_input, ddp_target = next(iter(dataloader)).values()
+    for iteration in range(3):
+        # Gather the distributed inputs and targs for the base model
+        input, target = accelerator.gather((ddp_input, ddp_target))
+        input, target = input.to(accelerator.device), target.to(accelerator.device)
+        # Perform our initial ground truth step in non "DDP"
+        step_model(model, input, target, accelerator)
+        # Do "gradient accumulation" (noop)
+        if iteration % 2 == 0:
+            # Accumulate grads locally
+            with accelerator.no_sync(ddp_model):
+                step_model(ddp_model, ddp_input, ddp_target, accelerator)
+        else:
+            # Sync grads
+            step_model(ddp_model, ddp_input, ddp_target, accelerator)
+
+        # Since `no_sync` is a noop, `ddp_model` and `model` grads should always be in sync
+        check_model_parameters(model, ddp_model, True, iteration)
+        for param, ddp_param in zip(model.parameters(), ddp_model.parameters()):
+            if not param.requires_grad:
+                continue
+            assert torch.allclose(
+                param.grad, ddp_param.grad
+            ), f"Gradients not in sync when they should be:\nModel grad ({param.grad}) != DDP grad ({ddp_param.grad})"
+
+        # Shuffle ddp_input on each iteration
+        torch.manual_seed(1337 + iteration)
+        ddp_input = ddp_input[torch.randperm(len(ddp_input))]
+
+
+def test_distributed_sync(accelerator):
+    # Test on distributed setup that context manager behaves properly
+    model, ddp_model, dataloader = get_training_setup(accelerator)
+    # Use a single batch
+    ddp_input, ddp_target = next(iter(dataloader)).values()
+    for iteration in range(3):
+        # Gather the distributed inputs and targs for the base model
+        input, target = accelerator.gather((ddp_input, ddp_target))
+        input, target = input.to(accelerator.device), target.to(accelerator.device)
+        # Perform our initial ground truth step in non "DDP"
+        step_model(model, input, target, accelerator)
+        # Do "gradient accumulation" (noop)
+        if iteration % 2 == 0:
+            # Accumulate grads locally
+            with accelerator.no_sync(ddp_model):
+                step_model(ddp_model, ddp_input, ddp_target, accelerator)
+        else:
+            # Sync grads
+            step_model(ddp_model, ddp_input, ddp_target, accelerator)
+
+        # DDP model and model should only be in sync when not (iteration % 2 == 0)
+        for param, ddp_param in zip(model.parameters(), ddp_model.parameters()):
+            if not param.requires_grad:
+                continue
+            if iteration % 2 == 0:
+                # Grads should not be in sync
+                assert (
+                    torch.allclose(param.grad, ddp_param.grad) is False
+                ), f"Gradients in sync when they should not be:\nModel grad ({param.grad}) == DDP grad ({ddp_param.grad})"
+            else:
+                # Grads should be in sync
+                assert (
+                    torch.allclose(param.grad, ddp_param.grad) is True
+                ), f"Gradients not in sync when they should be:\nModel grad ({param.grad}) != DDP grad ({ddp_param.grad})"
+
+        # Shuffle ddp_input on each iteration
+        torch.manual_seed(1337 + iteration)
+        ddp_input = ddp_input[torch.randperm(len(ddp_input))]
+
+
+def test_gradient_accumulation(split_batches=False, dispatch_batches=False):
+    accelerator = Accelerator(
+        gradient_accumulation_steps=2, split_batches=split_batches, dispatch_batches=dispatch_batches
+    )
+    # Test that context manager behaves properly
+    model, ddp_model, dataloader = get_training_setup(accelerator)
+    for iteration, batch in enumerate(dataloader):
+        ddp_input, ddp_target = batch.values()
+        # Gather the distributed inputs and targs for the base model
+        input, target = accelerator.gather((ddp_input, ddp_target))
+        input, target = input.to(accelerator.device), target.to(accelerator.device)
+        # Perform our initial ground truth step in non "DDP"
+        step_model(model, input, target, accelerator, False)
+        # Do "gradient accumulation" (noop)
+        with accelerator.accumulate(ddp_model):
+            step_model(ddp_model, ddp_input, ddp_target, accelerator)
+
+        # DDP model and model should only be in sync when not (iteration % 2 == 0)
+        for param, ddp_param in zip(model.parameters(), ddp_model.parameters()):
+            if not param.requires_grad:
+                continue
+            if ((iteration + 1) % 2 == 0) or (iteration == len(dataloader) - 1):
+                # Grads should be in sync
+                assert (
+                    torch.allclose(param.grad, ddp_param.grad) is True
+                ), f"Gradients not in sync when they should be at iteration {iteration}:\nModel grad ({param.grad}) != DDP grad ({ddp_param.grad})"
+            else:
+                # Grads should not be in sync
+                assert (
+                    torch.allclose(param.grad, ddp_param.grad) is False
+                ), f"Gradients in sync when they should not be at iteration {iteration}:\nModel grad ({param.grad}) == DDP grad ({ddp_param.grad})"
+
+        # Shuffle ddp_input on each iteration
+        torch.manual_seed(1337 + iteration)
+        ddp_input = ddp_input[torch.randperm(len(ddp_input))]
+
+
+def test_gradient_accumulation_with_opt_and_scheduler(split_batches=False, dispatch_batches=False):
+    accelerator = Accelerator(
+        gradient_accumulation_steps=2, split_batches=split_batches, dispatch_batches=dispatch_batches
+    )
+    # Test that context manager behaves properly
+    model, opt, sched, dataloader, ddp_model, ddp_opt, ddp_sched = get_training_setup(accelerator, True)
+    for iteration, batch in enumerate(dataloader):
+        ddp_input, ddp_target = batch.values()
+        # Gather the distributed inputs and targs for the base model
+        input, target = accelerator.gather((ddp_input, ddp_target))
+        input, target = input.to(accelerator.device), target.to(accelerator.device)
+        # Perform our initial ground truth step in non "DDP"
+        model.train()
+        ddp_model.train()
+        step_model(model, input, target, accelerator, False)
+        opt.step()
+        if split_batches:
+            sched.step()
+        else:
+            for _ in range(accelerator.num_processes):
+                sched.step()
+        opt.zero_grad()
+        # Perform gradient accumulation under wrapper
+        with accelerator.accumulate(ddp_model):
+            step_model(ddp_model, ddp_input, ddp_target, accelerator)
+            ddp_opt.step()
+            ddp_sched.step()
+            ddp_opt.zero_grad()
+
+        # Learning rates should be the same
+        assert (
+            opt.param_groups[0]["lr"] == ddp_opt.param_groups[0]["lr"]
+        ), f'Learning rates found in each optimizer did not align\nopt: {opt.param_groups[0]["lr"]}\nDDP opt: {ddp_opt.param_groups[0]["lr"]}\n'
+        did_step = (((iteration + 1) % 2) == 0) or ((iteration + 1) == len(dataloader))
+        if accelerator.num_processes > 1:
+            check_model_parameters(model, ddp_model, did_step, iteration)
+        # Shuffle ddp_input on each iteration
+        torch.manual_seed(1337 + iteration)
+
+
+def main():
+    accelerator = Accelerator()
+    state = accelerator.state
+    if state.distributed_type == DistributedType.NO:
+        if state.local_process_index == 0:
+            print("**Test NOOP `no_sync` context manager**")
+        test_noop_sync(accelerator)
+    if state.distributed_type in (DistributedType.MULTI_GPU, DistributedType.MULTI_CPU):
+        if state.local_process_index == 0:
+            print("**Test Distributed `no_sync` context manager**")
+        test_distributed_sync(accelerator)
+    if state.distributed_type == DistributedType.MULTI_GPU:
+        for split_batch in [True, False]:
+            for dispatch_batches in [True, False]:
+                if state.local_process_index == 0:
+                    print(
+                        "**Test `accumulate` gradient accumulation, ",
+                        f"`split_batches={split_batch}` and `dispatch_batches={dispatch_batches}`**",
+                    )
+                test_gradient_accumulation(split_batch)
+    if state.local_process_index == 0:
+        print(
+            "**Test `accumulate` gradient accumulation with optimizer and scheduler, ",
+            "`split_batches=False`, `dispatch_batches=False`**",
+        )
+    test_gradient_accumulation_with_opt_and_scheduler()
+    if state.distributed_type == DistributedType.MULTI_GPU:
+        for split_batch in [True, False]:
+            for dispatch_batches in [True, False]:
+                if not split_batch and not dispatch_batches:
+                    continue
+                if state.local_process_index == 0:
+                    print(
+                        "**Test `accumulate` gradient accumulation with optimizer and scheduler, ",
+                        f"`split_batches={split_batch}` and `dispatch_batches={dispatch_batches}`**",
+                    )
+                test_gradient_accumulation_with_opt_and_scheduler(split_batch, dispatch_batches)
+
+
+def _mp_fn(index):
+    # For xla_spawn (TPUs)
+    main()
+
+
+if __name__ == "__main__":
+    main()
--- a/src/accelerate/test_utils/testing.py
+++ b/src/accelerate/test_utils/testing.py
@ -15,6 +15,7 @@
 import asyncio
 import os
 import shutil
+import subprocess
 import sys
 import tempfile
 import unittest
@ -26,7 +27,14 @@ from unittest import mock
 import torch

 from ..state import AcceleratorState
-from ..utils import gather, is_comet_ml_available, is_tensorflow_available, is_tpu_available, is_wandb_available
+from ..utils import (
+    gather,
+    is_comet_ml_available,
+    is_deepspeed_available,
+    is_tensorboard_available,
+    is_tpu_available,
+    is_wandb_available,
+)


 def parse_flag_from_env(key, default=False):
@ -56,6 +64,13 @@ def slow(test_case):
    return unittest.skipUnless(_run_slow_tests, "test is slow")(test_case)


+def require_cpu(test_case):
+    """
+    Decorator marking a test that must be only ran on the CPU. These tests are skipped when a GPU is available.
+    """
+    return unittest.skipUnless(not torch.cuda.is_available(), "test requires only a CPU")(test_case)
+
+
 def require_cuda(test_case):
    """
    Decorator marking a test that requires CUDA. These tests are skipped when there are no GPU available.
@ -70,6 +85,14 @@ def require_tpu(test_case):
    return unittest.skipUnless(is_tpu_available(), "test requires TPU")(test_case)


+def require_single_gpu(test_case):
+    """
+    Decorator marking a test that requires CUDA on a single GPU. These tests are skipped when there are no GPU
+    available or number of GPUs is more than one.
+    """
+    return unittest.skipUnless(torch.cuda.device_count() == 1, "test requires a GPU")(test_case)
+
+
 def require_multi_gpu(test_case):
    """
    Decorator marking a test that requires a multi-GPU setup. These tests are skipped on a machine without multiple
@ -78,12 +101,19 @@ def require_multi_gpu(test_case):
    return unittest.skipUnless(torch.cuda.device_count() > 1, "test requires multiple GPUs")(test_case)


-def require_tensorflow(test_case):
+def require_deepspeed(test_case):
    """
-    Decorator marking a test that requires TensorFlow installed. These tests are skipped when TensorFlow isn't
+    Decorator marking a test that requires DeepSpeed installed. These tests are skipped when DeepSpeed isn't installed
+    """
+    return unittest.skipUnless(is_deepspeed_available(), "test requires DeepSpeed")(test_case)
+
+
+def require_tensorboard(test_case):
+    """
+    Decorator marking a test that requires tensorboard installed. These tests are skipped when tensorboard isn't
    installed
    """
-    return unittest.skipUnless(is_tensorflow_available(), "test requires TensorFlow")(test_case)
+    return unittest.skipUnless(is_tensorboard_available(), "test requires Tensorboard")(test_case)


 def require_wandb(test_case):
@ -100,6 +130,22 @@ def require_comet_ml(test_case):
    return unittest.skipUnless(is_comet_ml_available(), "test requires comet_ml")(test_case)


+_atleast_one_tracker_available = (
+    any([is_wandb_available(), is_tensorboard_available()]) and not is_comet_ml_available()
+)
+
+
+def require_trackers(test_case):
+    """
+    Decorator marking that a test requires at least one tracking library installed. These tests are skipped when none
+    are installed
+    """
+    return unittest.skipUnless(
+        _atleast_one_tracker_available,
+        "test requires at least one tracker to be available and for `comet_ml` to not be installed",
+    )(test_case)
+
+
 class TempDirTestCase(unittest.TestCase):
    """
    A TestCase class that keeps a single `tempfile.TemporaryDirectory` open for the duration of the class, wipes its
@ -250,3 +296,24 @@ def execute_subprocess_async(cmd, env=None, stdin=None, timeout=180, quiet=False
        )

    return result
+
+
+class SubprocessCallException(Exception):
+    pass
+
+
+def run_command(command: List[str], return_stdout=False):
+    """
+    Runs `command` with `subprocess.check_output` and will potentially return the `stdout`. Will also properly capture
+    if an error occured while running `command`
+    """
+    try:
+        output = subprocess.check_output(command, stderr=subprocess.STDOUT)
+        if return_stdout:
+            if hasattr(output, "decode"):
+                output = output.decode("utf-8")
+            return output
+    except subprocess.CalledProcessError as e:
+        raise SubprocessCallException(
+            f"Command `{' '.join(command)}` failed with the following error:\n\n{e.output.decode()}"
+        ) from e
--- a/src/accelerate/utils/init.py
+++ b/src/accelerate/utils/init.py
@ -20,14 +20,15 @@ from .dataclasses import (
 )
 from .imports import (
    is_apex_available,
+    is_bf16_available,
    is_boto3_available,
    is_ccl_available,
    is_comet_ml_available,
    is_deepspeed_available,
    is_sagemaker_available,
    is_tensorboard_available,
-    is_tensorflow_available,
    is_tpu_available,
+    is_transformers_available,
    is_wandb_available,
 )
 from .modeling import (
@ -48,6 +49,7 @@ from .offload import (
    OffloadedWeightsLoader,
    PrefixedDataset,
    extract_submodules_state_dict,
+    load_offloaded_weight,
    offload_state_dict,
    offload_weight,
    save_offload_index,
@ -77,9 +79,16 @@ from .versions import compare_versions, is_torch_version


 if is_deepspeed_available():
-    from .deepspeed import DeepSpeedEngineWrapper, DeepSpeedOptimizerWrapper
+    from .deepspeed import (
+        DeepSpeedEngineWrapper,
+        DeepSpeedOptimizerWrapper,
+        DeepSpeedSchedulerWrapper,
+        DummyOptim,
+        DummyScheduler,
+        HfDeepSpeedConfig,
+    )

-from .launch import PrepareForLaunch
+from .launch import PrepareForLaunch, get_launch_prefix
 from .memory import find_executable_batch_size
 from .other import (
    extract_model_from_parallel,
--- a/src/accelerate/utils/constants.py
+++ b/src/accelerate/utils/constants.py
@ -20,5 +20,13 @@ MODEL_NAME = "pytorch_model"
 RNG_STATE_NAME = "random_states"
 OPTIMIZER_NAME = "optimizer"
 SCHEDULER_NAME = "scheduler"
+SAGEMAKER_PYTORCH_VERSION = "1.10.2"
+SAGEMAKER_PYTHON_VERSION = "py38"
+SAGEMAKER_TRANSFORMERS_VERSION = "4.17.0"
+SAGEMAKER_PARALLEL_EC2_INSTANCES = ["ml.p3.16xlarge", "ml.p3dn.24xlarge", "ml.p4dn.24xlarge"]
+FSDP_SHARDING_STRATEGY = ["FULL_SHARD", "SHARD_GRAD_OP", "NO_SHARD"]
+FSDP_AUTO_WRAP_POLICY = ["TRANSFORMER_BASED_WRAP", "SIZE_BASED_WRAP", "NO_WRAP"]
+FSDP_BACKWARD_PREFETCH = ["BACKWARD_PRE", "BACKWARD_POST", "NO_PREFETCH"]
+DEEPSPEED_MULTINODE_LAUNCHERS = ["pdsh", "standard", "openmpi", "mvapich"]

 STR_OPERATION_TO_FUNC = {">": op.gt, ">=": op.ge, "==": op.eq, "!=": op.ne, "<=": op.le, "<": op.lt}
--- a/src/accelerate/utils/dataclasses.py
+++ b/src/accelerate/utils/dataclasses.py
@ -21,12 +21,15 @@ import enum
 import functools
 import os
 import typing
+import warnings
 from dataclasses import dataclass, field
 from datetime import timedelta
-from typing import Callable, Iterable, Optional
+from typing import Any, Callable, Iterable, Optional

 import torch

+from .constants import FSDP_AUTO_WRAP_POLICY, FSDP_BACKWARD_PREFETCH
+

 class KwargsHandler:
    """
@ -178,6 +181,16 @@ class BaseEnum(enum.Enum, metaclass=EnumWithContains):


 class LoggerType(BaseEnum):
+    """Represents a type of supported experiment tracker
+
+    Values:
+
+        - **ALL** -- all available trackers in the environment that are supported
+        - **TENSORBOARD** -- TensorBoard as an experiment tracker
+        - **WANDB** -- wandb as an experiment tracker
+        - **COMETML** -- comet_ml as an experiment tracker
+    """
+
    ALL = "all"
    TENSORBOARD = "tensorboard"
    WANDB = "wandb"
@ -185,6 +198,15 @@ class LoggerType(BaseEnum):


 class PrecisionType(BaseEnum):
+    """Represents a type of precision used on floating point values
+
+    Values:
+
+        - **NO** -- using full precision (FP32)
+        - **FP16** -- using half precision
+        - **BF16** -- using brain floating point precision
+    """
+
    NO = "no"
    FP16 = "fp16"
    BF16 = "bf16"
@ -208,10 +230,20 @@ class TensorInformation:

@dataclass
 class DeepSpeedPlugin:
+    """
+    This plugin is used to integrate DeepSpeed.
+    """

+    hf_ds_config: Any = field(
+        default=None,
+        metadata={
+            "help": "path to DeepSpeed config file or dict or an object of class `accelerate.utils.deepspeed.HfDeepSpeedConfig`."
+        },
+    )
    gradient_accumulation_steps: int = field(
        default=None, metadata={"help": "Number of steps to accumulate gradients before updating optimizer states"}
    )
+    gradient_clipping: float = field(default=None, metadata={"help": "Enable gradient clipping with value"})
    zero_stage: int = field(
        default=None,
        metadata={"help": "Possible options are 0,1,2,3; Default will be taken from environment variable"},
@ -220,37 +252,164 @@ class DeepSpeedPlugin:
        default=True,
        metadata={"help": "If both train & eval dataloaders are specified, this will decide the train_batch_size"},
    )
-
-    auto_opt_mapping: bool = field(
-        default=True,
-        metadata={"help": "whether to map torch.adam to deepspeed optimizer version of adam based on config"},
+    offload_optimizer_device: bool = field(
+        default=None,
+        metadata={"help": "Possible options are none|cpu|nvme. Only applicable with ZeRO Stages 2 and 3."},
+    )
+    offload_param_device: bool = field(
+        default=None,
+        metadata={"help": "Possible options are none|cpu|nvme. Only applicable with ZeRO Stage 3."},
+    )
+    zero3_init_flag: bool = field(
+        default=None,
+        metadata={
+            "help": "Flag to indicate whether to enable `deepspeed.zero.Init` for constructing massive models."
+            "Only applicable with ZeRO Stage-3."
+        },
+    )
+    zero3_save_16bit_model: bool = field(
+        default=None,
+        metadata={"help": "Flag to indicate whether to save 16-bit model. Only applicable with ZeRO Stage-3."},
    )

-    offload_optimizer_device: bool = field(default=None, metadata={"help": "Possible options are none|cpu|nvme"})
-
    def __post_init__(self):
+        from .deepspeed import HfDeepSpeedConfig

-        if self.gradient_accumulation_steps is None:
-            self.gradient_accumulation_steps = int(os.environ.get("GRADIENT_ACCUMULATION_STEPS", 1))
+        if self.hf_ds_config is None:
+            self.hf_ds_config = os.environ.get("DEEPSPEED_CONFIG_FILE", "none")
+        if (
+            isinstance(self.hf_ds_config, dict)
+            or (isinstance(self.hf_ds_config, str) and self.hf_ds_config != "none")
+            or isinstance(self.hf_ds_config, HfDeepSpeedConfig)
+        ):
+            if not isinstance(self.hf_ds_config, HfDeepSpeedConfig):
+                self.hf_ds_config = HfDeepSpeedConfig(self.hf_ds_config)
+            if "gradient_accumulation_steps" not in self.hf_ds_config.config:
+                self.hf_ds_config.config["gradient_accumulation_steps"] = 1
+            elif self.hf_ds_config.config["gradient_accumulation_steps"] == "auto":
+                raise ValueError("gradient_accumulation_steps cannot be set to 'auto' in the DeepSpeed config.")
+            if "zero_optimization" not in self.hf_ds_config.config:
+                raise ValueError("Please specify the ZeRO optimization config in the DeepSpeed config.")
+        else:
+            if self.gradient_accumulation_steps is None:
+                self.gradient_accumulation_steps = int(os.environ.get("GRADIENT_ACCUMULATION_STEPS", 1))

-        if self.zero_stage is None:
-            self.zero_stage = int(os.environ.get("DEEPSPEED_ZERO_STAGE", 2))
+            if self.gradient_clipping is None:
+                gradient_clipping = os.environ.get("GRADIENT_CLIPPING", "none")
+                if gradient_clipping != "none":
+                    self.gradient_clipping = float(gradient_clipping)

-        if self.offload_optimizer_device is None:
-            self.offload_optimizer_device = os.environ.get("DEEPSPEED_OFFLOAD_OPTIMIZER_DEVICE", "none")
+            if self.zero_stage is None:
+                self.zero_stage = int(os.environ.get("DEEPSPEED_ZERO_STAGE", 2))

-        self.deepspeed_config = {
-            "train_batch_size": None,
-            "gradient_accumulation_steps": self.gradient_accumulation_steps,
-            "zero_optimization": {
-                "stage": self.zero_stage,
-                "offload_optimizer": {
-                    "device": self.offload_optimizer_device,
+            if self.offload_optimizer_device is None:
+                self.offload_optimizer_device = os.environ.get("DEEPSPEED_OFFLOAD_OPTIMIZER_DEVICE", "none")
+
+            if self.offload_param_device is None:
+                self.offload_param_device = os.environ.get("DEEPSPEED_OFFLOAD_PARAM_DEVICE", "none")
+
+            if self.zero3_save_16bit_model is None:
+                self.zero3_save_16bit_model = os.environ.get("DEEPSPEED_ZERO3_SAVE_16BIT_MODEL", "false") == "true"
+
+            config = {
+                "train_batch_size": "auto",
+                "train_micro_batch_size_per_gpu": "auto",
+                "gradient_accumulation_steps": self.gradient_accumulation_steps,
+                "zero_optimization": {
+                    "stage": self.zero_stage,
+                    "offload_optimizer": {
+                        "device": self.offload_optimizer_device,
+                    },
+                    "offload_param": {
+                        "device": self.offload_param_device,
+                    },
+                    "stage3_gather_16bit_weights_on_model_save": self.zero3_save_16bit_model,
                },
-            },
-            "steps_per_print": float("inf"),  # this will stop deepspeed from logging @ stdout
-            "zero_allow_untested_optimizer": True,
-        }
+            }
+            if self.gradient_clipping:
+                config["gradient_clipping"] = self.gradient_clipping
+            self.hf_ds_config = HfDeepSpeedConfig(config)
+        self.deepspeed_config = self.hf_ds_config.config
+        self.deepspeed_config["steps_per_print"] = float("inf")  # this will stop deepspeed from logging @ stdout
+        if self.zero3_init_flag is None:
+            self.zero3_init_flag = os.environ.get("DEEPSPEED_ZERO3_INIT", "false") == "true"
+        if self.zero3_init_flag and not self.hf_ds_config.is_zero3():
+            warnings.warn("DeepSpeed Zero3 Init flag is only applicable for ZeRO Stage 3. Setting it to False.")
+            self.zero3_init_flag = False
+
+    def fill_match(self, ds_key_long, mismatches, must_match=True, **kwargs):
+        config, ds_key = self.hf_ds_config.find_config_node(ds_key_long)
+        if config is None:
+            return
+
+        if config.get(ds_key) == "auto":
+            if ds_key_long in kwargs:
+                config[ds_key] = kwargs[ds_key_long]
+                return
+            else:
+                raise ValueError(
+                    f"`{ds_key_long}` not found in kwargs. "
+                    f"Please specify `{ds_key_long}` without `auto`(set to correct value) in the DeepSpeed config file or "
+                    "pass it in kwargs."
+                )
+
+        if not must_match:
+            return
+
+        ds_val = config.get(ds_key)
+        if ds_val is not None and ds_key_long in kwargs:
+            if ds_val != kwargs[ds_key_long]:
+                mismatches.append(f"- ds {ds_key_long}={ds_val} vs arg {ds_key_long}={kwargs[ds_key_long]}")
+
+    def deepspeed_config_process(self, prefix="", mismatches=None, config=None, must_match=True, **kwargs):
+        """Process the DeepSpeed config with the values from the kwargs."""
+        mismatches = [] if mismatches is None else mismatches
+        if config is None:
+            config = self.deepspeed_config
+        for key, value in config.items():
+            if isinstance(value, dict):
+                self.deepspeed_config_process(
+                    prefix=prefix + key + ".", mismatches=mismatches, config=value, must_match=must_match, **kwargs
+                )
+            else:
+                self.fill_match(prefix + key, mismatches, must_match=must_match, **kwargs)
+        if len(mismatches) > 0 and prefix == "":
+            mismatches_msg = "\n".join(mismatches)
+            raise ValueError(
+                "Please correct the following DeepSpeed config values that mismatch kwargs "
+                f" values:\n{mismatches_msg}\nThe easiest method is to set these DeepSpeed config values to 'auto'."
+            )
+
+    def set_mixed_precision(self, mixed_precision):
+        ds_config = self.deepspeed_config
+        if mixed_precision == "fp16" and "fp16" not in ds_config and "bf16" not in ds_config:
+            ds_config.update({"fp16": {"enabled": True}})
+        elif mixed_precision == "bf16" and "fp16" not in ds_config and "bf16" not in ds_config:
+            ds_config.update({"bf16": {"enabled": True}})
+
+    def set_deepspeed_weakref(self):
+        from .imports import is_transformers_available
+
+        if self.zero3_init_flag:
+            if not is_transformers_available():
+                raise Exception(
+                    "When `zero3_init_flag` is set, it requires Transformers to be installed. "
+                    "Please run `pip install transformers`."
+                )
+            ds_config = copy.deepcopy(self.deepspeed_config)
+            if "gradient_accumulation_steps" not in ds_config or ds_config["gradient_accumulation_steps"] == "auto":
+                ds_config["gradient_accumulation_steps"] = 1
+            if (
+                "train_micro_batch_size_per_gpu" not in ds_config
+                or ds_config["train_micro_batch_size_per_gpu"] == "auto"
+            ):
+                ds_config["train_micro_batch_size_per_gpu"] = 1
+            if ds_config["train_batch_size"] == "auto":
+                del ds_config["train_batch_size"]
+
+            from transformers.deepspeed import HfDeepSpeedConfig
+
+            self.dschf = HfDeepSpeedConfig(ds_config)  # keep this object alive # noqa


@dataclass
@ -261,22 +420,35 @@ class FullyShardedDataParallelPlugin:

    sharding_strategy: "typing.Any" = field(
        default=None,
-        metadata={"help": "Possible options are [1] FULL_SHARD, [2] SHARD_GRAD_OP"},
+        metadata={
+            "help": "FSDP Sharding Strategy of type `torch.distributed.fsdp.fully_sharded_data_parallel.ShardingStrategy`"
+        },
    )
    backward_prefetch: "typing.Any" = field(
        default=None,
-        metadata={"help": "Possible options are [1] BACKWARD_PRE, [2] BACKWARD_POST"},
+        metadata={
+            "help": "FSDP Backward Prefetch of type `torch.distributed.fsdp.fully_sharded_data_parallel.BackwardPrefetch`"
+        },
    )
-    auto_wrap_policy: "typing.Any" = field(
+    mixed_precision_policy: "typing.Any" = field(
+        default=None,
+        metadata={
+            "help": "A config to enable mixed precision training with FullyShardedDataParallel. "
+            "The 3 flags that are set are `param_dtype`, `reduce_dtype`, `buffer_dtype`. "
+            "Each flag expects `torch.dtype` as the value. "
+            "It is of type `torch.distributed.fsdp.fully_sharded_data_parallel.MixedPrecision`."
+        },
+    )
+    auto_wrap_policy: Optional[Callable] = field(
        default=None,
        metadata={"help": "A callable specifying a policy to recursively wrap layers with FSDP"},
    )
-    cpu_offload: Optional[Callable] = field(
+    cpu_offload: "typing.Any" = field(
        default=None,
-        metadata={"help": "Decides Whether to offload parameters and gradients to CPU."},
-    )
-    min_num_params: int = field(
-        default=None, metadata={"help": "FSDP's minimum number of parameters for Default Auto Wrapping."}
+        metadata={
+            "help": "Decides Whether to offload parameters and gradients to CPU. "
+            "It is of type `torch.distributed.fsdp.fully_sharded_data_parallel.CPUOffload`."
+        },
    )
    ignored_modules: Optional[Iterable[torch.nn.Module]] = field(
        default=None,
@ -284,8 +456,7 @@ class FullyShardedDataParallelPlugin:
    )

    def __post_init__(self):
-        from torch.distributed.fsdp.fully_sharded_data_parallel import CPUOffload, ShardingStrategy
-        from torch.distributed.fsdp.wrap import default_auto_wrap_policy
+        from torch.distributed.fsdp.fully_sharded_data_parallel import BackwardPrefetch, CPUOffload, ShardingStrategy

        if self.sharding_strategy is None:
            self.sharding_strategy = ShardingStrategy(int(os.environ.get("FSDP_SHARDING_STRATEGY", 1)))
@ -296,9 +467,63 @@ class FullyShardedDataParallelPlugin:
            else:
                self.cpu_offload = CPUOffload(offload_params=False)

-        if self.min_num_params is None:
-            self.min_num_params = int(os.environ.get("FSDP_MIN_NUM_PARAMS", 0))
+        if self.backward_prefetch is None:
+            prefetch_policy = os.environ.get("FSDP_BACKWARD_PREFETCH", FSDP_BACKWARD_PREFETCH[-1])
+            if prefetch_policy != FSDP_BACKWARD_PREFETCH[-1]:
+                self.backward_prefetch = BackwardPrefetch(FSDP_BACKWARD_PREFETCH.index(prefetch_policy) + 1)
+
+    @staticmethod
+    def get_module_class_from_name(module, name):
+        """
+        Gets a class from a module by its name.
+
+        Args:
+            module (`torch.nn.Module`): The module to get the class from.
+            name (`str`): The name of the class.
+        """
+        modules_children = list(module.children())
+        if module.__class__.__name__ == name:
+            return module.__class__
+        elif len(modules_children) == 0:
+            return
+        else:
+            for child_module in modules_children:
+                module_class = FullyShardedDataParallelPlugin.get_module_class_from_name(child_module, name)
+                if module_class is not None:
+                    return module_class
+
+    def set_auto_wrap_policy(self, model):
+        from torch.distributed.fsdp.wrap import size_based_auto_wrap_policy, transformer_auto_wrap_policy

        if self.auto_wrap_policy is None:
-            if self.min_num_params > 0:
-                self.auto_wrap_policy = functools.partial(default_auto_wrap_policy, min_num_params=self.min_num_params)
+            auto_wrap_policy = os.environ.get("FSDP_AUTO_WRAP_POLICY", FSDP_AUTO_WRAP_POLICY[-1])
+            if auto_wrap_policy == FSDP_AUTO_WRAP_POLICY[0]:
+                transformer_cls_to_wrap = os.environ.get("FSDP_TRANSFORMER_CLS_TO_WRAP", "")
+                transformer_cls_to_wrap = FullyShardedDataParallelPlugin.get_module_class_from_name(
+                    model, transformer_cls_to_wrap
+                )
+                if transformer_cls_to_wrap is None:
+                    raise Exception("Could not find the transformer layer class to wrap in the model.")
+                self.auto_wrap_policy = functools.partial(
+                    transformer_auto_wrap_policy,
+                    # Transformer layer class to wrap
+                    transformer_layer_cls={transformer_cls_to_wrap},
+                )
+            elif auto_wrap_policy == FSDP_AUTO_WRAP_POLICY[1]:
+                min_num_params = int(os.environ.get("FSDP_MIN_NUM_PARAMS", 0))
+                if min_num_params > 0:
+                    self.auto_wrap_policy = functools.partial(
+                        size_based_auto_wrap_policy, min_num_params=min_num_params
+                    )
+
+    def set_mixed_precision(self, mixed_precision):
+        if mixed_precision == "fp16":
+            dtype = torch.float16
+        elif mixed_precision == "bf16":
+            dtype = torch.bfloat16
+        else:
+            raise ValueError(f"Unknown mixed precision value: {mixed_precision}")
+        from torch.distributed.fsdp.fully_sharded_data_parallel import MixedPrecision
+
+        if self.mixed_precision_policy is None:
+            self.mixed_precision_policy = MixedPrecision(param_dtype=dtype, reduce_dtype=dtype, buffer_dtype=dtype)
--- a/src/accelerate/utils/deepspeed.py
+++ b/src/accelerate/utils/deepspeed.py
@ -12,58 +12,157 @@
 # See the License for the specific language governing permissions and
 # limitations under the License.

+import io
+import json
+from copy import deepcopy
+
 from ..optimizer import AcceleratedOptimizer
-from .imports import is_apex_available, is_deepspeed_available
+from ..scheduler import AcceleratedScheduler


-if is_deepspeed_available():
-    from deepspeed import DeepSpeedEngine
-
-if is_apex_available():
-    from apex import amp
-
-
-class DeepSpeedEngineWrapper(DeepSpeedEngine):
+class HfDeepSpeedConfig:
    """
-    Wrapper over deepspeed.DeepSpeedEngine object
+    This object contains a DeepSpeed configuration dictionary and can be quickly queried for things like zero stage.
+
+    A `weakref` of this object is stored in the module's globals to be able to access the config from areas where
+    things like the Trainer object is not available (e.g. `from_pretrained` and `_get_resized_embeddings`). Therefore
+    it's important that this object remains alive while the program is still running.
+
+    [`Trainer`] uses the `HfTrainerDeepSpeedConfig` subclass instead. That subclass has logic to sync the configuration
+    with values of [`TrainingArguments`] by replacing special placeholder values: `"auto"`. Without this special logic
+    the DeepSpeed configuration is not modified in any way.
+
+    Args:
+        config_file_or_dict (`Union[str, Dict]`): path to DeepSpeed config file or dict.
+
    """

-    def __init__(self, *args, **kwargs):
-        super().__init__(*args, **kwargs)
+    def __init__(self, config_file_or_dict):

-        # overwriting micro_steps for user's gradient_accumulation
-        self.micro_steps = -1
+        if isinstance(config_file_or_dict, dict):
+            # Don't modify user's data should they want to reuse it (e.g. in tests), because once we
+            # modified it, it will not be accepted here again, since `auto` values would have been overridden
+            config = deepcopy(config_file_or_dict)
+        elif isinstance(config_file_or_dict, str):
+            with io.open(config_file_or_dict, "r", encoding="utf-8") as f:
+                config = json.load(f)
+        else:
+            raise ValueError("expecting either a path to a DeepSpeed config file or a pre-populated dict")
+        self.config = config

-    def step(self, lr_kwargs=None):
-        """DeepSpeedEngine.step() without `micro_steps` update & no profiling"""
-        if self.is_gradient_accumulation_boundary():  # it shouldn't matter whether we keep this line or not
-            if self.progressive_layer_drop:
-                self.progressive_layer_drop.update_state(self.global_steps)
+        # zero stage - this is done as early as possible, before model is created, to allow
+        # ``is_deepspeed_zero3_enabled`` query and getting to the early deepspeed config object
+        # during ``zero.Init()`` which needs to know the dtype, and some other hparams.
+        self._stage = self.get_value("zero_optimization.stage", -1)

-            self._take_model_step(lr_kwargs)
+        # offload
+        self._offload = False
+        if self.is_zero2() or self.is_zero3():
+            offload_devices_valid = set(["cpu", "nvme"])
+            offload_devices = set(
+                [
+                    self.get_value("zero_optimization.offload_optimizer.device"),
+                    self.get_value("zero_optimization.offload_param.device"),
+                ]
+            )
+            if len(offload_devices & offload_devices_valid) > 0:
+                self._offload = True
+
+    def find_config_node(self, ds_key_long):
+        config = self.config
+
+        # find the config node of interest if it exists
+        nodes = ds_key_long.split(".")
+        ds_key = nodes.pop()
+        for node in nodes:
+            config = config.get(node)
+            if config is None:
+                return None, ds_key
+
+        return config, ds_key
+
+    def get_value(self, ds_key_long, default=None):
+        """
+        Returns the set value or `default` if no value is set
+        """
+        config, ds_key = self.find_config_node(ds_key_long)
+        if config is None:
+            return default
+        return config.get(ds_key, default)
+
+    def del_config_sub_tree(self, ds_key_long, must_exist=False):
+        """
+        Deletes a sub-section of the config file if it's found.
+
+        Unless `must_exist` is `True` the section doesn't have to exist.
+        """
+        config = self.config
+
+        # find the config node of interest if it exists
+        nodes = ds_key_long.split(".")
+        for node in nodes:
+            parent_config = config
+            config = config.get(node)
+            if config is None:
+                if must_exist:
+                    raise ValueError(f"Can't find {ds_key_long} entry in the config: {self.config}")
+                else:
+                    return
+
+        # if found remove it
+        if parent_config is not None:
+            parent_config.pop(node)
+
+    def is_true(self, ds_key_long):
+        """
+        Returns `True`/``False` only if the value is set, always `False` otherwise. So use this method to ask the very
+        specific question of whether the value is set to `True` (and it's not set to `False`` or isn't set).
+
+        """
+        value = self.get_value(ds_key_long)
+        return False if value is None else bool(value)
+
+    def is_false(self, ds_key_long):
+        """
+        Returns `True`/``False` only if the value is set, always `False` otherwise. So use this method to ask the very
+        specific question of whether the value is set to `False` (and it's not set to `True`` or isn't set).
+        """
+        value = self.get_value(ds_key_long)
+        return False if value is None else not bool(value)
+
+    def is_zero2(self):
+        return self._stage == 2
+
+    def is_zero3(self):
+        return self._stage == 3
+
+    def is_offload(self):
+        return self._offload
+
+
+class DeepSpeedEngineWrapper:
+    """
+    Internal wrapper for deepspeed.runtime.engine.DeepSpeedEngine. This is used to follow conventional training loop.
+
+    Args:
+        engine (deepspeed.runtime.engine.DeepSpeedEngine): deepspeed engine to wrap
+    """
+
+    def __init__(self, engine):
+        self.engine = engine

    def backward(self, loss):
-        """DeepSpeedEngine.backward() with with no loss scaling; no profiling but with `micro_steps` update"""
+        # runs backpropagation and handles mixed precision
+        self.engine.backward(loss)

-        if self.zero_optimization():
-            self.optimizer.is_gradient_accumulation_boundary = self.is_gradient_accumulation_boundary()
-            self.optimizer.backward(loss)
-        elif self.amp_enabled():
-            # AMP requires delaying unscale when inside gradient accumulation boundaries
-            # https://nvidia.github.io/apex/advanced.html#gradient-accumulation-across-iterations
-            delay_unscale = not self.is_gradient_accumulation_boundary()
-            with amp.scale_loss(loss, self.optimizer, delay_unscale=delay_unscale) as scaled_loss:
-                scaled_loss.backward()
-        elif self.fp16_enabled():
-            self.optimizer.backward(loss)
-        else:
-            loss.backward()
-
-        if self.enable_backward_allreduce:
-            self.allreduce_gradients()
-
-        # this will ensure deepspeed gradient_accumulation matches user's accumulation
-        self.micro_steps += 1
+        # deepspeed `engine.step` performs following operations:
+        # gradient accumulation check
+        # gradient clipping
+        # optimizer step
+        # zero grad
+        # checking overflow
+        # lr_scheduler step
+        self.engine.step()


 class DeepSpeedOptimizerWrapper(AcceleratedOptimizer):
@ -75,22 +174,79 @@ class DeepSpeedOptimizerWrapper(AcceleratedOptimizer):
            The optimizer to wrap.
    """

-    def __init__(self, optimizer, model: DeepSpeedEngineWrapper):
+    def __init__(self, optimizer):
        super().__init__(optimizer, device_placement=False, scaler=None)

-        self.model = model
-
    def zero_grad(self, set_to_none=None):
-        pass  # `model.step()` is doing that automatically. Therefore, it's implementation is not needed
+        pass  # `accelerator.backward(loss)` is doing that automatically. Therefore, it's implementation is not needed

    def step(self):
-        """This will handle optimizer.step() & optimizer.zero_grad() with gradient_accumulation"""
-        self.model.step()
+        pass  # `accelerator.backward(loss)` is doing that automatically. Therefore, it's implementation is not needed

    @property
-    def is_overflow(self):
+    def step_was_skipped(self):
        """Whether or not the optimizer step was done, or skipped because of gradient overflow."""
-        overflow = False
-        if hasattr(self.optimizer, "overflow"):
-            overflow = self.optimizer.overflow
-        return overflow
+        return self.optimizer.overflow
+
+
+class DeepSpeedSchedulerWrapper(AcceleratedScheduler):
+    """
+    Internal wrapper around a deepspeed scheduler.
+
+    Args:
+        scheduler (`torch.optim.lr_scheduler.LambdaLR`):
+            The scheduler to wrap.
+        optimizers (one or a list of `torch.optim.Optimizer`):
+    """
+
+    def __init__(self, scheduler, optimizers):
+        super().__init__(scheduler, optimizers)
+
+    def step(self):
+        pass  # `accelerator.backward(loss)` is doing that automatically. Therefore, it's implementation is not needed
+
+
+class DummyOptim:
+    """
+    Dummy optimizer presents model parameters or param groups, this is primarily used to follow conventional training
+    loop when optimizer config is specified in the deepspeed config file.
+
+    Args:
+        lr (float):
+            Learning rate.
+        params (iterable): iterable of parameters to optimize or dicts defining
+            parameter groups
+        weight_decay (float):
+            Weight decay.
+        **kwargs:
+            Other arguments.
+    """
+
+    def __init__(self, params, lr=0.001, weight_decay=0, **kwargs):
+        self.params = params
+        self.lr = lr
+        self.weight_decay = weight_decay
+        self.kwargs = kwargs
+
+
+class DummyScheduler:
+    """
+    Dummy scheduler presents model parameters or param groups, this is primarily used to follow conventional training
+    loop when scheduler config is specified in the deepspeed config file.
+
+    Args:
+        optimizer (`torch.optim.optimizer.Optimizer`):
+            The optimizer to wrap.
+        total_num_steps (int):
+            Total number of steps.
+        warmup_num_steps (int):
+            Number of steps for warmup.
+        **kwargs:
+            Other arguments.
+    """
+
+    def __init__(self, optimizer, total_num_steps=None, warmup_num_steps=0, **kwargs):
+        self.optimizer = optimizer
+        self.total_num_steps = total_num_steps
+        self.warmup_num_steps = warmup_num_steps
+        self.kwargs = kwargs
--- a/src/accelerate/utils/imports.py
+++ b/src/accelerate/utils/imports.py
@ -15,6 +15,10 @@
 import importlib
 import sys

+import torch
+
+from .versions import is_torch_version
+

 # The package importlib_metadata is in a different place, depending on the Python version.
 if sys.version_info < (3, 8):
@ -47,7 +51,15 @@ def is_apex_available():
    return importlib.util.find_spec("apex") is not None


-def is_tpu_available():
+def is_tpu_available(check_device=True):
+    "Checks if `torch_xla` is installed and potentially if a TPU is in the environment"
+    if _tpu_available and check_device:
+        try:
+            # Will raise a RuntimeError if no XLA configuration is found
+            _ = xm.xla_device()
+            return True
+        except RuntimeError:
+            return False
    return _tpu_available


@ -63,8 +75,19 @@ def is_deepspeed_available():
            return False


-def is_tensorflow_available():
-    return importlib.util.find_spec("tensorflow") is not None
+def is_bf16_available(ignore_tpu=False):
+    "Checks if bf16 is supported, optionally ignoring the TPU"
+    if is_tpu_available():
+        return not ignore_tpu
+    if is_torch_version(">=", "1.10"):
+        if torch.cuda.is_available():
+            return torch.cuda.is_bf16_supported()
+        return True
+    return False
+
+
+def is_transformers_available():
+    return importlib.util.find_spec("transformers") is not None


 def is_tensorboard_available():
--- a/src/accelerate/utils/launch.py
+++ b/src/accelerate/utils/launch.py
@ -13,12 +13,28 @@
 # limitations under the License.

 import os
+import sys

 import torch

+from ..utils import is_torch_version
 from .dataclasses import DistributedType


+def get_launch_prefix():
+    """
+    Grabs the correct launcher for starting a distributed command, such as either `torchrun`, `python -m
+    torch.distributed.run`, etc
+    """
+    if is_torch_version(">=", "1.10.0"):
+        cmd = ["torchrun"]
+    elif is_torch_version(">=", "1.9.0"):
+        cmd = [sys.executable, "-m", "torch.distributed.run"]
+    else:
+        cmd = [sys.executable, "-m", "torch.distributed.launch", "--use_env"]
+    return cmd
+
+
 class PrepareForLaunch:
    """
    Prepare a function that will launched in a distributed setup.
@ -52,4 +68,5 @@ class PrepareForLaunch:
            os.environ["LOCAL_RANK"] = str(index)
            os.environ["RANK"] = str(index)

+        os.environ["FORK_LAUNCHED"] = str(1)
        self.launcher(*args)
--- a/src/accelerate/utils/modeling.py
+++ b/src/accelerate/utils/modeling.py
@ -78,7 +78,7 @@ def dtype_byte_size(dtype: torch.dtype):
    """
    if dtype == torch.bool:
        return 1 / 8
-    bit_search = re.search("[^\d](\d+)$", str(dtype))
+    bit_search = re.search(r"[^\d](\d+)$", str(dtype))
    if bit_search is None:
        raise ValueError(f"`dtype` is not a valid dtype: {dtype}.")
    bit_size = int(bit_search.groups()[0])
--- a/src/accelerate/utils/offload.py
+++ b/src/accelerate/utils/offload.py
@ -22,10 +22,18 @@ import torch


 def offload_weight(weight, weight_name, offload_folder, index=None):
+    dtype = None
+    # Check the string instead of the dtype to be compatible with versions of PyTorch that don't have bfloat16.
+    if str(weight.dtype) == "torch.bfloat16":
+        # Need to reinterpret the underlined data as int16 since NumPy does not handle bfloat16s.
+        weight = weight.view(torch.int16)
+        dtype = "bfloat16"
    array = weight.numpy()
    tensor_file = os.path.join(offload_folder, f"{weight_name}.dat")
    if index is not None:
-        index[weight_name] = {"dtype": str(array.dtype), "shape": list(array.shape)}
+        if dtype is None:
+            dtype = str(array.dtype)
+        index[weight_name] = {"dtype": dtype, "shape": list(array.shape)}
    if array.ndim == 0:
        array = array[None]
    file_array = np.memmap(tensor_file, dtype=array.dtype, mode="w+", shape=array.shape)
@ -34,6 +42,28 @@ def offload_weight(weight, weight_name, offload_folder, index=None):
    return index


+def load_offloaded_weight(weight_file, weight_info):
+    shape = tuple(weight_info["shape"])
+    if shape == ():
+        # NumPy memory-mapped arrays can't have 0 dims so it was saved as 1d tensor
+        shape = (1,)
+
+    dtype = weight_info["dtype"]
+    if dtype == "bfloat16":
+        # NumPy does not support bfloat16 so this was saved as a int16
+        dtype = "int16"
+
+    weight = np.memmap(weight_file, dtype=dtype, shape=shape, mode="r")
+
+    if len(weight_info["shape"]) == 0:
+        weight = weight[0]
+    weight = torch.tensor(weight)
+    if weight_info["dtype"] == "bfloat16":
+        weight = weight.view(torch.bfloat16)
+
+    return weight
+
+
 def save_offload_index(index, offload_folder):
    if index is None or len(index) == 0:
        # Nothing to save
@ -129,12 +159,7 @@ class OffloadedWeightsLoader(Mapping):
            return self.state_dict[key]
        weight_info = self.index[key]
        weight_file = os.path.join(self.save_folder, f"{key}.dat")
-        shape = tuple(weight_info["shape"])
-        if shape == ():
-            weight = np.memmap(weight_file, dtype=weight_info["dtype"], shape=(1,), mode="r")[0]
-        else:
-            weight = np.memmap(weight_file, dtype=weight_info["dtype"], shape=shape, mode="r")
-        return torch.tensor(weight)
+        return load_offloaded_weight(weight_file, weight_info)

    def __iter__(self):
        return iter(self.all_keys)
--- a/src/accelerate/utils/operations.py
+++ b/src/accelerate/utils/operations.py
@ -29,7 +29,7 @@ from .imports import is_tpu_available
 from .versions import is_torch_version


-if is_tpu_available():
+if is_tpu_available(check_device=False):
    import torch_xla.core.xla_model as xm


@ -217,7 +217,11 @@ def gather(tensor):
    """
    if AcceleratorState().distributed_type == DistributedType.TPU:
        return _tpu_gather(tensor, name="accelerate.utils.gather")
-    elif AcceleratorState().distributed_type in [DistributedType.DEEPSPEED, DistributedType.MULTI_GPU]:
+    elif AcceleratorState().distributed_type in [
+        DistributedType.DEEPSPEED,
+        DistributedType.MULTI_GPU,
+        DistributedType.FSDP,
+    ]:
        return _gpu_gather(tensor)
    elif AcceleratorState().distributed_type == DistributedType.MULTI_CPU:
        return _cpu_gather(tensor)
@ -430,7 +434,7 @@ def reduce(tensor, reduction="mean"):
            xm.all_reduce("sum", cloned_tensor)
            return cloned_tensor
        elif state.distributed_type in [DistributedType.DEEPSPEED, DistributedType.MULTI_GPU]:
-            torch.distributed.reduce(cloned_tensor, ReduceOp.SUM)
+            torch.distributed.all_reduce(cloned_tensor, ReduceOp.SUM)
            return cloned_tensor
        else:
            if reduction == "sum":
--- a/src/accelerate/utils/other.py
+++ b/src/accelerate/utils/other.py
@ -28,7 +28,7 @@ from .imports import is_deepspeed_available, is_tpu_available
 if is_deepspeed_available():
    from deepspeed import DeepSpeedEngine

-if is_tpu_available():
+if is_tpu_available(check_device=False):
    import torch_xla.core.xla_model as xm


--- a/src/accelerate/utils/random.py
+++ b/src/accelerate/utils/random.py
@ -23,7 +23,7 @@ from .dataclasses import DistributedType, RNGType
 from .imports import is_tpu_available


-if is_tpu_available():
+if is_tpu_available(check_device=False):
    import torch_xla.core.xla_model as xm


--- a/tests/deepspeed/ds_config_zero2.json
+++ b/tests/deepspeed/ds_config_zero2.json
@ -0,0 +1,49 @@
+{
+    "fp16": {
+        "enabled": "auto",
+        "loss_scale": 0,
+        "loss_scale_window": 1000,
+        "initial_scale_power": 16,
+        "hysteresis": 2,
+        "min_loss_scale": 1
+    },
+    "bf16": {
+        "enabled": "auto"
+    },
+    "optimizer": {
+        "type": "AdamW",
+        "params": {
+            "lr": "auto",
+            "weight_decay": "auto",
+            "torch_adam": true,
+            "adam_w_mode": true
+        }
+    },
+    "scheduler": {
+        "type": "WarmupLR",
+        "params": {
+            "warmup_min_lr": "auto",
+            "warmup_max_lr": "auto",
+            "warmup_num_steps": "auto"
+        }
+    },
+    "zero_optimization": {
+        "stage": 2,
+        "offload_optimizer": {
+            "device": "cpu",
+            "pin_memory": true
+        },
+        "allgather_partitions": true,
+        "allgather_bucket_size": 2e8,
+        "overlap_comm": true,
+        "reduce_scatter": true,
+        "reduce_bucket_size": "auto",
+        "contiguous_gradients": true
+    },
+    "gradient_accumulation_steps": 1,
+    "gradient_clipping": "auto",
+    "steps_per_print": 2000,
+    "train_batch_size": "auto",
+    "train_micro_batch_size_per_gpu": "auto",
+    "wall_clock_breakdown": false
+}
--- a/tests/deepspeed/ds_config_zero3.json
+++ b/tests/deepspeed/ds_config_zero3.json
@ -0,0 +1,56 @@
+{
+    "fp16": {
+        "enabled": "auto",
+        "loss_scale": 0,
+        "loss_scale_window": 1000,
+        "initial_scale_power": 16,
+        "hysteresis": 2,
+        "min_loss_scale": 1
+    },
+    "bf16": {
+        "enabled": "auto"
+    },
+    "optimizer": {
+        "type": "AdamW",
+        "params": {
+            "lr": "auto",
+            "weight_decay": "auto",
+            "torch_adam": true,
+            "adam_w_mode": true
+        }
+    },
+    "scheduler": {
+        "type": "WarmupLR",
+        "params": {
+            "warmup_min_lr": "auto",
+            "warmup_max_lr": "auto",
+            "warmup_num_steps": "auto"
+        }
+    },
+    "zero_optimization": {
+        "stage": 3,
+        "offload_optimizer": {
+            "device": "cpu",
+            "pin_memory": true
+        },
+        "offload_param": {
+            "device": "cpu",
+            "pin_memory": true
+        },
+        "overlap_comm": true,
+        "contiguous_gradients": true,
+        "sub_group_size": 1e9,
+        "reduce_bucket_size": "auto",
+        "stage3_prefetch_bucket_size": "auto",
+        "stage3_param_persistence_threshold": "auto",
+        "stage3_max_live_parameters": 1e9,
+        "stage3_max_reuse_distance": 1e9,
+        "stage3_gather_16bit_weights_on_model_save": "auto"
+    },
+    "gradient_accumulation_steps": 1,
+    "gradient_clipping": "auto",
+    "steps_per_print": 2000,
+    "train_batch_size": "auto",
+    "train_micro_batch_size_per_gpu": "auto",
+    "wall_clock_breakdown": false
+}
--- a/tests/deepspeed/test_deepspeed.py
+++ b/tests/deepspeed/test_deepspeed.py
@ -0,0 +1,584 @@
+# Copyright 2022 The HuggingFace Team. All rights reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import inspect
+import io
+import itertools
+import json
+import os
+import tempfile
+import unittest
+from copy import deepcopy
+from pathlib import Path
+
+import torch
+from torch.utils.data import DataLoader
+
+from accelerate.accelerator import Accelerator
+from accelerate.scheduler import AcceleratedScheduler
+from accelerate.state import AcceleratorState
+from accelerate.test_utils.testing import require_cuda, require_deepspeed
+from accelerate.test_utils.training import RegressionDataset
+from accelerate.utils.dataclasses import DeepSpeedPlugin
+from accelerate.utils.deepspeed import (
+    DeepSpeedEngineWrapper,
+    DeepSpeedOptimizerWrapper,
+    DeepSpeedSchedulerWrapper,
+    DummyOptim,
+    DummyScheduler,
+)
+from parameterized import parameterized
+from transformers import AutoModel, AutoModelForCausalLM, get_scheduler
+from transformers.testing_utils import mockenv_context
+from transformers.trainer_utils import set_seed
+from transformers.utils import is_torch_bf16_available
+
+
+set_seed(42)
+
+T5_SMALL = "t5-small"
+T5_TINY = "patrickvonplaten/t5-tiny-random"
+GPT2_TINY = "sshleifer/tiny-gpt2"
+
+ZERO2 = "zero2"
+ZERO3 = "zero3"
+
+FP16 = "fp16"
+BF16 = "bf16"
+
+CUSTOM_OPTIMIZER = "custom_optimizer"
+CUSTOM_SCHEDULER = "custom_scheduler"
+DS_OPTIMIZER = "deepspeed_optimizer"
+DS_SCHEDULER = "deepspeed_scheduler"
+
+stages = [ZERO2, ZERO3]
+optims = [CUSTOM_OPTIMIZER, DS_OPTIMIZER]
+schedulers = [CUSTOM_SCHEDULER, DS_SCHEDULER]
+if is_torch_bf16_available():
+    dtypes = [FP16, BF16]
+else:
+    dtypes = [FP16]
+
+
+def parameterized_custom_name_func(func, param_num, param):
+    # customize the test name generator function as we want both params to appear in the sub-test
+    # name, as by default it shows only the first param
+    param_based_name = parameterized.to_safe_name("_".join(str(x) for x in param.args))
+    return f"{func.__name__}_{param_based_name}"
+
+
+# Cartesian-product of zero stages with models to test
+params = list(itertools.product(stages, dtypes))
+optim_scheduler_params = list(itertools.product(optims, schedulers))
+
+
+@require_deepspeed
+@require_cuda
+class DeepSpeedConfigIntegration(unittest.TestCase):
+    def setUp(self):
+        super().setUp()
+
+        self._test_file_path = inspect.getfile(self.__class__)
+        path = Path(self._test_file_path).resolve()
+        self.test_file_dir_str = str(path.parents[0])
+
+        self.ds_config_file = dict(
+            zero2=f"{self.test_file_dir_str}/ds_config_zero2.json",
+            zero3=f"{self.test_file_dir_str}/ds_config_zero3.json",
+        )
+
+        # use self.get_config_dict(stage) to use these to ensure the original is not modified
+        with io.open(self.ds_config_file[ZERO2], "r", encoding="utf-8") as f:
+            config_zero2 = json.load(f)
+        with io.open(self.ds_config_file[ZERO3], "r", encoding="utf-8") as f:
+            config_zero3 = json.load(f)
+            # The following setting slows things down, so don't enable it by default unless needed by a test.
+            # It's in the file as a demo for users since we want everything to work out of the box even if slower.
+            config_zero3["zero_optimization"]["stage3_gather_16bit_weights_on_model_save"] = False
+
+        self.ds_config_dict = dict(zero2=config_zero2, zero3=config_zero3)
+
+        self.dist_env = dict(
+            USE_DEEPSPEED="true",
+            MASTER_ADDR="localhost",
+            MASTER_PORT="10999",
+            RANK="0",
+            LOCAL_RANK="0",
+            WORLD_SIZE="1",
+        )
+
+    def get_config_dict(self, stage):
+        # As some tests modify the dict, always make a copy
+        return deepcopy(self.ds_config_dict[stage])
+
+    @parameterized.expand(stages, name_func=parameterized_custom_name_func)
+    def test_deepspeed_plugin(self, stage):
+
+        # Test zero3_init_flag will be set to False when ZeRO stage != 3
+        deepspeed_plugin = DeepSpeedPlugin(
+            gradient_accumulation_steps=1,
+            gradient_clipping=1.0,
+            zero_stage=2,
+            offload_optimizer_device="cpu",
+            offload_param_device="cpu",
+            zero3_save_16bit_model=True,
+            zero3_init_flag=True,
+        )
+        self.assertFalse(deepspeed_plugin.zero3_init_flag)
+        deepspeed_plugin.deepspeed_config = None
+
+        # Test zero3_init_flag will be set to True only when ZeRO stage == 3
+        deepspeed_plugin = DeepSpeedPlugin(
+            gradient_accumulation_steps=1,
+            gradient_clipping=1.0,
+            zero_stage=3,
+            offload_optimizer_device="cpu",
+            offload_param_device="cpu",
+            zero3_save_16bit_model=True,
+            zero3_init_flag=True,
+        )
+        self.assertTrue(deepspeed_plugin.zero3_init_flag)
+        deepspeed_plugin.deepspeed_config = None
+
+        # Test config files are loaded correctly
+        deepspeed_plugin = DeepSpeedPlugin(hf_ds_config=self.ds_config_file[stage], zero3_init_flag=True)
+        if stage == ZERO2:
+            self.assertFalse(deepspeed_plugin.zero3_init_flag)
+        elif stage == ZERO3:
+            self.assertTrue(deepspeed_plugin.zero3_init_flag)
+
+        # Test `gradient_accumulation_steps` is set to 1 if unavailable in config file
+        with tempfile.TemporaryDirectory() as dirpath:
+            ds_config = self.get_config_dict(stage)
+            del ds_config["gradient_accumulation_steps"]
+            with open(os.path.join(dirpath, "ds_config.json"), "w") as out_file:
+                json.dump(ds_config, out_file)
+            deepspeed_plugin = DeepSpeedPlugin(hf_ds_config=os.path.join(dirpath, "ds_config.json"))
+            self.assertEqual(deepspeed_plugin.deepspeed_config["gradient_accumulation_steps"], 1)
+            deepspeed_plugin.deepspeed_config = None
+
+        # Test `ValueError` is raised if `zero_optimization` is unavailable in config file
+        with tempfile.TemporaryDirectory() as dirpath:
+            ds_config = self.get_config_dict(stage)
+            del ds_config["zero_optimization"]
+            with open(os.path.join(dirpath, "ds_config.json"), "w") as out_file:
+                json.dump(ds_config, out_file)
+            with self.assertRaises(ValueError) as cm:
+                deepspeed_plugin = DeepSpeedPlugin(hf_ds_config=os.path.join(dirpath, "ds_config.json"))
+            self.assertTrue(
+                "Please specify the ZeRO optimization config in the DeepSpeed config." in str(cm.exception)
+            )
+            deepspeed_plugin.deepspeed_config = None
+
+        # Test `deepspeed_config_process`
+        deepspeed_plugin = DeepSpeedPlugin(hf_ds_config=self.ds_config_file[stage])
+        kwargs = {
+            "fp16.enabled": True,
+            "bf16.enabled": False,
+            "optimizer.params.lr": 5e-5,
+            "optimizer.params.weight_decay": 0.0,
+            "scheduler.params.warmup_min_lr": 0.0,
+            "scheduler.params.warmup_max_lr": 5e-5,
+            "scheduler.params.warmup_num_steps": 0,
+            "train_micro_batch_size_per_gpu": 16,
+            "gradient_clipping": 1.0,
+            "train_batch_size": 16,
+            "zero_optimization.reduce_bucket_size": 5e5,
+            "zero_optimization.stage3_prefetch_bucket_size": 5e5,
+            "zero_optimization.stage3_param_persistence_threshold": 5e5,
+            "zero_optimization.stage3_gather_16bit_weights_on_model_save": False,
+        }
+        deepspeed_plugin.deepspeed_config_process(**kwargs)
+        for ds_key_long, value in kwargs.items():
+            config, ds_key = deepspeed_plugin.hf_ds_config.find_config_node(ds_key_long)
+            if config.get(ds_key) is not None:
+                self.assertEqual(config.get(ds_key), value)
+
+        # Test mismatches
+        mismatches = {
+            "optimizer.params.lr": 1e-5,
+            "optimizer.params.weight_decay": 1e-5,
+            "gradient_accumulation_steps": 2,
+        }
+        with self.assertRaises(ValueError) as cm:
+            new_kwargs = deepcopy(kwargs)
+            new_kwargs.update(mismatches)
+            deepspeed_plugin.deepspeed_config_process(**new_kwargs)
+        for key in mismatches.keys():
+            self.assertTrue(
+                key in str(cm.exception),
+                f"{key} is not in the exception message:\n{cm.exception}",
+            )
+
+        # Test `ValueError` is raised if some config file fields with `auto` value is missing in `kwargs`
+        deepspeed_plugin.deepspeed_config["optimizer"]["params"]["lr"] = "auto"
+        with self.assertRaises(ValueError) as cm:
+            del kwargs["optimizer.params.lr"]
+            deepspeed_plugin.deepspeed_config_process(**kwargs)
+        self.assertTrue("`optimizer.params.lr` not found in kwargs." in str(cm.exception))
+
+    @parameterized.expand([FP16, BF16], name_func=parameterized_custom_name_func)
+    def test_accelerate_state_deepspeed(self, dtype):
+        state = AcceleratorState(_from_accelerator=True)
+        if state.initialized:
+            state.initialized = False
+
+        deepspeed_plugin = DeepSpeedPlugin(
+            gradient_accumulation_steps=1,
+            gradient_clipping=1.0,
+            zero_stage=ZERO2,
+            offload_optimizer_device="cpu",
+            offload_param_device="cpu",
+            zero3_save_16bit_model=True,
+            zero3_init_flag=True,
+        )
+        with mockenv_context(**self.dist_env):
+            state = Accelerator(mixed_precision=dtype, deepspeed_plugin=deepspeed_plugin).state
+            self.assertTrue(state.deepspeed_plugin.deepspeed_config[dtype]["enabled"])
+            state.initialized = False
+
+    def test_init_zero3(self):
+        deepspeed_plugin = DeepSpeedPlugin(
+            gradient_accumulation_steps=1,
+            gradient_clipping=1.0,
+            zero_stage=3,
+            offload_optimizer_device="cpu",
+            offload_param_device="cpu",
+            zero3_save_16bit_model=True,
+            zero3_init_flag=True,
+        )
+
+        with mockenv_context(**self.dist_env):
+            accelerator = Accelerator(deepspeed_plugin=deepspeed_plugin)
+            from transformers.deepspeed import is_deepspeed_zero3_enabled
+
+            self.assertTrue(is_deepspeed_zero3_enabled())
+            accelerator.state.initialized = False
+
+    @parameterized.expand(optim_scheduler_params, name_func=parameterized_custom_name_func)
+    def test_prepare_deepspeed(self, optim_type, scheduler_type):
+        # 1. Testing with one of the ZeRO Stages is enough to test the `_prepare_deepspeed` function.
+        # Here we test using ZeRO Stage 2 with FP16 enabled.
+        from deepspeed.runtime.engine import DeepSpeedEngine
+
+        kwargs = {
+            "fp16.enabled": True,
+            "bf16.enabled": False,
+            "optimizer.params.lr": 5e-5,
+            "optimizer.params.weight_decay": 0.0,
+            "scheduler.params.warmup_min_lr": 0.0,
+            "scheduler.params.warmup_max_lr": 5e-5,
+            "scheduler.params.warmup_num_steps": 0,
+            "train_micro_batch_size_per_gpu": 16,
+            "gradient_clipping": 1.0,
+            "train_batch_size": 16,
+            "zero_optimization.reduce_bucket_size": 5e5,
+            "zero_optimization.stage3_prefetch_bucket_size": 5e5,
+            "zero_optimization.stage3_param_persistence_threshold": 5e5,
+            "zero_optimization.stage3_gather_16bit_weights_on_model_save": False,
+        }
+
+        if optim_type == CUSTOM_OPTIMIZER and scheduler_type == CUSTOM_SCHEDULER:
+            # Test custom optimizer + custom scheduler
+            deepspeed_plugin = DeepSpeedPlugin(
+                gradient_accumulation_steps=1,
+                gradient_clipping=1.0,
+                zero_stage=2,
+                offload_optimizer_device="cpu",
+                offload_param_device="cpu",
+                zero3_save_16bit_model=False,
+                zero3_init_flag=False,
+            )
+            with mockenv_context(**self.dist_env):
+                accelerator = Accelerator(mixed_precision="fp16", deepspeed_plugin=deepspeed_plugin)
+
+                train_set = RegressionDataset(length=80)
+                eval_set = RegressionDataset(length=20)
+                train_dataloader = DataLoader(train_set, batch_size=16, shuffle=True)
+                eval_dataloader = DataLoader(eval_set, batch_size=32, shuffle=False)
+                model = AutoModel.from_pretrained(GPT2_TINY)
+                optimizer = torch.optim.AdamW(model.parameters(), lr=5e-5)
+                lr_scheduler = get_scheduler(
+                    name="linear",
+                    optimizer=optimizer,
+                    num_warmup_steps=0,
+                    num_training_steps=1000,
+                )
+                dummy_optimizer = DummyOptim(params=model.parameters())
+                dummy_lr_scheduler = DummyScheduler(dummy_optimizer)
+
+                with self.assertRaises(ValueError) as cm:
+                    model, optimizer, train_dataloader, eval_dataloader, lr_scheduler = accelerator.prepare(
+                        model, dummy_optimizer, train_dataloader, eval_dataloader, lr_scheduler
+                    )
+                self.assertTrue(
+                    "You cannot create a `DummyOptim` without specifying an optimizer in the config file."
+                    in str(cm.exception)
+                )
+                with self.assertRaises(ValueError) as cm:
+                    model, optimizer, train_dataloader, eval_dataloader, lr_scheduler = accelerator.prepare(
+                        model, optimizer, train_dataloader, eval_dataloader, dummy_lr_scheduler
+                    )
+                self.assertTrue(
+                    "You cannot create a `DummyScheduler` without specifying a scheduler in the config file."
+                    in str(cm.exception)
+                )
+
+                with self.assertRaises(ValueError) as cm:
+                    model, optimizer, lr_scheduler = accelerator.prepare(model, optimizer, lr_scheduler)
+                self.assertTrue(
+                    "You must specify a training or evaluation dataloader in `accelerate.prepare()` when using DeepSpeed."
+                    in str(cm.exception)
+                )
+
+                model, optimizer, train_dataloader, eval_dataloader, lr_scheduler = accelerator.prepare(
+                    model, optimizer, train_dataloader, eval_dataloader, lr_scheduler
+                )
+                self.assertTrue(accelerator.deepspeed_config["zero_allow_untested_optimizer"])
+                self.assertTrue(accelerator.deepspeed_config["train_batch_size"], 16)
+                self.assertEqual(type(model), DeepSpeedEngine)
+                self.assertEqual(type(optimizer), DeepSpeedOptimizerWrapper)
+                self.assertEqual(type(lr_scheduler), AcceleratedScheduler)
+                self.assertEqual(type(accelerator.deepspeed_engine_wrapped), DeepSpeedEngineWrapper)
+
+        elif optim_type == DS_OPTIMIZER and scheduler_type == DS_SCHEDULER:
+            # Test DeepSpeed optimizer + DeepSpeed scheduler
+            deepspeed_plugin = DeepSpeedPlugin(hf_ds_config=self.ds_config_file[ZERO2])
+            with mockenv_context(**self.dist_env):
+                accelerator = Accelerator(deepspeed_plugin=deepspeed_plugin)
+                train_set = RegressionDataset(length=80)
+                eval_set = RegressionDataset(length=20)
+                train_dataloader = DataLoader(train_set, batch_size=10, shuffle=True)
+                eval_dataloader = DataLoader(eval_set, batch_size=5, shuffle=False)
+                model = AutoModel.from_pretrained(GPT2_TINY)
+                optimizer = torch.optim.AdamW(model.parameters(), lr=5e-5)
+                lr_scheduler = get_scheduler(
+                    name="linear",
+                    optimizer=optimizer,
+                    num_warmup_steps=0,
+                    num_training_steps=1000,
+                )
+                dummy_optimizer = DummyOptim(params=model.parameters())
+                dummy_lr_scheduler = DummyScheduler(dummy_optimizer)
+                kwargs["train_batch_size"] = (
+                    kwargs["train_micro_batch_size_per_gpu"]
+                    * deepspeed_plugin.deepspeed_config["gradient_accumulation_steps"]
+                    * accelerator.num_processes
+                )
+                accelerator.state.deepspeed_plugin.deepspeed_config_process(**kwargs)
+                with self.assertRaises(ValueError) as cm:
+                    model, optimizer, train_dataloader, eval_dataloader, lr_scheduler = accelerator.prepare(
+                        model, optimizer, train_dataloader, eval_dataloader, dummy_lr_scheduler
+                    )
+                self.assertTrue(
+                    "You cannot specify an optimizer in the config file and in the code at the same time"
+                    in str(cm.exception)
+                )
+
+                with self.assertRaises(ValueError) as cm:
+                    model, optimizer, train_dataloader, eval_dataloader, lr_scheduler = accelerator.prepare(
+                        model, dummy_optimizer, train_dataloader, eval_dataloader, lr_scheduler
+                    )
+                self.assertTrue(
+                    "You cannot specify a scheduler in the config file and in the code at the same time"
+                    in str(cm.exception)
+                )
+
+                with self.assertRaises(ValueError) as cm:
+                    model, optimizer, train_dataloader, eval_dataloader, lr_scheduler = accelerator.prepare(
+                        model, dummy_optimizer, train_dataloader, eval_dataloader, lr_scheduler
+                    )
+                self.assertTrue(
+                    "You cannot specify a scheduler in the config file and in the code at the same time"
+                    in str(cm.exception)
+                )
+
+                model, optimizer, train_dataloader, eval_dataloader, lr_scheduler = accelerator.prepare(
+                    model, dummy_optimizer, train_dataloader, eval_dataloader, dummy_lr_scheduler
+                )
+                self.assertTrue(type(model) == DeepSpeedEngine)
+                self.assertTrue(type(optimizer) == DeepSpeedOptimizerWrapper)
+                self.assertTrue(type(lr_scheduler) == DeepSpeedSchedulerWrapper)
+                self.assertTrue(type(accelerator.deepspeed_engine_wrapped) == DeepSpeedEngineWrapper)
+
+        elif optim_type == CUSTOM_OPTIMIZER and scheduler_type == DS_SCHEDULER:
+            # Test custom optimizer + DeepSpeed scheduler
+            deepspeed_plugin = DeepSpeedPlugin(hf_ds_config=self.ds_config_file[ZERO2])
+            with mockenv_context(**self.dist_env):
+                accelerator = Accelerator(deepspeed_plugin=deepspeed_plugin)
+                train_set = RegressionDataset(length=80)
+                eval_set = RegressionDataset(length=20)
+                train_dataloader = DataLoader(train_set, batch_size=10, shuffle=True)
+                eval_dataloader = DataLoader(eval_set, batch_size=5, shuffle=False)
+                model = AutoModel.from_pretrained(GPT2_TINY)
+                optimizer = torch.optim.AdamW(model.parameters(), lr=5e-5)
+                lr_scheduler = get_scheduler(
+                    name="linear",
+                    optimizer=optimizer,
+                    num_warmup_steps=0,
+                    num_training_steps=1000,
+                )
+                dummy_optimizer = DummyOptim(params=model.parameters())
+                dummy_lr_scheduler = DummyScheduler(dummy_optimizer)
+                kwargs["train_batch_size"] = (
+                    kwargs["train_micro_batch_size_per_gpu"]
+                    * deepspeed_plugin.deepspeed_config["gradient_accumulation_steps"]
+                    * accelerator.num_processes
+                )
+                accelerator.state.deepspeed_plugin.deepspeed_config_process(**kwargs)
+                del accelerator.state.deepspeed_plugin.deepspeed_config["optimizer"]
+                model, optimizer, train_dataloader, eval_dataloader, lr_scheduler = accelerator.prepare(
+                    model, optimizer, train_dataloader, eval_dataloader, dummy_lr_scheduler
+                )
+                self.assertTrue(type(model) == DeepSpeedEngine)
+                self.assertTrue(type(optimizer) == DeepSpeedOptimizerWrapper)
+                self.assertTrue(type(lr_scheduler) == DeepSpeedSchedulerWrapper)
+                self.assertTrue(type(accelerator.deepspeed_engine_wrapped) == DeepSpeedEngineWrapper)
+        elif optim_type == DS_OPTIMIZER and scheduler_type == CUSTOM_SCHEDULER:
+            # Test deepspeed optimizer + custom scheduler
+            deepspeed_plugin = DeepSpeedPlugin(hf_ds_config=self.ds_config_file[ZERO2])
+            with mockenv_context(**self.dist_env):
+                accelerator = Accelerator(deepspeed_plugin=deepspeed_plugin)
+                train_set = RegressionDataset(length=80)
+                eval_set = RegressionDataset(length=20)
+                train_dataloader = DataLoader(train_set, batch_size=10, shuffle=True)
+                eval_dataloader = DataLoader(eval_set, batch_size=5, shuffle=False)
+                model = AutoModel.from_pretrained(GPT2_TINY)
+                optimizer = torch.optim.AdamW(model.parameters(), lr=5e-5)
+                lr_scheduler = get_scheduler(
+                    name="linear",
+                    optimizer=optimizer,
+                    num_warmup_steps=0,
+                    num_training_steps=1000,
+                )
+                dummy_optimizer = DummyOptim(params=model.parameters())
+                dummy_lr_scheduler = DummyScheduler(dummy_optimizer)
+                kwargs["train_batch_size"] = (
+                    kwargs["train_micro_batch_size_per_gpu"]
+                    * deepspeed_plugin.deepspeed_config["gradient_accumulation_steps"]
+                    * accelerator.num_processes
+                )
+                accelerator.state.deepspeed_plugin.deepspeed_config_process(**kwargs)
+                del accelerator.state.deepspeed_plugin.deepspeed_config["scheduler"]
+                with self.assertRaises(ValueError) as cm:
+                    model, optimizer, train_dataloader, eval_dataloader, lr_scheduler = accelerator.prepare(
+                        model, dummy_optimizer, train_dataloader, eval_dataloader, lr_scheduler
+                    )
+                self.assertTrue(
+                    "You can only specify `accelerate.utils.DummyScheduler` in the code when using `accelerate.utils.DummyOptim`."
+                    in str(cm.exception)
+                )
+        accelerator.state.initialized = False
+
+    def test_save_checkpoints(self):
+        deepspeed_plugin = DeepSpeedPlugin(
+            hf_ds_config=self.ds_config_file[ZERO3],
+            zero3_init_flag=True,
+        )
+        del deepspeed_plugin.deepspeed_config["bf16"]
+        kwargs = {
+            "fp16.enabled": True,
+            "bf16.enabled": False,
+            "optimizer.params.lr": 5e-5,
+            "optimizer.params.weight_decay": 0.0,
+            "scheduler.params.warmup_min_lr": 0.0,
+            "scheduler.params.warmup_max_lr": 5e-5,
+            "scheduler.params.warmup_num_steps": 0,
+            "train_micro_batch_size_per_gpu": 16,
+            "gradient_clipping": 1.0,
+            "train_batch_size": 16,
+            "zero_optimization.reduce_bucket_size": 5e5,
+            "zero_optimization.stage3_prefetch_bucket_size": 5e5,
+            "zero_optimization.stage3_param_persistence_threshold": 5e5,
+            "zero_optimization.stage3_gather_16bit_weights_on_model_save": False,
+        }
+
+        with mockenv_context(**self.dist_env):
+            accelerator = Accelerator(deepspeed_plugin=deepspeed_plugin)
+            kwargs["train_batch_size"] = (
+                kwargs["train_micro_batch_size_per_gpu"]
+                * deepspeed_plugin.deepspeed_config["gradient_accumulation_steps"]
+                * accelerator.num_processes
+            )
+            accelerator.state.deepspeed_plugin.deepspeed_config_process(**kwargs)
+
+            train_set = RegressionDataset(length=80)
+            eval_set = RegressionDataset(length=20)
+            train_dataloader = DataLoader(train_set, batch_size=16, shuffle=True)
+            eval_dataloader = DataLoader(eval_set, batch_size=32, shuffle=False)
+            model = AutoModelForCausalLM.from_pretrained("gpt2")
+            dummy_optimizer = DummyOptim(params=model.parameters())
+            dummy_lr_scheduler = DummyScheduler(dummy_optimizer)
+
+            model, _, train_dataloader, eval_dataloader, _ = accelerator.prepare(
+                model, dummy_optimizer, train_dataloader, eval_dataloader, dummy_lr_scheduler
+            )
+            with self.assertRaises(ValueError) as cm:
+                accelerator.get_state_dict(model)
+            msg = (
+                "Cannot get 16bit model weights because `stage3_gather_16bit_weights_on_model_save` in DeepSpeed config is False. "
+                "To save the model weights in 16bit, set `stage3_gather_16bit_weights_on_model_save` to True in DeepSpeed config file or "
+                "set `zero3_save_16bit_model` to True when using `accelerate config`. "
+                "To save the full checkpoint, run `model.save_checkpoint(save_dir)` and use `zero_to_fp32.py` to recover weights."
+            )
+            self.assertTrue(msg in str(cm.exception))
+        accelerator.state.initialized = False
+
+    def test_autofill_dsconfig(self):
+        deepspeed_plugin = DeepSpeedPlugin(
+            hf_ds_config=self.ds_config_file[ZERO3],
+            zero3_init_flag=True,
+        )
+        del deepspeed_plugin.deepspeed_config["bf16"]
+        del deepspeed_plugin.deepspeed_config["fp16"]
+
+        with mockenv_context(**self.dist_env):
+            accelerator = Accelerator(deepspeed_plugin=deepspeed_plugin)
+            train_set = RegressionDataset(length=80)
+            eval_set = RegressionDataset(length=20)
+            train_dataloader = DataLoader(train_set, batch_size=16, shuffle=True)
+            eval_dataloader = DataLoader(eval_set, batch_size=32, shuffle=False)
+            model = AutoModelForCausalLM.from_pretrained("gpt2")
+            dummy_optimizer = DummyOptim(params=model.parameters(), lr=5e-5, weight_decay=1e-4)
+            dummy_lr_scheduler = DummyScheduler(dummy_optimizer, warmup_num_steps=10, total_num_steps=1000)
+            hidden_size = model.config.hidden_size
+            model, _, train_dataloader, eval_dataloader, _ = accelerator.prepare(
+                model, dummy_optimizer, train_dataloader, eval_dataloader, dummy_lr_scheduler
+            )
+            self.assertEqual(accelerator.deepspeed_config["train_micro_batch_size_per_gpu"], 16)
+            self.assertEqual(accelerator.deepspeed_config["train_batch_size"], 16)
+
+            self.assertEqual(accelerator.deepspeed_config["optimizer"]["params"]["lr"], 5e-5)
+            self.assertEqual(accelerator.deepspeed_config["optimizer"]["params"]["weight_decay"], 1e-4)
+
+            self.assertEqual(accelerator.deepspeed_config["scheduler"]["params"]["warmup_min_lr"], 0.0)
+            self.assertEqual(accelerator.deepspeed_config["scheduler"]["params"]["warmup_max_lr"], 5e-5)
+            self.assertEqual(accelerator.deepspeed_config["scheduler"]["params"]["warmup_num_steps"], 10)
+
+            self.assertEqual(accelerator.deepspeed_config["gradient_clipping"], 1.0)
+            self.assertEqual(
+                accelerator.deepspeed_config["zero_optimization"]["reduce_bucket_size"], hidden_size * hidden_size
+            )
+            self.assertEqual(
+                accelerator.deepspeed_config["zero_optimization"]["stage3_prefetch_bucket_size"],
+                0.9 * hidden_size * hidden_size,
+            )
+            self.assertEqual(
+                accelerator.deepspeed_config["zero_optimization"]["stage3_param_persistence_threshold"],
+                10 * hidden_size,
+            )
+            self.assertFalse(
+                accelerator.deepspeed_config["zero_optimization"]["stage3_gather_16bit_weights_on_model_save"]
+            )
+        accelerator.state.initialized = False
--- a/tests/test_big_modeling.py
+++ b/tests/test_big_modeling.py
@ -56,6 +56,29 @@ class BiggerModelForTest(nn.Module):
        return self.linear4(self.linear3(self.batchnorm(self.linear2(self.linear1(x)))))


+# To test preload_module_classes
+class ModuleWithUnusedSubModules(nn.Module):
+    def __init__(self, input_dim, output_dim):
+        super().__init__()
+        self.linear = nn.Linear(input_dim, output_dim)
+
+    def forward(self, x):
+        return x @ self.linear.weight.t() + self.linear.bias
+
+
+class ModelWithUnusedSubModulesForTest(nn.Module):
+    def __init__(self):
+        super().__init__()
+        self.linear1 = ModuleWithUnusedSubModules(3, 4)
+        self.linear2 = ModuleWithUnusedSubModules(4, 5)
+        self.batchnorm = nn.BatchNorm1d(5)
+        self.linear3 = ModuleWithUnusedSubModules(5, 6)
+        self.linear4 = ModuleWithUnusedSubModules(6, 5)
+
+    def forward(self, x):
+        return self.linear4(self.linear3(self.batchnorm(self.linear2(self.linear1(x)))))
+
+
 class BigModelingTester(unittest.TestCase):
    def test_init_empty_weights(self):
        # base use
@ -94,14 +117,45 @@ class BigModelingTester(unittest.TestCase):

        cpu_offload(model, execution_device=device)
        output = model(x)
-        self.assertTrue(torch.allclose(expected, output.cpu()))
+        self.assertTrue(
+            torch.allclose(expected, output.cpu(), 1e-4, 1e-5), msg=f"Expected: {expected}\nActual: {output.cpu()}"
+        )

        # Clean up for next test.
        remove_hook_from_submodules(model)

        cpu_offload(model, execution_device=device, offload_buffers=True)
        output = model(x)
-        self.assertTrue(torch.allclose(expected, output.cpu()))
+        self.assertTrue(
+            torch.allclose(expected, output.cpu(), 1e-4, 1e-5), msg=f"Expected: {expected}\nActual: {output.cpu()}"
+        )
+
+    def test_cpu_offload_with_unused_submodules(self):
+        model = ModelWithUnusedSubModulesForTest()
+        x = torch.randn(2, 3)
+        expected = model(x)
+
+        device = torch.device(0 if torch.cuda.is_available() else "cpu")
+
+        cpu_offload(model, execution_device=device, preload_module_classes=["ModuleWithUnusedSubModules"])
+        output = model(x)
+        self.assertTrue(
+            torch.allclose(expected, output.cpu(), 1e-4, 1e-5), msg=f"Expected: {expected}\nActual: {output.cpu()}"
+        )
+
+        # Clean up for next test.
+        remove_hook_from_submodules(model)
+
+        cpu_offload(
+            model,
+            execution_device=device,
+            offload_buffers=True,
+            preload_module_classes=["ModuleWithUnusedSubModules"],
+        )
+        output = model(x)
+        self.assertTrue(
+            torch.allclose(expected, output.cpu(), 1e-4, 1e-5), msg=f"Expected: {expected}\nActual: {output.cpu()}"
+        )

    @slow
    @require_cuda
@ -127,7 +181,9 @@ class BigModelingTester(unittest.TestCase):
        with TemporaryDirectory() as tmp_dir:
            disk_offload(model, tmp_dir, execution_device=device)
            output = model(x)
-            self.assertTrue(torch.allclose(expected, output.cpu()))
+            self.assertTrue(
+                torch.allclose(expected, output.cpu(), 1e-4, 1e-5), msg=f"Expected: {expected}\nActual: {output.cpu()}"
+            )

            # Clean up for next test.
            remove_hook_from_submodules(model)
@ -135,7 +191,41 @@ class BigModelingTester(unittest.TestCase):
        with TemporaryDirectory() as tmp_dir:
            disk_offload(model, tmp_dir, execution_device=device, offload_buffers=True)
            output = model(x)
-            self.assertTrue(torch.allclose(expected, output.cpu()))
+            self.assertTrue(
+                torch.allclose(expected, output.cpu(), 1e-4, 1e-5), msg=f"Expected: {expected}\nActual: {output.cpu()}"
+            )
+
+    def test_disk_offload_with_unused_submodules(self):
+        model = ModelWithUnusedSubModulesForTest()
+        x = torch.randn(2, 3)
+        expected = model(x)
+
+        device = torch.device(0 if torch.cuda.is_available() else "cpu")
+
+        with TemporaryDirectory() as tmp_dir:
+            disk_offload(
+                model, tmp_dir, execution_device=device, preload_module_classes=["ModuleWithUnusedSubModules"]
+            )
+            output = model(x)
+            self.assertTrue(
+                torch.allclose(expected, output.cpu(), 1e-4, 1e-5), msg=f"Expected: {expected}\nActual: {output.cpu()}"
+            )
+
+            # Clean up for next test.
+            remove_hook_from_submodules(model)
+
+        with TemporaryDirectory() as tmp_dir:
+            disk_offload(
+                model,
+                tmp_dir,
+                execution_device=device,
+                offload_buffers=True,
+                preload_module_classes=["ModuleWithUnusedSubModules"],
+            )
+            output = model(x)
+            self.assertTrue(
+                torch.allclose(expected, output.cpu(), 1e-4, 1e-5), msg=f"Expected: {expected}\nActual: {output.cpu()}"
+            )

    @slow
    @require_cuda
@ -229,6 +319,36 @@ class BigModelingTester(unittest.TestCase):
                "Hello world! My name is Kiyoshi, and I'm a student at the University of Tokyo",
            )

+    @require_cuda
+    def test_dispatch_model_with_unused_submodules(self):
+        model = ModelWithUnusedSubModulesForTest()
+        device_map = {"linear1": "cpu", "linear2": "disk", "batchnorm": "cpu", "linear3": 0, "linear4": 0}
+
+        x = torch.randn(2, 3)
+        expected = model(x)
+
+        with TemporaryDirectory() as tmp_dir:
+            dispatch_model(
+                model, device_map, offload_dir=tmp_dir, preload_module_classes=["ModuleWithUnusedSubModules"]
+            )
+            output = model(x)
+            self.assertTrue(torch.allclose(expected, output.cpu(), atol=1e-5))
+
+    @require_multi_gpu
+    def test_dispatch_model_with_unused_submodules_multi_gpu(self):
+        model = ModelWithUnusedSubModulesForTest()
+        device_map = {"linear1": "cpu", "linear2": "disk", "batchnorm": "cpu", "linear3": 0, "linear4": 1}
+
+        x = torch.randn(2, 3)
+        expected = model(x)
+
+        with TemporaryDirectory() as tmp_dir:
+            dispatch_model(
+                model, device_map, offload_dir=tmp_dir, preload_module_classes=["ModuleWithUnusedSubModules"]
+            )
+            output = model(x)
+            self.assertTrue(torch.allclose(expected, output.cpu(), atol=1e-5))
+
    @require_cuda
    def test_load_checkpoint_and_dispatch(self):
        model = ModelForTest()
@ -274,3 +394,55 @@ class BigModelingTester(unittest.TestCase):

        output = new_model(x)
        self.assertTrue(torch.allclose(expected, output.cpu(), atol=1e-5))
+
+    @require_cuda
+    def test_load_checkpoint_and_dispatch_with_unused_submodules(self):
+        model = ModelWithUnusedSubModulesForTest()
+        device_map = {"linear1": "cpu", "linear2": "cpu", "batchnorm": 0, "linear3": 0, "linear4": 0}
+
+        x = torch.randn(2, 3)
+        expected = model(x)
+
+        with TemporaryDirectory() as tmp_dir:
+            checkpoint = os.path.join(tmp_dir, "pt_model.bin")
+            torch.save(model.state_dict(), checkpoint)
+
+            new_model = ModelWithUnusedSubModulesForTest()
+            new_model = load_checkpoint_and_dispatch(
+                new_model, checkpoint, device_map=device_map, preload_module_classes=["ModuleWithUnusedSubModules"]
+            )
+
+        # CPU-offloaded weights are on the meta device while waiting for the forward pass.
+        self.assertEqual(new_model.linear1.linear.weight.device, torch.device("meta"))
+        self.assertEqual(new_model.linear2.linear.weight.device, torch.device("meta"))
+        self.assertEqual(new_model.linear3.linear.weight.device, torch.device(0))
+        self.assertEqual(new_model.linear4.linear.weight.device, torch.device(0))
+
+        output = new_model(x)
+        self.assertTrue(torch.allclose(expected, output.cpu(), atol=1e-5))
+
+    @require_multi_gpu
+    def test_load_checkpoint_and_dispatch_multi_gpu_with_unused_submodules(self):
+        model = ModelWithUnusedSubModulesForTest()
+        device_map = {"linear1": "cpu", "linear2": "cpu", "batchnorm": 0, "linear3": 0, "linear4": 1}
+
+        x = torch.randn(2, 3)
+        expected = model(x)
+
+        with TemporaryDirectory() as tmp_dir:
+            checkpoint = os.path.join(tmp_dir, "pt_model.bin")
+            torch.save(model.state_dict(), checkpoint)
+
+            new_model = ModelWithUnusedSubModulesForTest()
+            new_model = load_checkpoint_and_dispatch(
+                new_model, checkpoint, device_map=device_map, preload_module_classes=["ModuleWithUnusedSubModules"]
+            )
+
+        # CPU-offloaded weights are on the meta device while waiting for the forward pass.
+        self.assertEqual(new_model.linear1.linear.weight.device, torch.device("meta"))
+        self.assertEqual(new_model.linear2.linear.weight.device, torch.device("meta"))
+        self.assertEqual(new_model.linear3.linear.weight.device, torch.device(0))
+        self.assertEqual(new_model.linear4.linear.weight.device, torch.device(1))
+
+        output = new_model(x)
+        self.assertTrue(torch.allclose(expected, output.cpu(), atol=1e-5))
--- a/tests/test_cpu.py
+++ b/tests/test_cpu.py
@ -15,9 +15,10 @@
 import unittest

 from accelerate import debug_launcher
-from accelerate.test_utils import test_script
+from accelerate.test_utils import require_cpu, test_script


+@require_cpu
 class MultiCPUTester(unittest.TestCase):
    def test_cpu(self):
        debug_launcher(test_script.main)
--- a/tests/test_examples.py
+++ b/tests/test_examples.py
@ -16,14 +16,14 @@ import ast
 import os
 import re
 import shutil
-import subprocess
 import tempfile
 import unittest
 from unittest import mock

-from accelerate import Accelerator
+import torch
+
 from accelerate.test_utils.examples import compare_against_test
-from accelerate.test_utils.testing import TempDirTestCase, slow
+from accelerate.test_utils.testing import TempDirTestCase, require_trackers, run_command, slow
 from accelerate.utils import write_basic_config


@ -31,7 +31,14 @@ from accelerate.utils import write_basic_config
 # Should mock `{script_name}.get_dataloaders` via:
 # @mock.patch("{script_name}.get_dataloaders", mocked_dataloaders)

-EXCLUDE_EXAMPLES = ["cross_validation.py", "multi_process_metrics.py", "memory.py", "fsdp_with_peak_mem_tracking.py"]
+EXCLUDE_EXAMPLES = [
+    "cross_validation.py",
+    "gradient_accumulation.py",
+    "multi_process_metrics.py",
+    "memory.py",
+    "fsdp_with_peak_mem_tracking.py",
+    "deepspeed_with_config_support.py",
+]


 class ExampleDifferenceTests(unittest.TestCase):
@ -137,7 +144,7 @@ class FeatureExamplesTests(TempDirTestCase):
        --checkpointing_steps epoch
        --output_dir {self.tmpdir}
        """.split()
-        _ = subprocess.run(self._launch_args + testargs, stdout=subprocess.PIPE)
+        run_command(self._launch_args + testargs)
        self.assertTrue(os.path.exists(os.path.join(self.tmpdir, "epoch_1")))

    def test_checkpointing_by_steps(self):
@ -146,7 +153,7 @@ class FeatureExamplesTests(TempDirTestCase):
        --checkpointing_steps 1
        --output_dir {self.tmpdir}
        """.split()
-        _ = subprocess.run(self._launch_args + testargs, stdout=subprocess.PIPE, env=os.environ)
+        _ = run_command(self._launch_args + testargs)
        self.assertTrue(os.path.exists(os.path.join(self.tmpdir, "step_5")))

    def test_load_states_by_epoch(self):
@ -154,9 +161,7 @@ class FeatureExamplesTests(TempDirTestCase):
        examples/by_feature/checkpointing.py
        --resume_from_checkpoint {os.path.join(self.tmpdir, "epoch_1")}
        """.split()
-        output = subprocess.run(
-            self._launch_args + testargs, universal_newlines=True, stdout=subprocess.PIPE, stderr=subprocess.PIPE
-        ).stdout
+        output = run_command(self._launch_args + testargs, return_stdout=True)
        self.assertNotIn("epoch 0:", output)
        self.assertNotIn("epoch 1:", output)
        self.assertIn("epoch 2:", output)
@ -166,18 +171,18 @@ class FeatureExamplesTests(TempDirTestCase):
        examples/by_feature/checkpointing.py
        --resume_from_checkpoint {os.path.join(self.tmpdir, "step_5")}
        """.split()
-        output = subprocess.run(
-            self._launch_args + testargs, universal_newlines=True, stdout=subprocess.PIPE, stderr=subprocess.PIPE
-        ).stdout
-        num_processes = Accelerator().num_processes
+        output = run_command(self._launch_args + testargs, return_stdout=True)
+        if torch.cuda.is_available():
+            num_processes = torch.cuda.device_count()
+        else:
+            num_processes = 1
        if num_processes > 1:
            self.assertNotIn("epoch 0:", output)
            self.assertNotIn("epoch 1:", output)
-            self.assertIn("epoch 2:", output)
        else:
            self.assertNotIn("epoch 0:", output)
            self.assertIn("epoch 1:", output)
-            self.assertIn("epoch 2:", output)
+        self.assertIn("epoch 2:", output)

    @slow
    def test_cross_validation(self):
@ -186,16 +191,16 @@ class FeatureExamplesTests(TempDirTestCase):
        --num_folds 2
        """.split()
        with mock.patch.dict(os.environ, {"TESTING_MOCKED_DATALOADERS": "0"}):
-            output = subprocess.run(
-                self._launch_args + testargs, universal_newlines=True, stdout=subprocess.PIPE, stderr=subprocess.PIPE
-            ).stdout
+            output = run_command(self._launch_args + testargs, return_stdout=True)
            results = ast.literal_eval(re.findall("({.+})", output)[-1])
            self.assertGreaterEqual(results["accuracy"], 0.75)

    def test_multi_process_metrics(self):
        testargs = ["examples/by_feature/multi_process_metrics.py"]
-        _ = subprocess.run(self._launch_args + testargs, stdout=subprocess.PIPE)
+        run_command(self._launch_args + testargs)

+    @require_trackers
+    @mock.patch.dict(os.environ, {"WANDB_MODE": "offline"})
    def test_tracking(self):
        with tempfile.TemporaryDirectory() as tmpdir:
            testargs = f"""
@ -203,5 +208,9 @@ class FeatureExamplesTests(TempDirTestCase):
            --with_tracking
            --logging_dir {tmpdir}
            """.split()
-            _ = subprocess.run(self._launch_args + testargs, stdout=subprocess.PIPE)
+            run_command(self._launch_args + testargs)
            self.assertTrue(os.path.exists(os.path.join(tmpdir, "tracking")))
+
+    def test_gradient_accumulation(self):
+        testargs = ["examples/by_feature/gradient_accumulation.py"]
+        run_command(self._launch_args + testargs)
--- a/tests/test_grad_sync.py
+++ b/tests/test_grad_sync.py
@ -0,0 +1,55 @@
+# Copyright 2021 The HuggingFace Team. All rights reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import inspect
+import os
+import unittest
+
+import torch
+
+import accelerate
+from accelerate import debug_launcher
+from accelerate.test_utils import (
+    execute_subprocess_async,
+    require_cpu,
+    require_multi_gpu,
+    require_single_gpu,
+    test_sync,
+)
+from accelerate.utils import get_launch_prefix, patch_environment
+
+
+class SyncScheduler(unittest.TestCase):
+    def setUp(self):
+        mod_file = inspect.getfile(accelerate.test_utils)
+        self.test_file_path = os.path.sep.join(mod_file.split(os.path.sep)[:-1] + ["scripts", "test_sync.py"])
+
+    @require_cpu
+    def test_gradient_sync_cpu_noop(self):
+        debug_launcher(test_sync.main, num_processes=1)
+
+    @require_cpu
+    def test_gradient_sync_cpu_multi(self):
+        debug_launcher(test_sync.main)
+
+    @require_single_gpu
+    def test_gradient_sync_gpu(self):
+        test_sync.main()
+
+    @require_multi_gpu
+    def test_gradient_sync_gpu_multi(self):
+        print(f"Found {torch.cuda.device_count()} devices.")
+        cmd = get_launch_prefix() + [f"--nproc_per_node={torch.cuda.device_count()}", self.test_file_path]
+        with patch_environment(omp_num_threads=1):
+            execute_subprocess_async(cmd, env=os.environ.copy())
--- a/tests/test_hooks.py
+++ b/tests/test_hooks.py
@ -77,20 +77,20 @@ class HooksModelTester(unittest.TestCase):
        test_hook = PreForwardHook()
        add_hook_to_module(test_model, test_hook)
        output1 = test_model(x)
-        self.assertTrue(torch.allclose(output1, expected))
+        self.assertTrue(torch.allclose(output1, expected, atol=1e-5))

        # Attaching a hook to a model when it already has one replaces, does not chain
        test_hook = PreForwardHook()
        add_hook_to_module(test_model, test_hook)
        output1 = test_model(x)
-        self.assertTrue(torch.allclose(output1, expected))
+        self.assertTrue(torch.allclose(output1, expected, atol=1e-5))

        # You need to use the sequential hook to chain two or more hooks
        test_hook = SequentialHook(PreForwardHook(), PreForwardHook())
        add_hook_to_module(test_model, test_hook)

        output2 = test_model(x)
-        assert torch.allclose(output2, expected2)
+        assert torch.allclose(output2, expected2, atol=1e-5)

    def test_post_forward_hook_is_executed(self):
        test_model = ModelForTest()
@ -100,20 +100,20 @@ class HooksModelTester(unittest.TestCase):
        test_hook = PostForwardHook()
        add_hook_to_module(test_model, test_hook)
        output1 = test_model(x)
-        self.assertTrue(torch.allclose(output1, output + 1))
+        self.assertTrue(torch.allclose(output1, output + 1, atol=1e-5))

        # Attaching a hook to a model when it already has one replaces, does not chain
        test_hook = PostForwardHook()
        add_hook_to_module(test_model, test_hook)
        output1 = test_model(x)
-        self.assertTrue(torch.allclose(output1, output + 1))
+        self.assertTrue(torch.allclose(output1, output + 1, atol=1e-5))

        # You need to use the sequential hook to chain two or more hooks
        test_hook = SequentialHook(PostForwardHook(), PostForwardHook())
        add_hook_to_module(test_model, test_hook)

        output2 = test_model(x)
-        assert torch.allclose(output2, output + 2)
+        assert torch.allclose(output2, output + 2, atol=1e-5)

    def test_no_grad_in_hook(self):
        test_model = ModelForTest()
--- a/tests/test_kwargs_handlers.py
+++ b/tests/test_kwargs_handlers.py
@ -21,6 +21,7 @@ from dataclasses import dataclass
 import torch

 from accelerate import Accelerator, DistributedDataParallelKwargs, GradScalerKwargs
+from accelerate.state import AcceleratorState
 from accelerate.test_utils import execute_subprocess_async, require_cuda, require_multi_gpu
 from accelerate.utils import KwargsHandler

@ -44,7 +45,8 @@ class DataLoaderTester(unittest.TestCase):
    def test_grad_scaler_kwargs(self):
        # If no defaults are changed, `to_kwargs` returns an empty dict.
        scaler_handler = GradScalerKwargs(init_scale=1024, growth_factor=2)
-        accelerator = Accelerator(fp16=True, kwargs_handlers=[scaler_handler])
+        AcceleratorState._reset_state()
+        accelerator = Accelerator(mixed_precision="fp16", kwargs_handlers=[scaler_handler])
        print(accelerator.use_fp16)
        scaler = accelerator.scaler

--- a/tests/test_multigpu.py
+++ b/tests/test_multigpu.py
@ -14,7 +14,6 @@

 import inspect
 import os
-import sys
 import unittest

 import torch
@ -22,35 +21,26 @@ import torch
 import accelerate
 from accelerate import Accelerator
 from accelerate.test_utils import execute_subprocess_async, require_multi_gpu
+from accelerate.utils import get_launch_prefix, patch_environment


 class MultiGPUTester(unittest.TestCase):
    def setUp(self):
        mod_file = inspect.getfile(accelerate.test_utils)
-        self.test_file_path = os.path.sep.join(mod_file.split(os.path.sep)[:-1] + ["test_script.py"])
+        self.test_file_path = os.path.sep.join(mod_file.split(os.path.sep)[:-1] + ["scripts", "test_script.py"])

    @require_multi_gpu
    def test_multi_gpu(self):
        print(f"Found {torch.cuda.device_count()} devices.")
-        distributed_args = f"""
-            -m torch.distributed.launch
-            --nproc_per_node={torch.cuda.device_count()}
-            --use_env
-            {self.test_file_path}
-        """.split()
-        cmd = [sys.executable] + distributed_args
-        execute_subprocess_async(cmd, env=os.environ.copy())
+        cmd = get_launch_prefix() + [self.test_file_path]
+        with patch_environment(omp_num_threads=1):
+            execute_subprocess_async(cmd, env=os.environ.copy())

    @require_multi_gpu
    def test_pad_across_processes(self):
-        distributed_args = f"""
-            -m torch.distributed.launch
-            --nproc_per_node={torch.cuda.device_count()}
-            --use_env
-            {inspect.getfile(self.__class__)}
-        """.split()
-        cmd = [sys.executable] + distributed_args
-        execute_subprocess_async(cmd, env=os.environ.copy())
+        cmd = get_launch_prefix() + [inspect.getfile(self.__class__)]
+        with patch_environment(omp_num_threads=1):
+            execute_subprocess_async(cmd, env=os.environ.copy())


 if __name__ == "__main__":
--- a/tests/test_offload.py
+++ b/tests/test_offload.py
@ -19,7 +19,13 @@ from tempfile import TemporaryDirectory
 import torch
 import torch.nn as nn

-from accelerate.utils import OffloadedWeightsLoader, offload_state_dict
+from accelerate.utils import (
+    OffloadedWeightsLoader,
+    is_torch_version,
+    load_offloaded_weight,
+    offload_state_dict,
+    offload_weight,
+)


 class ModelForTest(nn.Module):
@ -35,8 +41,6 @@ class ModelForTest(nn.Module):

 class OffloadTester(unittest.TestCase):
    def test_offload_state_dict(self):
-        from tempfile import TemporaryDirectory
-
        model = ModelForTest()
        with TemporaryDirectory() as tmp_dir:
            offload_state_dict(tmp_dir, model.state_dict())
@ -49,6 +53,22 @@ class OffloadTester(unittest.TestCase):
                self.assertTrue(os.path.isfile(weight_file))
                # TODO: add tests on the fact weights are properly loaded

+    def test_offload_weight(self):
+        dtypes = [torch.float16, torch.float32]
+        if is_torch_version(">=", "1.10"):
+            dtypes.append(torch.bfloat16)
+
+        for dtype in dtypes:
+            weight = torch.randn(2, 3, dtype=dtype)
+            with TemporaryDirectory() as tmp_dir:
+                index = offload_weight(weight, "weight", tmp_dir, {})
+                weight_file = os.path.join(tmp_dir, "weight.dat")
+                self.assertTrue(os.path.isfile(weight_file))
+                self.assertDictEqual(index, {"weight": {"shape": [2, 3], "dtype": str(dtype).split(".")[1]}})
+
+                new_weight = load_offloaded_weight(weight_file, index["weight"])
+                self.assertTrue(torch.equal(weight, new_weight))
+
    def test_offload_weights_loader(self):
        model = ModelForTest()
        state_dict = model.state_dict()
--- a/tests/test_scheduler.py
+++ b/tests/test_scheduler.py
@ -18,6 +18,7 @@ from functools import partial
 import torch

 from accelerate import Accelerator, debug_launcher
+from accelerate.test_utils import require_cpu


 def scheduler_test(num_processes=2, step_scheduler_with_optimizer=True, split_batches=False):
@ -46,6 +47,7 @@ def scheduler_test(num_processes=2, step_scheduler_with_optimizer=True, split_ba
    ), f"Wrong lr found at second step, expected {expected_lr}, got {scheduler.get_last_lr()[0]}"


+@require_cpu
 class SchedulerTester(unittest.TestCase):
    def test_scheduler_steps_with_optimizer_single_process(self):
        debug_launcher(partial(scheduler_test, num_processes=1), num_processes=1)
--- a/tests/test_tpu.py
+++ b/tests/test_tpu.py
@ -24,7 +24,7 @@ from accelerate.test_utils import execute_subprocess_async, require_tpu
 class MultiTPUTester(unittest.TestCase):
    def setUp(self):
        mod_file = inspect.getfile(accelerate.test_utils)
-        self.test_file_path = os.path.sep.join(mod_file.split(os.path.sep)[:-1] + ["test_script.py"])
+        self.test_file_path = os.path.sep.join(mod_file.split(os.path.sep)[:-1] + ["scripts", "test_script.py"])
        self.test_dir = os.path.sep.join(inspect.getfile(self.__class__).split(os.path.sep)[:-1])

    @require_tpu
--- a/tests/test_tracking.py
+++ b/tests/test_tracking.py
@ -30,31 +30,22 @@ from accelerate.test_utils.testing import (
    MockingTestCase,
    TempDirTestCase,
    require_comet_ml,
-    require_tensorflow,
+    require_tensorboard,
    require_wandb,
 )
 from accelerate.tracking import CometMLTracker, GeneralTracker
-from accelerate.utils import is_comet_ml_available, is_tensorflow_available
+from accelerate.utils import is_comet_ml_available


 if is_comet_ml_available():
    from comet_ml import OfflineExperiment

-
-if is_tensorflow_available():
-    import tensorflow as tf
-    from tensorboard.plugins.hparams import plugin_data_pb2
-    from tensorflow.core.util import event_pb2
-    from tensorflow.python.summary.summary_iterator import summary_iterator
-
-
 logger = logging.getLogger(__name__)


+@require_tensorboard
 class TensorBoardTrackingTest(unittest.TestCase):
-    @require_tensorflow
    def test_init_trackers(self):
-        hps = None
        project_name = "test_project_with_config"
        with tempfile.TemporaryDirectory() as dirpath:
            accelerator = Accelerator(log_with="tensorboard", logging_dir=dirpath)
@ -63,29 +54,9 @@ class TensorBoardTrackingTest(unittest.TestCase):
            accelerator.end_training()
            for child in Path(f"{dirpath}/{project_name}").glob("*/**"):
                log = list(filter(lambda x: x.is_file(), child.iterdir()))[0]
-                # The config log is stored one layer deeper in the logged directory
-                # And names are randomly generated each time
-            si = summary_iterator(str(log))
-            # Pull HPS through careful parsing
-            for event in si:
-                for value in event.summary.value:
-                    proto_bytes = value.metadata.plugin_data.content
-                    plugin_data = plugin_data_pb2.HParamsPluginData.FromString(proto_bytes)
-                    if plugin_data.HasField("session_start_info"):
-                        hps = dict(plugin_data.session_start_info.hparams)
+            self.assertNotEqual(str(log), "")

-        self.assertTrue(isinstance(hps, dict))
-        keys = list(hps.keys())
-        keys.sort()
-        self.assertEqual(keys, ["learning_rate", "num_iterations", "some_boolean", "some_string"])
-        self.assertEqual(hps["num_iterations"].number_value, 12)
-        self.assertEqual(hps["learning_rate"].number_value, 0.01)
-        self.assertEqual(hps["some_boolean"].bool_value, False)
-        self.assertEqual(hps["some_string"].string_value, "some_value")
-
-    @require_tensorflow
    def test_log(self):
-        step = None
        project_name = "test_project_with_log"
        with tempfile.TemporaryDirectory() as dirpath:
            accelerator = Accelerator(log_with="tensorboard", logging_dir=dirpath)
@ -96,21 +67,7 @@ class TensorBoardTrackingTest(unittest.TestCase):
            # Logged values are stored in the outermost-tfevents file and can be read in as a TFRecord
            # Names are randomly generated each time
            log = list(filter(lambda x: x.is_file(), Path(f"{dirpath}/{project_name}").iterdir()))[0]
-            serialized_examples = tf.data.TFRecordDataset(log)
-            for e in serialized_examples:
-                event = event_pb2.Event.FromString(e.numpy())
-                if step is None:
-                    step = event.step
-                for value in event.summary.value:
-                    if value.tag == "total_loss":
-                        total_loss = value.simple_value
-                    elif value.tag == "iteration":
-                        iteration = value.simple_value
-                    elif value.tag == "my_text/text_summary":  # Append /text_summary to the key
-                        my_text = value.tensor.string_val[0].decode()
-        self.assertAlmostEqual(total_loss, values["total_loss"])
-        self.assertEqual(iteration, values["iteration"])
-        self.assertEqual(my_text, values["my_text"])
+            self.assertNotEqual(str(log), "")

    def test_logging_dir(self):
        with self.assertRaisesRegex(ValueError, "Logging with `tensorboard` requires a `logging_dir`"):
--- a/tests/test_utils.py
+++ b/tests/test_utils.py
@ -23,7 +23,7 @@ from accelerate.test_utils.training import RegressionModel
 from accelerate.utils import convert_outputs_to_fp32, find_device, patch_environment, send_to_device


-TestNamedTuple = namedtuple("TestNamedTuple", "a b c")
+ExampleNamedTuple = namedtuple("ExampleNamedTuple", "a b c")


 class UtilsTester(unittest.TestCase):
@ -50,8 +50,8 @@ class UtilsTester(unittest.TestCase):
        self.assertTrue(torch.equal(result2["b"][1].cpu(), tensor))
        self.assertEqual(result2["c"], 1)

-        result3 = send_to_device(TestNamedTuple(a=tensor, b=[tensor, tensor], c=1), device)
-        self.assertIsInstance(result3, TestNamedTuple)
+        result3 = send_to_device(ExampleNamedTuple(a=tensor, b=[tensor, tensor], c=1), device)
+        self.assertIsInstance(result3, ExampleNamedTuple)
        self.assertTrue(torch.equal(result3.a.cpu(), tensor))
        self.assertIsInstance(result3.b, list)
        self.assertTrue(torch.equal(result3.b[0].cpu(), tensor))
--- a/tests/xla_spawn.py
+++ b/tests/xla_spawn.py
@ -78,7 +78,6 @@ def main():

    # Patch sys.argv
    sys.argv = [args.training_script] + args.training_script_args + ["--tpu_num_cores", str(args.num_cores)]
-
    xmp.spawn(mod._mp_fn, args=(), nprocs=args.num_cores)
Author	SHA1	Message	Date
Sylvain Gugger	eebeb59a36	Fix accelerate tests command (#528 )	2022-07-18 14:47:34 +02:00
Sylvain Gugger	be4b74f42f	Relese: v0.11.0	2022-07-18 08:27:58 -04:00
Sourab Mangrulkar	c93b3eb5d7	FSDP integration enhancements and fixes (#522 ) * FSDP integration enhancements and fixes * bug fixes 1. fix circular dependency 2. Add model print statement in FSDP example 3. minor fixes * removing `always_wrap` as it is rarely useful * removing comment * resolving comments * fsdp fp16 mp uses ShardedGradScaler * fix import * fix check * add exception when class to wrap not found in model * adding `FSDP_BACKWARD_PREFETCH` * fix	2022-07-18 17:45:58 +05:30
Zachary Mueller	3eea8ceee0	Warn user if no trackers are installed (#524 )	2022-07-15 18:16:00 +02:00
Zachary Mueller	7abc708be2	Fixup all example CI tests and properly fail (#517 ) * Clean and make all tests pass	2022-07-15 18:15:45 +02:00
Sourab Mangrulkar	bb78b04cce	fixing deepspeed multi-node launcher (#514 ) * fixing deepspeed multi-node launcher * minor fixes * handling env variables for accelerate to correctly work * resolving comments	2022-07-14 18:40:48 +05:30
Younes Belkada	7e6593756f	Add special Parameters modules support (#519 ) * Meta init/tensor_to_device logic for Int8 Parameters. * add 8 bit support * add special modules support Co-authored-by: timdettmers <timdettmers@users.noreply.github.com> * bad formatting * bad formatting * restoring the poor lines that were alone! Co-authored-by: Tim Dettmers <tim.dettmers@gmail.com> Co-authored-by: timdettmers <timdettmers@users.noreply.github.com>	2022-07-13 12:46:36 -04:00
Jonathan Chang	960fd9d86a	Don't unwrap in save_state() (#489 )	2022-07-13 12:46:21 -04:00
wwhio	70ca65a9a1	Fix a bug when reduce a tensor. (#513 ) * return reduced result * update doc for Accelerator.reduce * update doc in Accelerator.reduce * fix reduce behavior for gpu devices	2022-07-13 09:19:01 -04:00
Sylvain Gugger	ea0d5368bd	Add benchmarks (#506 ) * Add benchmarks * Oops! Forgot one file * Update benchmarks/README.md Co-authored-by: Zachary Mueller <muellerzr@gmail.com> Co-authored-by: Zachary Mueller <muellerzr@gmail.com>	2022-07-12 15:16:45 -04:00
Zachary Mueller	78357f44b3	Add gradient accumulation doc (#511 ) * Gradient accumulation doc Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com>	2022-07-12 17:36:45 +02:00
Zachary Mueller	c7526e9483	Make gradient accumulation work with dispatched dataloaders (#510 ) * Make grad accum work with dispatch dl * Split print over multiple lines	2022-07-12 17:12:39 +02:00
Sylvain Gugger	f5ef120e77	Fix DispatchDataLoader length when `split_batches=True` (#509 )	2022-07-12 10:35:35 -04:00
Sourab Mangrulkar	3c1f97c386	SageMaker enhancements to allow custom docker image, input channels referring to s3/remote data locations and metrics logging (#504 ) * SageMaker DP and MP Support * fix 😅 * removing SageMaker MP option * adding support for custom image_uri, data location and metrics	2022-07-12 13:25:52 +05:30
Sourab Mangrulkar	a0514dd809	SageMaker DP Support (#494 ) * SageMaker DP and MP Support * fix 😅 * removing SageMaker MP option	2022-07-09 00:14:57 +05:30
Zachary Mueller	b20f90ab17	Fix scheduler in gradient accumulation example (#500 ) * Fix scheduler in gradient accumulation example * Phrase better how the scheduler is stepped during grad accum Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com>	2022-07-08 13:41:43 -04:00
Sourab Mangrulkar	cfb2a3e239	update dataloader wrappers to have `total_batch_size` attribute (#493 ) * update dataloader wrappers to have `total_batch_size` attribute * fix * resolving comments * Update src/accelerate/data_loader.py Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com> * quality * add docstrings * Update src/accelerate/data_loader.py Co-authored-by: Zachary Mueller <muellerzr@gmail.com> * docstrings iter 2 + quality + minor change in doc Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com> Co-authored-by: Zachary Mueller <muellerzr@gmail.com>	2022-07-08 21:16:31 +05:30
Zachary Mueller	86ce737d7f	Introduce automatic gradient accumulation wrapper + fix a few test issues (#484 ) * Have accelerator handle gradient accumulation Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com>	2022-07-05 15:49:36 -04:00
Zhiyuan Chen	deffaba8d6	add use_distributed property (#487 ) * add distributed property in accelerate_state * ensure num_process > 1	2022-07-05 09:19:44 -04:00
Sourab Mangrulkar	6ebddcd5e0	fixing fsdp autowrap functionality (#475 ) * fixing fsdp autowrap functionality * updating version requirements * update version to latest torch stable version * quality	2022-07-01 10:00:47 +05:30
Zachary Mueller	4a7bc3bcb7	Use datasets 2.2.0 for now (#481 )	2022-06-28 12:31:41 -04:00
Zachary Mueller	1f96f3cf85	Rm gradient accumulation on TPU (#479 ) * Rm gradient accumulation on TPU for now	2022-06-28 12:29:58 -04:00
Zachary Mueller	bbca2700c7	Revert "Pin datasets for now (#477 )" (#478 ) This reverts commit a8eca60d57e8294e666b765b5331770aa0c58893.	2022-06-28 10:09:11 -04:00
Zachary Mueller	a8eca60d57	Pin datasets for now (#477 )	2022-06-28 09:47:39 -04:00
douwekiela	329209871f	Some typos and cosmetic fixes (#472 )	2022-06-27 05:40:07 -07:00
Sylvain Gugger	619ef04f09	Dev version	2022-06-24 16:41:09 -04:00
Zachary Mueller	9d8ed50f7b	Fix when TPU device check is ran (#469 )	2022-06-24 12:07:38 -04:00
Zachary Mueller	196856f357	Refactor Utility Documentation (#467 ) * Add a utilities doc	2022-06-23 16:34:01 -04:00
Zachary Mueller	3a5490b066	Add docbuilder to quality (#468 )	2022-06-23 14:36:16 -04:00
Zachary Mueller	24be733d84	Expose some is_*_available utils in docs (#466 )	2022-06-23 10:34:45 -04:00
Zachary Mueller	775bc790e7	Cleanup CI Warnings (#465 ) * Fix named tuple warning * Use torch AdamW instead of transformers * Make regex string instead of literal	2022-06-23 10:06:19 -04:00
Zachary Mueller	799fa935e9	Link CI slow runners to the commit (#464 ) * Tweak trigger logic to link actions together	2022-06-23 08:56:01 -04:00
Zachary Mueller	3ccbd9f7a0	Fix subtle bug in BF16 (#463 ) * mixed precision bugfix * Use is_tpu_available	2022-06-23 08:55:13 -04:00
Zachary Mueller	f13c59f91e	Include bf16 support for TPUs and CPUs, and a better check for if a CUDA device supports BF16 (#462 ) * Support bf16 on TPU, CPU, and GPU in Accelerator directly	2022-06-22 17:53:42 -04:00
Noam Wies	d39c57c11f	Handle bfloat16 weights in disk offload without adding memory overhead (#460 ) (#461 )	2022-06-22 09:13:23 -04:00
Sylvain Gugger	e2a968c66d	Handle bfloat16 weights in disk offload (#460 ) * Handle bfloat16 weights in disk offload * Address review comments	2022-06-21 18:06:57 -04:00
Zachary Mueller	dc243c0db1	Raise a clear warning if a user tries to modify the AcceleratorState (#458 ) * Reinitialize warning	2022-06-21 16:42:35 -04:00
Zachary Mueller	97f4c9de61	Right step point (#459 )	2022-06-21 15:11:03 -04:00
Zachary Mueller	73a596593e	Better checks for if a TPU device exists (#456 ) * Check if a TPU device actually exists	2022-06-21 12:12:00 -04:00
Sylvain Gugger	eeaba598f4	Offload and modules with unused submodules (#442 ) * Offload and modules with unused submodules * Renaming * Update src/accelerate/hooks.py Co-authored-by: Sourab Mangrulkar <13534540+pacman100@users.noreply.github.com> * Address review comment Co-authored-by: Sourab Mangrulkar <13534540+pacman100@users.noreply.github.com>	2022-06-17 20:04:39 -04:00
Sylvain Gugger	3d92caa241	Release: v0.10.0	2022-06-15 13:58:22 -04:00
Zachary Mueller	fa17f207b5	Fix docstring (#447 )	2022-06-15 13:54:04 -04:00
Sourab Mangrulkar	873dcc63a4	Migrate HFDeepSpeedConfig from trfrs to accelerate (#432 ) * Migrate HFDeepSpeedConfig from trfrs to accelerate * update state.py to resolve comments 1. Adds static method to have a simple API for integrating deepspeed config in transformers trainer. * reverting changes and addressing comments * Marking DepSpeed and FSDP as experimental in accelerate	2022-06-15 20:56:39 +05:30
Sylvain Gugger	40b6fe1784	Add psutil as depenedency (#445 )	2022-06-15 11:03:52 -04:00
Zachary Mueller	29eef234c9	Revamp TPU internals to be more efficient + enable mixed precision types (#441 ) Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com>	2022-06-14 17:41:20 -04:00
Sourab Mangrulkar	3f0876ac03	fix fsdp torch version dependency (#437 )	2022-06-11 00:36:44 +05:30
Zachary Mueller	450d51ce01	Create Gradient Accumulation Example (#431 ) * Gradient accumulation example	2022-06-08 14:46:04 -04:00
Zachary Mueller	1b2da6c6a5	init (#429 )	2022-06-08 14:07:10 -04:00
Zachary Mueller	1424a8e00d	Introduce `no_sync` context wrapper + clean up some more warnings for DDP (#428 )	2022-06-08 12:56:21 -04:00
Sourab Mangrulkar	b2afd4e8da	updating tests to resolve runner failures wrt deepspeed revamp (#427 ) * deepspeed revamp * Update dataclasses.py * Update deepspeed.py * quality * fixing code * quality * FIx imports * saving 16bit model in zero stage 3 1. Saving 16bit model in zero stage 3 2. zero init in stage 3 support using HFDeepSpeedConfig * quality * adding test and fixing bugs * update makefile for deepspeed tests * Update test.yml * adding `deepspeed` as requirement for tests * Apply suggestions from code review Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com> * quality * addressing comments * add example and minor updates 1. Add example to show the usage of config file with revamped deepspeed support. 2. update required deepspeed version to 0.6.5 2. reverting `reinit` change as it is not required, 3. raising Exception when using `clip_grad_value` with DeepSpeed/FSDP. * Documentation and Zero-3 Inference Support 1. Changes to support ZeRo Stage-3 Inference support. 2. minor bug fixes. 3. Documentation. * doc fix * Apply suggestions from code review Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com> * addressing comments * update doc to address comments and bug fixes 1. update tests and add new one testing autofill functionality of `prepare` method. 2. fix bug related to zero-3 init related to HFDeepSpeedConfig 3. Update documentation addressing comments. * removing image and hosting it on `documentation-images` dataset * check for hidden_size for zero_opt heurisitics * updating tests to resolve runner failures Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com>	2022-06-07 16:21:26 +05:30
Zachary Mueller	2130205626	Fix secrets in Docker workflow (#426 ) * Fix secrets	2022-06-07 06:47:09 -04:00
Sourab Mangrulkar	1703b79a79	DeepSpeed Revamp (#405 ) * deepspeed revamp * Update dataclasses.py * Update deepspeed.py * quality * fixing code * quality * FIx imports * saving 16bit model in zero stage 3 1. Saving 16bit model in zero stage 3 2. zero init in stage 3 support using HFDeepSpeedConfig * quality * adding test and fixing bugs * update makefile for deepspeed tests * Update test.yml * adding `deepspeed` as requirement for tests * Apply suggestions from code review Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com> * quality * addressing comments * add example and minor updates 1. Add example to show the usage of config file with revamped deepspeed support. 2. update required deepspeed version to 0.6.5 2. reverting `reinit` change as it is not required, 3. raising Exception when using `clip_grad_value` with DeepSpeed/FSDP. * Documentation and Zero-3 Inference Support 1. Changes to support ZeRo Stage-3 Inference support. 2. minor bug fixes. 3. Documentation. * doc fix * Apply suggestions from code review Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com> * addressing comments * update doc to address comments and bug fixes 1. update tests and add new one testing autofill functionality of `prepare` method. 2. fix bug related to zero-3 init related to HFDeepSpeedConfig 3. Update documentation addressing comments. * removing image and hosting it on `documentation-images` dataset * check for hidden_size for zero_opt heurisitics Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com>	2022-06-07 00:52:18 +05:30
Zachary Mueller	05c641bc0c	Introduce a Dependency Checker to trigger new Docker Builds on main (#424 ) * Introduce warning + auto build * Trigger only on merge to main	2022-06-06 07:30:39 -04:00
Zachary Mueller	da78e296ba	Enable slow tests nightly (#421 )	2022-06-01 20:28:31 -04:00
Zachary Mueller	9e0fff9291	Push out python 3.6 + fix all tests related to the upgrade (#420 ) * Update Docker for py 3.7 Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com>	2022-06-01 16:49:27 -04:00
Zachary Mueller	938b8f358d	Speedup main CI (#419 ) * Speed up workflow	2022-06-01 10:59:01 -04:00
Sylvain Gugger	d04e8e2baa	Switch to evaluate for metrics (#417 ) * Switch to evaluate for metrics * Why the heck? * Fix syntax error * Install from githug * Is this the culprit? * Upgrade Python * Protobouf 💩 * Install from git not necessary now * Sneaky last tensorboard * Let's try this way * Forgot to add all files :-/	2022-06-01 09:57:57 -04:00
Zachary Mueller	8db128498c	Create an issue template for Accelerate (#415 )	2022-06-01 09:15:23 -04:00
Zachary Mueller	114707449b	Introduce post-merge runners (#416 ) * Introduce post-merge runners	2022-05-31 15:11:29 -04:00
Zachary Mueller	3b51d6e9ad	Fix debug_launcher issues (#413 ) * change to require_cpu only	2022-05-31 14:59:28 -04:00
Zachary Mueller	174eb3af1d	Use main egg (#414 )	2022-05-31 14:58:38 -04:00
Zachary Mueller	d176b552c9	Introduce nightly runners (#410 ) * Introduce nightly builds * Fixup docker images slightly * Make device-count specific test use `torch.cuda.device_count()` rather than `Accelerator.num_processes` to avoid bug.	2022-05-31 14:14:02 -04:00