Release: v0.9.0

Refactor some parts in utils (#380 )
Better check for deepspeed availability (#379 )
2025-11-17 16:04:35 +08:00 · 2022-05-20 13:46:17 -04:00 · 2022-05-20 12:23:54 -04:00 · 2022-05-20 11:05:18 -04:00 · 2022-05-20 08:55:03 -04:00 · 2022-05-20 17:18:17 +05:30
78 changed files with 8749 additions and 670 deletions
--- a/.github/workflows/test.yml
+++ b/.github/workflows/test.yml
@ -12,6 +12,19 @@ jobs:
      with:
        python-version: 3.6
    - name: Install Python dependencies
-      run: pip install -e .[test]
+      run: pip install setuptools==59.5.0; pip install -e .[test,test_trackers]
    - name: Run Tests
-      run: make test
+      run: make test
+      
+  test_examples:
+    runs-on: ubuntu-latest
+    steps:
+    - uses: actions/checkout@v2
+    - name: Set up Python 3.6
+      uses: actions/setup-python@v2
+      with:
+        python-version: 3.6
+    - name: Install Python dependencies
+      run: pip install setuptools==59.5.0; pip install -e .[test] tensorboard
+    - name: Run Tests
+      run: make test_examples
--- a/.gitignore
+++ b/.gitignore
@ -132,4 +132,7 @@ dmypy.json
 .vscode

 # IntelliJ
-.idea
+.idea
+
+# Mac .DS_Store
+.DS_Store
--- a/CONTRIBUTING.md
+++ b/CONTRIBUTING.md
@ -54,7 +54,7 @@ Did not find it? :( So we can act quickly on it, please follow these steps:
 * Include your **OS type and version**, the versions of **Python** and **PyTorch**.
 * A short, self-contained, code snippet that allows us to reproduce the bug in
  less than 30s;
-* Provide the with your Accelerate configuration (located by default in `~/.cache/huggingface/accelerate/default_congig.yml`)
+* Provide the with your Accelerate configuration (located by default in `~/.cache/huggingface/accelerate/default_config.yaml`)

 ### Do you want a new feature?

--- a/5
+++ b/5
@ -25,4 +25,7 @@ style:
 	
 # Run tests for the library
 test:
-	python -m pytest -n auto --dist=loadfile -s -v ./tests/
+	python -m pytest -n auto --dist=loadfile -s -v ./tests/ --ignore=./tests/test_examples.py
+
+test_examples:
+	python -m pytest -n auto --dist=loadfile -s -v ./tests/test_examples.py
--- a/README.md
+++ b/README.md
@ -168,7 +168,7 @@ mpirun -np 2 python examples/nlp_example.py

 ## Launching training using DeepSpeed

-🤗 Accelerate supports training on single/multiple GPUs using DeepSpeed. to use it, you don't need to change anything in your training code; you can set everything using just `accelerate config`. However, if you desire to tweak your DeepSpeed related args from your python script, we provide you the `DeepSpeedPlugin`.
+🤗 Accelerate supports training on single/multiple GPUs using DeepSpeed. To use it, you don't need to change anything in your training code; you can set everything using just `accelerate config`. However, if you desire to tweak your DeepSpeed related args from your python script, we provide you the `DeepSpeedPlugin`.

 ```python
 from accelerator import Accelerator, DeepSpeedPlugin
--- a/docs/source/_toctree.yml
+++ b/docs/source/_toctree.yml
@ -7,6 +7,8 @@
    title: Installation
  title: Get started
 - sections:
+  - local: big_modeling
+    title: Handling big models
  - local: sagemaker
    title: Amazon SageMaker
  title: Guides
@ -19,4 +21,12 @@
    title: Kwargs Handlers
  - local: internal
    title: Internals
+  - local: checkpoint
+    title: Checkpointing
+  - local: tracking
+    title: Experiment Tracking
+  - local: fsdp
+    title: Fully Sharded Data Parallel
+  - local: memory
+    title: Memory Utilities
  title: API Reference
--- a/docs/source/big_modeling.mdx
+++ b/docs/source/big_modeling.mdx
@ -0,0 +1,232 @@
+<!--Copyright 2022 The HuggingFace Team. All rights reserved.
+
+Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with
+the License. You may obtain a copy of the License at
+
+http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on
+an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the
+specific language governing permissions and limitations under the License.
+-->
+
+# Handling big models
+
+When loading a pretrained model in PyTorch, the usual workflow looks like this:
+
+```py
+import torch
+
+my_model = ModelClass(...)
+state_dict = torch.load(checkpoint_file)
+my_model.load_state_dict(state_dict)
+```
+
+In plain English, those steps are:
+1. Create the model with randomly initialized weights
+2. Load the model weights (in a dictionary usually called a state dict) from the disk
+3. Load those weights inside the model
+
+While this works very well for regularly sized models, this workflow has some clear limitation when we deal with a huge model: in step 1, we load a full version of the model in RAM, and spend some time randomly initializing the weights (which will be discarded in step 3). In step 2, we load another full version of the model in RAM, with the pretrained weights. If you're loading a model with 6 billions parameters, this needs you will need 24GB of RAM for each copy of the model, so 48GB in total (half of it to load the model in FP16).
+
+<Tip warning={true}>
+
+This API is quite new and still in its experimental stage. While we strive to provide a stable API, it's possible some small parts of the public API will change in the future.
+
+</Tip>
+
+## Instantiating an empty model
+
+The first tool Accelerate introduces to help with big models is a context manager [`init_empty_weights`] that helps you initialize a model without using any RAM, so that step 1 can be done on models of any size. Here is how it works:
+
+```py
+from accelerate import init_empty_weights
+
+with init_empty_weights():
+    my_model = ModelClass(...)
+```
+
+For instance:
+
+```py
+with init_empty_weights():
+    model = nn.Sequential(*[nn.Linear(10000, 10000) for _ in range(1000)])
+```
+
+initializes an empty model with a bit more than 100B parameters. Behind the scenes, this relies on the meta device introduced in PyTorch 1.9. During the initialization under the context manager, each time a parameter is created, it is instantly moved on that device.
+
+<Tip warning={true}>
+
+You can't move a model initialized like this on CPU or another device directly, since it doesn't have any data. It's also very likely that a forward pass with that empty model will fail, as not all operations are supported on the meta device.
+
+</Tip>
+
+## Sharded checkpoints
+
+It's possible your model is so big that even a single copy won't fit in RAM. That doesn't mean it can't be loaded: if you have one or several GPUs, this is more memory available to store your model. In this case, it's better if your checkpoint is split in several smaller files that we call checkpoint shards.
+
+Accelerate will handle sharded checkpoints as long as you follow the following format: your checkpoint should be in a folder, with several files containing the partial state dicts, and there should be an index in the JSON format that contains a dictionary mapping parameter names to the file containing their weights. For instance we could have a folder containing:
+
+```bash
+first_state_dict.bin
+index.json
+second_state_dict.bin
+```
+
+with index.json being the following file:
+
+```
+{
+  "linear1.weight": "first_state_dict.bin",
+  "linear1.bias": "first_state_dict.bin",
+  "linear2.weight": "second_state_dict.bin",
+  "linear2.bias": "second_state_dict.bin"
+}
+```
+
+and `first_state_dict.bin` containing the weights for `"linear1.weight"` and `"linear1.bias"`, `second_state_dict.bin` the ones for `"linear2.weight"` and `"linear2.bias"`
+
+## Loading weights
+
+The second tool Accelerate introduces is a function [`load_checkpoint_and_dispatch`], that will allow you to load a checkpoint inside your empty model. This supports full checkpoints (a single file containing the whole state dict) as well as sharded checkpoints. It will also automatically dispatch those weights across the devices you have available (GPUs, CPU RAM), so if you are loading a sharded checkpoint, the maximum RAM usage will be the size of the biggest shard.
+
+Here is how we can use this to load the [GPT-J-6B](https://huggingface.co/EleutherAI/gpt-j-6B) model. You clone the sharded version of this model with:
+
+```bash
+git clone https://huggingface.co/sgugger/sharded-gpt-j-6B
+cd sharded-gpt-j-6B
+git-lfs install
+git pull
+```
+
+then we can initialize the model with
+
+```py
+from accelerate import init_empty_weights
+from transformers import AutoConfig, AutoModelForCausalLM
+
+checkpoint = "EleutherAI/gpt-j-6B"
+config = AutoConfig.from_pretrained(checkpoint)
+
+with init_empty_weights():
+    model = AutoModelForCausalLM.from_config(config)
+```
+
+and load the checkpoint we just downloaded with:
+
+```py
+from accelerate import load_checkpoint_and_dispatch
+
+model = load_checkpoint_and_dispatch(
+    model, "sharded-gpt-j-6B", device_map="auto", no_split_module_classes=["GPTJBlock"]
+)
+```
+
+By passing `device_map="auto"`, we tell Accelerate to determine automatically where to put each layer of the model depending on the available resources:
+- first we use the maximum space available on the GPU(s)
+- if we still need space, we store the remaining weights on the CPU
+- if there is not enough RAM, we store the remaining weights on the hard drive as memory-mapped tensors
+
+`no_split_module_classes=["GPTJBlock"]` indicates that the modules that are `GPTJBlock` should not be split on different devices. You should set here all blocks that include a residual connection of some kind.
+
+You can see the `device_map` that Accelerate picked by accessing the `hf_device_map` attribute of your model:
+
+```py
+model.hf_device_map
+```
+
+```python out
+{'transformer.wte': 0,
+ 'transformer.drop': 0,
+ 'transformer.h.0': 0,
+ 'transformer.h.1': 0,
+ 'transformer.h.2': 0,
+ 'transformer.h.3': 0,
+ 'transformer.h.4': 0,
+ 'transformer.h.5': 0,
+ 'transformer.h.6': 0,
+ 'transformer.h.7': 0,
+ 'transformer.h.8': 0,
+ 'transformer.h.9': 0,
+ 'transformer.h.10': 0,
+ 'transformer.h.11': 0,
+ 'transformer.h.12': 0,
+ 'transformer.h.13': 0,
+ 'transformer.h.14': 0,
+ 'transformer.h.15': 0,
+ 'transformer.h.16': 0,
+ 'transformer.h.17': 0,
+ 'transformer.h.18': 0,
+ 'transformer.h.19': 0,
+ 'transformer.h.20': 0,
+ 'transformer.h.21': 0,
+ 'transformer.h.22': 0,
+ 'transformer.h.23': 0,
+ 'transformer.h.24': 1,
+ 'transformer.h.25': 1,
+ 'transformer.h.26': 1,
+ 'transformer.h.27': 1,
+ 'transformer.ln_f': 1,
+ 'lm_head': 1}
+ ```
+
+You can also design your `device_map` yourself, if you prefer to explicitly decide where each layer should be. In this case, the command above becomes:
+
+```py
+model = load_checkpoint_and_dispatch(model, "sharded-gpt-j-6B", device_map=my_device_map)
+```
+
+## Run the model
+
+Now that we have done this, our model lies across several devices, and maybe the hard drive. But it can still be used as a regular PyTorch model:
+
+```py
+from transformers import AutoTokenizer
+
+tokenizer = AutoTokenizer.from_pretrained(checkpoint)
+inputs = tokenizer("Hello, my name is", return_tensors="pt")
+inputs = inputs.to(0)
+output = model.generate(inputs["input_ids"])
+tokenizer.decode(output[0].tolist())
+```
+
+Behind the scenes, Accelerate added hooks to the model, so that:
+- at each layer, the inputs are put on the right device (so even if your model is spread across several GPUs, it works)
+- for the weights offloaded on the CPU, they are put on a GPU just before the forward pass, and cleaned up just after
+- for the weights offloaded on the hard drive, they are loaded in RAM then put on a GPU just before the forward pass, and cleaned up just after
+
+This way, you model can run for inference even if it doesn't fit on one of the GPUs or the CPU RAM!
+
+<Tip warning={true}>
+
+This only supports inference of your model, not training. Most of the computation happens behind `torch.no_grad()` context managers to avoid spending some GPU memory with intermediate activations.
+
+</Tip>
+
+## Limits and further development
+
+We are aware of the current limitations in the API:
+
+- While this could theoretically work just one CPU with potential disk offload, you need at least one GPU to run this API. This will be fixed in further development.
+- [`infer_auto_device_map`] (or `device_map="auto"` in [`load_checkpoint_and_dispatch`]) tries to maximize GPU and CPU RAM it sees available when you execute it. While PyTorch is very good at managing GPU RAM efficiently (and giving it back when not needed), it's not entirely true with Python and CPU RAM. Therefore, an automatically computed device map might be too intense on the CPU. Move a few modules to the disk device if you get crashes due to lack of RAM.
+- [`infer_auto_device_map`] (or `device_map="auto"` in [`load_checkpoint_and_dispatch`]) attributes devices sequentially (to avoid moving things back and forth) so if your first layer is bigger than the size of the GPU you have, it will end up with everything on the CPU/Disk.
+- [`load_checkpoint_and_dispatch`] and [`load_checkpoint_in_model`] do not perform any check on the correctness of your state dict compared to your model at the moment (this will be fixed in a future version), so you may get some weird errors if trying to load a checkpoint with mismatched or missing keys.
+- The model parallelism used when your model is split on several GPUs is naive and not optimized, meaning that only one GPU works at a given time and the other sits idle.
+- When weights are offloaded on the CPU/hard drive, there is no pre-fetching (yet, we will work on this for future versions) which means the weights are put on the GPU when they are needed and not before.
+- Hard-drive offloading might be very slow if the hardware you run on does not have fast communication between disk and CPU (like NVMes).
+
+## API doc
+
+[[autodoc]] cpu_offload
+
+[[autodoc]] disk_offload
+
+[[autodoc]] dispatch_model
+
+[[autodoc]] infer_auto_device_map
+
+[[autodoc]] init_empty_weights
+
+[[autodoc]] load_checkpoint_and_dispatch
+
+[[autodoc]] load_checkpoint_in_model
--- a/docs/source/checkpoint.mdx
+++ b/docs/source/checkpoint.mdx
@ -0,0 +1,60 @@
+<!--Copyright 2022 The HuggingFace Team. All rights reserved.
+
+Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with
+the License. You may obtain a copy of the License at
+
+http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on
+an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the
+specific language governing permissions and limitations under the License.
+-->
+
+# Checkpointing
+
+When training a PyTorch model with Accelerate, you may often want to save and continue a state of training. Doing so requires
+saving and loading the model, optimizer, RNG generators, and the GradScaler. Inside Accelerate are two convience functions to achieve this quickly:
+- Use [`~Accelerator.save_state`] for saving everything mentioned above to a folder location
+- Use [`~Accelerator.load_state`] for loading everything stored from an earlier `save_state`
+
+It should be noted that the expectation is that those states come from the same training script, they should not be from two separate scripts.
+
+- By using [`~Accelerator.register_for_checkpointing`], you can register custom objects to be automatically stored or loaded from the two prior functions,
+so long as the object has a `state_dict` **and** a `load_state_dict` functionality. This could include objects such as a learning rate scheduler. 
+
+Below is a brief example using checkpointing to save and reload a state during training:
+
+```python
+from accelerate import Accelerator
+import torch
+
+accelerator = Accelerator()
+
+my_scheduler = torch.optim.lr_scheduler.StepLR(my_optimizer, step_size=1, gamma=0.99)
+my_model, my_optimizer, my_training_dataloader = accelerate.prepare(my_model, my_optimizer, my_training_dataloader)
+
+# Register the LR scheduler
+accelerate.register_for_checkpointing(my_scheduler)
+
+# Save the starting state
+accelerate.save_state("my/save/path")
+
+device = accelerator.device
+my_model.to(device)
+
+# Perform training
+for epoch in range(num_epochs):
+    for batch in my_training_dataloader:
+        my_optimizer.zero_grad()
+        inputs, targets = batch
+        inputs = inputs.to(device)
+        targets = targets.to(device)
+        outputs = my_model(inputs)
+        loss = my_loss_function(outputs, targets)
+        accelerator.backward(loss)
+        my_optimizer.step()
+    my_scheduler.step()
+
+# Restore previous state
+accelerate.load_state("my/save/path")
+```
--- a/docs/source/fsdp.mdx
+++ b/docs/source/fsdp.mdx
@ -0,0 +1,120 @@
+<!--Copyright 2022 The HuggingFace Team. All rights reserved.
+
+Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with
+the License. You may obtain a copy of the License at
+
+http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on
+an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the
+specific language governing permissions and limitations under the License.
+-->
+
+# Fully Sharded Data Parallel
+
+To accelerate training huge models on larger batch sizes, we can use a fully sharded data parallel model.
+This type of data parallel paradigm enables fitting more data and larger models by sharding the optimizer states, gradients and parameters.
+To read more about it and the benefits, check out the [Fully Sharded Data Parallel blog](https://pytorch.org/blog/introducing-pytorch-fully-sharded-data-parallel-api/).
+We have integrated the latest PyTorch's Fully Sharded Data Parallel (FSDP) training feature.
+All you need to do is enable it through the config.
+
+## How it works out the box
+
+On your machine(s) just run:
+
+```bash
+accelerate config
+```
+
+and answer the questions asked. This will generate a config file that will be used automatically to properly set the
+default options when doing
+
+```bash
+accelerate launch my_script.py --args_to_my_script
+```
+
+For instance, here is how you would run the NLP example (from the root of the repo) with FSDP enabled:
+
+```bash
+compute_environment: LOCAL_MACHINE
+deepspeed_config: {}
+distributed_type: FSDP
+fsdp_config:
+  min_num_params: 2000
+  offload_params: false
+  sharding_strategy: 1
+machine_rank: 0
+main_process_ip: null
+main_process_port: null
+main_training_function: main
+mixed_precision: 'no'
+num_machines: 1
+num_processes: 2
+use_cpu: false
+```
+
+```bash
+accelerate launch examples/nlp_example.py
+```
+
+Currently, `Accelerate` supports following config through the CLI:
+
+```bash
+`Sharding Strategy`: [1] FULL_SHARD, [2] SHARD_GRAD_OP
+`Min Num Params`: FSDP\'s minimum number of parameters for Default Auto Wrapping.
+`Offload Params`: Decides Whether to offload parameters and gradients to CPU.
+```
+
+## Few caveats to be aware of
+
+- PyTorch FSDP auto wraps sub-modules, flattens the parameters and shards the parameters in place.
+  Due to this, any optimizer created before model wrapping gets broken and occupies more memory.
+  Hence, it is highly recommended and efficient to prepare model before creating optimizer.
+  `Accelerate` will automatically wrap the model and create an optimizer for you in case of single model with a warning message.
+  > FSDP Warning: When using FSDP, it is efficient and recommended to call prepare for the model before creating the optimizer
+
+However, below is the recommended way to prepare model and optimizer while using FSDP:
+
+```diff
+model = AutoModelForSequenceClassification.from_pretrained("bert-base-cased", return_dict=True)
+ model = accelerator.prepare(model)
+
+optimizer = torch.optim.AdamW(params=model.parameters(), lr=lr)
+
+- model, optimizer, train_dataloader, eval_dataloader, lr_scheduler = accelerator.prepare(model,
+-        optimizer, train_dataloader, eval_dataloader, lr_scheduler
+-    )
+
+ optimizer, train_dataloader, eval_dataloader, lr_scheduler = accelerator.prepare(
+         optimizer, train_dataloader, eval_dataloader, lr_scheduler
+        )
+
+```
+
+- In case of a single model, if you have created optimizer with multiple parameter groups and called prepare with them together,
+  then the parameter groups will be lost and the following warning is displayed:
+  > FSDP Warning: When using FSDP, several parameter groups will be conflated into
+  > a single one due to nested module wrapping and parameter flattening.
+  
+  This is because parameter groups created before wrapping will have no meaning post wrapping due parameter flattening of nested FSDP modules into 1D arrays (which can consume many layers).
+  For instance, below are the named parameters of FSDP model on GPU 0 (When using 2 GPUs. Around 55M (110M/2) params in 1D arrays as this will have the 1st shard of the parameters). 
+  Here, if one has applied no weight decay for [bias, LayerNorm.weight] named parameters of unwrapped BERT model, 
+  it can't be applied to the below FSDP wrapped model as there are no named parameters with either of those strings and 
+  the parameters of those layers are concatenated with parameters of various other layers.
+  ```
+  {
+    '_fsdp_wrapped_module.flat_param': torch.Size([494209]), 
+    '_fsdp_wrapped_module._fpw_module.bert.embeddings.word_embeddings._fsdp_wrapped_module.flat_param': torch.Size([11720448]), 
+    '_fsdp_wrapped_module._fpw_module.bert.encoder._fsdp_wrapped_module.flat_param': torch.Size([42527232])
+  }
+  ```
+
+
+- In case of multiple models, it is necessary to prepare the models before creating optimizers else it will throw an error.
+- Mixed precision is currently not supported with FSDP.
+
+For more control, users can leverage the `FullyShardedDataParallelPlugin` wherein they can specify `auto_wrap_policy`, `backward_prefetch` and `ignored_modules`.
+After creating an instance of this class, users can pass it to the Accelerator class instantiation.
+For more information on these options, please refer to the PyTorch [FullyShardedDataParallel](https://github.com/pytorch/pytorch/blob/0df2e863fbd5993a7b9e652910792bd21a516ff3/torch/distributed/fsdp/fully_sharded_data_parallel.py#L236) code.
+
+[[autodoc]] utils.FullyShardedDataParallelPlugin
--- a/docs/source/index.mdx
+++ b/docs/source/index.mdx
@ -52,7 +52,7 @@ Changing it to work with accelerate is really easy and only adds a few lines of
 + device = accelerator.device
  my_model.to(device)
  # Pass every important object (model, optimizer, dataloader) to *accelerator.prepare*
-+ my_model, my_optimizer, my_training_dataloader = accelerate.prepare(
+ my_model, my_optimizer, my_training_dataloader = accelerator.prepare(
 +     my_model, my_optimizer, my_training_dataloader
 + )

--- a/docs/source/internal.mdx
+++ b/docs/source/internal.mdx
@ -34,6 +34,10 @@ The main work on your PyTorch `DataLoader` is done by the following function:

 [[autodoc]] data_loader.IterableDatasetShard

+## Scheduler
+
+[[autodoc]] scheduler.AcceleratedScheduler
+
 ## Distributed Config

 ### AcceleratorState
@ -44,6 +48,10 @@ The main work on your PyTorch `DataLoader` is done by the following function:

 [[autodoc]] state.DistributedType

+## Tracking
+
+[[autodoc]] tracking.GeneralTracker
+
 ## Utilities

 [[autodoc]] utils.extract_model_from_parallel
@ -59,3 +67,5 @@ The main work on your PyTorch `DataLoader` is done by the following function:
 [[autodoc]] utils.synchronize_rng_states

 [[autodoc]] utils.wait_for_everyone
+
+[[autodoc]] utils.write_basic_config
--- a/docs/source/memory.mdx
+++ b/docs/source/memory.mdx
@ -0,0 +1,51 @@
+<!--Copyright 2022 The HuggingFace Team. All rights reserved.
+
+Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with
+the License. You may obtain a copy of the License at
+
+http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on
+an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the
+specific language governing permissions and limitations under the License.
+-->
+
+# Memory Utilities
+
+One of the most frustrating errors when it comes to running training scripts is hitting "CUDA Out-of-Memory", 
+as the entire script needs to be restarted, progress is lost, and typically a developer would want to simply
+start their script and let it run.
+
+`Accelerate` provides a utility heavily based on [toma](https://github.com/BlackHC/toma) to give this capability.
+
+## find_executable_batch_size
+
+This algorithm operates with exponential decay, decreasing the batch size in half after each failed run on some 
+training script. To use it, restructure your training function to include an inner function that includes this wrapper, 
+and build your dataloaders inside it. At a minimum, this could look like 4 new lines of code. 
+> Note: The inner function *must* take in the batch size as the first parameter, but we do not pass one to it when called. The wrapper handles this for us
+
+```diff
+def training_function(args):
+    accelerator = Accelerator()
+    model = get_model()
+    model.to(accelerator.device)
+    optimizer = get_optimizer()
+
+   @find_executable_batch_size(starting_batch_size=args.batch_size)
+   def inner_training_loop(batch_size):
+       nonlocal model, optimizer # Ensure they can be used in our context
+        train_dataloader, eval_dataloader = get_dataloaders(accelerator, batch_size)
+        lr_scheduler = get_scheduler(
+            optimizer, 
+            num_training_steps=len(train_dataloader)*num_epochs
+        )
+        model, optimizer, train_dataloader, eval_dataloader, lr_scheduler = accelerator.prepare(
+            model, optimizer, train_dataloader, eval_dataloader, lr_scheduler
+        )
+        train(model, optimizer, train_dataloader, lr_scheduler)
+        validate(model, eval_dataloader)
+   inner_training_loop()
+```
+
+[[autodoc]] utils.find_executable_batch_size
--- a/docs/source/quicktour.mdx
+++ b/docs/source/quicktour.mdx
@ -45,11 +45,13 @@ model on `accelerator.device` or your training will fail on TPU.

 </Tip>

-3. Pass all objects relevant to training (optimizer, model, training dataloader) to the
+3. Pass all objects relevant to training (optimizer, model, training dataloader, learning rate scheduler) to the
 [`~Accelerator.prepare`] method. This will make sure everything is ready for training.

 ```python
-model, optimizer, train_dataloader = accelerator.prepare(model, optimizer, train_dataloader)
+model, optimizer, train_dataloader, lr_scheduler = accelerator.prepare(
+    model, optimizer, train_dataloader, lr_scheduler
+)
 ```

 In particular, your training dataloader will be sharded accross all GPUs/TPU cores available so that each one sees a
@ -74,14 +76,21 @@ training loop.

 <Tip warning={true}>

+You should only pass the learning rate scheduler to [`~Accelerator.prepare`] when the scheduler needs to be stepped
+at each optimizer step.
+
+</Tip>
+
+<Tip warning={true}>
+
 Your training dataloader may change length when going through this method: if you run on X GPUs, it will have its
 length divided by X (since your actual batch size will be multiplied by X), unless you set
 `split_batches=True`.

 </Tip>

-Any instruction using your training dataloader length (for instance if you need the number of total training steps
-to create a learning rate scheduler) should go after the call to [`~Accelerator.prepare`].
+Any instruction using your training dataloader length (for instance if you want to log the number of total training
+steps) should go after the call to [`~Accelerator.prepare`].

 You can perfectly send your dataloader to [`~Accelerator.prepare`] on its own, but it's best to send the
 model and optimizer to [`~Accelerator.prepare`] together.
@ -340,6 +349,16 @@ unwrapped_model.load_state_dict(torch.load(filename))

 Note that since all the model parameters are references to tensors, this will load your weights inside `model`.

+## Saving/loading entire states
+
+When training your model, you may want to save the current state of the model, optimizer, random generators, and potentially LR schedulers to be restored in the _same script_.
+You can use `accelerator.save_state` and `accelerator.load_state` respectively to do so, just by simply passing in a save location. 
+If you have registered any other stateful items to be stored through `accelerator.register_for_checkpointing` they will also be saved and/or loaded.
+<Tip>
+    Every object passed to `register_for_checkpointing` must have a `load_state_dict` and `save_dict` function to be stored
+</Tip>
+
+
 ### Gradient clipping

 If you are using gradient clipping in your script, you should replace the calls to
@ -379,10 +398,6 @@ DeepSpeed support is experimental, so the underlying API will evolve in the near
 breaking changes. In particular, 🤗 Accelerate does not support DeepSpeed config you have written yourself yet, this
 will be added in a next version.

-One main caveat for the DeepSpeed integration is that the DeepSpeed launcher always passes a `local_rank` variable to
-the training script, so your training script should accept it (whether you launch training with the DeepSpeed launcher
-or `accelerate launch`).
-
 <Tip warning={true}>

 The [`notebook_launcher`] does not support the DeepSpeed integration yet.
--- a/docs/source/tracking.mdx
+++ b/docs/source/tracking.mdx
@ -0,0 +1,163 @@
+<!--Copyright 2022 The HuggingFace Team. All rights reserved.
+
+Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with
+the License. You may obtain a copy of the License at
+
+http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on
+an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the
+specific language governing permissions and limitations under the License.
+-->
+
+# Tracking
+
+There are a large number of experiment tracking API's available, however getting them all to work with in a multi-processing environment can oftentimes be complex.
+Accelerate provides a general tracking API that can be used to log useful items during your script through [`~Accelerator.log`]
+
+## Integrated Trackers
+
+Currently `Accelerate` supports three trackers out-of-the-box:
+
+
+[[autodoc]] tracking.TensorBoardTracker
+
+[[autodoc]] tracking.WandBTracker
+
+[[autodoc]] tracking.CometMLTracker
+
+To use any of them, pass in the selected type(s) to the `log_with` parameter in [`Accelerate`]:
+```python
+from accelerate import Accelerator
+from accelerate.utils import LoggerType
+
+accelerator = Accelerator(log_with="all")  # For all available trackers in the environment
+accelerator = Accelerator(log_with="wandb")
+accelerator = Accelerator(log_with=["wandb", LoggerType.TENSORBOARD])
+```
+
+At the start of your experiment [`~Accelerator.init_trackers`] should be used to setup your project, and potentially add any experiment hyperparameters to be logged:
+```python
+hps = {"num_iterations": 5, "learning_rate": 1e-2}
+accelerator.init_trackers("my_project", config=hps)
+```
+
+When you are ready to log any data, [`~Accelerator.log`] should be used.
+A `step` can also be passed in to correlate the data with a particular step in the training loop.
+```python
+accelerator.log({"train_loss": 1.12, "valid_loss": 0.8}, step=1)
+```
+
+Once you've finished training, make sure to run [`~Accelerator.end_training`] so that all the trackers can run their finish functionalities if they have any.
+```python
+accelerator.end_training()
+```
+
+
+A full example is below:
+```python
+from accelerate import Accelerator
+
+accelerator = Accelerator(log_with="all")
+config = {
+    "num_iterations": 5,
+    "learning_rate": 1e-2,
+    "loss_function": str(my_loss_function),
+}
+
+accelerator.init_trackers("example_project", config=config)
+
+my_model, my_optimizer, my_training_dataloader = accelerate.prepare(my_model, my_optimizer, my_training_dataloader)
+device = accelerator.device
+my_model.to(device)
+
+for iteration in config["num_iterations"]:
+    for step, batch in my_training_dataloader:
+        my_optimizer.zero_grad()
+        inputs, targets = batch
+        inputs = inputs.to(device)
+        targets = targets.to(device)
+        outputs = my_model(inputs)
+        loss = my_loss_function(outputs, targets)
+        accelerator.backward(loss)
+        my_optimizer.step()
+        accelerator.log({"training_loss": loss}, step=step)
+accelerator.end_training()
+```
+
+
+## Implementing Custom Trackers
+
+To implement a new tracker to be used in `Accelerator`, a new one can be made through implementing the [`~GeneralTracker`] class.
+Every tracker must implement three functions:
+  - `__init__`: 
+    - Should store a `run_name` and initialize the tracker API of the integrated library. 
+    - If a tracker stores their data locally (such as TensorBoard), a `logging_dir` parameter can be added.
+  - `store_init_configuration`: 
+    - Should take in a `values` dictionary and store them as a one-time experiment configuration
+  - `log`: 
+    - Should take in a `values` dictionary and a `step`, and should log them to the run
+
+A brief example can be seen below with an integration with Weights and Biases, containing only the relevent information:
+```python
+from accelerate.tracking import GeneralTracker
+from typing import Optional
+
+import wandb
+
+
+class MyCustomTracker(GeneralTracker):
+    def __init__(self, run_name: str):
+        self.run_name = run_name
+        wandb.init(self.run_name)
+
+    def store_init_configuration(self, values: dict):
+        wandb.config(values)
+
+    def log(self, values: dict, step: Optional[int] = None):
+        wandb.log(values, step=step)
+```
+
+When you are ready to build your `Accelerator` object, pass in an **instance** of your tracker to [`~Accelerator.log_with`] to have it automatically
+be used with the API:
+
+```python
+tracker = MyCustomTracker("some_run_name")
+accelerator = Accelerator(log_with=tracker)
+```
+
+These also can be mixed with existing trackers, including with `"all"`:
+
+```python
+tracker = MyCustomTracker("some_run_name")
+accelerator = Accelerator(log_with=[tracker, "all"])
+```
+
+## When a wrapper cannot work
+
+If a library has an API that does not follow a strict `.log` with an overall dictionary such as Neptune.AI, logging can be done manually under an `if accelerator.is_main_process` statement:
+```diff
+from accelerate import Accelerator
+ import neptune.new as neptune
+
+accelerator = Accelerator()
+ run = neptune.init(...)
+
+my_model, my_optimizer, my_training_dataloader = accelerate.prepare(my_model, my_optimizer, my_training_dataloader)
+device = accelerator.device
+my_model.to(device)
+
+for iteration in config["num_iterations"]:
+    for batch in my_training_dataloader:
+        my_optimizer.zero_grad()
+        inputs, targets = batch
+        inputs = inputs.to(device)
+        targets = targets.to(device)
+        outputs = my_model(inputs)
+        loss = my_loss_function(outputs, targets)
+        total_loss += loss
+        accelerator.backward(loss)
+        my_optimizer.step()
+       if accelerator.is_main_process:
+           run["logs/training/batch/loss"].log(loss)
+```
--- a/examples/README.md
+++ b/examples/README.md
@ -183,3 +183,23 @@ To run it in each of these various modes, use the following commands:
        ```
    * In PyTorch:
        Add an `xmp.spawn` line in your script as you usually do.
+
+## Finer Examples
+
+While the first two scripts are extremely barebones when it comes to what you can do with accelerate, more advanced features are documented in two other locations.
+
+### `by_feature` examples
+
+These scripts are *individual* examples highlighting one particular feature or use-case within Accelerate. They all stem from the [nlp_example.py](./nlp_example.py) script, and any changes or modifications is denoted with a `# New Code #` comment.
+
+Read the README.md file located in the `by_feature` folder for more information.
+
+### `complete_*` examples
+
+These two scripts contain *every* single feature currently available in Accelerate in one place, as one giant script.
+
+New arguments that can be passed include:
+
+- `checkpointing_steps`, whether the various states should be saved at the end of every `n` steps, or `"epoch"` for each epoch. States are then saved to folders named `step_{n}` or `epoch_{n}`
+- `resume_from_checkpoint`, should be used if you want to resume training off of a previous call to the script and passed a `checkpointing_steps` to it.
+- `with_tracking`, should be used if you want to log the training run using all available experiment trackers in your environment. Currently supported trackers include TensorBoard, Weights and Biases, and CometML.
--- a/examples/by_feature/README.md
+++ b/examples/by_feature/README.md
@ -0,0 +1,68 @@
+# What are these scripts?
+
+All scripts in this folder originate from the `nlp_example.py` file, as it is a very simplistic NLP training example using Accelerate with zero extra features.
+
+From there, each further script adds in just **one** feature of Accelerate, showing how you can quickly modify your own scripts to implement these capabilities.
+
+A full example with all of these parts integrated together can be found in the `complete_nlp_example.py` script and `complete_cv_example.py` script.
+
+Adjustments to each script from the base `nlp_example.py` file can be found quickly by searching for "# New Code #"
+
+## Example Scripts by Feature and their Arguments
+
+### Base Example (`../nlp_example.py`)
+
+- Shows how to use `Accelerator` in an extremely simplistic PyTorch training loop
+- Arguments available:
+  - `mixed_precision`, whether to use mixed precision. ("no", "fp16", or "bf16")
+  - `cpu`, whether to train using only the CPU. (yes/no/1/0)
+
+All following scripts also accept these arguments in addition to their added ones.
+
+These arguments should be added at the end of any method for starting the python script (such as `python`, `accelerate launch`, `python -m torch.distributed.launch`), such as:
+
+```bash
+accelerate launch ../nlp_example.py --mixed_precision fp16 --cpu 0
+```
+
+### Checkpointing and Resuming Training (`checkpointing.py`)
+
+- Shows how to use `Accelerator.save_state` and `Accelerator.load_state` to save or continue training
+- **It is assumed you are continuing off the same training script**
+- Arguments available:
+  - `checkpointing_steps`, after how many steps the various states should be saved. ("epoch", 1, 2, ...)
+  - `output_dir`, where saved state folders should be saved to, default is current working directory
+  - `resume_from_checkpoint`, what checkpoint folder to resume from. ("epoch_0", "step_22", ...)
+
+These arguments should be added at the end of any method for starting the python script (such as `python`, `accelerate launch`, `python -m torch.distributed.launch`), such as:
+
+(Note, `resume_from_checkpoint` assumes that we've ran the script for one epoch with the `--checkpointing_steps epoch` flag)
+
+```bash
+accelerate launch ./checkpointing.py --checkpointing_steps epoch output_dir "checkpointing_tutorial" --resume_from_checkpoint "checkpointing_tutorial/epoch_0"
+```
+
+### Experiment Tracking (`tracking.py`)
+
+- Shows how to use `Accelerate.init_trackers` and `Accelerator.log`
+- Can be used with Weights and Biases, TensorBoard, or CometML.
+- Arguments available:
+  - `with_tracking`, whether to load in all available experiment trackers from the environment.
+
+These arguments should be added at the end of any method for starting the python script (such as `python`, `accelerate launch`, `python -m torch.distributed.launch`), such as:
+
+```bash
+accelerate launch ./tracking.py --with_tracking
+```
+
+### Cross Validation (`cross_validation.py`)
+
+- Shows how to use `Accelerator.free_memory` and run cross validation efficiently with `datasets`.
+- Arguments available:
+  - `num_folds`, the number of folds the training dataset should be split into.
+
+These arguments should be added at the end of any method for starting the python script (such as `python`, `accelerate launch`, `python -m torch.distributed.launch`), such as:
+
+```bash
+accelerate launch ./cross_validation.py --num_folds 2
+```
--- a/examples/by_feature/checkpointing.py
+++ b/examples/by_feature/checkpointing.py
@ -0,0 +1,304 @@
+# coding=utf-8
+# Copyright 2021 The HuggingFace Inc. team. All rights reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+import argparse
+import os
+
+import torch
+from torch.utils.data import DataLoader
+
+from accelerate import Accelerator, DistributedType
+from datasets import load_dataset, load_metric
+from transformers import (
+    AdamW,
+    AutoModelForSequenceClassification,
+    AutoTokenizer,
+    get_linear_schedule_with_warmup,
+    set_seed,
+)
+
+
+########################################################################
+# This is a fully working simple example to use Accelerate,
+# specifically showcasing the checkpointing capability,
+# and builds off the `nlp_example.py` script.
+#
+# This example trains a Bert base model on GLUE MRPC
+# in any of the following settings (with the same script):
+#   - single CPU or single GPU
+#   - multi GPUS (using PyTorch distributed mode)
+#   - (multi) TPUs
+#   - fp16 (mixed-precision) or fp32 (normal precision)
+#
+# To help focus on the differences in the code, building `DataLoaders`
+# was refactored into its own function.
+# New additions from the base script can be found quickly by
+# looking for the # New Code # tags
+#
+# To run it in each of these various modes, follow the instructions
+# in the readme for examples:
+# https://github.com/huggingface/accelerate/tree/main/examples
+#
+########################################################################
+
+MAX_GPU_BATCH_SIZE = 16
+EVAL_BATCH_SIZE = 32
+
+
+def get_dataloaders(accelerator: Accelerator, batch_size: int = 16):
+    """
+    Creates a set of `DataLoader`s for the `glue` dataset,
+    using "bert-base-cased" as the tokenizer.
+
+    Args:
+        accelerator (`Accelerator`):
+            An `Accelerator` object
+        batch_size (`int`, *optional*):
+            The batch size for the train and validation DataLoaders.
+    """
+    tokenizer = AutoTokenizer.from_pretrained("bert-base-cased")
+    datasets = load_dataset("glue", "mrpc")
+
+    def tokenize_function(examples):
+        # max_length=None => use the model max length (it's actually the default)
+        outputs = tokenizer(examples["sentence1"], examples["sentence2"], truncation=True, max_length=None)
+        return outputs
+
+    # Apply the method we just defined to all the examples in all the splits of the dataset
+    tokenized_datasets = datasets.map(
+        tokenize_function,
+        batched=True,
+        remove_columns=["idx", "sentence1", "sentence2"],
+    )
+
+    # We also rename the 'label' column to 'labels' which is the expected name for labels by the models of the
+    # transformers library
+    tokenized_datasets = tokenized_datasets.rename_column("label", "labels")
+
+    def collate_fn(examples):
+        # On TPU it's best to pad everything to the same length or training will be very slow.
+        if accelerator.distributed_type == DistributedType.TPU:
+            return tokenizer.pad(examples, padding="max_length", max_length=128, return_tensors="pt")
+        return tokenizer.pad(examples, padding="longest", return_tensors="pt")
+
+    # Instantiate dataloaders.
+    train_dataloader = DataLoader(
+        tokenized_datasets["train"], shuffle=True, collate_fn=collate_fn, batch_size=batch_size
+    )
+    eval_dataloader = DataLoader(
+        tokenized_datasets["validation"], shuffle=False, collate_fn=collate_fn, batch_size=EVAL_BATCH_SIZE
+    )
+
+    return train_dataloader, eval_dataloader
+
+
+# For testing only
+if os.environ.get("TESTING_MOCKED_DATALOADERS", None) == "1":
+    from accelerate.test_utils.training import mocked_dataloaders
+
+    get_dataloaders = mocked_dataloaders  # noqa: F811
+
+
+def training_function(config, args):
+    # Initialize accelerator
+    accelerator = Accelerator(cpu=args.cpu, mixed_precision=args.mixed_precision)
+    # Sample hyper-parameters for learning rate, batch size, seed and a few other HPs
+    lr = config["lr"]
+    num_epochs = int(config["num_epochs"])
+    correct_bias = config["correct_bias"]
+    seed = int(config["seed"])
+    batch_size = int(config["batch_size"])
+
+    # New Code #
+    # Parse out whether we are saving every epoch or after a certain number of batches
+    if hasattr(args.checkpointing_steps, "isdigit"):
+        if args.checkpointing_steps == "epoch":
+            checkpointing_steps = args.checkpointing_steps
+        elif args.checkpointing_steps.isdigit():
+            checkpointing_steps = int(args.checkpointing_steps)
+        else:
+            raise ValueError(
+                f"Argument `checkpointing_steps` must be either a number or `epoch`. `{args.checkpointing_steps}` passed."
+            )
+    else:
+        checkpointing_steps = None
+
+    set_seed(seed)
+
+    train_dataloader, eval_dataloader = get_dataloaders(accelerator, batch_size)
+    metric = load_metric("glue", "mrpc")
+
+    # If the batch size is too big we use gradient accumulation
+    gradient_accumulation_steps = 1
+    if batch_size > MAX_GPU_BATCH_SIZE:
+        gradient_accumulation_steps = batch_size // MAX_GPU_BATCH_SIZE
+        batch_size = MAX_GPU_BATCH_SIZE
+
+    # Instantiate the model (we build the model here so that the seed also control new weights initialization)
+    model = AutoModelForSequenceClassification.from_pretrained("bert-base-cased", return_dict=True)
+
+    # We could avoid this line since the accelerator is set with `device_placement=True` (default value).
+    # Note that if you are placing tensors on devices manually, this line absolutely needs to be before the optimizer
+    # creation otherwise training will not work on TPU (`accelerate` will kindly throw an error to make us aware of that).
+    model = model.to(accelerator.device)
+
+    # Instantiate optimizer
+    optimizer = AdamW(params=model.parameters(), lr=lr, correct_bias=correct_bias)
+
+    # Instantiate scheduler
+    lr_scheduler = get_linear_schedule_with_warmup(
+        optimizer=optimizer,
+        num_warmup_steps=100,
+        num_training_steps=(len(train_dataloader) * num_epochs) // gradient_accumulation_steps,
+    )
+
+    # Prepare everything
+    # There is no specific order to remember, we just need to unpack the objects in the same order we gave them to the
+    # prepare method.
+    model, optimizer, train_dataloader, eval_dataloader, lr_scheduler = accelerator.prepare(
+        model, optimizer, train_dataloader, eval_dataloader, lr_scheduler
+    )
+
+    # New Code #
+    # We need to keep track of how many total steps we have iterated over
+    overall_step = 0
+    # We also need to keep track of the stating epoch so files are named properly
+    starting_epoch = 0
+
+    # We need to load the checkpoint back in before training here with `load_state`
+    # The total number of epochs is adjusted based on where the state is being loaded from,
+    # as we assume continuation of the same training script
+    if args.resume_from_checkpoint:
+        if args.resume_from_checkpoint is not None or args.resume_from_checkpoint != "":
+            accelerator.print(f"Resumed from checkpoint: {args.resume_from_checkpoint}")
+            accelerator.load_state(args.resume_from_checkpoint)
+            path = os.path.basename(args.resume_from_checkpoint)
+        else:
+            # Get the most recent checkpoint
+            dirs = [f.name for f in os.scandir(os.getcwd()) if f.is_dir()]
+            dirs.sort(key=os.path.getctime)
+            path = dirs[-1]  # Sorts folders by date modified, most recent checkpoint is the last
+        # Extract `epoch_{i}` or `step_{i}`
+        training_difference = os.path.splitext(path)[0]
+
+        if "epoch" in training_difference:
+            starting_epoch = int(training_difference.replace("epoch_", "")) + 1
+            resume_step = None
+        else:
+            resume_step = int(training_difference.replace("step_", ""))
+            starting_epoch = resume_step // len(train_dataloader)
+            resume_step -= starting_epoch * len(train_dataloader)
+
+    # Now we train the model
+    for epoch in range(starting_epoch, num_epochs):
+        model.train()
+        for step, batch in enumerate(train_dataloader):
+            # New Code #
+            # We need to skip steps until we reach the resumed step during the first epoch
+            if args.resume_from_checkpoint and epoch == starting_epoch:
+                if resume_step is not None and step < resume_step:
+                    overall_step += 1
+                    continue
+            # We could avoid this line since we set the accelerator with `device_placement=True`.
+            batch.to(accelerator.device)
+            outputs = model(**batch)
+            loss = outputs.loss
+            loss = loss / gradient_accumulation_steps
+            accelerator.backward(loss)
+            if step % gradient_accumulation_steps == 0:
+                optimizer.step()
+                lr_scheduler.step()
+                optimizer.zero_grad()
+            # New Code #
+            overall_step += 1
+
+            # New Code #
+            # We save the model, optimizer, lr_scheduler, and seed states by calling `save_state`
+            # These are saved to folders named `step_{overall_step}`
+            # Will contain files: "pytorch_model.bin", "optimizer.bin", "scheduler.bin", and "random_states.pkl"
+            # If mixed precision was used, will also save a "scalar.bin" file
+            if isinstance(checkpointing_steps, int):
+                output_dir = f"step_{overall_step}"
+                if overall_step % checkpointing_steps == 0:
+                    if args.output_dir is not None:
+                        output_dir = os.path.join(args.output_dir, output_dir)
+                    accelerator.save_state(output_dir)
+
+        model.eval()
+        for step, batch in enumerate(eval_dataloader):
+            # We could avoid this line since we set the accelerator with `device_placement=True` (the default).
+            batch.to(accelerator.device)
+            with torch.no_grad():
+                outputs = model(**batch)
+            predictions = outputs.logits.argmax(dim=-1)
+            # It is slightly faster to call this once, than multiple times
+            predictions, references = accelerator.gather((predictions, batch["labels"]))
+            metric.add_batch(
+                predictions=predictions,
+                references=references,
+            )
+
+        eval_metric = metric.compute()
+        # Use accelerator.print to print only on the main process.
+        accelerator.print(f"epoch {epoch}:", eval_metric)
+
+        # New Code #
+        # We save the model, optimizer, lr_scheduler, and seed states by calling `save_state`
+        # These are saved to folders named `epoch_{epoch}`
+        # Will contain files: "pytorch_model.bin", "optimizer.bin", "scheduler.bin", and "random_states.pkl"
+        # If mixed precision was used, will also save a "scalar.bin" file
+        if checkpointing_steps == "epoch":
+            output_dir = f"epoch_{epoch}"
+            if args.output_dir is not None:
+                output_dir = os.path.join(args.output_dir, output_dir)
+            accelerator.save_state(output_dir)
+
+
+def main():
+    parser = argparse.ArgumentParser(description="Simple example of training script.")
+    parser.add_argument(
+        "--mixed_precision",
+        type=str,
+        default="no",
+        choices=["no", "fp16", "bf16"],
+        help="Whether to use mixed precision. Choose"
+        "between fp16 and bf16 (bfloat16). Bf16 requires PyTorch >= 1.10."
+        "and an Nvidia Ampere GPU.",
+    )
+    parser.add_argument("--cpu", action="store_true", help="If passed, will train on the CPU.")
+    parser.add_argument(
+        "--checkpointing_steps",
+        type=str,
+        default=None,
+        help="Whether the various states should be saved at the end of every n steps, or 'epoch' for each epoch.",
+    )
+    parser.add_argument(
+        "--output_dir",
+        type=str,
+        default=".",
+        help="Optional save directory where all checkpoint folders will be stored. Default is the current working directory.",
+    )
+    parser.add_argument(
+        "--resume_from_checkpoint",
+        type=str,
+        default=None,
+        help="If the training should continue from a checkpoint folder.",
+    )
+    args = parser.parse_args()
+    config = {"lr": 2e-5, "num_epochs": 3, "correct_bias": True, "seed": 42, "batch_size": 16}
+    training_function(config, args)
+
+
+if __name__ == "__main__":
+    main()
--- a/examples/by_feature/cross_validation.py
+++ b/examples/by_feature/cross_validation.py
@ -0,0 +1,275 @@
+# coding=utf-8
+# Copyright 2022 The HuggingFace Inc. team. All rights reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+import argparse
+from typing import List
+
+import numpy as np
+import torch
+from torch.utils.data import DataLoader
+
+from accelerate import Accelerator, DistributedType
+from datasets import DatasetDict, load_dataset, load_metric
+
+# New Code #
+# We'll be using StratifiedKFold for this example
+from sklearn.model_selection import StratifiedKFold
+from transformers import (
+    AdamW,
+    AutoModelForSequenceClassification,
+    AutoTokenizer,
+    get_linear_schedule_with_warmup,
+    set_seed,
+)
+
+
+########################################################################
+# This is a fully working simple example to use Accelerate,
+# specifically showcasing how to perform Cross Validation,
+# and builds off the `nlp_example.py` script.
+#
+# This example trains a Bert base model on GLUE MRPC
+# in any of the following settings (with the same script):
+#   - single CPU or single GPU
+#   - multi GPUS (using PyTorch distributed mode)
+#   - (multi) TPUs
+#   - fp16 (mixed-precision) or fp32 (normal precision)
+#
+# To help focus on the differences in the code, building `DataLoaders`
+# was refactored into its own function.
+# New additions from the base script can be found quickly by
+# looking for the # New Code # tags
+#
+# To run it in each of these various modes, follow the instructions
+# in the readme for examples:
+# https://github.com/huggingface/accelerate/tree/main/examples
+#
+########################################################################
+
+
+MAX_GPU_BATCH_SIZE = 16
+EVAL_BATCH_SIZE = 32
+
+# New Code #
+# We need a different `get_dataloaders` function that will build dataloaders by indexs
+
+
+def get_fold_dataloaders(
+    accelerator: Accelerator, dataset: DatasetDict, train_idxs: List[int], valid_idxs: List[int], batch_size: int = 16
+):
+    """
+    Gets a set of train, valid, and test dataloaders for a particular fold
+
+    Args:
+        accelerator (`Accelerator`):
+            The main `Accelerator` object
+        train_idxs (list of `int`):
+            The split indicies for the training dataset
+        valid_idxs (list of `int`):
+            The split indicies for the validation dataset
+        batch_size (`int`):
+            The size of the minibatch. Default is 16
+    """
+    tokenizer = AutoTokenizer.from_pretrained("bert-base-cased")
+    datasets = DatasetDict(
+        {
+            "train": dataset["train"].select(train_idxs),
+            "validation": dataset["train"].select(valid_idxs),
+            "test": dataset["validation"],
+        }
+    )
+
+    def tokenize_function(examples):
+        # max_length=None => use the model max length (it's actually the default)
+        outputs = tokenizer(examples["sentence1"], examples["sentence2"], truncation=True, max_length=None)
+        return outputs
+
+    # Apply the method we just defined to all the examples in all the splits of the dataset
+    tokenized_datasets = datasets.map(
+        tokenize_function,
+        batched=True,
+        remove_columns=["idx", "sentence1", "sentence2"],
+    )
+
+    # We also rename the 'label' column to 'labels' which is the expected name for labels by the models of the
+    # transformers library
+    tokenized_datasets = tokenized_datasets.rename_column("label", "labels")
+
+    def collate_fn(examples):
+        # On TPU it's best to pad everything to the same length or training will be very slow.
+        if accelerator.distributed_type == DistributedType.TPU:
+            return tokenizer.pad(examples, padding="max_length", max_length=128, return_tensors="pt")
+        return tokenizer.pad(examples, padding="longest", return_tensors="pt")
+
+    # Instantiate dataloaders.
+    train_dataloader = DataLoader(
+        tokenized_datasets["train"], shuffle=True, collate_fn=collate_fn, batch_size=batch_size
+    )
+    eval_dataloader = DataLoader(
+        tokenized_datasets["validation"], shuffle=False, collate_fn=collate_fn, batch_size=EVAL_BATCH_SIZE
+    )
+
+    test_dataloader = DataLoader(
+        tokenized_datasets["test"], shuffle=False, collate_fn=collate_fn, batch_size=EVAL_BATCH_SIZE
+    )
+
+    return train_dataloader, eval_dataloader, test_dataloader
+
+
+def training_function(config, args):
+    # New Code #
+    test_labels = None
+    test_predictions = []
+    # Download the dataset
+    datasets = load_dataset("glue", "mrpc")
+    # Create our splits
+    kfold = StratifiedKFold(n_splits=int(args.num_folds))
+    # Initialize accelerator
+    accelerator = Accelerator(cpu=args.cpu, mixed_precision=args.mixed_precision)
+    # Sample hyper-parameters for learning rate, batch size, seed and a few other HPs
+    lr = config["lr"]
+    num_epochs = int(config["num_epochs"])
+    correct_bias = config["correct_bias"]
+    seed = int(config["seed"])
+    batch_size = int(config["batch_size"])
+
+    metric = load_metric("glue", "mrpc")
+
+    # If the batch size is too big we use gradient accumulation
+    gradient_accumulation_steps = 1
+    if batch_size > MAX_GPU_BATCH_SIZE:
+        gradient_accumulation_steps = batch_size // MAX_GPU_BATCH_SIZE
+        batch_size = MAX_GPU_BATCH_SIZE
+
+    set_seed(seed)
+
+    # New Code #
+    # Create our folds:
+    folds = kfold.split(np.zeros(datasets["train"].num_rows), datasets["train"]["label"])
+
+    # Iterate over them
+    for train_idxs, valid_idxs in folds:
+        train_dataloader, eval_dataloader, test_dataloader = get_fold_dataloaders(
+            accelerator,
+            datasets,
+            train_idxs,
+            valid_idxs,
+        )
+        if test_labels is None:
+            test_labels = datasets["validation"]["label"]
+        # Instantiate the model (we build the model here so that the seed also control new weights initialization)
+        model = AutoModelForSequenceClassification.from_pretrained("bert-base-cased", return_dict=True)
+
+        # We could avoid this line since the accelerator is set with `device_placement=True` (default value).
+        # Note that if you are placing tensors on devices manually, this line absolutely needs to be before the optimizer
+        # creation otherwise training will not work on TPU (`accelerate` will kindly throw an error to make us aware of that).
+        model = model.to(accelerator.device)
+
+        # Instantiate optimizer
+        optimizer = AdamW(params=model.parameters(), lr=lr, correct_bias=correct_bias)
+
+        # Instantiate scheduler
+        lr_scheduler = get_linear_schedule_with_warmup(
+            optimizer=optimizer,
+            num_warmup_steps=100,
+            num_training_steps=(len(train_dataloader) * num_epochs) // gradient_accumulation_steps,
+        )
+
+        # Prepare everything
+        # There is no specific order to remember, we just need to unpack the objects in the same order we gave them to the
+        # prepare method.
+        model, optimizer, train_dataloader, eval_dataloader, lr_scheduler = accelerator.prepare(
+            model, optimizer, train_dataloader, eval_dataloader, lr_scheduler
+        )
+
+        # Now we train the model
+        for epoch in range(num_epochs):
+            model.train()
+            for step, batch in enumerate(train_dataloader):
+                # We could avoid this line since we set the accelerator with `device_placement=True`.
+                batch.to(accelerator.device)
+                outputs = model(**batch)
+                loss = outputs.loss
+                loss = loss / gradient_accumulation_steps
+                accelerator.backward(loss)
+                if step % gradient_accumulation_steps == 0:
+                    optimizer.step()
+                    lr_scheduler.step()
+                    optimizer.zero_grad()
+
+            model.eval()
+            for step, batch in enumerate(eval_dataloader):
+                # We could avoid this line since we set the accelerator with `device_placement=True`.
+                batch.to(accelerator.device)
+                with torch.no_grad():
+                    outputs = model(**batch)
+                predictions = outputs.logits.argmax(dim=-1)
+                predictions, references = accelerator.gather((predictions, batch["labels"]))
+                metric.add_batch(
+                    predictions=predictions,
+                    references=references,
+                )
+
+            eval_metric = metric.compute()
+            # Use accelerator.print to print only on the main process.
+            accelerator.print(f"epoch {epoch}:", eval_metric)
+
+        # New Code #
+        # We also run predictions on the test set at the very end
+        fold_predictions = []
+        for step, batch in enumerate(test_dataloader):
+            # We could avoid this line since we set the accelerator with `device_placement=True`.
+            batch.to(accelerator.device)
+            with torch.no_grad():
+                outputs = model(**batch)
+            predictions = outputs.logits
+            predictions, references = accelerator.gather((predictions, batch["labels"]))
+            fold_predictions.append(predictions.cpu())
+            metric.add_batch(
+                predictions=predictions.argmax(dim=-1),
+                references=references,
+            )
+        test_metric = metric.compute()
+        # Use accelerator.print to print only on the main process.
+        test_predictions.append(torch.cat(fold_predictions, dim=0))
+        # We now need to release all our memory and get rid of the current model, optimizer, etc
+        accelerator.free_memory()
+    # New Code #
+    # Finally we check the accuracy of our folded results:
+    preds = torch.stack(test_predictions, dim=0).sum(dim=0).div(int(config["n_splits"])).argmax(dim=-1)
+    test_metric = metric.compute(predictions=preds, references=test_labels)
+    accelerator.print("Average test metrics from all folds:", test_metric)
+
+
+def main():
+    parser = argparse.ArgumentParser(description="Simple example of training script.")
+    parser.add_argument(
+        "--mixed_precision",
+        type=str,
+        default="no",
+        choices=["no", "fp16", "bf16"],
+        help="Whether to use mixed precision. Choose"
+        "between fp16 and bf16 (bfloat16). Bf16 requires PyTorch >= 1.10."
+        "and an Nvidia Ampere GPU.",
+    )
+    parser.add_argument("--cpu", action="store_true", help="If passed, will train on the CPU.")
+    # New Code #
+    parser.add_argument("--num_folds", type=int, default=3, help="The number of splits to perform across the dataset")
+    args = parser.parse_args()
+    config = {"lr": 2e-5, "num_epochs": 3, "correct_bias": True, "seed": 42, "batch_size": 16}
+    training_function(config, args)
+
+
+if __name__ == "__main__":
+    main()
--- a/examples/by_feature/fsdp_with_peak_mem_tracking.py
+++ b/examples/by_feature/fsdp_with_peak_mem_tracking.py
@ -0,0 +1,388 @@
+# coding=utf-8
+# Copyright 2021 The HuggingFace Inc. team. All rights reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+import argparse
+import gc
+import os
+
+import torch
+from torch.utils.data import DataLoader
+
+from accelerate import Accelerator, DistributedType
+from datasets import load_dataset, load_metric
+from transformers import AutoModelForSequenceClassification, AutoTokenizer, get_linear_schedule_with_warmup, set_seed
+
+
+########################################################################
+# This is a fully working simple example to use Accelerate
+#
+# This example trains a Bert base model on GLUE MRPC
+# in any of the following settings (with the same script):
+#   - single CPU or single GPU
+#   - multi GPUS (using PyTorch distributed mode)
+#   - (multi) TPUs
+#   - fp16 (mixed-precision) or fp32 (normal precision)
+#   - FSDP
+#
+# This example also demonstrates the checkpointing and sharding capabilities
+#
+# To run it in each of these various modes, follow the instructions
+# in the readme for examples:
+# https://github.com/huggingface/accelerate/tree/main/examples
+#
+########################################################################
+
+
+MAX_GPU_BATCH_SIZE = 16
+EVAL_BATCH_SIZE = 32
+
+
+# New Code #
+# Converting Bytes to Megabytes
+def b2mb(x):
+    return int(x / 2**20)
+
+
+# New Code #
+# This context manager is used to track the peak memory usage of the process
+class TorchTracemalloc:
+    def __enter__(self):
+        gc.collect()
+        torch.cuda.empty_cache()
+        torch.cuda.reset_max_memory_allocated()  # reset the peak gauge to zero
+        self.begin = torch.cuda.memory_allocated()
+        return self
+
+    def __exit__(self, *exc):
+        gc.collect()
+        torch.cuda.empty_cache()
+        self.end = torch.cuda.memory_allocated()
+        self.peak = torch.cuda.max_memory_allocated()
+        self.used = b2mb(self.end - self.begin)
+        self.peaked = b2mb(self.peak - self.begin)
+        # print(f"delta used/peak {self.used:4d}/{self.peaked:4d}")
+
+
+# For testing only
+if os.environ.get("TESTING_MOCKED_DATALOADERS", None) == "1":
+    from accelerate.test_utils.training import mocked_dataloaders
+
+    get_dataloaders = mocked_dataloaders  # noqa: F811
+
+
+def training_function(config, args):
+    # Initialize accelerator
+    if args.with_tracking:
+        accelerator = Accelerator(
+            cpu=args.cpu, mixed_precision=args.mixed_precision, log_with="wandb", logging_dir=args.logging_dir
+        )
+    else:
+        accelerator = Accelerator()
+    accelerator.print(accelerator.distributed_type)
+
+    if hasattr(args.checkpointing_steps, "isdigit"):
+        if args.checkpointing_steps == "epoch":
+            checkpointing_steps = args.checkpointing_steps
+        elif args.checkpointing_steps.isdigit():
+            checkpointing_steps = int(args.checkpointing_steps)
+        else:
+            raise ValueError(
+                f"Argument `checkpointing_steps` must be either a number or `epoch`. `{args.checkpointing_steps}` passed."
+            )
+    else:
+        checkpointing_steps = None
+    # Sample hyper-parameters for learning rate, batch size, seed and a few other HPs
+    lr = config["lr"]
+    num_epochs = int(config["num_epochs"])
+    seed = int(config["seed"])
+    batch_size = int(config["batch_size"])
+
+    # We need to initialize the trackers we use, and also store our configuration
+    if args.with_tracking:
+        if accelerator.is_main_process:
+            run = os.path.split(__file__)[-1].split(".")[0]
+            if args.logging_dir:
+                run = os.path.join(args.logging_dir, run)
+                accelerator.print(run)
+            accelerator.init_trackers(run, config)
+
+    tokenizer = AutoTokenizer.from_pretrained(args.model_name_or_path)
+    datasets = load_dataset("glue", "mrpc")
+    metric = load_metric("glue", "mrpc")
+
+    def tokenize_function(examples):
+        # max_length=None => use the model max length (it's actually the default)
+        outputs = tokenizer(examples["sentence1"], examples["sentence2"], truncation=True, max_length=None)
+        return outputs
+
+    # Apply the method we just defined to all the examples in all the splits of the dataset
+    tokenized_datasets = datasets.map(
+        tokenize_function,
+        batched=True,
+        remove_columns=["idx", "sentence1", "sentence2"],
+    )
+
+    # We also rename the 'label' column to 'labels' which is the expected name for labels by the models of the
+    # transformers library
+    tokenized_datasets = tokenized_datasets.rename_column("label", "labels")
+
+    # If the batch size is too big we use gradient accumulation
+    gradient_accumulation_steps = 1
+    if batch_size > MAX_GPU_BATCH_SIZE:
+        gradient_accumulation_steps = batch_size // MAX_GPU_BATCH_SIZE
+        batch_size = MAX_GPU_BATCH_SIZE
+
+    def collate_fn(examples):
+        # On TPU it's best to pad everything to the same length or training will be very slow.
+        if accelerator.distributed_type == DistributedType.TPU:
+            return tokenizer.pad(examples, padding="max_length", max_length=128, return_tensors="pt")
+        return tokenizer.pad(examples, padding="longest", return_tensors="pt")
+
+    # Instantiate dataloaders.
+    train_dataloader = DataLoader(
+        tokenized_datasets["train"], shuffle=True, collate_fn=collate_fn, batch_size=batch_size
+    )
+    eval_dataloader = DataLoader(
+        tokenized_datasets["validation"], shuffle=False, collate_fn=collate_fn, batch_size=EVAL_BATCH_SIZE
+    )
+
+    set_seed(seed)
+
+    # Instantiate the model (we build the model here so that the seed also control new weights initialization)
+    model = AutoModelForSequenceClassification.from_pretrained(args.model_name_or_path, return_dict=True)
+    # New Code #
+    # For FSDP feature, it is highly recommended and efficient to prepare the model before creating optimizer
+    model = accelerator.prepare(model)
+
+    # Instantiate optimizer
+    # New Code #
+    # For FSDP feature, at present it doesn't support multiple parameter groups,
+    # so we need to create a single parameter group for the whole model
+    optimizer = torch.optim.AdamW(params=model.parameters(), lr=lr, weight_decay=2e-4)
+
+    # Instantiate scheduler
+    lr_scheduler = get_linear_schedule_with_warmup(
+        optimizer=optimizer,
+        num_warmup_steps=10,
+        num_training_steps=(len(train_dataloader) * num_epochs) // gradient_accumulation_steps,
+    )
+
+    # New Code #
+    # For FSDP feature, prepare everything except the model as we have already prepared the model
+    # before creating the optimizer
+    # There is no specific order to remember, we just need to unpack the objects in the same order we gave them to the
+    # prepare method.
+    optimizer, train_dataloader, eval_dataloader, lr_scheduler = accelerator.prepare(
+        optimizer, train_dataloader, eval_dataloader, lr_scheduler
+    )
+
+    overall_step = 0
+
+    # Potentially load in the weights and states from a previous save
+    if args.resume_from_checkpoint:
+        if args.resume_from_checkpoint is not None or args.resume_from_checkpoint != "":
+            accelerator.print(f"Resumed from checkpoint: {args.resume_from_checkpoint}")
+            accelerator.load_state(args.resume_from_checkpoint)
+            path = os.path.basename(args.resume_from_checkpoint)
+        else:
+            # Get the most recent checkpoint
+            dirs = [f.name for f in os.scandir(os.getcwd()) if f.is_dir()]
+            dirs.sort(key=os.path.getctime)
+            path = dirs[-1]  # Sorts folders by date modified, most recent checkpoint is the last
+        # Extract `epoch_{i}` or `step_{i}`
+        training_difference = os.path.splitext(path)[0]
+
+        if "epoch" in training_difference:
+            num_epochs -= int(training_difference.replace("epoch_", ""))
+            resume_step = None
+        else:
+            resume_step = int(training_difference.replace("step_", ""))
+            num_epochs -= resume_step // len(train_dataloader)
+            # If resuming by step, we also need to know exactly how far into the DataLoader we went
+            resume_step = (num_epochs * len(train_dataloader)) - resume_step
+
+    # Now we train the model
+    for epoch in range(num_epochs):
+        # New Code #
+        # context manager to track the peak memory usage during the training epoch
+        with TorchTracemalloc() as tracemalloc:
+            model.train()
+            if args.with_tracking:
+                total_loss = 0
+            for step, batch in enumerate(train_dataloader):
+                # We need to skip steps until we reach the resumed step
+                if args.resume_from_checkpoint and epoch == 0:
+                    if resume_step is not None and step < resume_step:
+                        pass
+                # We could avoid this line since we set the accelerator with `device_placement=True`.
+                batch.to(accelerator.device)
+                outputs = model(**batch)
+                loss = outputs.loss
+                loss = loss / gradient_accumulation_steps
+                # We keep track of the loss at each epoch
+                if args.with_tracking:
+                    total_loss += loss.detach().float()
+                accelerator.backward(loss)
+                if step % gradient_accumulation_steps == 0:
+                    optimizer.step()
+                    lr_scheduler.step()
+                    optimizer.zero_grad()
+                    # accelerator.print(lr_scheduler.get_lr())
+
+                overall_step += 1
+
+                if isinstance(checkpointing_steps, int):
+                    output_dir = f"step_{overall_step}"
+                    if overall_step % checkpointing_steps == 0:
+                        if args.output_dir is not None:
+                            output_dir = os.path.join(args.output_dir, output_dir)
+                        accelerator.save_state(output_dir)
+        # New Code #
+        # Printing the GPU memory usage details such as allocated memory, peak memory, and total memory usage
+        accelerator.print("Memory before entering the train : {}".format(b2mb(tracemalloc.begin)))
+        accelerator.print("Memory consumed at the end of the train (end-begin): {}".format(tracemalloc.used))
+        accelerator.print("Peak Memory consumed during the train (max-begin): {}".format(tracemalloc.peaked))
+        accelerator.print(
+            "Total Peak Memory consumed during the train (max): {}".format(
+                tracemalloc.peaked + b2mb(tracemalloc.begin)
+            )
+        )
+        # Logging the peak memory usage of the GPU to the tracker
+        if args.with_tracking:
+            accelerator.log(
+                {
+                    "train_total_peak_memory": tracemalloc.peaked + b2mb(tracemalloc.begin),
+                },
+                step=epoch,
+            )
+
+        # New Code #
+        # context manager to track the peak memory usage during the evaluation
+        with TorchTracemalloc() as tracemalloc:
+            model.eval()
+            samples_seen = 0
+            for step, batch in enumerate(eval_dataloader):
+                # We could avoid this line since we set the accelerator with `device_placement=True`.
+                batch.to(accelerator.device)
+                with torch.no_grad():
+                    outputs = model(**batch)
+                predictions = outputs.logits.argmax(dim=-1)
+                # It is slightly faster to call this once, than multiple times
+                predictions, references = accelerator.gather(
+                    (predictions, batch["labels"])
+                )  # If we are in a multiprocess environment, the last batch has duplicates
+                if accelerator.num_processes > 1:
+                    if step == len(eval_dataloader) - 1:
+                        predictions = predictions[: len(eval_dataloader.dataset) - samples_seen]
+                        references = references[: len(eval_dataloader.dataset) - samples_seen]
+                    else:
+                        samples_seen += references.shape[0]
+                metric.add_batch(
+                    predictions=predictions,
+                    references=references,
+                )
+
+            eval_metric = metric.compute()
+            # Use accelerator.print to print only on the main process.
+            accelerator.print(f"epoch {epoch}:", eval_metric)
+            if args.with_tracking:
+                accelerator.log(
+                    {
+                        "accuracy": eval_metric["accuracy"],
+                        "f1": eval_metric["f1"],
+                        "train_loss": total_loss.item(),
+                    },
+                    step=epoch,
+                )
+
+            if checkpointing_steps == "epoch":
+                output_dir = f"epoch_{epoch}"
+                if args.output_dir is not None:
+                    output_dir = os.path.join(args.output_dir, output_dir)
+                accelerator.save_state(output_dir)
+        # New Code #
+        # Printing the GPU memory usage details such as allocated memory, peak memory, and total memory usage
+        accelerator.print("Memory before entering the eval : {}".format(b2mb(tracemalloc.begin)))
+        accelerator.print("Memory consumed at the end of the eval (end-begin): {}".format(tracemalloc.used))
+        accelerator.print("Peak Memory consumed during the eval (max-begin): {}".format(tracemalloc.peaked))
+        accelerator.print(
+            "Total Peak Memory consumed during the eval (max): {}".format(tracemalloc.peaked + b2mb(tracemalloc.begin))
+        )
+        # Logging the peak memory usage of the GPU to the tracker
+        if args.with_tracking:
+            accelerator.log(
+                {
+                    "eval_total_peak_memory": tracemalloc.peaked + b2mb(tracemalloc.begin),
+                },
+                step=epoch,
+            )
+
+    if args.with_tracking:
+        accelerator.end_training()
+
+
+def main():
+    parser = argparse.ArgumentParser(description="Simple example of training script.")
+    parser.add_argument(
+        "--mixed_precision",
+        type=str,
+        default="no",
+        choices=["no", "fp16", "bf16"],
+        help="Whether to use mixed precision. Choose"
+        "between fp16 and bf16 (bfloat16). Bf16 requires PyTorch >= 1.10."
+        "and an Nvidia Ampere GPU.",
+    )
+    parser.add_argument("--cpu", action="store_true", help="If passed, will train on the CPU.")
+    parser.add_argument(
+        "--checkpointing_steps",
+        type=str,
+        default=None,
+        help="Whether the various states should be saved at the end of every n steps, or 'epoch' for each epoch.",
+    )
+    parser.add_argument(
+        "--resume_from_checkpoint",
+        type=str,
+        default=None,
+        help="If the training should continue from a checkpoint folder.",
+    )
+    parser.add_argument(
+        "--with_tracking",
+        action="store_true",
+        help="Whether to load in all available experiment trackers from the environment and use them for logging.",
+    )
+    parser.add_argument(
+        "--output_dir",
+        type=str,
+        default=".",
+        help="Optional save directory where all checkpoint folders will be stored. Default is the current working directory.",
+    )
+    parser.add_argument(
+        "--logging_dir",
+        type=str,
+        default="logs",
+        help="Location on where to store experiment tracking logs`",
+    )
+    parser.add_argument(
+        "--model_name_or_path",
+        type=str,
+        help="Path to pretrained model or model identifier from huggingface.co/models.",
+        required=True,
+    )
+    args = parser.parse_args()
+    config = {"lr": 2e-5, "num_epochs": 3, "seed": 1, "batch_size": 16}
+    training_function(config, args)
+
+
+if __name__ == "__main__":
+    main()
--- a/examples/by_feature/memory.py
+++ b/examples/by_feature/memory.py
@ -0,0 +1,226 @@
+# Copyright 2022 The HuggingFace Team. All rights reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+import argparse
+import os
+
+import torch
+from torch.utils.data import DataLoader
+
+from accelerate import Accelerator, DistributedType
+
+# New Code #
+from accelerate.utils import find_executable_batch_size
+from datasets import load_dataset, load_metric
+from transformers import (
+    AdamW,
+    AutoModelForSequenceClassification,
+    AutoTokenizer,
+    get_linear_schedule_with_warmup,
+    set_seed,
+)
+
+
+########################################################################
+# This is a fully working simple example to use Accelerate,
+# specifically showcasing how to ensure out-of-memory errors never
+# iterrupt training, and builds off the `nlp_example.py` script.
+#
+# This example trains a Bert base model on GLUE MRPC
+# in any of the following settings (with the same script):
+#   - single CPU or single GPU
+#   - multi GPUS (using PyTorch distributed mode)
+#   - (multi) TPUs
+#   - fp16 (mixed-precision) or fp32 (normal precision)
+#
+# New additions from the base script can be found quickly by
+# looking for the # New Code # tags
+#
+# To run it in each of these various modes, follow the instructions
+# in the readme for examples:
+# https://github.com/huggingface/accelerate/tree/main/examples
+#
+########################################################################
+
+
+MAX_GPU_BATCH_SIZE = 16
+EVAL_BATCH_SIZE = 32
+
+
+def get_dataloaders(accelerator: Accelerator, batch_size: int = 16):
+    """
+    Creates a set of `DataLoader`s for the `glue` dataset,
+    using "bert-base-cased" as the tokenizer.
+
+    Args:
+        accelerator (`Accelerator`):
+            An `Accelerator` object
+        batch_size (`int`, *optional*):
+            The batch size for the train and validation DataLoaders.
+    """
+    tokenizer = AutoTokenizer.from_pretrained("bert-base-cased")
+    datasets = load_dataset("glue", "mrpc")
+
+    def tokenize_function(examples):
+        # max_length=None => use the model max length (it's actually the default)
+        outputs = tokenizer(examples["sentence1"], examples["sentence2"], truncation=True, max_length=None)
+        return outputs
+
+    # Apply the method we just defined to all the examples in all the splits of the dataset
+    tokenized_datasets = datasets.map(
+        tokenize_function,
+        batched=True,
+        remove_columns=["idx", "sentence1", "sentence2"],
+    )
+
+    # We also rename the 'label' column to 'labels' which is the expected name for labels by the models of the
+    # transformers library
+    tokenized_datasets = tokenized_datasets.rename_column("label", "labels")
+
+    def collate_fn(examples):
+        # On TPU it's best to pad everything to the same length or training will be very slow.
+        if accelerator.distributed_type == DistributedType.TPU:
+            return tokenizer.pad(examples, padding="max_length", max_length=128, return_tensors="pt")
+        return tokenizer.pad(examples, padding="longest", return_tensors="pt")
+
+    # Instantiate dataloaders.
+    train_dataloader = DataLoader(
+        tokenized_datasets["train"], shuffle=True, collate_fn=collate_fn, batch_size=batch_size
+    )
+    eval_dataloader = DataLoader(
+        tokenized_datasets["validation"], shuffle=False, collate_fn=collate_fn, batch_size=EVAL_BATCH_SIZE
+    )
+
+    return train_dataloader, eval_dataloader
+
+
+# For testing only
+if os.environ.get("TESTING_MOCKED_DATALOADERS", None) == "1":
+    from accelerate.test_utils.training import mocked_dataloaders
+
+    get_dataloaders = mocked_dataloaders  # noqa: F811
+
+
+def training_function(config, args):
+    # Initialize accelerator
+    accelerator = Accelerator(cpu=args.cpu, mixed_precision=args.mixed_precision)
+    # Sample hyper-parameters for learning rate, batch size, seed and a few other HPs
+    lr = config["lr"]
+    num_epochs = int(config["num_epochs"])
+    correct_bias = config["correct_bias"]
+    seed = int(config["seed"])
+    batch_size = int(config["batch_size"])
+
+    metric = load_metric("glue", "mrpc")
+
+    # If the batch size is too big we use gradient accumulation
+    gradient_accumulation_steps = 1
+    if batch_size > MAX_GPU_BATCH_SIZE:
+        gradient_accumulation_steps = batch_size // MAX_GPU_BATCH_SIZE
+        batch_size = MAX_GPU_BATCH_SIZE
+
+    set_seed(seed)
+    # Instantiate the model (we build the model here so that the seed also control new weights initialization)
+    model = AutoModelForSequenceClassification.from_pretrained("bert-base-cased", return_dict=True)
+
+    # We could avoid this line since the accelerator is set with `device_placement=True` (default value).
+    # Note that if you are placing tensors on devices manually, this line absolutely needs to be before the optimizer
+    # creation otherwise training will not work on TPU (`accelerate` will kindly throw an error to make us aware of that).
+    model = model.to(accelerator.device)
+
+    # Instantiate optimizer
+    optimizer = AdamW(params=model.parameters(), lr=lr, correct_bias=correct_bias)
+
+    # New Code #
+    # We now can define an inner training loop function. It should take a batch size as the only parameter,
+    # and build the dataloaders in there.
+    # It also gets our decorator
+    @find_executable_batch_size(starting_batch_size=batch_size)
+    def inner_training_loop(batch_size):
+        # And now just move everything below under this function
+        # Ensure that anything declared outside this function is set as `nonlocal`
+        # so it is in scope
+        nonlocal model, optimizer
+        train_dataloader, eval_dataloader = get_dataloaders(accelerator, batch_size)
+
+        # Instantiate scheduler
+        lr_scheduler = get_linear_schedule_with_warmup(
+            optimizer=optimizer,
+            num_warmup_steps=100,
+            num_training_steps=(len(train_dataloader) * num_epochs) // gradient_accumulation_steps,
+        )
+
+        # Prepare everything
+        # There is no specific order to remember, we just need to unpack the objects in the same order we gave them to the
+        # prepare method.
+        model, optimizer, train_dataloader, eval_dataloader, lr_scheduler = accelerator.prepare(
+            model, optimizer, train_dataloader, eval_dataloader, lr_scheduler
+        )
+
+        # Now we train the model
+        for epoch in range(num_epochs):
+            model.train()
+            for step, batch in enumerate(train_dataloader):
+                # We could avoid this line since we set the accelerator with `device_placement=True`.
+                batch.to(accelerator.device)
+                outputs = model(**batch)
+                loss = outputs.loss
+                loss = loss / gradient_accumulation_steps
+                accelerator.backward(loss)
+                if step % gradient_accumulation_steps == 0:
+                    optimizer.step()
+                    lr_scheduler.step()
+                    optimizer.zero_grad()
+
+            model.eval()
+            for step, batch in enumerate(eval_dataloader):
+                # We could avoid this line since we set the accelerator with `device_placement=True`.
+                batch.to(accelerator.device)
+                with torch.no_grad():
+                    outputs = model(**batch)
+                predictions = outputs.logits.argmax(dim=-1)
+                predictions, references = accelerator.gather((predictions, batch["labels"]))
+                metric.add_batch(
+                    predictions=predictions,
+                    references=references,
+                )
+
+            eval_metric = metric.compute()
+            # Use accelerator.print to print only on the main process.
+            accelerator.print(f"epoch {epoch}:", eval_metric)
+
+    # New Code #
+    # And call it at the end with no arguments
+    # Note: You could also refactor this outside of your training loop function
+    inner_training_loop()
+
+
+def main():
+    parser = argparse.ArgumentParser(description="Simple example of training script.")
+    parser.add_argument(
+        "--mixed_precision",
+        type=str,
+        default="no",
+        choices=["no", "fp16", "bf16"],
+        help="Whether to use mixed precision. Choose"
+        "between fp16 and bf16 (bfloat16). Bf16 requires PyTorch >= 1.10."
+        "and an Nvidia Ampere GPU.",
+    )
+    parser.add_argument("--cpu", action="store_true", help="If passed, will train on the CPU.")
+    args = parser.parse_args()
+    config = {"lr": 2e-5, "num_epochs": 3, "correct_bias": True, "seed": 42, "batch_size": 16}
+    training_function(config, args)
+
+
+if __name__ == "__main__":
+    main()
--- a/examples/by_feature/multi_process_metrics.py
+++ b/examples/by_feature/multi_process_metrics.py
@ -0,0 +1,223 @@
+# coding=utf-8
+# Copyright 2022 The HuggingFace Inc. team. All rights reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+import argparse
+import os
+
+import torch
+from torch.utils.data import DataLoader
+
+from accelerate import Accelerator, DistributedType
+from datasets import load_dataset, load_metric
+from transformers import (
+    AdamW,
+    AutoModelForSequenceClassification,
+    AutoTokenizer,
+    get_linear_schedule_with_warmup,
+    set_seed,
+)
+
+
+########################################################################
+# This is a fully working simple example to use Accelerate,
+# specifically showcasing how to properly calculate the metrics on the
+# validation dataset when in a distributed system, and builds off the
+# `nlp_example.py` script.
+#
+# This example trains a Bert base model on GLUE MRPC
+# in any of the following settings (with the same script):
+#   - single CPU or single GPU
+#   - multi GPUS (using PyTorch distributed mode)
+#   - (multi) TPUs
+#   - fp16 (mixed-precision) or fp32 (normal precision)
+#
+# To help focus on the differences in the code, building `DataLoaders`
+# was refactored into its own function.
+# New additions from the base script can be found quickly by
+# looking for the # New Code # tags
+#
+# To run it in each of these various modes, follow the instructions
+# in the readme for examples:
+# https://github.com/huggingface/accelerate/tree/main/examples
+#
+########################################################################
+
+
+MAX_GPU_BATCH_SIZE = 16
+EVAL_BATCH_SIZE = 32
+
+
+def get_dataloaders(accelerator: Accelerator, batch_size: int = 16):
+    """
+    Creates a set of `DataLoader`s for the `glue` dataset,
+    using "bert-base-cased" as the tokenizer.
+
+    Args:
+        accelerator (`Accelerator`):
+            An `Accelerator` object
+        batch_size (`int`, *optional*):
+            The batch size for the train and validation DataLoaders.
+    """
+    tokenizer = AutoTokenizer.from_pretrained("bert-base-cased")
+    datasets = load_dataset("glue", "mrpc")
+
+    def tokenize_function(examples):
+        # max_length=None => use the model max length (it's actually the default)
+        outputs = tokenizer(examples["sentence1"], examples["sentence2"], truncation=True, max_length=None)
+        return outputs
+
+    # Apply the method we just defined to all the examples in all the splits of the dataset
+    tokenized_datasets = datasets.map(
+        tokenize_function,
+        batched=True,
+        remove_columns=["idx", "sentence1", "sentence2"],
+    )
+
+    # We also rename the 'label' column to 'labels' which is the expected name for labels by the models of the
+    # transformers library
+    tokenized_datasets = tokenized_datasets.rename_column("label", "labels")
+
+    def collate_fn(examples):
+        # On TPU it's best to pad everything to the same length or training will be very slow.
+        if accelerator.distributed_type == DistributedType.TPU:
+            return tokenizer.pad(examples, padding="max_length", max_length=128, return_tensors="pt")
+        return tokenizer.pad(examples, padding="longest", return_tensors="pt")
+
+    # Instantiate dataloaders.
+    train_dataloader = DataLoader(
+        tokenized_datasets["train"], shuffle=True, collate_fn=collate_fn, batch_size=batch_size
+    )
+    eval_dataloader = DataLoader(
+        tokenized_datasets["validation"], shuffle=False, collate_fn=collate_fn, batch_size=EVAL_BATCH_SIZE
+    )
+
+    return train_dataloader, eval_dataloader
+
+
+# For testing only
+if os.environ.get("TESTING_MOCKED_DATALOADERS", None) == "1":
+    from accelerate.test_utils.training import mocked_dataloaders
+
+    get_dataloaders = mocked_dataloaders  # noqa: F811
+
+
+def training_function(config, args):
+    # Initialize accelerator
+    accelerator = Accelerator(cpu=args.cpu, mixed_precision=args.mixed_precision)
+    # Sample hyper-parameters for learning rate, batch size, seed and a few other HPs
+    lr = config["lr"]
+    num_epochs = int(config["num_epochs"])
+    correct_bias = config["correct_bias"]
+    seed = int(config["seed"])
+    batch_size = int(config["batch_size"])
+
+    metric = load_metric("glue", "mrpc")
+
+    # If the batch size is too big we use gradient accumulation
+    gradient_accumulation_steps = 1
+    if batch_size > MAX_GPU_BATCH_SIZE:
+        gradient_accumulation_steps = batch_size // MAX_GPU_BATCH_SIZE
+        batch_size = MAX_GPU_BATCH_SIZE
+
+    set_seed(seed)
+    train_dataloader, eval_dataloader = get_dataloaders(accelerator, batch_size)
+    # Instantiate the model (we build the model here so that the seed also control new weights initialization)
+    model = AutoModelForSequenceClassification.from_pretrained("bert-base-cased", return_dict=True)
+
+    # We could avoid this line since the accelerator is set with `device_placement=True` (default value).
+    # Note that if you are placing tensors on devices manually, this line absolutely needs to be before the optimizer
+    # creation otherwise training will not work on TPU (`accelerate` will kindly throw an error to make us aware of that).
+    model = model.to(accelerator.device)
+
+    # Instantiate optimizer
+    optimizer = AdamW(params=model.parameters(), lr=lr, correct_bias=correct_bias)
+
+    # Instantiate scheduler
+    lr_scheduler = get_linear_schedule_with_warmup(
+        optimizer=optimizer,
+        num_warmup_steps=100,
+        num_training_steps=(len(train_dataloader) * num_epochs) // gradient_accumulation_steps,
+    )
+
+    # Prepare everything
+    # There is no specific order to remember, we just need to unpack the objects in the same order we gave them to the
+    # prepare method.
+    model, optimizer, train_dataloader, eval_dataloader, lr_scheduler = accelerator.prepare(
+        model, optimizer, train_dataloader, eval_dataloader, lr_scheduler
+    )
+
+    # Now we train the model
+    for epoch in range(num_epochs):
+        model.train()
+        for step, batch in enumerate(train_dataloader):
+            # We could avoid this line since we set the accelerator with `device_placement=True`.
+            batch.to(accelerator.device)
+            outputs = model(**batch)
+            loss = outputs.loss
+            loss = loss / gradient_accumulation_steps
+            accelerator.backward(loss)
+            if step % gradient_accumulation_steps == 0:
+                optimizer.step()
+                lr_scheduler.step()
+                optimizer.zero_grad()
+
+        model.eval()
+        samples_seen = 0
+        for step, batch in enumerate(eval_dataloader):
+            # We could avoid this line since we set the accelerator with `device_placement=True`.
+            batch.to(accelerator.device)
+            with torch.no_grad():
+                outputs = model(**batch)
+            predictions = outputs.logits.argmax(dim=-1)
+            predictions, references = accelerator.gather((predictions, batch["labels"]))
+            # New Code #
+            # First we check if it's a distributed system
+            if accelerator.num_processes > 1:
+                # Then see if we're on the last batch of our eval dataloader
+                if step == len(eval_dataloader) - 1:
+                    # Last batch needs to be truncated on distributed systems as it contains additional samples
+                    predictions = predictions[: len(eval_dataloader.dataset) - samples_seen]
+                    references = references[: len(eval_dataloader.dataset) - samples_seen]
+                else:
+                    # Otherwise we add the number of samples seen
+                    samples_seen += references.shape[0]
+            metric.add_batch(
+                predictions=predictions,
+                references=references,
+            )
+
+        eval_metric = metric.compute()
+        # Use accelerator.print to print only on the main process.
+        accelerator.print(f"epoch {epoch}:", eval_metric)
+
+
+def main():
+    parser = argparse.ArgumentParser(description="Simple example of training script.")
+    parser.add_argument(
+        "--mixed_precision",
+        type=str,
+        default="no",
+        choices=["no", "fp16", "bf16"],
+        help="Whether to use mixed precision. Choose"
+        "between fp16 and bf16 (bfloat16). Bf16 requires PyTorch >= 1.10."
+        "and an Nvidia Ampere GPU.",
+    )
+    parser.add_argument("--cpu", action="store_true", help="If passed, will train on the CPU.")
+    args = parser.parse_args()
+    config = {"lr": 2e-5, "num_epochs": 3, "correct_bias": True, "seed": 42, "batch_size": 16}
+    training_function(config, args)
+
+
+if __name__ == "__main__":
+    main()
--- a/examples/by_feature/tracking.py
+++ b/examples/by_feature/tracking.py
@ -0,0 +1,267 @@
+# coding=utf-8
+# Copyright 2021 The HuggingFace Inc. team. All rights reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+import argparse
+import os
+
+import torch
+from torch.utils.data import DataLoader
+
+from accelerate import Accelerator, DistributedType
+from datasets import load_dataset, load_metric
+from transformers import (
+    AdamW,
+    AutoModelForSequenceClassification,
+    AutoTokenizer,
+    get_linear_schedule_with_warmup,
+    set_seed,
+)
+
+
+########################################################################
+# This is a fully working simple example to use Accelerate,
+# specifically showcasing the experiment tracking capability,
+# and builds off the `nlp_example.py` script.
+#
+# This example trains a Bert base model on GLUE MRPC
+# in any of the following settings (with the same script):
+#   - single CPU or single GPU
+#   - multi GPUS (using PyTorch distributed mode)
+#   - (multi) TPUs
+#   - fp16 (mixed-precision) or fp32 (normal precision)
+#
+# To help focus on the differences in the code, building `DataLoaders`
+# was refactored into its own function.
+# New additions from the base script can be found quickly by
+# looking for the # New Code # tags
+#
+# To run it in each of these various modes, follow the instructions
+# in the readme for examples:
+# https://github.com/huggingface/accelerate/tree/main/examples
+#
+########################################################################
+
+MAX_GPU_BATCH_SIZE = 16
+EVAL_BATCH_SIZE = 32
+
+
+def get_dataloaders(accelerator: Accelerator, batch_size: int = 16):
+    """
+    Creates a set of `DataLoader`s for the `glue` dataset,
+    using "bert-base-cased" as the tokenizer.
+
+    Args:
+        accelerator (`Accelerator`):
+            An `Accelerator` object
+        batch_size (`int`, *optional*):
+            The batch size for the train and validation DataLoaders.
+    """
+    tokenizer = AutoTokenizer.from_pretrained("bert-base-cased")
+    datasets = load_dataset("glue", "mrpc")
+
+    def tokenize_function(examples):
+        # max_length=None => use the model max length (it's actually the default)
+        outputs = tokenizer(examples["sentence1"], examples["sentence2"], truncation=True, max_length=None)
+        return outputs
+
+    # Apply the method we just defined to all the examples in all the splits of the dataset
+    tokenized_datasets = datasets.map(
+        tokenize_function,
+        batched=True,
+        remove_columns=["idx", "sentence1", "sentence2"],
+    )
+
+    # We also rename the 'label' column to 'labels' which is the expected name for labels by the models of the
+    # transformers library
+    tokenized_datasets = tokenized_datasets.rename_column("label", "labels")
+
+    def collate_fn(examples):
+        # On TPU it's best to pad everything to the same length or training will be very slow.
+        if accelerator.distributed_type == DistributedType.TPU:
+            return tokenizer.pad(examples, padding="max_length", max_length=128, return_tensors="pt")
+        return tokenizer.pad(examples, padding="longest", return_tensors="pt")
+
+    # Instantiate dataloaders.
+    train_dataloader = DataLoader(
+        tokenized_datasets["train"], shuffle=True, collate_fn=collate_fn, batch_size=batch_size
+    )
+    eval_dataloader = DataLoader(
+        tokenized_datasets["validation"], shuffle=False, collate_fn=collate_fn, batch_size=EVAL_BATCH_SIZE
+    )
+
+    return train_dataloader, eval_dataloader
+
+
+# For testing only
+if os.environ.get("TESTING_MOCKED_DATALOADERS", None) == "1":
+    from accelerate.test_utils.training import mocked_dataloaders
+
+    get_dataloaders = mocked_dataloaders  # noqa: F811
+
+
+def training_function(config, args):
+    # Initialize Accelerator
+
+    # New Code #
+    # We pass in "all" to `log_with` to grab all available trackers in the environment
+    # Note: If using a custom `Tracker` class, should be passed in here such as:
+    # >>> log_with = ["all", MyCustomTrackerClassInstance()]
+    if args.with_tracking:
+        accelerator = Accelerator(
+            cpu=args.cpu, mixed_precision=args.mixed_precision, log_with="all", logging_dir=args.logging_dir
+        )
+    else:
+        accelerator = Accelerator(cpu=args.cpu, mixed_precision=args.mixed_precision)
+    # Sample hyper-parameters for learning rate, batch size, seed and a few other HPs
+    lr = config["lr"]
+    num_epochs = int(config["num_epochs"])
+    correct_bias = config["correct_bias"]
+    seed = int(config["seed"])
+    batch_size = int(config["batch_size"])
+    set_seed(seed)
+
+    train_dataloader, eval_dataloader = get_dataloaders(accelerator, batch_size)
+    metric = load_metric("glue", "mrpc")
+
+    # If the batch size is too big we use gradient accumulation
+    gradient_accumulation_steps = 1
+    if batch_size > MAX_GPU_BATCH_SIZE:
+        gradient_accumulation_steps = batch_size // MAX_GPU_BATCH_SIZE
+        batch_size = MAX_GPU_BATCH_SIZE
+
+    # Instantiate the model (we build the model here so that the seed also control new weights initialization)
+    model = AutoModelForSequenceClassification.from_pretrained("bert-base-cased", return_dict=True)
+
+    # We could avoid this line since the accelerator is set with `device_placement=True` (default value).
+    # Note that if you are placing tensors on devices manually, this line absolutely needs to be before the optimizer
+    # creation otherwise training will not work on TPU (`accelerate` will kindly throw an error to make us aware of that).
+    model = model.to(accelerator.device)
+
+    # Instantiate optimizer
+    optimizer = AdamW(params=model.parameters(), lr=lr, correct_bias=correct_bias)
+
+    # Instantiate scheduler
+    lr_scheduler = get_linear_schedule_with_warmup(
+        optimizer=optimizer,
+        num_warmup_steps=100,
+        num_training_steps=(len(train_dataloader) * num_epochs) // gradient_accumulation_steps,
+    )
+
+    # Prepare everything
+    # There is no specific order to remember, we just need to unpack the objects in the same order we gave them to the
+    # prepare method.
+    model, optimizer, train_dataloader, eval_dataloader, lr_scheduler = accelerator.prepare(
+        model, optimizer, train_dataloader, eval_dataloader, lr_scheduler
+    )
+
+    # New Code #
+    # We need to initalize the trackers we use. Overall configurations can also be stored
+    if args.with_tracking:
+        if accelerator.is_main_process:
+            run = os.path.split(__file__)[-1].split(".")[0]
+            if args.logging_dir:
+                run = os.path.join(args.logging_dir, run)
+            accelerator.init_trackers(run, config)
+
+    # Now we train the model
+    for epoch in range(num_epochs):
+        model.train()
+        # New Code #
+        # For our tracking example, we will log the total loss of each epoch
+        if args.with_tracking:
+            total_loss = 0
+        for step, batch in enumerate(train_dataloader):
+            # We could avoid this line since we set the accelerator with `device_placement=True`.
+            batch.to(accelerator.device)
+            outputs = model(**batch)
+            loss = outputs.loss
+            # New Code #
+            if args.with_tracking:
+                total_loss += loss.detach().float()
+            loss = loss / gradient_accumulation_steps
+            accelerator.backward(loss)
+            if step % gradient_accumulation_steps == 0:
+                optimizer.step()
+                lr_scheduler.step()
+                optimizer.zero_grad()
+
+        model.eval()
+        for step, batch in enumerate(eval_dataloader):
+            # We could avoid this line since we set the accelerator with `device_placement=True` (the default).
+            batch.to(accelerator.device)
+            with torch.no_grad():
+                outputs = model(**batch)
+            predictions = outputs.logits.argmax(dim=-1)
+            # It is slightly faster to call this once, than multiple times
+            predictions, references = accelerator.gather((predictions, batch["labels"]))
+            metric.add_batch(
+                predictions=predictions,
+                references=references,
+            )
+
+        eval_metric = metric.compute()
+        # Use accelerator.print to print only on the main process.
+        accelerator.print(f"epoch {epoch}:", eval_metric)
+
+        # New Code #
+        # To actually log, we call `Accelerator.log`
+        # The values passed can be of `str`, `int`, `float` or `dict` of `str` to `float`/`int`
+        if args.with_tracking:
+            accelerator.log(
+                {
+                    "accuracy": eval_metric["accuracy"],
+                    "f1": eval_metric["f1"],
+                    "train_loss": total_loss.item(),
+                    "epoch": epoch,
+                },
+                step=epoch,
+            )
+
+    # New Code #
+    # When a run is finished, you should call `accelerator.end_training()`
+    # to close all of the open trackers
+    if args.with_tracking:
+        accelerator.end_training()
+
+
+def main():
+    parser = argparse.ArgumentParser(description="Simple example of training script.")
+    parser.add_argument(
+        "--mixed_precision",
+        type=str,
+        default="no",
+        choices=["no", "fp16", "bf16"],
+        help="Whether to use mixed precision. Choose"
+        "between fp16 and bf16 (bfloat16). Bf16 requires PyTorch >= 1.10."
+        "and an Nvidia Ampere GPU.",
+    )
+    parser.add_argument("--cpu", action="store_true", help="If passed, will train on the CPU.")
+    parser.add_argument(
+        "--with_tracking",
+        action="store_true",
+        help="Whether to load in all available experiment trackers from the environment and use them for logging.",
+    )
+    parser.add_argument(
+        "--logging_dir",
+        type=str,
+        default="logs",
+        help="Location on where to store experiment tracking logs`",
+    )
+    args = parser.parse_args()
+    config = {"lr": 2e-5, "num_epochs": 3, "correct_bias": True, "seed": 42, "batch_size": 16}
+    training_function(config, args)
+
+
+if __name__ == "__main__":
+    main()
--- a/examples/complete_cv_example.py
+++ b/examples/complete_cv_example.py
@ -0,0 +1,322 @@
+# coding=utf-8
+# Copyright 2021 The HuggingFace Inc. team. All rights reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+import argparse
+import os
+import re
+
+import numpy as np
+import torch
+from torch.optim.lr_scheduler import OneCycleLR
+from torch.utils.data import DataLoader, Dataset
+
+import PIL
+from accelerate import Accelerator
+from timm import create_model
+from torchvision.transforms import Compose, RandomResizedCrop, Resize, ToTensor
+
+
+########################################################################
+# This is a fully working simple example to use Accelerate
+#
+# This example trains a ResNet50 on the Oxford-IIT Pet Dataset
+# in any of the following settings (with the same script):
+#   - single CPU or single GPU
+#   - multi GPUS (using PyTorch distributed mode)
+#   - (multi) TPUs
+#   - fp16 (mixed-precision) or fp32 (normal precision)
+#
+# To run it in each of these various modes, follow the instructions
+# in the readme for examples:
+# https://github.com/huggingface/accelerate/tree/main/examples
+#
+########################################################################
+
+
+# Function to get the label from the filename
+def extract_label(fname):
+    stem = fname.split(os.path.sep)[-1]
+    return re.search(r"^(.*)_\d+\.jpg$", stem).groups()[0]
+
+
+class PetsDataset(Dataset):
+    def __init__(self, file_names, image_transform=None, label_to_id=None):
+        self.file_names = file_names
+        self.image_transform = image_transform
+        self.label_to_id = label_to_id
+
+    def __len__(self):
+        return len(self.file_names)
+
+    def __getitem__(self, idx):
+        fname = self.file_names[idx]
+        raw_image = PIL.Image.open(fname)
+        image = raw_image.convert("RGB")
+        if self.image_transform is not None:
+            image = self.image_transform(image)
+        label = extract_label(fname)
+        if self.label_to_id is not None:
+            label = self.label_to_id[label]
+        return {"image": image, "label": label}
+
+
+def training_function(config, args):
+    # Initialize accelerator
+    if args.with_tracking:
+        accelerator = Accelerator(
+            cpu=args.cpu, mixed_precision=args.mixed_precision, log_with="all", logging_dir=args.logging_dir
+        )
+    else:
+        accelerator = Accelerator(cpu=args.cpu, mixed_precision=args.mixed_precision)
+
+    # Sample hyper-parameters for learning rate, batch size, seed and a few other HPs
+    lr = config["lr"]
+    num_epochs = int(config["num_epochs"])
+    seed = int(config["seed"])
+    batch_size = int(config["batch_size"])
+    image_size = config["image_size"]
+    if not isinstance(image_size, (list, tuple)):
+        image_size = (image_size, image_size)
+
+    # Parse out whether we are saving every epoch or after a certain number of batches
+    if hasattr(args.checkpointing_steps, "isdigit"):
+        if args.checkpointing_steps == "epoch":
+            checkpointing_steps = args.checkpointing_steps
+        elif args.checkpointing_steps.isdigit():
+            checkpointing_steps = int(args.checkpointing_steps)
+        else:
+            raise ValueError(
+                f"Argument `checkpointing_steps` must be either a number or `epoch`. `{args.checkpointing_steps}` passed."
+            )
+    else:
+        checkpointing_steps = None
+
+    # We need to initialize the trackers we use, and also store our configuration
+    if args.with_tracking:
+        if accelerator.is_main_process:
+            run = os.path.split(__file__)[-1].split(".")[0]
+            if args.logging_dir:
+                run = os.path.join(args.logging_dir, run)
+            accelerator.init_trackers(run, config)
+
+    # Grab all the image filenames
+    file_names = [os.path.join(args.data_dir, fname) for fname in os.listdir(args.data_dir) if fname.endswith(".jpg")]
+
+    # Build the label correspondences
+    all_labels = [extract_label(fname) for fname in file_names]
+    id_to_label = list(set(all_labels))
+    id_to_label.sort()
+    label_to_id = {lbl: i for i, lbl in enumerate(id_to_label)}
+
+    # Set the seed before splitting the data.
+    np.random.seed(seed)
+    torch.manual_seed(seed)
+    torch.cuda.manual_seed_all(seed)
+
+    # Split our filenames between train and validation
+    random_perm = np.random.permutation(len(file_names))
+    cut = int(0.8 * len(file_names))
+    train_split = random_perm[:cut]
+    eval_split = random_perm[cut:]
+
+    # For training we use a simple RandomResizedCrop
+    train_tfm = Compose([RandomResizedCrop(image_size, scale=(0.5, 1.0)), ToTensor()])
+    train_dataset = PetsDataset(
+        [file_names[i] for i in train_split], image_transform=train_tfm, label_to_id=label_to_id
+    )
+
+    # For evaluation, we use a deterministic Resize
+    eval_tfm = Compose([Resize(image_size), ToTensor()])
+    eval_dataset = PetsDataset([file_names[i] for i in eval_split], image_transform=eval_tfm, label_to_id=label_to_id)
+
+    # Instantiate dataloaders.
+    train_dataloader = DataLoader(train_dataset, shuffle=True, batch_size=batch_size, num_workers=4)
+    eval_dataloader = DataLoader(eval_dataset, shuffle=False, batch_size=batch_size, num_workers=4)
+
+    # Instantiate the model (we build the model here so that the seed also control new weights initialization)
+    model = create_model("resnet50d", pretrained=True, num_classes=len(label_to_id))
+
+    # We could avoid this line since the accelerator is set with `device_placement=True` (default value).
+    # Note that if you are placing tensors on devices manually, this line absolutely needs to be before the optimizer
+    # creation otherwise training will not work on TPU (`accelerate` will kindly throw an error to make us aware of that).
+    model = model.to(accelerator.device)
+
+    # Freezing the base model
+    for param in model.parameters():
+        param.requires_grad = False
+    for param in model.get_classifier().parameters():
+        param.requires_grad = True
+
+    # We normalize the batches of images to be a bit faster.
+    mean = torch.tensor(model.default_cfg["mean"])[None, :, None, None].to(accelerator.device)
+    std = torch.tensor(model.default_cfg["std"])[None, :, None, None].to(accelerator.device)
+
+    # Instantiate optimizer
+    optimizer = torch.optim.Adam(params=model.parameters(), lr=lr / 25)
+
+    # Instantiate learning rate scheduler
+    lr_scheduler = OneCycleLR(optimizer=optimizer, max_lr=lr, epochs=num_epochs, steps_per_epoch=len(train_dataloader))
+
+    # Prepare everything
+    # There is no specific order to remember, we just need to unpack the objects in the same order we gave them to the
+    # prepare method.
+    model, optimizer, train_dataloader, eval_dataloader, lr_scheduler = accelerator.prepare(
+        model, optimizer, train_dataloader, eval_dataloader, lr_scheduler
+    )
+    # We need to keep track of how many total steps we have iterated over
+    overall_step = 0
+    # We also need to keep track of the stating epoch so files are named properly
+    starting_epoch = 0
+
+    # Potentially load in the weights and states from a previous save
+    if args.resume_from_checkpoint:
+        if args.resume_from_checkpoint is not None or args.resume_from_checkpoint != "":
+            accelerator.print(f"Resumed from checkpoint: {args.resume_from_checkpoint}")
+            accelerator.load_state(args.resume_from_checkpoint)
+            path = os.path.basename(args.resume_from_checkpoint)
+        else:
+            # Get the most recent checkpoint
+            dirs = [f.name for f in os.scandir(os.getcwd()) if f.is_dir()]
+            dirs.sort(key=os.path.getctime)
+            path = dirs[-1]  # Sorts folders by date modified, most recent checkpoint is the last
+        # Extract `epoch_{i}` or `step_{i}`
+        training_difference = os.path.splitext(path)[0]
+
+        if "epoch" in training_difference:
+            starting_epoch = int(training_difference.replace("epoch_", "")) + 1
+            resume_step = None
+        else:
+            resume_step = int(training_difference.replace("step_", ""))
+            starting_epoch = resume_step // len(train_dataloader)
+            resume_step -= starting_epoch * len(train_dataloader)
+
+    # Now we train the model
+    for epoch in range(starting_epoch, num_epochs):
+        model.train()
+        if args.with_tracking:
+            total_loss = 0
+        for step, batch in enumerate(train_dataloader):
+            # We need to skip steps until we reach the resumed step
+            if args.resume_from_checkpoint and epoch == starting_epoch:
+                if resume_step is not None and step < resume_step:
+                    overall_step += 1
+                    continue
+            # We could avoid this line since we set the accelerator with `device_placement=True`.
+            batch = {k: v.to(accelerator.device) for k, v in batch.items()}
+            inputs = (batch["image"] - mean) / std
+            outputs = model(inputs)
+            loss = torch.nn.functional.cross_entropy(outputs, batch["label"])
+            # We keep track of the loss at each epoch
+            if args.with_tracking:
+                total_loss += loss.detach().float()
+            accelerator.backward(loss)
+            optimizer.step()
+            lr_scheduler.step()
+            optimizer.zero_grad()
+            overall_step += 1
+            if isinstance(checkpointing_steps, int):
+                output_dir = f"step_{overall_step}"
+                if overall_step % checkpointing_steps == 0:
+                    if args.output_dir is not None:
+                        output_dir = os.path.join(args.output_dir, output_dir)
+                    accelerator.save_state(output_dir)
+        model.eval()
+        accurate = 0
+        samples_seen = 0
+        for step, batch in enumerate(eval_dataloader):
+            # We could avoid this line since we set the accelerator with `device_placement=True`.
+            batch = {k: v.to(accelerator.device) for k, v in batch.items()}
+            inputs = (batch["image"] - mean) / std
+            with torch.no_grad():
+                outputs = model(inputs)
+            predictions = outputs.argmax(dim=-1)
+            predictions, references = accelerator.gather((predictions, batch["label"]))
+            if accelerator.num_processes > 1:
+                if step == len(eval_dataloader) - 1:
+                    predictions = predictions[: len(eval_dataloader) - samples_seen]
+                    references = references[: len(eval_dataloader) - samples_seen]
+                else:
+                    samples_seen += references.shape[0]
+            else:
+                samples_seen += references.shape[0]
+            accurate_preds = predictions == references
+            accurate += accurate_preds.long().sum()
+
+        eval_metric = accurate.item() / samples_seen
+        # Use accelerator.print to print only on the main process.
+        accelerator.print(f"epoch {epoch}: {100 * eval_metric:.2f}")
+        if args.with_tracking:
+            accelerator.log(
+                {"accuracy": 100 * eval_metric, "total_loss": total_loss, "epoch": epoch}, step=overall_step
+            )
+        if checkpointing_steps == "epoch":
+            output_dir = f"epoch_{epoch}"
+            if args.output_dir is not None:
+                output_dir = os.path.join(args.output_dir, output_dir)
+            accelerator.save_state(output_dir)
+
+    if args.with_tracking:
+        accelerator.end_training()
+
+
+def main():
+    parser = argparse.ArgumentParser(description="Simple example of training script.")
+    parser.add_argument("--data_dir", required=True, help="The data folder on disk.")
+    parser.add_argument("--fp16", action="store_true", help="If passed, will use FP16 training.")
+    parser.add_argument(
+        "--mixed_precision",
+        type=str,
+        default="no",
+        choices=["no", "fp16", "bf16"],
+        help="Whether to use mixed precision. Choose"
+        "between fp16 and bf16 (bfloat16). Bf16 requires PyTorch >= 1.10."
+        "and an Nvidia Ampere GPU.",
+    )
+    parser.add_argument("--cpu", action="store_true", help="If passed, will train on the CPU.")
+    parser.add_argument(
+        "--checkpointing_steps",
+        type=str,
+        default=None,
+        help="Whether the various states should be saved at the end of every n steps, or 'epoch' for each epoch.",
+    )
+    parser.add_argument(
+        "--output_dir",
+        type=str,
+        default=".",
+        help="Optional save directory where all checkpoint folders will be stored. Default is the current working directory.",
+    )
+    parser.add_argument(
+        "--resume_from_checkpoint",
+        type=str,
+        default=None,
+        help="If the training should continue from a checkpoint folder.",
+    )
+    parser.add_argument(
+        "--with_tracking",
+        action="store_true",
+        help="Whether to load in all available experiment trackers from the environment and use them for logging.",
+    )
+    parser.add_argument(
+        "--logging_dir",
+        type=str,
+        default="logs",
+        help="Location on where to store experiment tracking logs`",
+    )
+    args = parser.parse_args()
+    config = {"lr": 3e-2, "num_epochs": 3, "seed": 42, "batch_size": 64, "image_size": 224}
+    training_function(config, args)
+
+
+if __name__ == "__main__":
+    main()
--- a/examples/complete_nlp_example.py
+++ b/examples/complete_nlp_example.py
@ -0,0 +1,312 @@
+# coding=utf-8
+# Copyright 2021 The HuggingFace Inc. team. All rights reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+import argparse
+import os
+
+import torch
+from torch.utils.data import DataLoader
+
+from accelerate import Accelerator, DistributedType
+from datasets import load_dataset, load_metric
+from transformers import (
+    AdamW,
+    AutoModelForSequenceClassification,
+    AutoTokenizer,
+    get_linear_schedule_with_warmup,
+    set_seed,
+)
+
+
+########################################################################
+# This is a fully working simple example to use Accelerate
+#
+# This example trains a Bert base model on GLUE MRPC
+# in any of the following settings (with the same script):
+#   - single CPU or single GPU
+#   - multi GPUS (using PyTorch distributed mode)
+#   - (multi) TPUs
+#   - fp16 (mixed-precision) or fp32 (normal precision)
+#
+# This example also demonstrates the checkpointing and sharding capabilities
+#
+# To run it in each of these various modes, follow the instructions
+# in the readme for examples:
+# https://github.com/huggingface/accelerate/tree/main/examples
+#
+########################################################################
+
+
+MAX_GPU_BATCH_SIZE = 16
+EVAL_BATCH_SIZE = 32
+
+
+def training_function(config, args):
+    # Initialize accelerator
+    if args.with_tracking:
+        accelerator = Accelerator(
+            cpu=args.cpu, mixed_precision=args.mixed_precision, log_with="all", logging_dir=args.logging_dir
+        )
+    else:
+        accelerator = Accelerator(cpu=args.cpu, mixed_precision=args.mixed_precision)
+
+    if hasattr(args.checkpointing_steps, "isdigit"):
+        if args.checkpointing_steps == "epoch":
+            checkpointing_steps = args.checkpointing_steps
+        elif args.checkpointing_steps.isdigit():
+            checkpointing_steps = int(args.checkpointing_steps)
+        else:
+            raise ValueError(
+                f"Argument `checkpointing_steps` must be either a number or `epoch`. `{args.checkpointing_steps}` passed."
+            )
+    else:
+        checkpointing_steps = None
+    # Sample hyper-parameters for learning rate, batch size, seed and a few other HPs
+    lr = config["lr"]
+    num_epochs = int(config["num_epochs"])
+    correct_bias = config["correct_bias"]
+    seed = int(config["seed"])
+    batch_size = int(config["batch_size"])
+
+    # We need to initialize the trackers we use, and also store our configuration
+    if args.with_tracking:
+        if accelerator.is_main_process:
+            run = os.path.split(__file__)[-1].split(".")[0]
+            if args.logging_dir:
+                run = os.path.join(args.logging_dir, run)
+            accelerator.init_trackers(run, config)
+
+    tokenizer = AutoTokenizer.from_pretrained("bert-base-cased")
+    datasets = load_dataset("glue", "mrpc")
+    metric = load_metric("glue", "mrpc")
+
+    def tokenize_function(examples):
+        # max_length=None => use the model max length (it's actually the default)
+        outputs = tokenizer(examples["sentence1"], examples["sentence2"], truncation=True, max_length=None)
+        return outputs
+
+    # Apply the method we just defined to all the examples in all the splits of the dataset
+    tokenized_datasets = datasets.map(
+        tokenize_function,
+        batched=True,
+        remove_columns=["idx", "sentence1", "sentence2"],
+    )
+
+    # We also rename the 'label' column to 'labels' which is the expected name for labels by the models of the
+    # transformers library
+    tokenized_datasets = tokenized_datasets.rename_column("label", "labels")
+
+    # If the batch size is too big we use gradient accumulation
+    gradient_accumulation_steps = 1
+    if batch_size > MAX_GPU_BATCH_SIZE:
+        gradient_accumulation_steps = batch_size // MAX_GPU_BATCH_SIZE
+        batch_size = MAX_GPU_BATCH_SIZE
+
+    def collate_fn(examples):
+        # On TPU it's best to pad everything to the same length or training will be very slow.
+        if accelerator.distributed_type == DistributedType.TPU:
+            return tokenizer.pad(examples, padding="max_length", max_length=128, return_tensors="pt")
+        return tokenizer.pad(examples, padding="longest", return_tensors="pt")
+
+    # Instantiate dataloaders.
+    train_dataloader = DataLoader(
+        tokenized_datasets["train"], shuffle=True, collate_fn=collate_fn, batch_size=batch_size
+    )
+    eval_dataloader = DataLoader(
+        tokenized_datasets["validation"], shuffle=False, collate_fn=collate_fn, batch_size=EVAL_BATCH_SIZE
+    )
+
+    set_seed(seed)
+
+    # Instantiate the model (we build the model here so that the seed also control new weights initialization)
+    model = AutoModelForSequenceClassification.from_pretrained("bert-base-cased", return_dict=True)
+
+    # We could avoid this line since the accelerator is set with `device_placement=True` (default value).
+    # Note that if you are placing tensors on devices manually, this line absolutely needs to be before the optimizer
+    # creation otherwise training will not work on TPU (`accelerate` will kindly throw an error to make us aware of that).
+    model = model.to(accelerator.device)
+
+    # Instantiate optimizer
+    optimizer = AdamW(params=model.parameters(), lr=lr, correct_bias=correct_bias)
+
+    # Instantiate scheduler
+    lr_scheduler = get_linear_schedule_with_warmup(
+        optimizer=optimizer,
+        num_warmup_steps=100,
+        num_training_steps=(len(train_dataloader) * num_epochs) // gradient_accumulation_steps,
+    )
+
+    # Prepare everything
+    # There is no specific order to remember, we just need to unpack the objects in the same order we gave them to the
+    # prepare method.
+    model, optimizer, train_dataloader, eval_dataloader, lr_scheduler = accelerator.prepare(
+        model, optimizer, train_dataloader, eval_dataloader, lr_scheduler
+    )
+
+    # We need to keep track of how many total steps we have iterated over
+    overall_step = 0
+    # We also need to keep track of the stating epoch so files are named properly
+    starting_epoch = 0
+
+    # Potentially load in the weights and states from a previous save
+    if args.resume_from_checkpoint:
+        if args.resume_from_checkpoint is not None or args.resume_from_checkpoint != "":
+            accelerator.print(f"Resumed from checkpoint: {args.resume_from_checkpoint}")
+            accelerator.load_state(args.resume_from_checkpoint)
+            path = os.path.basename(args.resume_from_checkpoint)
+        else:
+            # Get the most recent checkpoint
+            dirs = [f.name for f in os.scandir(os.getcwd()) if f.is_dir()]
+            dirs.sort(key=os.path.getctime)
+            path = dirs[-1]  # Sorts folders by date modified, most recent checkpoint is the last
+        # Extract `epoch_{i}` or `step_{i}`
+        training_difference = os.path.splitext(path)[0]
+
+        if "epoch" in training_difference:
+            starting_epoch = int(training_difference.replace("epoch_", "")) + 1
+            resume_step = None
+        else:
+            resume_step = int(training_difference.replace("step_", ""))
+            starting_epoch = resume_step // len(train_dataloader)
+            resume_step -= starting_epoch * len(train_dataloader)
+
+    # Now we train the model
+    for epoch in range(starting_epoch, num_epochs):
+        model.train()
+        if args.with_tracking:
+            total_loss = 0
+        for step, batch in enumerate(train_dataloader):
+            # We need to skip steps until we reach the resumed step
+            if args.resume_from_checkpoint and epoch == starting_epoch:
+                if resume_step is not None and step < resume_step:
+                    overall_step += 1
+                    continue
+            # We could avoid this line since we set the accelerator with `device_placement=True`.
+            batch.to(accelerator.device)
+            outputs = model(**batch)
+            loss = outputs.loss
+            loss = loss / gradient_accumulation_steps
+            # We keep track of the loss at each epoch
+            if args.with_tracking:
+                total_loss += loss.detach().float()
+            accelerator.backward(loss)
+            if step % gradient_accumulation_steps == 0:
+                optimizer.step()
+                lr_scheduler.step()
+                optimizer.zero_grad()
+
+            overall_step += 1
+
+            if isinstance(checkpointing_steps, int):
+                output_dir = f"step_{overall_step}"
+                if overall_step % checkpointing_steps == 0:
+                    if args.output_dir is not None:
+                        output_dir = os.path.join(args.output_dir, output_dir)
+                    accelerator.save_state(output_dir)
+
+        model.eval()
+        samples_seen = 0
+        for step, batch in enumerate(eval_dataloader):
+            # We could avoid this line since we set the accelerator with `device_placement=True`.
+            batch.to(accelerator.device)
+            with torch.no_grad():
+                outputs = model(**batch)
+            predictions = outputs.logits.argmax(dim=-1)
+            # It is slightly faster to call this once, than multiple times
+            predictions, references = accelerator.gather(
+                (predictions, batch["labels"])
+            )  # If we are in a multiprocess environment, the last batch has duplicates
+            if accelerator.num_processes > 1:
+                if step == len(eval_dataloader) - 1:
+                    predictions = predictions[: len(eval_dataloader.dataset) - samples_seen]
+                    references = references[: len(eval_dataloader.dataset) - samples_seen]
+                else:
+                    samples_seen += references.shape[0]
+            metric.add_batch(
+                predictions=predictions,
+                references=references,
+            )
+
+        eval_metric = metric.compute()
+        # Use accelerator.print to print only on the main process.
+        accelerator.print(f"epoch {epoch}:", eval_metric)
+        if args.with_tracking:
+            accelerator.log(
+                {
+                    "accuracy": eval_metric["accuracy"],
+                    "f1": eval_metric["f1"],
+                    "train_loss": total_loss.item(),
+                    "epoch": epoch,
+                },
+                step=epoch,
+            )
+
+        if checkpointing_steps == "epoch":
+            output_dir = f"epoch_{epoch}"
+            if args.output_dir is not None:
+                output_dir = os.path.join(args.output_dir, output_dir)
+            accelerator.save_state(output_dir)
+
+    if args.with_tracking:
+        accelerator.end_training()
+
+
+def main():
+    parser = argparse.ArgumentParser(description="Simple example of training script.")
+    parser.add_argument(
+        "--mixed_precision",
+        type=str,
+        default="no",
+        choices=["no", "fp16", "bf16"],
+        help="Whether to use mixed precision. Choose"
+        "between fp16 and bf16 (bfloat16). Bf16 requires PyTorch >= 1.10."
+        "and an Nvidia Ampere GPU.",
+    )
+    parser.add_argument("--cpu", action="store_true", help="If passed, will train on the CPU.")
+    parser.add_argument(
+        "--checkpointing_steps",
+        type=str,
+        default=None,
+        help="Whether the various states should be saved at the end of every n steps, or 'epoch' for each epoch.",
+    )
+    parser.add_argument(
+        "--resume_from_checkpoint",
+        type=str,
+        default=None,
+        help="If the training should continue from a checkpoint folder.",
+    )
+    parser.add_argument(
+        "--with_tracking",
+        action="store_true",
+        help="Whether to load in all available experiment trackers from the environment and use them for logging.",
+    )
+    parser.add_argument(
+        "--output_dir",
+        type=str,
+        default=".",
+        help="Optional save directory where all checkpoint folders will be stored. Default is the current working directory.",
+    )
+    parser.add_argument(
+        "--logging_dir",
+        type=str,
+        default="logs",
+        help="Location on where to store experiment tracking logs`",
+    )
+    args = parser.parse_args()
+    config = {"lr": 2e-5, "num_epochs": 3, "correct_bias": True, "seed": 42, "batch_size": 16}
+    training_function(config, args)
+
+
+if __name__ == "__main__":
+    main()
--- a/examples/cv_example.py
+++ b/examples/cv_example.py
@ -73,7 +73,7 @@ class PetsDataset(Dataset):

 def training_function(config, args):
    # Initialize accelerator
-    accelerator = Accelerator(fp16=args.fp16, cpu=args.cpu, mixed_precision=args.mix_precision)
+    accelerator = Accelerator(cpu=args.cpu, mixed_precision=args.mix_precision)

    # Sample hyper-parameters for learning rate, batch size, seed and a few other HPs
    lr = config["lr"]
@ -139,17 +139,16 @@ def training_function(config, args):
    # Instantiate optimizer
    optimizer = torch.optim.Adam(params=model.parameters(), lr=lr / 25)

+    # Instantiate learning rate scheduler
+    lr_scheduler = OneCycleLR(optimizer=optimizer, max_lr=lr, epochs=num_epochs, steps_per_epoch=len(train_dataloader))
+
    # Prepare everything
    # There is no specific order to remember, we just need to unpack the objects in the same order we gave them to the
    # prepare method.
-    model, optimizer, train_dataloader, eval_dataloader = accelerator.prepare(
-        model, optimizer, train_dataloader, eval_dataloader
+    model, optimizer, train_dataloader, eval_dataloader, lr_scheduler = accelerator.prepare(
+        model, optimizer, train_dataloader, eval_dataloader, lr_scheduler
    )

-    # Instantiate learning rate scheduler after preparing the training dataloader as the prepare method
-    # may change its length.
-    lr_scheduler = OneCycleLR(optimizer=optimizer, max_lr=lr, epochs=num_epochs, steps_per_epoch=len(train_dataloader))
-
    # Now we train the model
    for epoch in range(num_epochs):
        model.train()
@ -167,7 +166,7 @@ def training_function(config, args):
        model.eval()
        accurate = 0
        num_elems = 0
-        for step, batch in enumerate(eval_dataloader):
+        for _, batch in enumerate(eval_dataloader):
            # We could avoid this line since we set the accelerator with `device_placement=True`.
            batch = {k: v.to(accelerator.device) for k, v in batch.items()}
            inputs = (batch["image"] - mean) / std
@ -186,7 +185,6 @@ def training_function(config, args):
 def main():
    parser = argparse.ArgumentParser(description="Simple example of training script.")
    parser.add_argument("--data_dir", required=True, help="The data folder on disk.")
-    parser.add_argument("--fp16", action="store_true", help="If passed, will use FP16 training.")
    parser.add_argument(
        "--mixed_precision",
        type=str,
@ -196,6 +194,12 @@ def main():
        "between fp16 and bf16 (bfloat16). Bf16 requires PyTorch >= 1.10."
        "and an Nvidia Ampere GPU.",
    )
+    parser.add_argument(
+        "--checkpointing_steps",
+        type=str,
+        default=None,
+        help="Whether the various states should be saved at the end of every n steps, or 'epoch' for each epoch.",
+    )
    parser.add_argument("--cpu", action="store_true", help="If passed, will train on the CPU.")
    args = parser.parse_args()
    config = {"lr": 3e-2, "num_epochs": 3, "seed": 42, "batch_size": 64, "image_size": 224}
--- a/examples/nlp_example.py
+++ b/examples/nlp_example.py
@ -49,20 +49,19 @@ MAX_GPU_BATCH_SIZE = 16
 EVAL_BATCH_SIZE = 32


-def training_function(config, args):
-    # Initialize accelerator
-    accelerator = Accelerator(fp16=args.fp16, cpu=args.cpu, mixed_precision=args.mixed_precision)
-
-    # Sample hyper-parameters for learning rate, batch size, seed and a few other HPs
-    lr = config["lr"]
-    num_epochs = int(config["num_epochs"])
-    correct_bias = config["correct_bias"]
-    seed = int(config["seed"])
-    batch_size = int(config["batch_size"])
+def get_dataloaders(accelerator: Accelerator, batch_size: int = 16):
+    """
+    Creates a set of `DataLoader`s for the `glue` dataset,
+    using "bert-base-cased" as the tokenizer.

+    Args:
+        accelerator (`Accelerator`):
+            An `Accelerator` object
+        batch_size (`int`, *optional*):
+            The batch size for the train and validation DataLoaders.
+    """
    tokenizer = AutoTokenizer.from_pretrained("bert-base-cased")
    datasets = load_dataset("glue", "mrpc")
-    metric = load_metric("glue", "mrpc")

    def tokenize_function(examples):
        # max_length=None => use the model max length (it's actually the default)
@ -78,13 +77,7 @@ def training_function(config, args):

    # We also rename the 'label' column to 'labels' which is the expected name for labels by the models of the
    # transformers library
-    tokenized_datasets.rename_column_("label", "labels")
-
-    # If the batch size is too big we use gradient accumulation
-    gradient_accumulation_steps = 1
-    if batch_size > MAX_GPU_BATCH_SIZE:
-        gradient_accumulation_steps = batch_size // MAX_GPU_BATCH_SIZE
-        batch_size = MAX_GPU_BATCH_SIZE
+    tokenized_datasets = tokenized_datasets.rename_column("label", "labels")

    def collate_fn(examples):
        # On TPU it's best to pad everything to the same length or training will be very slow.
@ -100,8 +93,29 @@ def training_function(config, args):
        tokenized_datasets["validation"], shuffle=False, collate_fn=collate_fn, batch_size=EVAL_BATCH_SIZE
    )

-    set_seed(seed)
+    return train_dataloader, eval_dataloader

+
+def training_function(config, args):
+    # Initialize accelerator
+    accelerator = Accelerator(cpu=args.cpu, mixed_precision=args.mixed_precision)
+    # Sample hyper-parameters for learning rate, batch size, seed and a few other HPs
+    lr = config["lr"]
+    num_epochs = int(config["num_epochs"])
+    correct_bias = config["correct_bias"]
+    seed = int(config["seed"])
+    batch_size = int(config["batch_size"])
+
+    metric = load_metric("glue", "mrpc")
+
+    # If the batch size is too big we use gradient accumulation
+    gradient_accumulation_steps = 1
+    if batch_size > MAX_GPU_BATCH_SIZE:
+        gradient_accumulation_steps = batch_size // MAX_GPU_BATCH_SIZE
+        batch_size = MAX_GPU_BATCH_SIZE
+
+    set_seed(seed)
+    train_dataloader, eval_dataloader = get_dataloaders(accelerator, batch_size)
    # Instantiate the model (we build the model here so that the seed also control new weights initialization)
    model = AutoModelForSequenceClassification.from_pretrained("bert-base-cased", return_dict=True)

@ -113,21 +127,20 @@ def training_function(config, args):
    # Instantiate optimizer
    optimizer = AdamW(params=model.parameters(), lr=lr, correct_bias=correct_bias)

-    # Prepare everything
-    # There is no specific order to remember, we just need to unpack the objects in the same order we gave them to the
-    # prepare method.
-    model, optimizer, train_dataloader, eval_dataloader = accelerator.prepare(
-        model, optimizer, train_dataloader, eval_dataloader
-    )
-
-    # Instantiate learning rate scheduler after preparing the training dataloader as the prepare method
-    # may change its length.
+    # Instantiate scheduler
    lr_scheduler = get_linear_schedule_with_warmup(
        optimizer=optimizer,
        num_warmup_steps=100,
        num_training_steps=(len(train_dataloader) * num_epochs) // gradient_accumulation_steps,
    )

+    # Prepare everything
+    # There is no specific order to remember, we just need to unpack the objects in the same order we gave them to the
+    # prepare method.
+    model, optimizer, train_dataloader, eval_dataloader, lr_scheduler = accelerator.prepare(
+        model, optimizer, train_dataloader, eval_dataloader, lr_scheduler
+    )
+
    # Now we train the model
    for epoch in range(num_epochs):
        model.train()
@ -150,9 +163,10 @@ def training_function(config, args):
            with torch.no_grad():
                outputs = model(**batch)
            predictions = outputs.logits.argmax(dim=-1)
+            predictions, references = accelerator.gather((predictions, batch["labels"]))
            metric.add_batch(
-                predictions=accelerator.gather(predictions),
-                references=accelerator.gather(batch["labels"]),
+                predictions=predictions,
+                references=references,
            )

        eval_metric = metric.compute()
@ -162,7 +176,6 @@ def training_function(config, args):

 def main():
    parser = argparse.ArgumentParser(description="Simple example of training script.")
-    parser.add_argument("--fp16", action="store_true", help="If passed, will use FP16 training.")
    parser.add_argument(
        "--mixed_precision",
        type=str,
--- a/setup.py
+++ b/setup.py
@ -21,7 +21,13 @@ extras["docs"] = []
 extras["test"] = [
    "pytest",
    "pytest-xdist",
+    "pytest-subtests",
+    "datasets",
+    "transformers",
+    "scipy",
+    "sklearn"
 ]
+extras["test_trackers"] = ["wandb", "comet-ml", "tensorflow"]
 extras["dev"] = extras["quality"] + extras["test"]

 extras["sagemaker"] = [
@ -30,7 +36,7 @@ extras["sagemaker"] = [

 setup(
    name="accelerate",
-    version="0.6.1",
+    version="0.9.0",
    description="Accelerate",
    long_description=open("README.md", "r", encoding="utf-8").read(),
    long_description_content_type="text/markdown",
--- a/src/accelerate/init.py
+++ b/src/accelerate/init.py
@ -2,10 +2,20 @@
 # There's no way to ignore "F401 '...' imported but unused" warnings in this
 # module, but to preserve other warnings. So, don't check this module at all.

-__version__ = "0.6.1"
+__version__ = "0.9.0"

 from .accelerator import Accelerator
-from .kwargs_handlers import DistributedDataParallelKwargs, GradScalerKwargs, InitProcessGroupKwargs
+from .big_modeling import cpu_offload, disk_offload, dispatch_model, init_empty_weights, load_checkpoint_and_dispatch
 from .launchers import debug_launcher, notebook_launcher
-from .state import DistributedType
-from .utils import DeepSpeedPlugin, synchronize_rng_states
+from .utils import (
+    DeepSpeedPlugin,
+    DistributedDataParallelKwargs,
+    DistributedType,
+    FullyShardedDataParallelPlugin,
+    GradScalerKwargs,
+    InitProcessGroupKwargs,
+    find_executable_batch_size,
+    infer_auto_device_map,
+    load_checkpoint_in_model,
+    synchronize_rng_states,
+)
--- a/src/accelerate/accelerator.py
+++ b/src/accelerate/accelerator.py
@ -25,17 +25,29 @@ from packaging import version

 from .checkpointing import load_accelerator_state, load_custom_state, save_accelerator_state, save_custom_state
 from .data_loader import prepare_data_loader
-from .kwargs_handlers import DistributedDataParallelKwargs, GradScalerKwargs, InitProcessGroupKwargs, KwargsHandler
+from .logging import get_logger
 from .optimizer import AcceleratedOptimizer
-from .state import AcceleratorState, DistributedType, is_deepspeed_available
+from .scheduler import AcceleratedScheduler
+from .state import AcceleratorState
+from .tracking import LOGGER_TYPE_TO_CLASS, GeneralTracker, filter_trackers
 from .utils import (
    DeepSpeedPlugin,
+    DistributedDataParallelKwargs,
+    DistributedType,
+    FullyShardedDataParallelPlugin,
+    GradScalerKwargs,
+    InitProcessGroupKwargs,
+    KwargsHandler,
+    LoggerType,
+    PrecisionType,
    RNGType,
    convert_outputs_to_fp32,
    extract_model_from_parallel,
    gather,
    get_pretty_name,
+    is_deepspeed_available,
    pad_across_processes,
+    reduce,
    save,
    wait_for_everyone,
 )
@ -44,12 +56,9 @@ from .utils import (
 if is_deepspeed_available():
    import deepspeed

-    from .deepspeed_utils import DeepSpeedEngineWrapper, DeepSpeedOptimizerWrapper
+    from .utils import DeepSpeedEngineWrapper, DeepSpeedOptimizerWrapper

-import logging
-
-
-logger = logging.getLogger(__name__)
+logger = get_logger(__name__)


 class Accelerator:
@ -76,6 +85,9 @@ class Accelerator:
        deepspeed_plugin (`DeepSpeedPlugin`, *optional*):
            Tweak your DeepSpeed related args using this argument. This argument is optional and can be configured
            directly using *accelerate config*
+        fsdp_plugin (`FullyShardedDataParallelPlugin`, *optional*):
+            Tweak your FSDP related args using this argument. This argument is optional and can be configured directly
+            using *accelerate config*
        rng_types (list of `str` or [`~utils.RNGType`]):
            The list of random number generators to synchronize at the beginning of each iteration in your prepared
            dataloaders. Should be one or several of:
@ -87,10 +99,24 @@ class Accelerator:
              dataloader) or of the iterable dataset (if it exists) if the underlying dataset is of that type.

            Will default to `["torch"]` for PyTorch versions <=1.5.1 and `["generator"]` for PyTorch versions >= 1.6.
+        log_with (list of `str`, [`~utils.LoggerType`] or [`~tracking.GeneralTracker`], *optional*):
+            A list of loggers to be setup for experiment tracking. Should be one or several of:
+
+            - `"all"`
+            - `"tensorboard"`
+            - `"wandb"`
+            - `"comet_ml"`
+            If `"all`" is selected, will pick up all available trackers in the environment and intialize them. Can also
+            accept implementations of `GeneralTracker` for custom trackers, and can be combined with `"all"`.
+        logging_dir (`str`, `os.PathLike`, *optional*):
+            A path to a directory for storing logs of locally-compatible loggers.
        dispatch_batches (`bool`, *optional*):
            If set to `True`, the dataloader prepared by the Accelerator is only iterated through on the main process
            and then the batches are split and broadcast to each process. Will default to `True` for `DataLoader` whose
            underlying dataset is an `IterableDataset`, `False` otherwise.
+        step_scheduler_with_optimizer (`bool`, *optional`, defaults to `True`):
+            Set `True` if the learning rate scheduler is stepped at the same time as the optimizer, `False` if only
+            done under certain circumstances (at the end of each epoch, for instance).
        kwargs_handlers (`List[KwargHandler]`, *optional*)
            A list of `KwargHandler` to customize how the objects related to distributed training or mixed precision
            are created. See [kwargs](kwargs) for more information.
@ -106,19 +132,25 @@ class Accelerator:
        device_placement: bool = True,
        split_batches: bool = False,
        fp16: bool = None,
-        mixed_precision: str = None,
+        mixed_precision: Union[PrecisionType, str] = None,
        cpu: bool = False,
        deepspeed_plugin: DeepSpeedPlugin = None,
+        fsdp_plugin: FullyShardedDataParallelPlugin = None,
        rng_types: Optional[List[Union[str, RNGType]]] = None,
+        log_with: Optional[List[Union[str, LoggerType, GeneralTracker]]] = None,
+        logging_dir: Optional[Union[str, os.PathLike]] = None,
        dispatch_batches: Optional[bool] = None,
+        step_scheduler_with_optimizer: bool = True,
        kwargs_handlers: Optional[List[KwargsHandler]] = None,
    ):
+        self.logging_dir = logging_dir
+        self.log_with = filter_trackers(log_with, self.logging_dir)

        if mixed_precision is not None:
-            mixed_precision = mixed_precision.lower()
-            if mixed_precision not in ["no", "fp16", "bf16"]:
+            mixed_precision = str(mixed_precision)
+            if mixed_precision not in PrecisionType:
                raise ValueError(
-                    f"Unknown mixed_precision mode: {mixed_precision}. Choose between 'no', 'fp16' and 'bf16'."
+                    f"Unknown mixed_precision mode: {mixed_precision}. Choose between {PrecisionType.list()}"
                )

        if fp16:
@ -131,6 +163,18 @@ class Accelerator:
            assert isinstance(
                deepspeed_plugin, DeepSpeedPlugin
            ), "`deepspeed_plugin` must be a DeepSpeedPlugin object."
+            os.environ["USE_DEEPSPEED"] = "true"  # use DeepSpeed if plugin is provided
+
+        if fsdp_plugin is None:  # init from env variables
+            fsdp_plugin = FullyShardedDataParallelPlugin() if os.environ.get("USE_FSDP", "false") == "true" else None
+        else:
+            if not isinstance(fsdp_plugin, FullyShardedDataParallelPlugin):
+                raise TypeError("`fsdp_plugin` must be a FullyShardedDataParallelPlugin object.")
+            os.environ["USE_FSDP"] = "true"  # use FSDP if plugin is provided
+
+        if os.environ.get("USE_FSDP", "false") == "true":
+            if version.parse(torch.__version__) < version.parse("1.12.0.dev20220418+cu113"):
+                raise ValueError("FSDP requires PyTorch >= 1.12.0.dev20220418+cu113")

        # Kwargs handlers
        self.ddp_handler = None
@ -160,6 +204,7 @@ class Accelerator:
            mixed_precision=mixed_precision,
            cpu=cpu,
            deepspeed_plugin=deepspeed_plugin,
+            fsdp_plugin=fsdp_plugin,
            _from_accelerator=True,
            **kwargs,
        )
@ -171,6 +216,7 @@ class Accelerator:
            raise ImportError(
                "Using `DataLoaderDispatcher` requires PyTorch 1.8.0 minimum. You have {torch.__version__}."
            )
+        self.step_scheduler_with_optimizer = step_scheduler_with_optimizer

        # Mixed precision attributes
        self.scaler = None
@ -193,6 +239,7 @@ class Accelerator:
        # Internal references to the training objects
        self._optimizers = []
        self._models = []
+        self._schedulers = []
        self._custom_objects = []

        # RNG Types
@ -237,9 +284,10 @@ class Accelerator:
    @property
    def mixed_precision(self):
        if self.distributed_type == DistributedType.DEEPSPEED:
-            if self.state.deepspeed_plugin.deepspeed_config["fp16"]["enabled"]:
+            config = self.state.deepspeed_plugin.deepspeed_config
+            if config.get("fp16", {}).get("enabled", False):
                mixed_precision = "fp16"
-            elif self.state.deepspeed_plugin.deepspeed_config["bf16"]["enabled"]:
+            elif config.get("bf16", {}).get("enabled", False):
                mixed_precision = "bf16"
            else:
                mixed_precision = "no"
@ -281,19 +329,63 @@ class Accelerator:
        if self.is_local_main_process:
            print(*args, **kwargs)

-    def _prepare_one(self, obj):
-        if isinstance(obj, torch.utils.data.DataLoader):
+    def _prepare_one(self, obj, first_pass=False):
+        # First pass of preparation: DataLoader, model, optimizer
+        if isinstance(obj, torch.utils.data.DataLoader) and first_pass:
            return self.prepare_data_loader(obj)
-        elif isinstance(obj, torch.nn.Module):
+        elif isinstance(obj, torch.nn.Module) and first_pass:
            self._models.append(obj)
            return self.prepare_model(obj)
-        elif isinstance(obj, torch.optim.Optimizer):
+        elif isinstance(obj, torch.optim.Optimizer) and first_pass:
            optimizer = self.prepare_optimizer(obj)
            self._optimizers.append(optimizer)
            return optimizer
+        # Second pass of preparation: LR scheduler (which need the full list of optimizers)
+        elif isinstance(obj, torch.optim.lr_scheduler._LRScheduler) and not first_pass:
+            scheduler = self.prepare_scheduler(obj)
+            self._schedulers.append(scheduler)
+            return scheduler
        else:
            return obj

+    def _prepare_fsdp(self, *args):
+        result = []
+        for obj in args:
+            if isinstance(obj, torch.nn.Module):
+                model = obj
+                break
+        optimizers = []
+
+        self._schedulers = []
+        self._models = []
+        intermediate_result = []
+        for obj in args:
+            if isinstance(obj, torch.optim.Optimizer):
+                if len(obj.param_groups) > 1:
+                    logger.warn(
+                        "FSDP Warning: When using FSDP, several parameter groups will be conflated into "
+                        "a single one due to nested module wrapping and parameter flattening."
+                    )
+                optimizer = obj.optimizer.__class__(model.parameters(), **obj.optimizer.defaults)
+                obj = self.prepare_optimizer(optimizer)
+                optimizers.append(obj)
+            elif isinstance(obj, torch.nn.Module):
+                self._models.append(obj)
+            intermediate_result.append(obj)
+
+        for obj in intermediate_result:
+            if isinstance(obj, AcceleratedScheduler):
+                obj.optimizer = optimizers
+                for i, opt in enumerate(self._optimizers):
+                    if getattr(obj.scheduler, "optimizer", None) == opt.optimizer:
+                        obj.scheduler.optimizer = optimizers[i]
+                        obj.optimizers = [optimizers[i]]
+                        break
+                self._schedulers.append(obj)
+            result.append(obj)
+        self._optimizers = optimizers
+        return tuple(result)
+
    def prepare(self, *args):
        """
        Prepare all objects passed in `args` for distributed training and mixed precision, then return them in the same
@ -305,6 +397,25 @@ class Accelerator:
            - `torch.nn.Module`: PyTorch Module
            - `torch.optim.Optimizer`: PyTorch Optimizer
        """
+        if self.distributed_type == DistributedType.FSDP:
+            model_count = 0
+            optimizer_present = False
+            for obj in args:
+                if isinstance(obj, torch.nn.Module):
+                    model_count += 1
+                if isinstance(obj, torch.optim.Optimizer):
+                    optimizer_present = True
+            if model_count > 1 and optimizer_present:
+                raise ValueError(
+                    "For FSDP to work with multiple models (>1), "
+                    "prepare must be called for all the models before optimizers are created"
+                )
+            elif model_count == 1 and optimizer_present:
+                logger.warn(
+                    "FSDP Warning: When using FSDP, "
+                    "it is efficient and recommended to call prepare for the model before creating the optimizer"
+                )
+
        # On TPUs, putting the model on the XLA device will create new parameters, so the corresponding optimizer will
        # have parameters disconnected from the model (so no training :-( ).
        # If the model and optimizer have parameters on different devices we raise an error.
@ -328,7 +439,8 @@ class Accelerator:
        if self.distributed_type == DistributedType.DEEPSPEED:
            result = self._prepare_deepspeed(*args)
        else:
-            result = tuple(self._prepare_one(obj) for obj in args)
+            result = tuple(self._prepare_one(obj, first_pass=True) for obj in args)
+            result = tuple(self._prepare_one(obj) for obj in result)

        if tpu_should_fix_optimizer:
            # 2. grabbing new model parameters
@ -340,16 +452,36 @@ class Accelerator:
                if isinstance(obj, torch.optim.Optimizer):
                    obj._switch_parameters(mapping)

+        if self.distributed_type == DistributedType.FSDP and model_count == 1 and optimizer_present:
+            result = self._prepare_fsdp(*result)
+
        return result if len(result) > 1 else result[0]

    def prepare_model(self, model):
-        if self.device_placement:
+        if self.device_placement and self.distributed_type != DistributedType.FSDP:
            model = model.to(self.device)
        if self.distributed_type == DistributedType.MULTI_GPU:
            kwargs = self.ddp_handler.to_kwargs() if self.ddp_handler is not None else {}
            model = torch.nn.parallel.DistributedDataParallel(
                model, device_ids=[self.local_process_index], output_device=self.local_process_index, **kwargs
            )
+        elif self.distributed_type == DistributedType.FSDP:
+            from torch.distributed.fsdp.fully_sharded_data_parallel import FullyShardedDataParallel as FSDP
+
+            # Check if the model is already a FSDP model due to `Manual Wrapping` and if so,
+            # don't wrap it again
+            if type(model) != FSDP:
+                fsdp_plugin = self.state.fsdp_plugin
+                model = FSDP(
+                    model,
+                    sharding_strategy=fsdp_plugin.sharding_strategy,
+                    cpu_offload=fsdp_plugin.cpu_offload,
+                    auto_wrap_policy=fsdp_plugin.auto_wrap_policy,
+                    backward_prefetch=fsdp_plugin.backward_prefetch,
+                    ignored_modules=fsdp_plugin.ignored_modules,
+                )
+                if not fsdp_plugin.cpu_offload.offload_params:
+                    model.to(self.device)
        elif self.distributed_type == DistributedType.MULTI_CPU:
            kwargs = self.ddp_handler.to_kwargs() if self.ddp_handler is not None else {}
            model = torch.nn.parallel.DistributedDataParallel(model, **kwargs)
@ -385,7 +517,10 @@ class Accelerator:
            batch_size_per_device * deepspeed_plugin.gradient_accumulation_steps * self.num_processes
        )

-        result = [self._prepare_one(obj) if isinstance(obj, torch.utils.data.DataLoader) else obj for obj in args]
+        result = [
+            self._prepare_one(obj, first_pass=True) if isinstance(obj, torch.utils.data.DataLoader) else obj
+            for obj in args
+        ]

        model = None
        optimizer = None
@ -458,6 +593,21 @@ class Accelerator:
    def prepare_optimizer(self, optimizer):
        return AcceleratedOptimizer(optimizer, device_placement=self.device_placement, scaler=self.scaler)

+    def prepare_scheduler(self, scheduler):
+        # We try to find the optimizer associated with `scheduler`, the default is the full list.
+        optimizer = self._optimizers
+        for opt in self._optimizers:
+            if getattr(scheduler, "optimizer", None) == opt.optimizer:
+                optimizer = opt
+                break
+
+        return AcceleratedScheduler(
+            scheduler,
+            optimizer,
+            step_with_optimizer=self.step_scheduler_with_optimizer,
+            split_batches=self.split_batches,
+        )
+
    def backward(self, loss, **kwargs):
        """
        Use `accelerator.backward(loss)` in lieu of `loss.backward()`.
@ -493,6 +643,12 @@ class Accelerator:
        """
        Should be used in place of `torch.nn.utils.clip_grad_norm_`.
        """
+        if self.distributed_type == DistributedType.FSDP:
+            parameters = [p for p in parameters]
+            for model in self._models:
+                if parameters == [p for p in model.parameters()]:
+                    model.clip_grad_norm_(max_norm, norm_type)
+                    return
        self.unscale_gradients()
        torch.nn.utils.clip_grad_norm_(parameters, max_norm, norm_type=norm_type)

@ -505,7 +661,7 @@ class Accelerator:

    def gather(self, tensor):
        """
-        Gather the values in *tensor* accross all processes and concatenate them on the first dimension. Useful to
+        Gather the values in *tensor* across all processes and concatenate them on the first dimension. Useful to
        regroup the predictions from all processes when doing evaluation.

        Note:
@ -521,6 +677,18 @@ class Accelerator:
        """
        return gather(tensor)

+    def reduce(self, tensor: torch.Tensor, reduction="sum"):
+        """
+        Reduce the values in *tensor* across all processes based on *reduction*.
+
+        Args:
+            tensor (`torch.Tensor`):
+                The tensors to reduce across all processes.
+            reduction (`str`, *optional*, defaults to "sum"):
+                A reduction type, can be one of 'sum', 'mean', or 'none'. If 'none', will not perform any operation.
+        """
+        reduce(tensor, reduction)
+
    def pad_across_processes(self, tensor, dim=0, pad_index=0, pad_first=False):
        """
        Recursively pad the tensors in a nested list/tuple/dictionary of tensors from all devices to the same size so
@ -556,6 +724,54 @@ class Accelerator:
        """
        wait_for_everyone()

+    def init_trackers(self, project_name: str, config: Optional[dict] = None):
+        """
+        Initializes a run for all trackers stored in `self.log_with`, potentially with starting configurations
+
+        Args:
+            project_name (`str`):
+                The name of the project. All trackers will save their data based on this
+            config (`dict`, *optional*):
+                Optional starting configuration to be logged.
+        """
+        self.trackers = []
+        for tracker in self.log_with:
+            if issubclass(type(tracker), GeneralTracker):
+                # Custom trackers are already initialized
+                self.trackers.append(tracker)
+            else:
+                tracker_init = LOGGER_TYPE_TO_CLASS[str(tracker)]
+                if getattr(tracker_init, "requires_logging_directory"):
+                    # We can skip this check since it was done in `__init__`
+                    self.trackers.append(tracker_init(project_name, self.logging_dir))
+                else:
+                    self.trackers.append(tracker_init(project_name))
+        if config is not None:
+            for tracker in self.trackers:
+                tracker.store_init_configuration(config)
+
+    def log(self, values: dict, step: Optional[int] = None):
+        """
+        Logs `values` to all stored trackers in `self.trackers`.
+
+        Args:
+            values (`dict`):
+                Values should be a dictionary-like object containing only types `int`, `float`, or `str`.
+            step (`int`, *optional*):
+                The run step. If included, the log will be affiliated with this step.
+        """
+        if self.is_main_process:
+            for tracker in self.trackers:
+                tracker.log(values, step=step)
+
+    def end_training(self):
+        """
+        Runs any special end training behaviors, such as stopping trackers
+        """
+        if self.is_main_process:
+            for tracker in self.trackers:
+                tracker.finish()
+
    def save(self, obj, f):
        """
        Save the object passed to disk once per machine. Use in place of `torch.save`.
@ -581,7 +797,7 @@ class Accelerator:
        logger.info(f"Saving current state to {output_dir}")
        weights = [self.get_state_dict(m) for m in self._models]
        save_location = save_accelerator_state(
-            output_dir, weights, self._optimizers, self.state.process_index, self.scaler
+            output_dir, weights, self._optimizers, self._schedulers, self.state.process_index, self.scaler
        )
        for i, obj in enumerate(self._custom_objects):
            save_custom_state(obj, output_dir, i)
@ -600,7 +816,9 @@ class Accelerator:
        if not os.path.isdir(input_dir):
            raise ValueError(f"Tried to find {input_dir} but folder does not exist")
        logger.info(f"Loading states from {input_dir}")
-        load_accelerator_state(input_dir, self._models, self._optimizers, self.state.process_index, self.scaler)
+        load_accelerator_state(
+            input_dir, self._models, self._optimizers, self._schedulers, self.state.process_index, self.scaler
+        )
        custom_checkpoints = [f for f in os.listdir(input_dir) if "custom_checkpoint" in f]
        if len(custom_checkpoints) != len(self._custom_objects):
            err = "Warning! Number of found checkpoints does not match the number of registered objects:"
@ -617,12 +835,20 @@ class Accelerator:
        Will release all references to the internal objects stored and call the garbage collector. You should call this
        method between two trainings with different models/optimizers.
        """
+        self._schedulers = []
        self._optimizers = []
        self._models = []
        self.deepspeed_engine = None
        gc.collect()
        torch.cuda.empty_cache()

+    def clear(self):
+        """
+        Alias for [`Accelerate.free_memory`], releases all references to the internal objects stored and call the
+        garbage collector. You should call this method between two trainings with different models/optimizers.
+        """
+        self.free_memory()
+
    def _get_named_parameters(self, *args):
        named_parameters = {}
        for obj in args:
@ -655,14 +881,15 @@ class Accelerator:
                is_zero_3 = self.state.deepspeed_plugin.zero_stage == 3

        if is_zero_3:
-            state_dict = model._zero3_consolidated_fp16_state_dict()
+            state_dict = model._zero3_consolidated_16bit_state_dict()
        else:
            model = self.unwrap_model(model)
            state_dict = model.state_dict()

-        for k in state_dict:
-            if state_dict[k].dtype == torch.float16:
-                state_dict[k] = state_dict[k].float()
+        if state_dict is not None:
+            for k in state_dict:
+                if state_dict[k].dtype == torch.float16:
+                    state_dict[k] = state_dict[k].float()

        return state_dict

@ -717,6 +944,6 @@ class Accelerator:
        case the learning rate should not be changed.
        """
        for optimizer in self._optimizers:
-            if optimizer.is_overflow:
+            if optimizer.step_was_skipped:
                return True
        return False
--- a/src/accelerate/big_modeling.py
+++ b/src/accelerate/big_modeling.py
@ -0,0 +1,285 @@
+# Copyright 2022 The HuggingFace Team. All rights reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import os
+from contextlib import contextmanager
+from typing import Dict, List, Optional, Union
+
+import torch
+import torch.nn as nn
+
+from .hooks import AlignDevicesHook, add_hook_to_module, attach_align_device_hook, attach_align_device_hook_on_blocks
+from .utils import (
+    OffloadedWeightsLoader,
+    check_device_map,
+    extract_submodules_state_dict,
+    infer_auto_device_map,
+    load_checkpoint_in_model,
+    offload_state_dict,
+)
+
+
+@contextmanager
+def init_empty_weights(include_buffers: bool = False):
+    """
+    A context manager under which models are initialized with all parameters on the meta device, therefore creating an
+    empty model. Useful when just initializing the model would blow the available RAM.
+
+    Args:
+        include_buffers (`bool`, *optional*, defaults to `False`):
+            Whether or not to also put all buffers on the meta device while initializing.
+
+    Example:
+
+    ```pyton
+    import torch.nn as nn
+    from accelerate import init_empty_weights
+
+    # Initialize a model with 100 billions parameters in no time and without using any RAM.
+    with init_empty_weights():
+        tst = nn.Sequential(*[nn.Linear(10000, 10000) for _ in range(1000)])
+    ```
+
+    <Tip warning={true}>
+
+    Any model created under this context manager has no weights. As such you can't do something like
+    `model.to(some_device)` with it. To load weights inside your empty model, see [`load_checkpoint_and_dispatch`].
+
+    </Tip>
+    """
+    old_register_parameter = nn.Module.register_parameter
+    if include_buffers:
+        old_register_buffer = nn.Module.register_buffer
+
+    def register_empty_parameter(module, name, param):
+        old_register_parameter(module, name, param)
+        if param is not None:
+            module._parameters[name] = nn.Parameter(module._parameters[name].to(torch.device("meta")))
+
+    def register_empty_buffer(module, name, buffer):
+        old_register_buffer(module, name, buffer)
+        if buffer is not None:
+            module._buffers[name] = module._buffers[name].to(torch.device("meta"))
+
+    try:
+        nn.Module.register_parameter = register_empty_parameter
+        if include_buffers:
+            nn.Module.register_buffer = register_empty_buffer
+        yield
+    finally:
+        nn.Module.register_parameter = old_register_parameter
+        if include_buffers:
+            nn.Module.register_buffer = old_register_buffer
+
+
+def cpu_offload(
+    model: nn.Module,
+    execution_device: Optional[torch.device] = None,
+    offload_buffers: bool = False,
+    state_dict: Optional[Dict[str, torch.Tensor]] = None,
+):
+    """
+    Activates full CPU offload for a model. As a result, all parameters of the model will be offloaded and only one
+    copy of the state dict of the model will be kept. During the forward pass, parameters will be extracted from that
+    state dict and put on the execution device passed as they are needed, then offloaded again.
+
+    Args:
+        model (`torch.nn.Module`):
+            The model to offload.
+        execution_device (`torch.device`, *optional*):
+            The device on which the forward pass of the model will be executed (should be a GPU). Will default to the
+            model first parameter device.
+        offload_buffers (`bool`, *optional*, defaults to `False`):
+            Whether or not to offload the buffers with the model parameters.
+        state_dict (`Dict[str, torch.Tensor]`, *optional*):
+            The state dict of the model that will be kept on CPU.
+    """
+    if execution_device is None:
+        execution_device = next(iter(model.parameters())).device
+    if state_dict is None:
+        state_dict = {n: p.to("cpu") for n, p in model.state_dict().items()}
+    attach_align_device_hook(
+        model, execution_device=execution_device, offload=True, offload_buffers=offload_buffers, weights_map=state_dict
+    )
+    add_hook_to_module(model, AlignDevicesHook(io_same_device=True))
+    return model
+
+
+def disk_offload(
+    model: nn.Module,
+    offload_dir: Union[str, os.PathLike],
+    execution_device: Optional[torch.device] = None,
+    offload_buffers: bool = False,
+):
+    """
+    Activates full disk offload for a model. As a result, all parameters of the model will be offloaded as
+    memory-mapped array in a given folder. During the forward pass, parameters will be accessed from that folder and
+    put on the execution device passed as they are needed, then offloaded again.
+
+    Args:
+        model (`torch.nn.Module`): The model to offload.
+        offload_dir (`str` or `os.PathLike`):
+            The folder in which to offload the model weights (or where the model weights are already offloaded).
+        execution_device (`torch.device`, *optional*):
+            The device on which the forward pass of the model will be executed (should be a GPU). Will default to the
+            model's first parameter device.
+        offload_buffers (`bool`, *optional*, defaults to `False`):
+            Whether or not to offload the buffers with the model parameters.
+    """
+    if not os.path.isdir(offload_dir) or not os.path.isfile(os.path.join(offload_dir, "index.json")):
+        offload_state_dict(offload_dir, model.state_dict())
+    if execution_device is None:
+        execution_device = next(iter(model.parameters())).device
+    weights_map = OffloadedWeightsLoader(save_folder=offload_dir)
+    attach_align_device_hook(
+        model,
+        execution_device=execution_device,
+        offload=True,
+        offload_buffers=offload_buffers,
+        weights_map=weights_map,
+    )
+    add_hook_to_module(model, AlignDevicesHook(io_same_device=True))
+    return model
+
+
+def dispatch_model(
+    model: nn.Module,
+    device_map: Dict[str, Union[str, int, torch.device]],
+    main_device: Optional[torch.device] = None,
+    state_dict: Optional[Dict[str, torch.Tensor]] = None,
+    offload_dir: Union[str, os.PathLike] = None,
+    offload_buffers: bool = False,
+):
+    """
+    Dispatches a model according to a given device map. Layers of the model might be spread across GPUs, offloaded on
+    the CPU or even the disk.
+
+    Args:
+        model (`torch.nn.Module`):
+            The model to dispatch.
+        device_map (`Dict[str, Union[str, int, torch.device]]`):
+            A dictionary mapping module names in the models `state_dict` to the device they should go to. Note that
+            `"disk"` is accepted even if it's not a proper value for `torch.device`.
+        main_device (`str`, `int` or `torch.device`, *optional*):
+            The main execution device. Will default to the first device in the `device_map` different from `"cpu"` or
+            `"disk"`.
+        state_dict (`Dict[str, torch.Tensor]`, *optional*):
+            The state dict of the part of the model that will be kept on CPU.
+        offload_dir (`str` or `os.PathLike`):
+            The folder in which to offload the model weights (or where the model weights are already offloaded).
+        offload_buffers (`bool`, *optional*, defaults to `False`):
+            Whether or not to offload the buffers with the model parameters.
+    """
+    # Error early if the device map is incomplete.
+    check_device_map(model, device_map)
+
+    if main_device is None:
+        main_device = [d for d in device_map.values() if d not in ["cpu", "disk"]][0]
+
+    cpu_modules = [name for name, device in device_map.items() if device == "cpu"]
+    if state_dict is None and len(cpu_modules) > 0:
+        state_dict = extract_submodules_state_dict(model.state_dict(), cpu_modules)
+
+    disk_modules = [name for name, device in device_map.items() if device == "disk"]
+    if offload_dir is None and len(disk_modules) > 0:
+        raise ValueError(
+            "We need an `offload_dir` to dispatch this model according to this `device_map`, the following submodules "
+            f"need to be offloaded: {', '.join(disk_modules)}."
+        )
+    if len(disk_modules) > 0 and (
+        not os.path.isdir(offload_dir) or not os.path.isfile(os.path.join(offload_dir, "index.json"))
+    ):
+        disk_state_dict = extract_submodules_state_dict(model.state_dict(), disk_modules)
+        offload_state_dict(offload_dir, disk_state_dict)
+
+    execution_device = {
+        name: main_device if device in ["cpu", "disk"] else device for name, device in device_map.items()
+    }
+    offload = {name: device in ["cpu", "disk"] for name, device in device_map.items()}
+    save_folder = offload_dir if len(disk_modules) > 0 else None
+    if state_dict is not None or save_folder is not None:
+        weights_map = OffloadedWeightsLoader(state_dict=state_dict, save_folder=save_folder)
+    else:
+        weights_map = None
+
+    attach_align_device_hook_on_blocks(
+        model,
+        execution_device=execution_device,
+        offload=offload,
+        offload_buffers=offload_buffers,
+        weights_map=weights_map,
+    )
+    model.hf_device_map = device_map
+    return model
+
+
+def load_checkpoint_and_dispatch(
+    model: nn.Module,
+    checkpoint: Union[str, os.PathLike],
+    device_map: Optional[Union[str, Dict[str, Union[int, str, torch.device]]]] = None,
+    max_memory: Optional[Dict[Union[int, str], Union[int, str]]] = None,
+    no_split_module_classes: Optional[List[str]] = None,
+    offload_folder: Optional[Union[str, os.PathLike]] = None,
+    offload_buffers: bool = False,
+    dtype: Optional[Union[str, torch.dtype]] = None,
+    offload_state_dict: bool = False,
+):
+    """
+    Loads a (potentially sharded) checkpoint inside a model, potentially sending weights to a given device as they are
+    loaded and adds the various hooks that will make this model run properly (even if split across devices).
+
+    Args:
+        model (`torch.nn.Module`): The model in which we want to load a checkpoint.
+        checkpoint (`str` or `os.PathLike`):
+            The folder checkpoint to load. It can be:
+            - a path to a file containing a whole model state dict
+            - a path to a `.json` file containing the index to a sharded checkpoint
+            - a path to a folder containing a unique `.index.json` file and the shards of a checkpoint.
+        device_map (`Dict[str, Union[int, str, torch.device]]`, *optional*):
+            A map that specifies where each submodule should go. It doesn't need to be refined to each parameter/buffer
+            name, once a given module name is inside, every submodule of it will be sent to the same device.
+
+            To have Accelerate compute the most optimized `device_map` automatically, set `device_map="auto"`.
+        max_memory (`Dict`, *optional*):
+            A dictionary device identifier to maximum memory. Will default to the maximum memory available for each GPU
+            and the available CPU RAM if unset.
+        no_split_module_classes (`List[str]`, *optional*):
+            A list of layer class names that should never be split across device (for instance any layer that has a
+            residual connection).
+        offload_folder (`str` or `os.PathLike`, *optional*):
+            If the `device_map` contains any value `"disk"`, the folder where we will offload weights.
+        offload_buffers (`bool`, *optional*, defaults to `False`):
+            In the layers that are offloaded on the CPU or the hard drive, whether or not to offload the buffers as
+            well as the parameters.
+        dtype (`str` or `torch.dtype`, *optional*):
+            If provided, the weights will be converted to that type when loaded.
+        offload_state_dict (`bool`, *optional*, defaults to `False`):
+            If `True`, will temporarily offload the CPU state dict on the hard drive to avoig getting out of CPU RAM if
+            the weight of the CPU state dict + the biggest shard does not fit.
+    """
+    if device_map == "auto":
+        device_map = infer_auto_device_map(
+            model, max_memory=max_memory, no_split_module_classes=no_split_module_classes, dtype=dtype
+        )
+    load_checkpoint_in_model(
+        model,
+        checkpoint,
+        device_map=device_map,
+        offload_folder=offload_folder,
+        dtype=dtype,
+        offload_state_dict=offload_state_dict,
+    )
+    if device_map is None:
+        return model
+    return dispatch_model(model, device_map=device_map, offload_dir=offload_folder, offload_buffers=offload_buffers)
--- a/src/accelerate/checkpointing.py
+++ b/src/accelerate/checkpointing.py
@ -21,21 +21,34 @@ import numpy as np
 import torch
 from torch.cuda.amp import GradScaler

-from .state import is_tpu_available
-from .utils import MODEL_NAME, OPTIMIZER_NAME, RNG_STATE_NAME, SCALER_NAME, get_pretty_name, save
+from .utils import (
+    MODEL_NAME,
+    OPTIMIZER_NAME,
+    RNG_STATE_NAME,
+    SCALER_NAME,
+    SCHEDULER_NAME,
+    get_pretty_name,
+    is_tpu_available,
+    save,
+)


 if is_tpu_available():
    import torch_xla.core.xla_model as xm

-import logging
+from .logging import get_logger


-logger = logging.getLogger(__name__)
+logger = get_logger(__name__)


 def save_accelerator_state(
-    output_dir: str, model_states: List[dict], optimizers: list, process_index: int, scaler: GradScaler = None
+    output_dir: str,
+    model_states: List[dict],
+    optimizers: list,
+    schedulers: list,
+    process_index: int,
+    scaler: GradScaler = None,
 ):
    """
    Saves the current states of the models, optimizers, scaler, and RNG generators to a given directory.
@ -47,6 +60,8 @@ def save_accelerator_state(
            A list of model states
        optimizers (`List[torch.optim.Optimizer]`):
            A list of optimizer instances
+        schedulers (`List[torch.optim.lr_scheduler._LRScheduler]`):
+            A list of learning rate schedulers
        process_index (`int`):
            The current process index in the Accelerator state
        scaler (`torch.cuda.amp.GradScaler`, *optional*):
@ -65,6 +80,13 @@ def save_accelerator_state(
        output_optimizer_file = os.path.join(output_dir, optimizer_name)
        save(state, output_optimizer_file)
        logger.info(f"Optimizer state saved in {output_optimizer_file}")
+    # Scheduler states
+    for i, scheduler in enumerate(schedulers):
+        state = scheduler.state_dict()
+        scheduler_name = f"{SCHEDULER_NAME}.bin" if i == 0 else f"{SCHEDULER_NAME}_{i}.bin"
+        output_scheduler_file = os.path.join(output_dir, scheduler_name)
+        save(state, output_scheduler_file)
+        logger.info(f"Scheduler state saved in {output_scheduler_file}")
    # GradScaler state
    if scaler is not None:
        state = scaler.state_dict()
@ -80,14 +102,14 @@ def save_accelerator_state(
    states["torch_cuda_manual_seed"] = torch.cuda.get_rng_state_all()
    # ^^ safe to call this function even if cuda is not available
    if is_tpu_available():
-        states["xm_seed"] = torch.tensor(xm.get_rng_state())
+        states["xm_seed"] = xm.get_rng_state()
    output_states_file = os.path.join(output_dir, states_name)
    torch.save(states, output_states_file)
    logger.info(f"Random states saved in {output_states_file}")
    return output_dir


-def load_accelerator_state(input_dir, models, optimizers, process_index, scaler=None):
+def load_accelerator_state(input_dir, models, optimizers, schedulers, process_index, scaler=None):
    """
    Loads states of the models, optimizers, scaler, and RNG generators from a given directory.

@ -98,6 +120,8 @@ def load_accelerator_state(input_dir, models, optimizers, process_index, scaler=
            A list of model instances
        optimizers (`List[torch.optim.Optimizer]`):
            A list of optimizer instances
+        schedulers (`List[torch.optim.lr_scheduler._LRScheduler]`):
+            A list of learning rate schedulers
        process_index (`int`):
            The current process index in the Accelerator state
        scaler (`torch.cuda.amp.GradScaler`, *optional*):
@ -107,16 +131,23 @@ def load_accelerator_state(input_dir, models, optimizers, process_index, scaler=
    for i, model in enumerate(models):
        weights_name = f"{MODEL_NAME}.bin" if i == 0 else f"{MODEL_NAME}_{i}.bin"
        input_model_file = os.path.join(input_dir, weights_name)
-        models[i].load_state_dict(torch.load(input_model_file))
+        models[i].load_state_dict(torch.load(input_model_file, map_location="cpu"))
    logger.info("All model weights loaded successfully")

    # Optimizer states
    for i, opt in enumerate(optimizers):
        optimizer_name = f"{OPTIMIZER_NAME}.bin" if i == 0 else f"{OPTIMIZER_NAME}_{i}.bin"
        input_optimizer_file = os.path.join(input_dir, optimizer_name)
-        optimizers[i].load_state_dict(torch.load(input_optimizer_file))
+        optimizers[i].load_state_dict(torch.load(input_optimizer_file, map_location="cpu"))
    logger.info("All optimizer states loaded successfully")

+    # Scheduler states
+    for i, scheduler in enumerate(schedulers):
+        scheduler_name = f"{SCHEDULER_NAME}.bin" if i == 0 else f"{SCHEDULER_NAME}_{i}.bin"
+        input_scheduler_file = os.path.join(input_dir, scheduler_name)
+        scheduler.load_state_dict(torch.load(input_scheduler_file))
+    logger.info("All scheduler states loaded successfully")
+
    # GradScaler state
    if scaler is not None:
        input_scaler_file = os.path.join(input_dir, SCALER_NAME)
--- a/src/accelerate/commands/config/init.py
+++ b/src/accelerate/commands/config/init.py
@ -17,7 +17,7 @@
 import argparse
 import os

-from accelerate.state import ComputeEnvironment
+from accelerate.utils import ComputeEnvironment

 from .cluster import get_cluster_input
 from .config_args import cache_dir, default_config_file, default_yaml_config_file, load_config_from_file  # noqa: F401
--- a/src/accelerate/commands/config/cluster.py
+++ b/src/accelerate/commands/config/cluster.py
@ -14,9 +14,7 @@
 # See the License for the specific language governing permissions and
 # limitations under the License.

-from accelerate.state import ComputeEnvironment, DistributedType
-
-from ...utils import is_deepspeed_available
+from ...utils import ComputeEnvironment, DistributedType, is_deepspeed_available
 from .config_args import ClusterConfig
 from .config_utils import _ask_field, _convert_distributed_mode, _convert_yes_no_to_bool

@ -54,16 +52,17 @@ def get_cluster_input():

    if distributed_type == DistributedType.NO:
        use_cpu = _ask_field(
-            "Do you want to run your training on CPU only (even if a GPU is available)? [no]:",
-            lambda x: bool(x),
+            "Do you want to run your training on CPU only (even if a GPU is available)? [yes/NO]:",
+            _convert_yes_no_to_bool,
            default=False,
+            error_message="Please enter yes or no.",
        )
    elif distributed_type == DistributedType.MULTI_CPU:
        use_cpu = True
    else:
        use_cpu = False

-    deepspeed_config = None
+    deepspeed_config = {}
    if distributed_type in [DistributedType.MULTI_GPU, DistributedType.NO]:
        use_deepspeed = _ask_field(
            "Do you want to use DeepSpeed? [yes/NO]: ",
@ -77,7 +76,6 @@ def get_cluster_input():
                is_deepspeed_available()
            ), "DeepSpeed is not installed => run `pip3 install deepspeed` or build it from source"

-        deepspeed_config = {}
        if distributed_type == DistributedType.DEEPSPEED:
            deepspeed_config["zero_stage"] = _ask_field(
                "What should be your DeepSpeed's ZeRO optimization stage (0, 1, 2, 3)? [2]: ",
@ -98,6 +96,34 @@ def get_cluster_input():
                default=1,
            )

+    fsdp_config = {}
+    if distributed_type in [DistributedType.MULTI_GPU]:
+        use_fsdp = _ask_field(
+            "Do you want to use FullyShardedDataParallel? [yes/NO]: ",
+            _convert_yes_no_to_bool,
+            default=False,
+            error_message="Please enter yes or no.",
+        )
+        if use_fsdp:
+            distributed_type = DistributedType.FSDP
+        if distributed_type == DistributedType.FSDP:
+            fsdp_config["sharding_strategy"] = _ask_field(
+                "What should be your sharding strategy ([1] FULL_SHARD, [2] SHARD_GRAD_OP)? [1]: ",
+                lambda x: int(x),
+                default=1,
+            )
+            fsdp_config["offload_params"] = _ask_field(
+                "Do you want to offload parameters and gradients to CPU? [yes/NO]: ",
+                _convert_yes_no_to_bool,
+                default=False,
+                error_message="Please enter yes or no.",
+            )
+            fsdp_config["min_num_params"] = _ask_field(
+                "What should be your FSDP's minimum number of parameters for Default Auto Wrapping Policy? [1e8]: ",
+                lambda x: int(x),
+                default=1e8,
+            )
+
    if distributed_type == DistributedType.TPU:
        main_training_function = _ask_field(
            "What is the name of the function in your script that should be launched in all parallel scripts? [main]: ",
@ -106,12 +132,27 @@ def get_cluster_input():
    else:
        main_training_function = "main"

-    num_processes = _ask_field(
-        "How many processes in total will you use? [1]: ",
-        lambda x: int(x),
-        default=1,
-        error_message="Please enter an integer.",
-    )
+    if distributed_type in [DistributedType.MULTI_CPU, DistributedType.MULTI_GPU, DistributedType.TPU]:
+        machine_type = str(distributed_type).split(".")[1].replace("MULTI_", "")
+        if machine_type == "TPU":
+            machine_type += " cores"
+        else:
+            machine_type += "(s)"
+        num_processes = _ask_field(
+            f"How many {machine_type} should be used for distributed training? [1]:",
+            lambda x: int(x),
+            default=1,
+            error_message="Please enter an integer.",
+        )
+    elif distributed_type in [DistributedType.FSDP, DistributedType.DEEPSPEED]:
+        num_processes = _ask_field(
+            "How many GPU(s) should be used for distributed training? [1]:",
+            lambda x: int(x),
+            default=1,
+            error_message="Please enter an integer.",
+        )
+    else:
+        num_processes = 1

    if distributed_type != DistributedType.TPU:
        mixed_precision = _ask_field(
@ -133,5 +174,6 @@ def get_cluster_input():
        main_process_port=main_process_port,
        main_training_function=main_training_function,
        deepspeed_config=deepspeed_config,
+        fsdp_config=fsdp_config,
        use_cpu=use_cpu,
    )
--- a/src/accelerate/commands/config/config_args.py
+++ b/src/accelerate/commands/config/config_args.py
@ -21,7 +21,8 @@ from enum import Enum
 from typing import Optional, Union

 import yaml
-from accelerate.state import ComputeEnvironment, DistributedType, SageMakerDistributedType
+
+from ...utils import ComputeEnvironment, DistributedType, SageMakerDistributedType


 hf_cache_home = os.path.expanduser(
@ -136,6 +137,15 @@ class ClusterConfig(BaseConfig):

    # args for deepspeed_plugin
    deepspeed_config: dict = None
+    # args for fsdp
+    fsdp_config: dict = None
+
+    def __post_init__(self):
+        if self.deepspeed_config is None:
+            self.deepspeed_config = {}
+        if self.fsdp_config is None:
+            self.fsdp_config = {}
+        return super().__post_init__()


@dataclass
--- a/src/accelerate/commands/config/config_utils.py
+++ b/src/accelerate/commands/config/config_utils.py
@ -14,7 +14,7 @@
 # See the License for the specific language governing permissions and
 # limitations under the License.

-from accelerate.state import ComputeEnvironment, DistributedType, SageMakerDistributedType
+from ...utils.dataclasses import ComputeEnvironment, DistributedType, SageMakerDistributedType


 def _ask_field(input_text, convert_value=None, default=None, error_message=None):
--- a/src/accelerate/commands/config/sagemaker.py
+++ b/src/accelerate/commands/config/sagemaker.py
@ -16,9 +16,8 @@
 import json
 import os

-from accelerate.state import ComputeEnvironment, SageMakerDistributedType
-from accelerate.utils import is_boto3_available
-
+from ...utils.dataclasses import ComputeEnvironment, SageMakerDistributedType
+from ...utils.imports import is_boto3_available
 from .config_args import SageMakerConfig
 from .config_utils import _ask_field, _convert_sagemaker_distributed_mode

--- a/src/accelerate/commands/launch.py
+++ b/src/accelerate/commands/launch.py
@ -26,8 +26,15 @@ from typing import Dict, List

 from accelerate.commands.config import default_config_file, load_config_from_file
 from accelerate.commands.config.config_args import SageMakerConfig
-from accelerate.state import ComputeEnvironment, DistributedType
-from accelerate.utils import PrepareForLaunch, is_sagemaker_available
+from accelerate.utils import (
+    ComputeEnvironment,
+    DistributedType,
+    PrecisionType,
+    PrepareForLaunch,
+    is_sagemaker_available,
+)
+from accelerate.utils.versions import torch_version
+from packaging import version


 def launch_command_parser(subparsers=None):
@ -51,12 +58,35 @@ def launch_command_parser(subparsers=None):
        action="store_true",
        help="Whether to use deepspeed.",
    )
+    parser.add_argument(
+        "--use_fsdp",
+        default=False,
+        action="store_true",
+        help="Whether to use fsdp.",
+    )
+    parser.add_argument(
+        "--offload_params",
+        default="false",
+        type=str,
+        help="Decides Whether (true|false) to offload parameters and gradients to CPU. (useful only when `use_fsdp` flag is passed).",
+    )
+    parser.add_argument(
+        "--min_num_params",
+        type=int,
+        default=1e8,
+        help="FSDP's minimum number of parameters for Default Auto Wrapping. (useful only when `use_fsdp` flag is passed).",
+    )
+    parser.add_argument(
+        "--sharding_strategy",
+        type=int,
+        default=1,
+        help="FSDP's Sharding Strategy. (useful only when `use_fsdp` flag is passed).",
+    )
    parser.add_argument(
        "--tpu", default=False, action="store_true", help="Whether or not this should launch a TPU training."
    )
    parser.add_argument(
        "--mixed_precision",
-        default="no",
        type=str,
        choices=["no", "fp16", "bf16"],
        help="Whether or not to use mixed precision training. "
@ -103,6 +133,12 @@ def launch_command_parser(subparsers=None):
        action="store_true",
        help="Skip prepending the training script with 'python' - just execute it directly. Useful when the script is not a Python script.",
    )
+    parser.add_argument(
+        "--num_cpu_threads_per_process",
+        type=int,
+        default=1,
+        help="The number of CPU threads per process. Can be tuned for optimal performance.",
+    )
    parser.add_argument(
        "--aws_access_key_id",
        type=str,
@ -163,10 +199,12 @@ def simple_launcher(args):

    current_env = os.environ.copy()
    current_env["USE_CPU"] = str(args.cpu)
-
-    mixed_precision = args.mixed_precision.lower()
-    if mixed_precision not in ["no", "fp16", "bf16"]:
-        raise ValueError(f"Unknown mixed_precision mode: {mixed_precision}. Choose between 'no', 'fp16' and 'bf16'.")
+    try:
+        mixed_precision = PrecisionType(args.mixed_precision.lower())
+    except ValueError:
+        raise ValueError(
+            f"Unknown mixed_precision mode: {args.mixed_precision.lower()}. Choose between {PrecisionType.list()}."
+        )

    if args.fp16:
        warnings.warn('--fp16 flag is deprecated. Use "--mixed_precision fp16" instead.', DeprecationWarning)
@ -181,7 +219,12 @@ def simple_launcher(args):


 def multi_gpu_launcher(args):
-    cmd = [sys.executable, "-m", "torch.distributed.launch", "--use_env"]
+    if torch_version >= version.parse("1.10.0"):
+        cmd = ["torchrun"]
+    elif torch_version >= version.parse("1.9.0"):
+        cmd = [sys.executable, "-m", "torch.distributed.run"]
+    else:
+        cmd = [sys.executable, "-m", "torch.distributed.launch", "--use_env"]
    if args.num_machines > 1:
        cmd.extend(
            [
@ -212,17 +255,24 @@ def multi_gpu_launcher(args):
    cmd.extend(args.training_script_args)

    current_env = os.environ.copy()
-    mixed_precision = args.mixed_precision.lower()
-
-    if mixed_precision not in ["no", "fp16", "bf16"]:
-        raise ValueError(f"Unknown mixed_precision mode: {mixed_precision}. Choose between 'no', 'fp16' and 'bf16'.")
+    try:
+        mixed_precision = PrecisionType(args.mixed_precision.lower())
+    except ValueError:
+        raise ValueError(
+            f"Unknown mixed_precision mode: {args.mixed_precision.lower()}. Choose between {PrecisionType.list()}."
+        )

    if args.fp16:
        warnings.warn('--fp16 flag is deprecated. Use "--mixed_precision fp16" instead.', DeprecationWarning)
        mixed_precision = "fp16"

    current_env["MIXED_PRECISION"] = str(mixed_precision)
-
+    if args.use_fsdp:
+        current_env["USE_FSDP"] = "true"
+        current_env["FSDP_OFFLOAD_PARAMS"] = str(args.offload_params).lower()
+        current_env["FSDP_MIN_NUM_PARAMS"] = str(args.min_num_params)
+        current_env["FSDP_SHARDING_STRATEGY"] = str(args.sharding_strategy)
+    current_env["OMP_NUM_THREADS"] = str(args.num_cpu_threads_per_process)
    process = subprocess.Popen(cmd, env=current_env)
    process.wait()
    if process.returncode != 0:
@ -230,7 +280,7 @@ def multi_gpu_launcher(args):


 def deepspeed_launcher(args):
-    cmd = ["deepspeed"]
+    cmd = ["deepspeed", "--no_local_rank"]
    if args.num_machines > 1:
        cmd.extend(
            [
@ -259,10 +309,12 @@ def deepspeed_launcher(args):
    cmd.extend(args.training_script_args)

    current_env = os.environ.copy()
-    mixed_precision = args.mixed_precision.lower()
-
-    if mixed_precision not in ["no", "fp16", "bf16"]:
-        raise ValueError(f"Unknown mixed_precision mode: {mixed_precision}. Choose between 'no', 'fp16' and 'bf16'.")
+    try:
+        mixed_precision = PrecisionType(args.mixed_precision.lower())
+    except ValueError:
+        raise ValueError(
+            f"Unknown mixed_precision mode: {args.mixed_precision.lower()}. Choose between {PrecisionType.list()}."
+        )

    if args.fp16:
        warnings.warn('--fp16 flag is deprecated. Use "--mixed_precision fp16" instead.', DeprecationWarning)
@ -272,7 +324,7 @@ def deepspeed_launcher(args):
    current_env["USE_DEEPSPEED"] = "true"
    current_env["DEEPSPEED_ZERO_STAGE"] = str(args.zero_stage)
    current_env["GRADIENT_ACCUMULATION_STEPS"] = str(args.gradient_accumulation_steps)
-    current_env["DEEPSPEED_OFFLOAD_OPTIMIZER_DEVICE"] = str(args.offload_optimizer_device)
+    current_env["DEEPSPEED_OFFLOAD_OPTIMIZER_DEVICE"] = str(args.offload_optimizer_device).lower()

    process = subprocess.Popen(cmd, env=current_env)
    process.wait()
@ -388,10 +440,12 @@ def sagemaker_launcher(sagemaker_config: SageMakerConfig, args):
    print("Converting Arguments to Hyperparameters")
    hyperparameters = _convert_nargs_to_dict(args.training_script_args)

-    mixed_precision = args.mixed_precision.lower()
-
-    if mixed_precision not in ["no", "fp16", "bf16"]:
-        raise ValueError(f"Unknown mixed_precision mode: {mixed_precision}. Choose between 'no', 'fp16' and 'bf16'.")
+    try:
+        mixed_precision = PrecisionType(args.mixed_precision.lower())
+    except ValueError:
+        raise ValueError(
+            f"Unknown mixed_precision mode: {args.mixed_precision.lower()}. Choose between {PrecisionType.list()}."
+        )

    if args.fp16:
        warnings.warn('--fp16 flag is deprecated. Use "--mixed_precision fp16" instead.', DeprecationWarning)
@ -426,8 +480,8 @@ def sagemaker_launcher(sagemaker_config: SageMakerConfig, args):

 def launch_command(args):
    # Sanity checks
-    if sum([args.multi_gpu, args.tpu, args.use_deepspeed]) > 1:
-        raise ValueError("You can only pick one between `--multi_gpu`, `--use_deepspeed`, `--tpu`.")
+    if sum([args.multi_gpu, args.tpu, args.use_deepspeed, args.use_fsdp]) > 1:
+        raise ValueError("You can only pick one between `--multi_gpu`, `--use_deepspeed`, `--tpu`, `--use_fsdp`.")

    defaults = None
    # Get the default from the config file.
@ -437,6 +491,7 @@ def launch_command(args):
            args.use_deepspeed = defaults.distributed_type == DistributedType.DEEPSPEED
            args.multi_gpu = defaults.distributed_type == DistributedType.MULTI_GPU
            args.tpu = defaults.distributed_type == DistributedType.TPU
+            args.use_fsdp = defaults.distributed_type == DistributedType.FSDP
        if defaults.compute_environment == ComputeEnvironment.LOCAL_MACHINE:
            # Update args with the defaults
            for name, attr in defaults.__dict__.items():
@ -444,6 +499,8 @@ def launch_command(args):
                    for k in defaults.deepspeed_config:
                        if getattr(args, k) is None:
                            setattr(args, k, defaults.deepspeed_config[k])
+                    for k in defaults.fsdp_config:
+                        setattr(args, k, defaults.fsdp_config[k])
                    continue

                # Those args are handled separately
@ -465,6 +522,8 @@ def launch_command(args):
    # Use the proper launcher
    if args.use_deepspeed and not args.cpu:
        deepspeed_launcher(args)
+    elif args.use_fsdp and not args.cpu:
+        multi_gpu_launcher(args)
    elif args.multi_gpu and not args.cpu:
        multi_gpu_launcher(args)
    elif args.tpu and not args.cpu:
--- a/src/accelerate/data_loader.py
+++ b/src/accelerate/data_loader.py
@ -326,12 +326,21 @@ class DataLoaderDispatcher(DataLoader):
    """

    def __init__(self, dataset, split_batches: bool = False, **kwargs):
+        shuffle = False
+        if version.parse(torch.__version__) >= version.parse("1.11.0"):
+            from torch.utils.data.datapipes.iter.combinatorics import ShufflerIterDataPipe
+
+            # We need to save the shuffling state of the DataPipe
+            if isinstance(dataset, ShufflerIterDataPipe):
+                shuffle = dataset._shuffle_enabled
        super().__init__(dataset, **kwargs)
        self.split_batches = split_batches
        if version.parse(torch.__version__) < version.parse("1.8.0"):
            raise ImportError(
                "Using `DataLoaderDispatcher` requires PyTorch 1.8.0 minimum. You have {torch.__version__}."
            )
+        if shuffle:
+            torch.utils.data.graph_settings.apply_shuffle_settings(dataset, shuffle=shuffle)

    def __iter__(self):
        state = AcceleratorState()
--- a/src/accelerate/hooks.py
+++ b/src/accelerate/hooks.py
@ -0,0 +1,411 @@
+# Copyright 2022 The HuggingFace Team. All rights reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import functools
+from typing import Dict, Mapping, Optional, Union
+
+import torch
+import torch.nn as nn
+
+from .utils import PrefixedDataset, find_device, named_module_tensors, send_to_device, set_module_tensor_to_device
+
+
+class ModelHook:
+    """
+    A hook that contains callbacks to be executed just before and after the forward method of a model. The difference
+    with PyTorch existing hooks is that they get passed along the kwargs.
+
+    Class attribute:
+    - **no_grad** (`bool`, *optional*, defaults to `False`) -- Whether or not to execute the actual forward pass under
+      the `torch.no_grad()` context manager.
+    """
+
+    no_grad = False
+
+    def init_hook(self, module):
+        """
+        To be executed when the hook is attached to the module.
+
+        Args:
+            module (`torch.nn.Module`): The module attached to this hook.
+        """
+        return module
+
+    def pre_forward(self, module, *args, **kwargs):
+        """
+        To be executed just before the forward method of the model.
+
+        Args:
+            module (`torch.nn.Module`): The module whose forward pass will be executed just after this event.
+            args (`Tuple[Any]`): The positional arguments passed to the module.
+            kwargs (`Dict[Str, Any]`): The keyword arguments passed to the module.
+
+        Returns:
+            `Tuple[Tuple[Any], Dict[Str, Any]]`: A tuple with the treated `args` and `kwargs`.
+        """
+        return args, kwargs
+
+    def post_forward(self, module, output):
+        """
+        To be executed just after the forward method of the model.
+
+        Args:
+            module (`torch.nn.Module`): The module whose forward pass been executed just before this event.
+            output (`Any`): The output of the module.
+
+        Returns:
+            `Any`: The processed `output`.
+        """
+        return output
+
+    def detach_hook(self, module):
+        """
+        To be executed when the hook is deached from a module.
+
+        Args:
+            module (`torch.nn.Module`): The module detached from this hook.
+        """
+        return module
+
+
+class SequentialHook(ModelHook):
+    """
+    A hook that can contain several hooks and iterates through them at each event.
+    """
+
+    def __init__(self, *hooks):
+        self.hooks = hooks
+
+    def init_hook(self, module):
+        for hook in self.hooks:
+            module = hook.init_hook(module)
+        return module
+
+    def pre_forward(self, module, *args, **kwargs):
+        for hook in self.hooks:
+            args, kwargs = hook.pre_forward(module, *args, **kwargs)
+        return args, kwargs
+
+    def post_forward(self, module, output):
+        for hook in self.hooks:
+            output = hook.post_forward(module, output)
+        return output
+
+    def detach_hook(self, module):
+        for hook in self.hooks:
+            module = hook.detach_hook(module)
+        return module
+
+
+def add_hook_to_module(module: nn.Module, hook: ModelHook):
+    """
+    Adds a hook to a given module. This will rewrite the `forward` method of the module to include the hook, to remove
+    this behavior and restore the original `forward` method, use `remove_hook_from_module`.
+
+    <Tip warning={true}>
+
+    If the module already contains a hook, this will replace it with the new hook passed. To chain two hooks together,
+    use the `SequentialHook` class.
+
+    </Tip>
+
+    Args:
+        module (`torch.nn.Module`): The module to attach a hook to.
+        hook (`ModelHook`): The hook to attach.
+
+    Returns:
+        `torch.nn.Module`: The same module, with the hook attached (the module is modified in place, so the result can
+        be discarded).
+    """
+    if hasattr(module, "_hf_hook") and hasattr(module, "_old_forward"):
+        # If we already put some hook on this module, we replace it with the new one.
+        old_forward = module._old_forward
+    else:
+        old_forward = module.forward
+        module._old_forward = old_forward
+
+    module = hook.init_hook(module)
+    module._hf_hook = hook
+
+    @functools.wraps(old_forward)
+    def new_forward(*args, **kwargs):
+        args, kwargs = module._hf_hook.pre_forward(module, *args, **kwargs)
+        if module._hf_hook.no_grad:
+            with torch.no_grad():
+                output = old_forward(*args, **kwargs)
+        else:
+            output = old_forward(*args, **kwargs)
+        return module._hf_hook.post_forward(module, output)
+
+    module.forward = new_forward
+    return module
+
+
+def remove_hook_from_module(module: nn.Module):
+    """
+    Removes any hook attached to a module via `add_hook_to_module`.
+
+    Args:
+        module (`torch.nn.Module`): The module to attach a hook to.
+
+    Returns:
+        `torch.nn.Module`: The same module, with the hook detached (the module is modified in place, so the result can
+        be discarded).
+    """
+    if hasattr(module, "_hf_hook"):
+        module._hf_hook.detach_hook(module)
+        delattr(module, "_hf_hook")
+
+    if hasattr(module, "_old_forward"):
+        module.forward = module._old_forward
+        delattr(module, "_old_forward")
+
+    return module
+
+
+class AlignDevicesHook(ModelHook):
+    """
+    A generic `ModelHook` that ensures inputs and model weights are on the same device for the forward pass of the
+    associated module, potentially offloading the weights after the forward pass.
+
+    Args:
+        execution_device (`torch.device`, *optional*):
+            The device on which inputs and model weights should be placed before the forward pass.
+        offload (`bool`, *optional*, defauts to `False`):
+            Whether or not the weights should be offloaded after the forward pass.
+        io_same_device (`bool`, *optional*, defaults to `False`):
+            Whether or not the output should be placed on the same device as the input was.
+        weights_map (`Mapping[str, torch.Tensor]`, *optional*):
+            When the model weights are offloaded, a (potentially lazy) map from param names to the tensor values.
+        offload_buffers (`bool`, *optional*, defaults to `False`):
+            Whether or not to include the associated module's buffers when offloading.
+        place_submodules (`bool`, *optional*, defaults to `False`):
+            Whether to place the submodules on `execution_device` during the `init_hook` event.
+    """
+
+    def __init__(
+        self,
+        execution_device: Optional[Union[int, str, torch.device]] = None,
+        offload: bool = False,
+        io_same_device: bool = False,
+        weights_map: Optional[Mapping] = None,
+        offload_buffers: bool = False,
+        place_submodules: bool = False,
+    ):
+        self.execution_device = execution_device
+        self.offload = offload
+        self.io_same_device = io_same_device
+        self.weights_map = weights_map
+        self.offload_buffers = offload_buffers
+        self.place_submodules = place_submodules
+
+        # Will contain the input device when `io_same_device=True`.
+        self.input_device = None
+        self.param_original_devices = {}
+        self.buffer_original_devices = {}
+
+    def init_hook(self, module):
+        if not self.offload and self.execution_device is not None:
+            for name, _ in named_module_tensors(module, recurse=self.place_submodules):
+                set_module_tensor_to_device(module, name, self.execution_device)
+        elif self.offload:
+            self.original_devices = {name: param.device for name, param in named_module_tensors(module)}
+            if self.weights_map is None:
+                self.weights_map = {
+                    name: param.to("cpu")
+                    for name, param in named_module_tensors(module, include_buffers=self.offload_buffers)
+                }
+
+            for name, _ in named_module_tensors(module, include_buffers=self.offload_buffers):
+                set_module_tensor_to_device(module, name, "meta")
+            if not self.offload_buffers and self.execution_device is not None:
+                for name, _ in module.named_buffers(recurse=False):
+                    set_module_tensor_to_device(module, name, self.execution_device)
+        return module
+
+    def pre_forward(self, module, *args, **kwargs):
+        if self.io_same_device:
+            self.input_device = find_device([args, kwargs])
+        if self.offload:
+            for name, _ in named_module_tensors(module, include_buffers=self.offload_buffers):
+                set_module_tensor_to_device(module, name, self.execution_device, value=self.weights_map[name])
+
+        return send_to_device(args, self.execution_device), send_to_device(kwargs, self.execution_device)
+
+    def post_forward(self, module, output):
+        if self.offload:
+            for name, _ in named_module_tensors(module, include_buffers=self.offload_buffers):
+                set_module_tensor_to_device(module, name, "meta")
+
+        if self.io_same_device and self.input_device is not None:
+            output = send_to_device(output, self.input_device)
+
+        return output
+
+    def detach_hook(self, module):
+        if self.offload:
+            for name, device in self.original_devices.items():
+                if device != torch.device("meta"):
+                    set_module_tensor_to_device(module, name, device, value=self.weights_map.get(name, None))
+
+
+def attach_align_device_hook(
+    module: torch.nn.Module,
+    execution_device: Optional[torch.device] = None,
+    offload: bool = False,
+    weights_map: Optional[Mapping] = None,
+    offload_buffers: bool = False,
+    module_name: str = "",
+):
+    """
+    Recursively attaches `AlignDevicesHook` to all submodules of a given model that have direct parameters and/or
+    buffers.
+
+    Args:
+        module (`torch.nn.Module`):
+            The module where we want to attach the hooks.
+        execution_device (`torch.device`, *optional*):
+            The device on which inputs and model weights should be placed before the forward pass.
+        offload (`bool`, *optional*, defauts to `False`):
+            Whether or not the weights should be offloaded after the forward pass.
+        weights_map (`Mapping[str, torch.Tensor]`, *optional*):
+            When the model weights are offloaded, a (potentially lazy) map from param names to the tensor values.
+        offload_buffers (`bool`, *optional*, defaults to `False`):
+            Whether or not to include the associated module's buffers when offloading.
+        module_name (`str`, *optional*, defaults to `""`):
+            The name of the module.
+    """
+    # Attach the hook on this module if it has any direct tensor.
+    directs = named_module_tensors(module)
+    if len(list(directs)) > 0:
+        if weights_map is not None:
+            prefix = f"{module_name}." if len(module_name) > 0 else ""
+            prefixed_weights_map = PrefixedDataset(weights_map, prefix)
+        else:
+            prefixed_weights_map = None
+        hook = AlignDevicesHook(
+            execution_device=execution_device,
+            offload=offload,
+            weights_map=prefixed_weights_map,
+            offload_buffers=offload_buffers,
+        )
+        add_hook_to_module(module, hook)
+
+    # Recurse on all children of the module.
+    for child_name, child in module.named_children():
+        child_name = f"{module_name}.{child_name}" if len(module_name) > 0 else child_name
+        attach_align_device_hook(
+            child,
+            execution_device=execution_device,
+            offload=offload,
+            weights_map=weights_map,
+            offload_buffers=offload_buffers,
+            module_name=child_name,
+        )
+
+
+def remove_hook_from_submodules(module: nn.Module):
+    """
+    Recursively removes all hooks attached on the submodules of a given model.
+
+    Args:
+        module (`torch.nn.Module`): The module on which to remove all hooks.
+    """
+    remove_hook_from_module(module)
+    for child in module.children():
+        remove_hook_from_submodules(child)
+
+
+def attach_align_device_hook_on_blocks(
+    module: nn.Module,
+    execution_device: Optional[Union[torch.device, Dict[str, torch.device]]] = None,
+    offload: Union[bool, Dict[str, bool]] = False,
+    weights_map: Mapping = None,
+    offload_buffers: bool = False,
+    module_name: str = "",
+):
+    """
+    Attaches `AlignDevicesHook` to all blocks of a given model as needed.
+
+    Args:
+        module (`torch.nn.Module`):
+            The module where we want to attach the hooks.
+        execution_device (`torch.device` or `Dict[str, torch.device]`, *optional*):
+            The device on which inputs and model weights should be placed before the forward pass. It can be one device
+            for the whole module, or a dictionary mapping module name to device.
+        offload (`bool`, *optional*, defauts to `False`):
+            Whether or not the weights should be offloaded after the forward pass. It can be one boolean for the whole
+            module, or a dictionary mapping module name to boolean.
+        weights_map (`Mapping[str, torch.Tensor]`, *optional*):
+            When the model weights are offloaded, a (potentially lazy) map from param names to the tensor values.
+        offload_buffers (`bool`, *optional*, defaults to `False`):
+            Whether or not to include the associated module's buffers when offloading.
+        module_name (`str`, *optional*, defaults to `""`):
+            The name of the module.
+    """
+    # If one device and one offload, we've got one hook.
+    if not isinstance(execution_device, Mapping) and not isinstance(offload, dict):
+        if not offload:
+            hook = AlignDevicesHook(execution_device=execution_device, io_same_device=True, place_submodules=True)
+            add_hook_to_module(module, hook)
+        else:
+            attach_align_device_hook(
+                module,
+                execution_device=execution_device,
+                offload=True,
+                weights_map=weights_map,
+                offload_buffers=offload_buffers,
+                module_name=module_name,
+            )
+        return
+
+    if not isinstance(execution_device, Mapping):
+        execution_device = {key: offload for key in offload.keys()}
+    if not isinstance(offload, Mapping):
+        offload = {key: offload for key in execution_device.keys()}
+
+    if module_name in execution_device and not offload[module_name]:
+        hook = AlignDevicesHook(
+            execution_device=execution_device[module_name],
+            offload_buffers=offload_buffers,
+            io_same_device=(module_name == ""),
+            place_submodules=True,
+        )
+        add_hook_to_module(module, hook)
+    elif module_name in execution_device:
+        attach_align_device_hook(
+            module,
+            execution_device=execution_device[module_name],
+            offload=True,
+            weights_map=weights_map,
+            offload_buffers=offload_buffers,
+            module_name=module_name,
+        )
+        if not hasattr(module, "_hf_hook"):
+            hook = AlignDevicesHook(execution_device=execution_device[module_name], io_same_device=(module_name == ""))
+            add_hook_to_module(module, hook)
+    elif module_name == "":
+        hook = AlignDevicesHook(io_same_device=True)
+        add_hook_to_module(module, hook)
+
+    for child_name, child in module.named_children():
+        child_name = f"{module_name}.{child_name}" if len(module_name) > 0 else child_name
+        attach_align_device_hook_on_blocks(
+            child,
+            execution_device=execution_device,
+            offload=offload,
+            weights_map=weights_map,
+            offload_buffers=offload_buffers,
+            module_name=child_name,
+        )
--- a/src/accelerate/kwargs_handlers.py
+++ b/src/accelerate/kwargs_handlers.py
@ -1,90 +0,0 @@
-# Copyright 2021 The HuggingFace Team. All rights reserved.
-#
-# Licensed under the Apache License, Version 2.0 (the "License");
-# you may not use this file except in compliance with the License.
-# You may obtain a copy of the License at
-#
-#     http://www.apache.org/licenses/LICENSE-2.0
-#
-# Unless required by applicable law or agreed to in writing, software
-# distributed under the License is distributed on an "AS IS" BASIS,
-# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
-# See the License for the specific language governing permissions and
-# limitations under the License.
-
-import copy
-from dataclasses import dataclass
-from datetime import timedelta
-from typing import Optional
-
-
-class KwargsHandler:
-    """
-    Internal mixin that implements a `to_kwargs()` method for a dataclass.
-    """
-
-    def to_dict(self):
-        return copy.deepcopy(self.__dict__)
-
-    def to_kwargs(self):
-        """
-        Returns a dictionary containing the attributes with values different from the default of this class.
-        """
-        default_dict = self.__class__().to_dict()
-        this_dict = self.to_dict()
-        return {k: v for k, v in this_dict.items() if default_dict[k] != v}
-
-
-@dataclass
-class DistributedDataParallelKwargs(KwargsHandler):
-    """
-    Use this object in your [`Accelerator`] to customize how your model is wrapped in a
-    `torch.nn.parallel.DistributedDataParallel`. Please refer to the documentation of this
-    [wrapper](https://pytorch.org/docs/stable/generated/torch.nn.parallel.DistributedDataParallel.html) for more
-    information on each argument.
-
-    <Tip warning={true}>
-
-    `gradient_as_bucket_view` is only available in PyTorch 1.7.0 and later versions.
-
-    </Tip>"""
-
-    dim: int = 0
-    broadcast_buffers: bool = True
-    bucket_cap_mb: int = 25
-    find_unused_parameters: bool = False
-    check_reduction: bool = False
-    gradient_as_bucket_view: bool = False
-
-
-@dataclass
-class GradScalerKwargs(KwargsHandler):
-    """
-    Use this object in your [`Accelerator`] to customize the behavior of mixed precision, specifically how the
-    `torch.cuda.amp.GradScaler` used is created. Please refer to the documentation of this
-    [scaler](https://pytorch.org/docs/stable/amp.html?highlight=gradscaler) for more information on each argument.
-
-    <Tip warning={true}>
-
-    `GradScaler` is only available in PyTorch 1.5.0 and later versions.
-
-    </Tip>"""
-
-    init_scale: float = 65536.0
-    growth_factor: float = 2.0
-    backoff_factor: float = 0.5
-    growth_interval: int = 2000
-    enabled: bool = True
-
-
-@dataclass
-class InitProcessGroupKwargs(KwargsHandler):
-    """
-    Use this object in your [`Accelerator`] to customize the initialization of the distributed processes. Please refer
-    to the documentation of this
-    [method](https://pytorch.org/docs/stable/distributed.html#torch.distributed.init_process_group) for more
-    information on each argument.
-    """
-
-    init_method: Optional[str] = None
-    timeout: timedelta = timedelta(seconds=1800)
--- a/src/accelerate/launchers.py
+++ b/src/accelerate/launchers.py
@ -22,7 +22,7 @@ import torch
 from packaging import version

 from .state import AcceleratorState
-from .utils import PrepareForLaunch, patch_environment
+from .utils import PrecisionType, PrepareForLaunch, patch_environment


 def notebook_launcher(function, args=(), num_processes=None, use_fp16=False, mixed_precision="no", use_port="29500"):
@ -80,7 +80,7 @@ def notebook_launcher(function, args=(), num_processes=None, use_fp16=False, mix
    else:
        if num_processes is None:
            raise ValueError(
-                "You have to specify the number of GPUs you would like to use, add `num_process=...` to your call."
+                "You have to specify the number of GPUs you would like to use, add `num_processes=...` to your call."
            )

        if num_processes > 1:
@ -107,10 +107,11 @@ def notebook_launcher(function, args=(), num_processes=None, use_fp16=False, mix
                    "function."
                )

-            mixed_precision = mixed_precision.lower()
-            if mixed_precision not in ["no", "fp16", "bf16"]:
+            try:
+                mixed_precision = PrecisionType(mixed_precision.lower())
+            except ValueError:
                raise ValueError(
-                    f"Unknown mixed_precision: {mixed_precision}. Choose between 'no', 'fp16' and 'bf16'."
+                    f"Unknown mixed_precision mode: {args.mixed_precision.lower()}. Choose between {PrecisionType.list()}."
                )

            if use_fp16:
--- a/src/accelerate/logging.py
+++ b/src/accelerate/logging.py
@ -0,0 +1,63 @@
+# Copyright 2022 The HuggingFace Team. All rights reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import logging
+
+from .state import AcceleratorState
+
+
+class MultiProcessAdapter(logging.LoggerAdapter):
+    """
+    An adapter to assist with logging in multiprocess.
+
+    `log` takes in an additional `main_process_only` kwarg, which dictates whether it should be called on all processes
+    or only the main executed one. Default is `main_process_only=True`.
+    """
+
+    @staticmethod
+    def _should_log(main_process_only):
+        "Check if log should be performed"
+        return not main_process_only or (main_process_only and AcceleratorState().local_process_index == 0)
+
+    def log(self, level, msg, *args, **kwargs):
+        """
+        Delegates logger call after checking if we should log.
+
+        Accepts a new kwarg of `main_process_only`, which will dictate whether it will be logged across all processes
+        or only the main executed one. Default is `True` if not passed
+        """
+        main_process_only = kwargs.pop("main_process_only", True)
+        if self.isEnabledFor(level) and self._should_log(main_process_only):
+            msg, kwargs = self.process(msg, kwargs)
+            self.logger.log(level, msg, *args, **kwargs)
+
+
+def get_logger(name: str):
+    """
+    Returns a `logging.Logger` for `name` that can handle multiprocessing.
+
+    If a log should be called on all processes, pass `main_process_only=False`
+
+    E.g.
+    ```python
+    logger.info("My log", main_process_only=False)
+    logger.debug("My log", main_process_only=False)
+    ```
+
+    Args:
+        name (`str`):
+            The name for the logger, such as `__file__`
+    """
+    logger = logging.getLogger(name)
+    return MultiProcessAdapter(logger, {})
--- a/src/accelerate/memory_utils.py
+++ b/src/accelerate/memory_utils.py
@ -0,0 +1,29 @@
+# Copyright 2022 The HuggingFace Team. All rights reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+# flake8: noqa
+# There's no way to ignore "F401 '...' imported but unused" warnings in this
+# module, but to preserve other warnings. So, don't check this module at all
+
+
+import warnings
+
+
+warnings.warn(
+    "memory_utils has been reorganized to utils.memory. Import `find_executable_batchsize` from the main `__init__`: "
+    "`from accelerate import find_executable_batch_size` to avoid this warning.",
+    FutureWarning,
+)
+
+from .utils.memory import find_executable_batch_size
--- a/src/accelerate/optimizer.py
+++ b/src/accelerate/optimizer.py
@ -13,13 +13,14 @@
 # limitations under the License.

 import inspect
+import warnings

 import torch

 from packaging import version

-from .state import AcceleratorState, DistributedType, is_tpu_available
-from .utils import honor_type
+from .state import AcceleratorState
+from .utils import DistributedType, honor_type, is_tpu_available


 if is_tpu_available():
@ -141,4 +142,14 @@ class AcceleratedOptimizer(torch.optim.Optimizer):
    @property
    def is_overflow(self):
        """Whether or not the optimizer step was done, or skipped because of gradient overflow."""
+        warnings.warn(
+            "The `is_overflow` property is deprecated and will be removed in version 1.0 of Accelerate use "
+            "`optimizer.step_was_skipped` instead.",
+            FutureWarning,
+        )
+        return self._is_overflow
+
+    @property
+    def step_was_skipped(self):
+        """Whether or not the optimizer step was skipped."""
        return self._is_overflow
--- a/src/accelerate/scheduler.py
+++ b/src/accelerate/scheduler.py
@ -0,0 +1,80 @@
+# Copyright 2022 The HuggingFace Team. All rights reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+from .state import AcceleratorState
+
+
+class AcceleratedScheduler:
+    """
+    A wrapper around a learning rate scheduler that will only step when the optimizer(s) have a training step. Useful
+    to avoid making a scheduler step too fast when:
+
+    - gradients went overflow and there was no training step (in mixed precision training)
+    - step was skipped because of gradient accumulation
+
+    Args:
+        scheduler (`torch.optim.lr_scheduler._LRScheduler`):
+            The scheduler to wrap.
+        optimizers (one or a list of `torch.optim.Optimizer`):
+            The optimizers used.
+        step_with_optimizer (`bool`, *optional*, defaults to `True`):
+            Whether or not the scheduler should be stepped at each optimizer step.
+        split_batches (`bool`, *optional*, defaults to `False`):
+            Whether or not the dataloaders split one batch across the different processes (so batch size is the same
+            regardless of the number of processes) or create batches on each process (so batch size is the original
+            batch size multiplied by the number of processes).
+    """
+
+    def __init__(self, scheduler, optimizers, step_with_optimizer: bool = True, split_batches: bool = False):
+        self.scheduler = scheduler
+        self.optimizers = optimizers if isinstance(optimizers, (list, tuple)) else [optimizers]
+        self.split_batches = split_batches
+        self.step_with_optimizer = step_with_optimizer
+
+    def step(self, *args, **kwargs):
+        if not self.step_with_optimizer:
+            # No link between scheduler and optimizer -> just step
+            self.scheduler.step(*args, **kwargs)
+            return
+
+        # Otherwise, first make sure the optimizer was stepped.
+        for opt in self.optimizers:
+            if opt.step_was_skipped:
+                return
+
+        if self.split_batches:
+            # Split batches -> the training dataloader batch size is not changed so one step per training step
+            self.scheduler.step(*args, **kwargs)
+        else:
+            # Otherwise the training dataloader batch size was multiplied by `num_processes`, so we need to do
+            # num_processes steps per training step
+            num_processes = AcceleratorState().num_processes
+            for _ in range(num_processes):
+                self.scheduler.step(*args, **kwargs)
+
+    # Passthroughs
+    def get_last_lr(self):
+        return self.scheduler.get_last_lr()
+
+    def state_dict(self):
+        return self.scheduler.state_dict()
+
+    def load_state_dict(self, state_dict):
+        self.scheduler.load_state_dict(state_dict)
+
+    def get_lr(self):
+        return self.scheduler.get_lr()
+
+    def print_lr(self, *args, **kwargs):
+        return self.scheduler.print_lr(*args, **kwargs)
--- a/src/accelerate/state.py
+++ b/src/accelerate/state.py
@ -12,29 +12,17 @@
 # See the License for the specific language governing permissions and
 # limitations under the License.

-import importlib
 import os
 from distutils.util import strtobool
-from enum import Enum

 import torch

-
-try:
-    import torch_ccl  # noqa: F401
-
-    _ccl_available = True
-except ImportError:
-    _ccl_available = False
+from .utils import DistributedType, is_ccl_available, is_deepspeed_available, is_tpu_available


-try:
+if is_tpu_available():
    import torch_xla.core.xla_model as xm

-    _tpu_available = True
-except ImportError:
-    _tpu_available = False
-

 def get_int_from_env(env_keys, default):
    """Returns the first positive env value found in the `env_keys` list or the default."""
@ -45,22 +33,6 @@ def get_int_from_env(env_keys, default):
    return default


-def is_ccl_available():
-    return _ccl_available
-
-
-def is_apex_available():
-    return importlib.util.find_spec("apex") is not None
-
-
-def is_tpu_available():
-    return _tpu_available
-
-
-def is_deepspeed_available():
-    return importlib.util.find_spec("deepspeed") is not None
-
-
 def parse_flag_from_env(key, default=False):
    value = os.environ.get(key, str(default))
    return strtobool(value) == 1  # As its name indicates `strtobool` actually returns an int...
@ -71,59 +43,6 @@ def parse_choice_from_env(key, default="no"):
    return value


-class DistributedType(str, Enum):
-    """
-    Represents a type of distributed environment.
-
-    Values:
-
-        - **NO** -- Not a distributed environment, just a single process.
-        - **MULTI_CPU** -- Distributed on multiple CPU nodes.
-        - **MULTI_GPU** -- Distributed on multiple GPUs.
-        - **DEEPSPEED** -- Using DeepSpeed.
-        - **TPU** -- Distributed on TPUs.
-    """
-
-    # Subclassing str as well as Enum allows the `DistributedType` to be JSON-serializable out of the box.
-    NO = "NO"
-    MULTI_CPU = "MULTI_CPU"
-    MULTI_GPU = "MULTI_GPU"
-    DEEPSPEED = "DEEPSPEED"
-    TPU = "TPU"
-
-
-class SageMakerDistributedType(str, Enum):
-    """
-    Represents a type of distributed environment.
-
-    Values:
-
-        - **NO** -- Not a distributed environment, just a single process.
-        - **DATA_PARALLEL** -- using sagemaker distributed data parallelism.
-        - **MODEL_PARALLEL** -- using sagemaker distributed model parallelism.
-    """
-
-    # Subclassing str as well as Enum allows the `SageMakerDistributedType` to be JSON-serializable out of the box.
-    NO = "NO"
-    DATA_PARALLEL = "DATA_PARALLEL"
-    MODEL_PARALLEL = "MODEL_PARALLEL"
-
-
-class ComputeEnvironment(str, Enum):
-    """
-    Represents a type of the compute environment.
-
-    Values:
-
-        - **LOCAL_MACHINE** -- private/custom cluster hardware.
-        - **AMAZON_SAGEMAKER** -- Amazon SageMaker as compute environment.
-    """
-
-    # Subclassing str as well as Enum allows the `ComputeEnvironment` to be JSON-serializable out of the box.
-    LOCAL_MACHINE = "LOCAL_MACHINE"
-    AMAZON_SAGEMAKER = "AMAZON_SAGEMAKER"
-
-
 # Inspired by Alex Martelli's 'Borg'.
 class AcceleratorState:
    """
@ -149,6 +68,7 @@ class AcceleratorState:
        mixed_precision: str = None,
        cpu: bool = False,
        deepspeed_plugin=None,
+        fsdp_plugin=None,
        _from_accelerator: bool = False,
        **kwargs,
    ):
@ -206,7 +126,13 @@ class AcceleratorState:
                self.mixed_precision = (
                    parse_choice_from_env("MIXED_PRECISION", "no") if mixed_precision is None else mixed_precision
                )
-
+                if os.environ.get("USE_FSDP", "false") == "true":
+                    self.distributed_type = DistributedType.FSDP
+                    if self.mixed_precision != "no":
+                        raise ValueError(
+                            "Mixed precision is currently not supported for FSDP. Please set `mixed_precision` to `no`."
+                        )
+                    self.fsdp_plugin = fsdp_plugin
            elif get_int_from_env(["PMI_SIZE", "OMPI_COMM_WORLD_SIZE", "MV2_COMM_WORLD_SIZE", "WORLD_SIZE"], 1) > 1:
                self.distributed_type = DistributedType.MULTI_CPU
                if is_ccl_available() and get_int_from_env(["CCL_WORKER_COUNT"], 0) > 0:
--- a/src/accelerate/test_utils/init.py
+++ b/src/accelerate/test_utils/init.py
@ -2,5 +2,5 @@
 # There's no way to ignore "F401 '...' imported but unused" warnings in this
 # module, but to preserve other warnings. So, don't check this module at all.

-from .testing import are_the_same_tensors, execute_subprocess_async, require_cuda, require_multi_gpu, require_tpu
+from .testing import are_the_same_tensors, execute_subprocess_async, require_cuda, require_multi_gpu, require_tpu, slow
 from .training import RegressionDataset, RegressionModel
--- a/src/accelerate/test_utils/examples.py
+++ b/src/accelerate/test_utils/examples.py
@ -0,0 +1,139 @@
+#!/usr/bin/env python
+
+# Copyright 2022 The HuggingFace Team. All rights reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+"""
+A collection of utilities for comparing `examples/complete_*_example.py` scripts with the capabilities inside of each
+`examples/by_feature` example. `compare_against_test` is the main function that should be used when testing, while the
+others are used to either get the code that matters, or to preprocess them (such as stripping comments)
+"""
+
+import os
+from typing import List
+
+
+def get_function_contents_by_name(lines: List[str], name: str):
+    """
+    Extracts a function from `lines` of segmented source code with the name `name`.
+
+    Args:
+        lines (`List[str]`):
+            Source code of a script seperated by line.
+        name (`str`):
+            The name of the function to extract. Should be either `training_function` or `main`
+    """
+    if name != "training_function" and name != "main":
+        raise ValueError(f"Incorrect function name passed: {name}, choose either 'main' or 'training_function'")
+    good_lines, found_start = [], False
+    for line in lines:
+        if not found_start and f"def {name}" in line:
+            found_start = True
+            good_lines.append(line)
+            continue
+        if found_start:
+            if name == "training_function" and "def main" in line:
+                return good_lines
+            if name == "main" and "if __name__" in line:
+                return good_lines
+            good_lines.append(line)
+
+
+def clean_lines(lines: List[str]):
+    """
+    Filters `lines` and removes any entries that start with a comment ('#') or is just a newline ('\n')
+
+    Args:
+        lines (`List[str]`):
+            Source code of a script seperated by line.
+    """
+    return [line for line in lines if not line.lstrip().startswith("#") and line != "\n"]
+
+
+def compare_against_test(base_filename: str, feature_filename: str, parser_only: bool, secondary_filename: str = None):
+    """
+    Tests whether the additional code inside of `feature_filename` was implemented in `base_filename`. This should be
+    used when testing to see if `complete_*_.py` examples have all of the implementations from each of the
+    `examples/by_feature/*` scripts.
+
+    It utilizes `nlp_example.py` to extract out all of the repeated training code, so that only the new additional code
+    is examined and checked. If something *other* than `nlp_example.py` should be used, such as `cv_example.py` for the
+    `complete_cv_example.py` script, it should be passed in for the `secondary_filename` parameter.
+
+    Args:
+        base_filename (`str` or `os.PathLike`):
+            The filepath of a single "complete" example script to test, such as `examples/complete_cv_example.py`
+        feature_filename (`str` or `os.PathLike`):
+            The filepath of a single feature example script. The contents of this script are checked to see if they
+            exist in `base_filename`
+        parser_only (`bool`):
+            Whether to compare only the `main()` sections in both files, or to compare the contents of
+            `training_loop()`
+        secondary_filename (`str`, *optional*):
+            A potential secondary filepath that should be included in the check. This function extracts the base
+            functionalities off of "examples/nlp_example.py", so if `base_filename` is a script other than
+            `complete_nlp_example.py`, the template script should be included here. Such as `examples/cv_example.py`
+    """
+    with open(base_filename, "r") as f:
+        base_file_contents = f.readlines()
+    with open(os.path.abspath(os.path.join("examples", "nlp_example.py")), "r") as f:
+        full_file_contents = f.readlines()
+    with open(feature_filename, "r") as f:
+        feature_file_contents = f.readlines()
+    if secondary_filename is not None:
+        with open(secondary_filename, "r") as f:
+            secondary_file_contents = f.readlines()
+
+    # This is our base, we remove all the code from here in our `full_filename` and `feature_filename` to find the new content
+    if parser_only:
+        base_file_func = clean_lines(get_function_contents_by_name(base_file_contents, "main"))
+        full_file_func = clean_lines(get_function_contents_by_name(full_file_contents, "main"))
+        feature_file_func = clean_lines(get_function_contents_by_name(feature_file_contents, "main"))
+        if secondary_filename is not None:
+            secondary_file_func = clean_lines(get_function_contents_by_name(secondary_file_contents, "main"))
+    else:
+        base_file_func = clean_lines(get_function_contents_by_name(base_file_contents, "training_function"))
+        full_file_func = clean_lines(get_function_contents_by_name(full_file_contents, "training_function"))
+        feature_file_func = clean_lines(get_function_contents_by_name(feature_file_contents, "training_function"))
+        if secondary_filename is not None:
+            secondary_file_func = clean_lines(
+                get_function_contents_by_name(secondary_file_contents, "training_function")
+            )
+
+    _dl_line = "train_dataloader, eval_dataloader = get_dataloaders(accelerator, batch_size)\n"
+
+    # Specific code in our script that differs from the full version, aka what is new
+    new_feature_code = []
+    passed_idxs = []  # We keep track of the idxs just in case it's a repeated statement
+    for i, line in enumerate(feature_file_func):
+        if i not in passed_idxs:
+            if (line not in full_file_func) and (line.lstrip() != _dl_line):
+                new_feature_code.append(line)
+                passed_idxs.append(i)
+
+    # Extract out just the new parts from the full_file_training_func
+    new_full_example_parts = []
+    passed_idxs = []  # We keep track of the idxs just in case it's a repeated statement
+    for i, line in enumerate(base_file_func):
+        if i not in passed_idxs:
+            if (line not in full_file_func) and (line.lstrip() != _dl_line):
+                new_full_example_parts.append(line)
+                passed_idxs.append(i)
+
+    # Finally, get the overall diff
+    diff_from_example = [line for line in new_feature_code if line not in new_full_example_parts]
+    if secondary_filename is not None:
+        diff_from_two = [line for line in full_file_contents if line not in secondary_file_func]
+        diff_from_example = [line for line in diff_from_example if line not in diff_from_two]
+
+    return diff_from_example
--- a/src/accelerate/test_utils/test_script.py
+++ b/src/accelerate/test_utils/test_script.py
@ -19,9 +19,9 @@ from torch.utils.data import DataLoader

 from accelerate import Accelerator
 from accelerate.data_loader import prepare_data_loader
-from accelerate.state import AcceleratorState, DistributedType
+from accelerate.state import AcceleratorState
 from accelerate.test_utils import RegressionDataset, RegressionModel, are_the_same_tensors
-from accelerate.utils import gather, set_seed, synchronize_rng_states
+from accelerate.utils import DistributedType, gather, set_seed, synchronize_rng_states
 from packaging import version


--- a/src/accelerate/test_utils/testing.py
+++ b/src/accelerate/test_utils/testing.py
@ -13,13 +13,157 @@
 # limitations under the License.

 import asyncio
+import os
+import shutil
 import sys
+import tempfile
 import unittest
+from distutils.util import strtobool
+from pathlib import Path
+from typing import List, Union
+from unittest import mock

 import torch

-from ..state import AcceleratorState, is_tpu_available
-from ..utils import gather
+from ..state import AcceleratorState
+from ..utils import gather, is_comet_ml_available, is_tensorflow_available, is_tpu_available, is_wandb_available
+
+
+def parse_flag_from_env(key, default=False):
+    try:
+        value = os.environ[key]
+    except KeyError:
+        # KEY isn't set, default to `default`.
+        _value = default
+    else:
+        # KEY is set, convert it to True or False.
+        try:
+            _value = strtobool(value)
+        except ValueError:
+            # More values are supported, but let's keep the message simple.
+            raise ValueError(f"If set, {key} must be yes or no.")
+    return _value
+
+
+_run_slow_tests = parse_flag_from_env("RUN_SLOW", default=False)
+
+
+def slow(test_case):
+    """
+    Decorator marking a test as slow. Slow tests are skipped by default. Set the RUN_SLOW environment variable to a
+    truthy value to run them.
+    """
+    return unittest.skipUnless(_run_slow_tests, "test is slow")(test_case)
+
+
+def require_cuda(test_case):
+    """
+    Decorator marking a test that requires CUDA. These tests are skipped when there are no GPU available.
+    """
+    return unittest.skipUnless(torch.cuda.is_available(), "test requires a GPU")(test_case)
+
+
+def require_tpu(test_case):
+    """
+    Decorator marking a test that requires TPUs. These tests are skipped when there are no TPUs available.
+    """
+    return unittest.skipUnless(is_tpu_available(), "test requires TPU")(test_case)
+
+
+def require_multi_gpu(test_case):
+    """
+    Decorator marking a test that requires a multi-GPU setup. These tests are skipped on a machine without multiple
+    GPUs.
+    """
+    return unittest.skipUnless(torch.cuda.device_count() > 1, "test requires multiple GPUs")(test_case)
+
+
+def require_tensorflow(test_case):
+    """
+    Decorator marking a test that requires TensorFlow installed. These tests are skipped when TensorFlow isn't
+    installed
+    """
+    return unittest.skipUnless(is_tensorflow_available(), "test requires TensorFlow")(test_case)
+
+
+def require_wandb(test_case):
+    """
+    Decorator marking a test that requires wandb installed. These tests are skipped when wandb isn't installed
+    """
+    return unittest.skipUnless(is_wandb_available(), "test requires wandb")(test_case)
+
+
+def require_comet_ml(test_case):
+    """
+    Decorator marking a test that requires comet_ml installed. These tests are skipped when comet_ml isn't installed
+    """
+    return unittest.skipUnless(is_comet_ml_available(), "test requires comet_ml")(test_case)
+
+
+class TempDirTestCase(unittest.TestCase):
+    """
+    A TestCase class that keeps a single `tempfile.TemporaryDirectory` open for the duration of the class, wipes its
+    data at the start of a test, and then destroyes it at the end of the TestCase.
+
+    Useful for when a class or API requires a single constant folder throughout it's use, such as Weights and Biases
+
+    The temporary directory location will be stored in `self.tmpdir`
+    """
+
+    clear_on_setup = True
+
+    @classmethod
+    def setUpClass(cls):
+        "Creates a `tempfile.TemporaryDirectory` and stores it in `cls.tmpdir`"
+        cls.tmpdir = tempfile.mkdtemp()
+
+    @classmethod
+    def tearDownClass(cls):
+        "Remove `cls.tmpdir` after test suite has finished"
+        if os.path.exists(cls.tmpdir):
+            shutil.rmtree(cls.tmpdir)
+
+    def setUp(self):
+        "Destroy all contents in `self.tmpdir`, but not `self.tmpdir`"
+        if self.clear_on_setup:
+            for path in Path(self.tmpdir).glob("**/*"):
+                if path.is_file():
+                    path.unlink()
+                elif path.is_dir():
+                    shutil.rmtree(path)
+
+
+class MockingTestCase(unittest.TestCase):
+    """
+    A TestCase class designed to dynamically add various mockers that should be used in every test, mimicking the
+    behavior of a class-wide mock when defining one normally will not do.
+
+    Useful when a mock requires specific information available only initialized after `TestCase.setUpClass`, such as
+    setting an environment variable with that information.
+
+    The `add_mocks` function should be ran at the end of a `TestCase`'s `setUp` function, after a call to
+    `super().setUp()` such as:
+    ```python
+    def setUp(self):
+        super().setUp()
+        mocks = mock.patch.dict(os.environ, {"SOME_ENV_VAR", "SOME_VALUE"})
+        self.add_mocks(mocks)
+    ```
+    """
+
+    def add_mocks(self, mocks: Union[mock.Mock, List[mock.Mock]]):
+        """
+        Add custom mocks for tests that should be repeated on each test. Should be called during
+        `MockingTestCase.setUp`, after `super().setUp()`.
+
+        Args:
+            mocks (`mock.Mock` or list of `mock.Mock`):
+                Mocks that should be added to the `TestCase` after `TestCase.setUpClass` has been run
+        """
+        self.mocks = mocks if isinstance(mocks, (tuple, list)) else [mocks]
+        for m in self.mocks:
+            m.start()
+            self.addCleanup(m.stop)


 def are_the_same_tensors(tensor):
@ -33,37 +177,6 @@ def are_the_same_tensors(tensor):
    return True


-def require_cuda(test_case):
-    """
-    Decorator marking a test that requires CUDA. These tests are skipped when there are no GPU available.
-    """
-    if not torch.cuda.is_available():
-        return unittest.skip("test requires a GPU")(test_case)
-    else:
-        return test_case
-
-
-def require_tpu(test_case):
-    """
-    Decorator marking a test that requires TPUs. These tests are skipped when there are no TPUs available.
-    """
-    if not is_tpu_available():
-        return unittest.skip("test requires TPU")(test_case)
-    else:
-        return test_case
-
-
-def require_multi_gpu(test_case):
-    """
-    Decorator marking a test that requires a multi-GPU setup. These tests are skipped on a machine without multiple
-    GPUs.
-    """
-    if torch.cuda.device_count() < 2:
-        return unittest.skip("test requires multiple GPUs")(test_case)
-    else:
-        return test_case
-
-
 class _RunOutput:
    def __init__(self, returncode, stdout, stderr):
        self.returncode = returncode
--- a/src/accelerate/test_utils/training.py
+++ b/src/accelerate/test_utils/training.py
@ -14,6 +14,11 @@

 import numpy as np
 import torch
+from torch.utils.data import DataLoader
+
+from accelerate.utils.dataclasses import DistributedType
+from datasets import load_dataset
+from transformers import AutoTokenizer


 class RegressionDataset:
@ -43,3 +48,40 @@ class RegressionModel(torch.nn.Module):
            print(f"Model dtype: {self.a.dtype}, {self.b.dtype}. Input dtype: {x.dtype}")
            self.first_batch = False
        return x * self.a + self.b
+
+
+def mocked_dataloaders(accelerator, batch_size: int = 16):
+    tokenizer = AutoTokenizer.from_pretrained("bert-base-cased")
+    data_files = {"train": "tests/test_samples/MRPC/train.csv", "validation": "tests/test_samples/MRPC/dev.csv"}
+    datasets = load_dataset("csv", data_files=data_files)
+    label_list = datasets["train"].unique("label")
+
+    label_to_id = {v: i for i, v in enumerate(label_list)}
+
+    def tokenize_function(examples):
+        # max_length=None => use the model max length (it's actually the default)
+        outputs = tokenizer(
+            examples["sentence1"], examples["sentence2"], truncation=True, max_length=None, padding="max_length"
+        )
+        if "label" in examples:
+            outputs["labels"] = [label_to_id[l] for l in examples["label"]]
+        return outputs
+
+    # Apply the method we just defined to all the examples in all the splits of the dataset
+    tokenized_datasets = datasets.map(
+        tokenize_function,
+        batched=True,
+        remove_columns=["sentence1", "sentence2", "label"],
+    )
+
+    def collate_fn(examples):
+        # On TPU it's best to pad everything to the same length or training will be very slow.
+        if accelerator.distributed_type == DistributedType.TPU:
+            return tokenizer.pad(examples, padding="max_length", max_length=128, return_tensors="pt")
+        return tokenizer.pad(examples, padding="longest", return_tensors="pt")
+
+    # Instantiate dataloaders.
+    train_dataloader = DataLoader(tokenized_datasets["train"], shuffle=True, collate_fn=collate_fn, batch_size=2)
+    eval_dataloader = DataLoader(tokenized_datasets["validation"], shuffle=False, collate_fn=collate_fn, batch_size=1)
+
+    return train_dataloader, eval_dataloader
--- a/src/accelerate/tracking.py
+++ b/src/accelerate/tracking.py
@ -0,0 +1,332 @@
+# Copyright 2022 The HuggingFace Team. All rights reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+# Expectation:
+# Provide a project dir name, then each type of logger gets stored in project/{`logging_dir`}
+
+import os
+from abc import ABCMeta, abstractmethod, abstractproperty
+from typing import List, Optional, Union
+
+from .logging import get_logger
+from .utils import LoggerType, is_comet_ml_available, is_tensorboard_available, is_wandb_available
+
+
+_available_trackers = []
+
+if is_tensorboard_available():
+    from torch.utils import tensorboard
+
+    _available_trackers.append(LoggerType.TENSORBOARD)
+
+if is_wandb_available():
+    import wandb
+
+    _available_trackers.append(LoggerType.WANDB)
+
+if is_comet_ml_available():
+    from comet_ml import Experiment
+
+    _available_trackers.append(LoggerType.COMETML)
+
+
+logger = get_logger(__name__)
+
+
+def get_available_trackers():
+    "Returns a list of all supported available trackers in the system"
+    return _available_trackers
+
+
+class GeneralTracker(object, metaclass=ABCMeta):
+    """
+    A base Tracker class to be used for all logging integration implementations.
+    """
+
+    @abstractproperty
+    def requires_logging_directory(self):
+        """
+        Whether the logger requires a directory to store their logs. Should either return `True` or `False`.
+        """
+        pass
+
+    @abstractmethod
+    def store_init_configuration(self, values: dict):
+        """
+        Logs `values` as hyperparameters for the run. Implementations should use the experiment configuration
+        functionality of a tracking API.
+
+        Args:
+            values (Dictionary `str` to `bool`, `str`, `float` or `int`):
+                Values to be stored as initial hyperparameters as key-value pairs. The values need to have type `bool`,
+                `str`, `float`, `int`, or `None`.
+        """
+        pass
+
+    @abstractmethod
+    def log(self, values: dict, step: Optional[int]):
+        """
+        Logs `values` to the current run. Base `log` implementations of a tracking API should go in here, along with
+        special behavior for the `step parameter.
+
+        Args:
+            values (Dictionary `str` to `str`, `float`, or `int`):
+                Values to be logged as key-value pairs. The values need to have type `str`, `float`, or `int`.
+            step (`int`, *optional*):
+                The run step. If included, the log will be affiliated with this step.
+        """
+        pass
+
+    def finish(self):
+        """
+        Should run any finalizing functions within the tracking API. If the API should not have one, just don't
+        overwrite that method.
+        """
+        pass
+
+
+class TensorBoardTracker(GeneralTracker):
+    """
+    A `Tracker` class that supports `tensorboard`. Should be initialized at the start of your script.
+
+    Args:
+        run_name (`str`):
+            The name of the experiment run
+        logging_dir (`str`, `os.PathLike`):
+            Location for TensorBoard logs to be stored.
+    """
+
+    requires_logging_directory = True
+
+    def __init__(self, run_name: str, logging_dir: Optional[Union[str, os.PathLike]]):
+        self.run_name = run_name
+        self.logging_dir = os.path.join(logging_dir, run_name)
+        self.writer = tensorboard.SummaryWriter(self.logging_dir)
+        logger.info(f"Initialized TensorBoard project {self.run_name} logging to {self.logging_dir}")
+        logger.info(
+            "Make sure to log any initial configurations with `self.store_init_configuration` before training!"
+        )
+
+    def store_init_configuration(self, values: dict):
+        """
+        Logs `values` as hyperparameters for the run. Should be run at the beginning of your experiment.
+
+        Args:
+            values (Dictionary `str` to `bool`, `str`, `float` or `int`):
+                Values to be stored as initial hyperparameters as key-value pairs. The values need to have type `bool`,
+                `str`, `float`, `int`, or `None`.
+        """
+        self.writer.add_hparams(values, metric_dict={})
+        self.writer.flush()
+        logger.info("Stored initial configuration hyperparameters to TensorBoard")
+
+    def log(self, values: dict, step: Optional[int] = None):
+        """
+        Logs `values` to the current run.
+
+        Args:
+            values (Dictionary `str` to `str`, `float`, `int` or `dict` of `str` to `float`/`int`):
+                Values to be logged as key-value pairs. The values need to have type `str`, `float`, `int` or `dict` of
+                `str` to `float`/`int`.
+            step (`int`, *optional*):
+                The run step. If included, the log will be affiliated with this step.
+        """
+        for k, v in values.items():
+            if isinstance(v, (int, float)):
+                self.writer.add_scalar(k, v, global_step=step)
+            elif isinstance(v, str):
+                self.writer.add_text(k, v, global_step=step)
+            elif isinstance(v, dict):
+                self.writer.add_scalars(k, v, global_step=step)
+        self.writer.flush()
+        logger.info("Successfully logged to TensorBoard")
+
+    def finish(self):
+        """
+        Closes `TensorBoard` writer
+        """
+        self.writer.close()
+        logger.info("TensorBoard writer closed")
+
+
+class WandBTracker(GeneralTracker):
+    """
+    A `Tracker` class that supports `wandb`. Should be initialized at the start of your script.
+
+    Args:
+        run_name (`str`):
+            The name of the experiment run.
+    """
+
+    requires_logging_directory = False
+
+    def __init__(self, run_name: str):
+        self.run_name = run_name
+        self.run = wandb.init(project=self.run_name)
+        logger.info(f"Initialized WandB project {self.run_name}")
+        logger.info(
+            "Make sure to log any initial configurations with `self.store_init_configuration` before training!"
+        )
+
+    def store_init_configuration(self, values: dict):
+        """
+        Logs `values` as hyperparameters for the run. Should be run at the beginning of your experiment.
+
+        Args:
+            values (Dictionary `str` to `bool`, `str`, `float` or `int`):
+                Values to be stored as initial hyperparameters as key-value pairs. The values need to have type `bool`,
+                `str`, `float`, `int`, or `None`.
+        """
+        wandb.config.update(values)
+        logger.info("Stored initial configuration hyperparameters to WandB")
+
+    def log(self, values: dict, step: Optional[int] = None):
+        """
+        Logs `values` to the current run.
+
+        Args:
+            values (Dictionary `str` to `str`, `float`, `int` or `dict` of `str` to `float`/`int`):
+                Values to be logged as key-value pairs. The values need to have type `str`, `float`, `int` or `dict` of
+                `str` to `float`/`int`.
+            step (`int`, *optional*):
+                The run step. If included, the log will be affiliated with this step.
+        """
+        self.run.log(values, step=step)
+        logger.info("Successfully logged to WandB")
+
+    def finish(self):
+        """
+        Closes `wandb` writer
+        """
+        self.run.finish()
+        logger.info("WandB run closed")
+
+
+class CometMLTracker(GeneralTracker):
+    """
+    A `Tracker` class that supports `comet_ml`. Should be initialized at the start of your script.
+
+    API keys must be stored in a Comet config file.
+
+    Args:
+        run_name (`str`):
+            The name of the experiment run.
+    """
+
+    requires_logging_directory = False
+
+    def __init__(self, run_name: str):
+        self.run_name = run_name
+        self.writer = Experiment(project_name=run_name)
+        logger.info(f"Initialized CometML project {self.run_name}")
+        logger.info(
+            "Make sure to log any initial configurations with `self.store_init_configuration` before training!"
+        )
+
+    def store_init_configuration(self, values: dict):
+        """
+        Logs `values` as hyperparameters for the run. Should be run at the beginning of your experiment.
+
+        Args:
+            values (Dictionary `str` to `bool`, `str`, `float` or `int`):
+                Values to be stored as initial hyperparameters as key-value pairs. The values need to have type `bool`,
+                `str`, `float`, `int`, or `None`.
+        """
+        self.writer.log_parameters(values)
+        logger.info("Stored initial configuration hyperparameters to CometML")
+
+    def log(self, values: dict, step: Optional[int] = None):
+        """
+        Logs `values` to the current run.
+
+        Args:
+            values (Dictionary `str` to `str`, `float`, `int` or `dict` of `str` to `float`/`int`):
+                Values to be logged as key-value pairs. The values need to have type `str`, `float`, `int` or `dict` of
+                `str` to `float`/`int`.
+            step (`int`, *optional*):
+                The run step. If included, the log will be affiliated with this step.
+        """
+        if step is not None:
+            self.writer.set_step(step)
+        for k, v in values.items():
+            if isinstance(v, (int, float)):
+                self.writer.log_metric(k, v, step=step)
+            elif isinstance(v, str):
+                self.writer.log_other(k, v)
+            elif isinstance(v, dict):
+                self.writer.log_metrics(v, step=step)
+        logger.info("Successfully logged to CometML")
+
+    def finish(self):
+        """
+        Closes `comet-ml` writer
+        """
+        self.writer.end()
+        logger.info("CometML run closed")
+
+
+LOGGER_TYPE_TO_CLASS = {"tensorboard": TensorBoardTracker, "wandb": WandBTracker, "comet_ml": CometMLTracker}
+
+
+def filter_trackers(
+    log_with: List[Union[str, LoggerType, GeneralTracker]], logging_dir: Union[str, os.PathLike] = None
+):
+    """
+    Takes in a list of potential tracker types and checks that:
+        - The tracker wanted is available in that environment
+        - Filters out repeats of tracker types
+        - If `all` is in `log_with`, will return all trackers in the environment
+        - If a tracker requires a `logging_dir`, ensures that `logging_dir` is not `None`
+
+    Args:
+        log_with (list of `str`, [`~utils.LoggerType`] or [`~tracking.GeneralTracker`], *optional*):
+            A list of loggers to be setup for experiment tracking. Should be one or several of:
+
+            - `"all"`
+            - `"tensorboard"`
+            - `"wandb"`
+            - `"comet_ml"`
+            If `"all`" is selected, will pick up all available trackers in the environment and intialize them. Can also
+            accept implementations of `GeneralTracker` for custom trackers, and can be combined with `"all"`.
+        logging_dir (`str`, `os.PathLike`, *optional*):
+            A path to a directory for storing logs of locally-compatible loggers.
+    """
+    loggers = []
+    if log_with is not None:
+        if not isinstance(log_with, (list, tuple)):
+            log_with = [log_with]
+            logger.debug(f"{log_with}")
+        if "all" in log_with or LoggerType.ALL in log_with:
+            loggers = [o for o in log_with if issubclass(type(o), GeneralTracker)] + get_available_trackers()
+        else:
+            for log_type in log_with:
+                if log_type not in LoggerType and not issubclass(type(log_type), GeneralTracker):
+                    raise ValueError(f"Unsupported logging capability: {log_type}. Choose between {LoggerType.list()}")
+                if issubclass(type(log_type), GeneralTracker):
+                    loggers.append(log_type)
+                else:
+                    log_type = LoggerType(log_type)
+                    if log_type not in loggers:
+                        if log_type in get_available_trackers():
+                            tracker_init = LOGGER_TYPE_TO_CLASS[str(log_type)]
+                            if getattr(tracker_init, "requires_logging_directory"):
+                                if logging_dir is None:
+                                    raise ValueError(
+                                        f"Logging with `{str(log_type)}` requires a `logging_dir` to be passed in."
+                                    )
+                            loggers.append(log_type)
+                        else:
+                            logger.info(f"Tried adding logger {log_type}, but package is unavailable in the system.")
+
+    return loggers
--- a/src/accelerate/utils/init.py
+++ b/src/accelerate/utils/init.py
@ -0,0 +1,91 @@
+# flake8: noqa
+# There's no way to ignore "F401 '...' imported but unused" warnings in this
+# module, but to preserve other warnings. So, don't check this module at all
+
+from .constants import MODEL_NAME, OPTIMIZER_NAME, RNG_STATE_NAME, SCALER_NAME, SCHEDULER_NAME
+from .dataclasses import (
+    ComputeEnvironment,
+    DeepSpeedPlugin,
+    DistributedDataParallelKwargs,
+    DistributedType,
+    FullyShardedDataParallelPlugin,
+    GradScalerKwargs,
+    InitProcessGroupKwargs,
+    KwargsHandler,
+    LoggerType,
+    PrecisionType,
+    RNGType,
+    SageMakerDistributedType,
+    TensorInformation,
+)
+from .imports import (
+    is_apex_available,
+    is_boto3_available,
+    is_ccl_available,
+    is_comet_ml_available,
+    is_deepspeed_available,
+    is_sagemaker_available,
+    is_tensorboard_available,
+    is_tensorflow_available,
+    is_tpu_available,
+    is_wandb_available,
+)
+from .modeling import (
+    check_device_map,
+    compute_module_sizes,
+    convert_file_size_to_int,
+    dtype_byte_size,
+    find_tied_parameters,
+    get_max_layer_size,
+    get_max_memory,
+    infer_auto_device_map,
+    load_checkpoint_in_model,
+    load_offloaded_weights,
+    named_module_tensors,
+    set_module_tensor_to_device,
+)
+from .offload import (
+    OffloadedWeightsLoader,
+    PrefixedDataset,
+    extract_submodules_state_dict,
+    offload_state_dict,
+    offload_weight,
+    save_offload_index,
+)
+from .operations import (
+    broadcast,
+    broadcast_object_list,
+    concatenate,
+    convert_outputs_to_fp32,
+    convert_to_fp32,
+    find_batch_size,
+    find_device,
+    gather,
+    gather_object,
+    get_data_structure,
+    honor_type,
+    initialize_tensors,
+    is_tensor_information,
+    is_torch_tensor,
+    pad_across_processes,
+    recursively_apply,
+    reduce,
+    send_to_device,
+    slice_tensors,
+)
+
+
+if is_deepspeed_available():
+    from .deepspeed import DeepSpeedEngineWrapper, DeepSpeedOptimizerWrapper
+
+from .launch import PrepareForLaunch
+from .memory import find_executable_batch_size
+from .other import (
+    extract_model_from_parallel,
+    get_pretty_name,
+    patch_environment,
+    save,
+    wait_for_everyone,
+    write_basic_config,
+)
+from .random import set_seed, synchronize_rng_state, synchronize_rng_states
--- a/src/accelerate/utils/constants.py
+++ b/src/accelerate/utils/constants.py
@ -0,0 +1,19 @@
+# Copyright 2022 The HuggingFace Team. All rights reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+SCALER_NAME = "scaler.pt"
+MODEL_NAME = "pytorch_model"
+RNG_STATE_NAME = "random_states"
+OPTIMIZER_NAME = "optimizer"
+SCHEDULER_NAME = "scheduler"
--- a/src/accelerate/utils/dataclasses.py
+++ b/src/accelerate/utils/dataclasses.py
@ -0,0 +1,304 @@
+# Copyright 2022 The HuggingFace Team. All rights reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+"""
+General namespace and dataclass related classes
+"""
+
+import copy
+import enum
+import functools
+import os
+import typing
+from dataclasses import dataclass, field
+from datetime import timedelta
+from typing import Callable, Iterable, Optional
+
+import torch
+
+
+class KwargsHandler:
+    """
+    Internal mixin that implements a `to_kwargs()` method for a dataclass.
+    """
+
+    def to_dict(self):
+        return copy.deepcopy(self.__dict__)
+
+    def to_kwargs(self):
+        """
+        Returns a dictionary containing the attributes with values different from the default of this class.
+        """
+        default_dict = self.__class__().to_dict()
+        this_dict = self.to_dict()
+        return {k: v for k, v in this_dict.items() if default_dict[k] != v}
+
+
+@dataclass
+class DistributedDataParallelKwargs(KwargsHandler):
+    """
+    Use this object in your [`Accelerator`] to customize how your model is wrapped in a
+    `torch.nn.parallel.DistributedDataParallel`. Please refer to the documentation of this
+    [wrapper](https://pytorch.org/docs/stable/generated/torch.nn.parallel.DistributedDataParallel.html) for more
+    information on each argument.
+
+    <Tip warning={true}>
+
+    `gradient_as_bucket_view` is only available in PyTorch 1.7.0 and later versions.
+
+    </Tip>"""
+
+    dim: int = 0
+    broadcast_buffers: bool = True
+    bucket_cap_mb: int = 25
+    find_unused_parameters: bool = False
+    check_reduction: bool = False
+    gradient_as_bucket_view: bool = False
+
+
+@dataclass
+class GradScalerKwargs(KwargsHandler):
+    """
+    Use this object in your [`Accelerator`] to customize the behavior of mixed precision, specifically how the
+    `torch.cuda.amp.GradScaler` used is created. Please refer to the documentation of this
+    [scaler](https://pytorch.org/docs/stable/amp.html?highlight=gradscaler) for more information on each argument.
+
+    <Tip warning={true}>
+
+    `GradScaler` is only available in PyTorch 1.5.0 and later versions.
+
+    </Tip>"""
+
+    init_scale: float = 65536.0
+    growth_factor: float = 2.0
+    backoff_factor: float = 0.5
+    growth_interval: int = 2000
+    enabled: bool = True
+
+
+@dataclass
+class InitProcessGroupKwargs(KwargsHandler):
+    """
+    Use this object in your [`Accelerator`] to customize the initialization of the distributed processes. Please refer
+    to the documentation of this
+    [method](https://pytorch.org/docs/stable/distributed.html#torch.distributed.init_process_group) for more
+    information on each argument.
+    """
+
+    init_method: Optional[str] = None
+    timeout: timedelta = timedelta(seconds=1800)
+
+
+class DistributedType(str, enum.Enum):
+    """
+    Represents a type of distributed environment.
+
+    Values:
+
+        - **NO** -- Not a distributed environment, just a single process.
+        - **MULTI_CPU** -- Distributed on multiple CPU nodes.
+        - **MULTI_GPU** -- Distributed on multiple GPUs.
+        - **DEEPSPEED** -- Using DeepSpeed.
+        - **TPU** -- Distributed on TPUs.
+    """
+
+    # Subclassing str as well as Enum allows the `DistributedType` to be JSON-serializable out of the box.
+    NO = "NO"
+    MULTI_CPU = "MULTI_CPU"
+    MULTI_GPU = "MULTI_GPU"
+    DEEPSPEED = "DEEPSPEED"
+    FSDP = "FSDP"
+    TPU = "TPU"
+
+
+class SageMakerDistributedType(str, enum.Enum):
+    """
+    Represents a type of distributed environment.
+
+    Values:
+
+        - **NO** -- Not a distributed environment, just a single process.
+        - **DATA_PARALLEL** -- using sagemaker distributed data parallelism.
+        - **MODEL_PARALLEL** -- using sagemaker distributed model parallelism.
+    """
+
+    # Subclassing str as well as Enum allows the `SageMakerDistributedType` to be JSON-serializable out of the box.
+    NO = "NO"
+    DATA_PARALLEL = "DATA_PARALLEL"
+    MODEL_PARALLEL = "MODEL_PARALLEL"
+
+
+class ComputeEnvironment(str, enum.Enum):
+    """
+    Represents a type of the compute environment.
+
+    Values:
+
+        - **LOCAL_MACHINE** -- private/custom cluster hardware.
+        - **AMAZON_SAGEMAKER** -- Amazon SageMaker as compute environment.
+    """
+
+    # Subclassing str as well as Enum allows the `ComputeEnvironment` to be JSON-serializable out of the box.
+    LOCAL_MACHINE = "LOCAL_MACHINE"
+    AMAZON_SAGEMAKER = "AMAZON_SAGEMAKER"
+
+
+class EnumWithContains(enum.EnumMeta):
+    "A metaclass that adds the ability to check if `self` contains an item with the `in` operator"
+
+    def __contains__(cls, item):
+        try:
+            cls(item)
+        except ValueError:
+            return False
+        return True
+
+
+class BaseEnum(enum.Enum, metaclass=EnumWithContains):
+    "An enum class that can get the value of an item with `str(Enum.key)`"
+
+    def __str__(self):
+        return self.value
+
+    @classmethod
+    def list(cls):
+        "Method to list all the possible items in `cls`"
+        return list(map(lambda item: str(item), cls))
+
+
+class LoggerType(BaseEnum):
+    ALL = "all"
+    TENSORBOARD = "tensorboard"
+    WANDB = "wandb"
+    COMETML = "comet_ml"
+
+
+class PrecisionType(BaseEnum):
+    NO = "no"
+    FP16 = "fp16"
+    BF16 = "bf16"
+
+
+class RNGType(BaseEnum):
+    TORCH = "torch"
+    CUDA = "cuda"
+    XLA = "xla"
+    GENERATOR = "generator"
+
+
+# data classes
+
+
+@dataclass
+class TensorInformation:
+    shape: torch.Size
+    dtype: torch.dtype
+
+
+@dataclass
+class DeepSpeedPlugin:
+
+    gradient_accumulation_steps: int = field(
+        default=None, metadata={"help": "Number of steps to accumulate gradients before updating optimizer states"}
+    )
+    zero_stage: int = field(
+        default=None,
+        metadata={"help": "Possible options are 0,1,2,3; Default will be taken from environment variable"},
+    )
+    is_train_batch_min: str = field(
+        default=True,
+        metadata={"help": "If both train & eval dataloaders are specified, this will decide the train_batch_size"},
+    )
+
+    auto_opt_mapping: bool = field(
+        default=True,
+        metadata={"help": "whether to map torch.adam to deepspeed optimizer version of adam based on config"},
+    )
+
+    offload_optimizer_device: bool = field(default=None, metadata={"help": "Possible options are none|cpu|nvme"})
+
+    def __post_init__(self):
+
+        if self.gradient_accumulation_steps is None:
+            self.gradient_accumulation_steps = int(os.environ.get("GRADIENT_ACCUMULATION_STEPS", 1))
+
+        if self.zero_stage is None:
+            self.zero_stage = int(os.environ.get("DEEPSPEED_ZERO_STAGE", 2))
+
+        if self.offload_optimizer_device is None:
+            self.offload_optimizer_device = os.environ.get("DEEPSPEED_OFFLOAD_OPTIMIZER_DEVICE", "none")
+
+        self.deepspeed_config = {
+            "train_batch_size": None,
+            "gradient_accumulation_steps": self.gradient_accumulation_steps,
+            "zero_optimization": {
+                "stage": self.zero_stage,
+                "offload_optimizer": {
+                    "device": self.offload_optimizer_device,
+                },
+            },
+            "steps_per_print": float("inf"),  # this will stop deepspeed from logging @ stdout
+            "zero_allow_untested_optimizer": True,
+        }
+
+
+@dataclass
+class FullyShardedDataParallelPlugin:
+    """
+    This plugin is used to enable fully sharded data parallelism.
+    """
+
+    sharding_strategy: "typing.Any" = field(
+        default=None,
+        metadata={"help": "Possible options are [1] FULL_SHARD, [2] SHARD_GRAD_OP"},
+    )
+    backward_prefetch: "typing.Any" = field(
+        default=None,
+        metadata={"help": "Possible options are [1] BACKWARD_PRE, [2] BACKWARD_POST"},
+    )
+    auto_wrap_policy: "typing.Any" = field(
+        default=None,
+        metadata={"help": "A callable specifying a policy to recursively wrap layers with FSDP"},
+    )
+    cpu_offload: Optional[Callable] = field(
+        default=None,
+        metadata={"help": "Decides Whether to offload parameters and gradients to CPU."},
+    )
+    min_num_params: int = field(
+        default=None, metadata={"help": "FSDP's minimum number of parameters for Default Auto Wrapping."}
+    )
+    ignored_modules: Optional[Iterable[torch.nn.Module]] = field(
+        default=None,
+        metadata={"help": "A list of modules to ignore for FSDP."},
+    )
+
+    def __post_init__(self):
+        from torch.distributed.fsdp.fully_sharded_data_parallel import CPUOffload, ShardingStrategy
+        from torch.distributed.fsdp.wrap import default_auto_wrap_policy
+
+        if self.sharding_strategy is None:
+            self.sharding_strategy = ShardingStrategy(int(os.environ.get("FSDP_SHARDING_STRATEGY", 1)))
+
+        if self.cpu_offload is None:
+            if os.environ.get("FSDP_OFFLOAD_PARAMS", "false") == "true":
+                self.cpu_offload = CPUOffload(offload_params=True)
+            else:
+                self.cpu_offload = CPUOffload(offload_params=False)
+
+        if self.min_num_params is None:
+            self.min_num_params = int(os.environ.get("FSDP_MIN_NUM_PARAMS", 0))
+
+        if self.auto_wrap_policy is None:
+            if self.min_num_params > 0:
+                self.auto_wrap_policy = functools.partial(default_auto_wrap_policy, min_num_params=self.min_num_params)
--- a/src/accelerate/utils/deepspeed.py
+++ b/src/accelerate/utils/deepspeed.py
@ -12,8 +12,8 @@
 # See the License for the specific language governing permissions and
 # limitations under the License.

-from .optimizer import AcceleratedOptimizer
-from .state import is_apex_available, is_deepspeed_available
+from ..optimizer import AcceleratedOptimizer
+from .imports import is_apex_available, is_deepspeed_available


 if is_deepspeed_available():
--- a/src/accelerate/utils/imports.py
+++ b/src/accelerate/utils/imports.py
@ -0,0 +1,87 @@
+# Copyright 2022 The HuggingFace Team. All rights reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import importlib
+import sys
+
+
+# The package importlib_metadata is in a different place, depending on the Python version.
+if sys.version_info < (3, 8):
+    import importlib_metadata
+else:
+    import importlib.metadata as importlib_metadata
+
+
+try:
+    import torch_ccl  # noqa: F401
+
+    _ccl_available = True
+except ImportError:
+    _ccl_available = False
+
+
+try:
+    import torch_xla.core.xla_model as xm  # noqa: F401
+
+    _tpu_available = True
+except ImportError:
+    _tpu_available = False
+
+
+def is_ccl_available():
+    return _ccl_available
+
+
+def is_apex_available():
+    return importlib.util.find_spec("apex") is not None
+
+
+def is_tpu_available():
+    return _tpu_available
+
+
+def is_deepspeed_available():
+    package_exists = importlib.util.find_spec("deepspeed") is not None
+    # Check we're not importing a "deepspeed" directory somewhere but the actual library by trying to grab the version
+    # AND checking it has an author field in the metadata that is HuggingFace.
+    if package_exists:
+        try:
+            _ = importlib_metadata.metadata("deepspeed")
+            return True
+        except importlib_metadata.PackageNotFoundError:
+            return False
+
+
+def is_tensorflow_available():
+    return importlib.util.find_spec("tensorflow") is not None
+
+
+def is_tensorboard_available():
+    return importlib.util.find_spec("tensorboard") is not None or importlib.util.find_spec("tensorboardX") is not None
+
+
+def is_wandb_available():
+    return importlib.util.find_spec("wandb") is not None
+
+
+def is_comet_ml_available():
+    return importlib.util.find_spec("comet_ml") is not None
+
+
+def is_boto3_available():
+    return importlib.util.find_spec("boto3") is not None
+
+
+def is_sagemaker_available():
+    return importlib.util.find_spec("sagemaker") is not None
--- a/src/accelerate/utils/launch.py
+++ b/src/accelerate/utils/launch.py
@ -0,0 +1,55 @@
+# Copyright 2022 The HuggingFace Team. All rights reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import os
+
+import torch
+
+from .dataclasses import DistributedType
+
+
+class PrepareForLaunch:
+    """
+    Prepare a function that will launched in a distributed setup.
+
+    Args:
+        launcher (`Callable`):
+            The function to launch.
+        distributed_type ([`~state.DistributedType`]):
+            The distributed type to prepare for.
+        debug (`bool`, *optional*, defaults to `False`):
+            Whether or not this is a debug launch.
+    """
+
+    def __init__(self, launcher, distributed_type="NO", debug=False):
+        self.launcher = launcher
+        self.distributed_type = DistributedType(distributed_type)
+        self.debug = debug
+
+    def __call__(self, index, *args):
+        if self.debug:
+            world_size = int(os.environ.get("WORLD_SIZE"))
+            rdv_file = os.environ.get("ACCELERATE_DEBUG_RDV_FILE")
+            torch.distributed.init_process_group(
+                "gloo",
+                rank=index,
+                store=torch.distributed.FileStore(rdv_file, world_size),
+                world_size=world_size,
+            )
+        elif self.distributed_type == DistributedType.MULTI_GPU or self.distributed_type == DistributedType.MULTI_CPU:
+            # Prepare the environment for torch.distributed
+            os.environ["LOCAL_RANK"] = str(index)
+            os.environ["RANK"] = str(index)
+
+        self.launcher(*args)
--- a/src/accelerate/utils/memory.py
+++ b/src/accelerate/utils/memory.py
@ -0,0 +1,88 @@
+# Copyright 2022 The HuggingFace Team. All rights reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+"""
+A collection of utilities for ensuring that training can always occur. Heavily influenced by the
+[toma](https://github.com/BlackHC/toma) library.
+"""
+
+import functools
+import gc
+import inspect
+
+import torch
+
+
+def should_reduce_batch_size(exception: Exception) -> bool:
+    """
+    Checks if `exception` relates to CUDA out-of-memory, CUDNN not supported, or CPU out-of-memory
+
+    Args:
+        exception (`Exception`):
+            An exception
+    """
+    _statements = [
+        "CUDA out of memory.",  # CUDA OOM
+        "cuDNN error: CUDNN_STATUS_NOT_SUPPORTED.",  # CUDNN SNAFU
+        "DefaultCPUAllocator: can't allocate memory",  # CPU OOM
+    ]
+    if isinstance(exception, RuntimeError) and len(exception.args) == 1:
+        return any(err in exception.args[0] for err in _statements)
+    return False
+
+
+def find_executable_batch_size(function: callable = None, starting_batch_size: int = 128):
+    """
+    A basic decorator that will try to execute `function`. If it fails from exceptions related to out-of-memory or
+    CUDNN, the batch size is cut in half and passed to `function`
+
+    `function` must take in a `batch_size` parameter as its first argument.
+
+    Args:
+        function (`callable`, *optional*):
+            A function to wrap
+        starting_batch_size (`int`, *optional*):
+            The batch size to try and fit into memory
+    """
+    if function is None:
+        return functools.partial(find_executable_batch_size, starting_batch_size=starting_batch_size)
+
+    batch_size = starting_batch_size
+
+    def decorator(*args, **kwargs):
+        nonlocal batch_size
+        gc.collect()
+        torch.cuda.empty_cache()
+        params = list(inspect.signature(function).parameters.keys())
+        # Guard against user error
+        if len(params) < (len(args) + 1):
+            arg_str = ", ".join([f"{arg}={value}" for arg, value in zip(params[1:], args[1:])])
+            raise TypeError(
+                f"Batch size was passed into `{function.__name__}` as the first argument when called."
+                f"Remove this as the decorator already does so: `{function.__name__}({arg_str})`"
+            )
+        while True:
+            if batch_size == 0:
+                raise RuntimeError("No executable batch size found, reached zero.")
+            try:
+                return function(batch_size, *args, **kwargs)
+            except Exception as e:
+                if should_reduce_batch_size(e):
+                    gc.collect()
+                    torch.cuda.empty_cache()
+                    batch_size //= 2
+                else:
+                    raise
+
+    return decorator
--- a/src/accelerate/utils/modeling.py
+++ b/src/accelerate/utils/modeling.py
@ -0,0 +1,614 @@
+# Copyright 2022 The HuggingFace Team. All rights reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import gc
+import json
+import os
+import re
+import shutil
+import tempfile
+from collections import defaultdict
+from typing import Dict, List, Optional, Tuple, Union
+
+import numpy as np
+import torch
+import torch.nn as nn
+
+from .offload import offload_weight, save_offload_index
+
+
+WEIGHTS_INDEX_NAME = "pytorch_model.bin.index.json"
+
+
+def convert_file_size_to_int(size: Union[int, str]):
+    """
+    Converts a size expressed as a string with digits an unit (like `"5MB"`) to an integer (in bytes).
+
+    Args:
+        size (`int` or `str`): The size to convert. Will be directly returned if an `int`.
+
+    Example:
+
+    ```py
+    >>> convert_file_size_to_int("1MiB")
+    1048576
+    ```
+    """
+    if isinstance(size, int):
+        return size
+    if size.upper().endswith("GIB"):
+        return int(size[:-3]) * (2**30)
+    if size.upper().endswith("MIB"):
+        return int(size[:-3]) * (2**20)
+    if size.upper().endswith("KIB"):
+        return int(size[:-3]) * (2**10)
+    if size.upper().endswith("GB"):
+        int_size = int(size[:-2]) * (10**9)
+        return int_size // 8 if size.endswith("b") else int_size
+    if size.upper().endswith("MB"):
+        int_size = int(size[:-2]) * (10**6)
+        return int_size // 8 if size.endswith("b") else int_size
+    if size.upper().endswith("KB"):
+        int_size = int(size[:-2]) * (10**3)
+        return int_size // 8 if size.endswith("b") else int_size
+    raise ValueError("`size` is not in a valid format. Use an integer followed by the unit, e.g., '5GB'.")
+
+
+def dtype_byte_size(dtype: torch.dtype):
+    """
+    Returns the size (in bytes) occupied by one parameter of type `dtype`.
+
+    Example:
+
+    ```py
+    >>> dtype_byte_size(torch.float32)
+    4
+    ```
+    """
+    if dtype == torch.bool:
+        return 1 / 8
+    bit_search = re.search("[^\d](\d+)$", str(dtype))
+    if bit_search is None:
+        raise ValueError(f"`dtype` is not a valid dtype: {dtype}.")
+    bit_size = int(bit_search.groups()[0])
+    return bit_size // 8
+
+
+def set_module_tensor_to_device(
+    module: nn.Module, tensor_name: str, device: Union[int, str, torch.device], value: Optional[torch.Tensor] = None
+):
+    """
+    A helper function to set a given tensor (parameter of buffer) of a module on a specific device (note that doing
+    `param.to(device)` creates a new tensor not linked to the parameter, which is why we need this function).
+
+    Args:
+        module (`torch.nn.Module`): The module in which the tensor we want to move lives.
+        param_name (`str`): The full name of the parameter/buffer.
+        device (`int`, `str` or `torch.device`): The device on which to set the tensor.
+        value (`torch.Tensor`, *optional*): The value of the tensor (useful when going from the meta device to any
+            other device).
+    """
+    # Recurse if needed
+    if "." in tensor_name:
+        splits = tensor_name.split(".")
+        for split in splits[:-1]:
+            new_module = getattr(module, split)
+            if new_module is None:
+                raise ValueError(f"{module} has no attribute {split}.")
+            module = new_module
+        tensor_name = splits[-1]
+
+    if tensor_name not in module._parameters and tensor_name not in module._buffers:
+        raise ValueError(f"{module} does not have a parameter or a buffer named {tensor_name}.")
+    is_buffer = tensor_name in module._buffers
+    old_value = getattr(module, tensor_name)
+
+    if old_value.device == torch.device("meta") and device not in ["meta", torch.device("meta")] and value is None:
+        raise ValueError(f"{tensor_name} is on the meta device, we need a `value` to put in on {device}.")
+
+    with torch.no_grad():
+        if value is None:
+            new_value = old_value.to(device)
+        elif isinstance(value, torch.Tensor):
+            new_value = value.to(device)
+        else:
+            new_value = torch.tensor(value, device=device)
+    if is_buffer:
+        module._buffers[tensor_name] = new_value
+    else:
+        new_value = nn.Parameter(new_value, requires_grad=old_value.requires_grad)
+        module._parameters[tensor_name] = new_value
+
+
+def named_module_tensors(module: nn.Module, include_buffers: bool = True, recurse: bool = False):
+    """
+    A helper function that gathers all the tensors (parameters + buffers) of a given module. If `include_buffers=True`
+    it's the same as doing `module.named_parameters(recurse=recurse) + module.named_buffers(recurse=recurse)`.
+
+    Args:
+        module (`torch.nn.Module`): The module we want the tensors or.
+        include_buffer (`bool`, *optional*, defaults to `True`): Whether or not to include the buffers in the result.
+        recurse (`bool`, *optional`, defaults to `False`):
+            Whether or not to go look in every submodule or just return the direct parameters and buffers.
+    """
+    for named_parameter in module.named_parameters(recurse=recurse):
+        yield named_parameter
+
+    if include_buffers:
+        for named_buffer in module.named_buffers(recurse=recurse):
+            yield named_buffer
+
+
+def find_tied_parameters(model: nn.Module, **kwargs):
+    """
+    Find the tied parameters in a given model.
+
+    Args:
+        model (`torch.nn.Module`): The model to inspect.
+
+    <Tip warning={true}>
+
+    The signature accepts keyword arguments, but they are for the recursive part of this function and you should ignore
+    them.
+
+    </Tip>
+
+    Example:
+
+
+    ```py
+    >>> from collections import OrderedDict
+    >>> import torch.nn as nn
+
+    >>> model = nn.Sequential(OrderedDict([("linear1", nn.Linear(4, 4)), ("linear2", nn.Linear(4, 4))]))
+    >>> model.linear2.weight = test_model.linear1.weight
+    >>> find_tied_parameters(test_model)
+    {'linear1.weight': 'linear2.weight'}
+    ```
+
+    Returns:
+        Dict[str, str]: A dictionary mapping tied parameter names to the name of the parameter they are tied to.
+    """
+    # Initialize result and named_parameters before recursing.
+    named_parameters = kwargs.get("named_parameters", None)
+    prefix = kwargs.get("prefix", "")
+    result = kwargs.get("result", {})
+
+    if named_parameters is None:
+        named_parameters = {n: p for n, p in model.named_parameters()}
+    else:
+        # A tied parameter will not be in the full `named_parameters` seen above but will be in the `named_parameters`
+        # of the submodule it belongs to. So while recursing we track the names that are not in the initial
+        # `named_parameters`.
+        for name, parameter in model.named_parameters():
+            full_name = name if prefix == "" else f"{prefix}.{name}"
+            if full_name not in named_parameters:
+                # When we find one, it has to be one of the existing parameters.
+                for new_name, new_param in named_parameters.items():
+                    if new_param is parameter:
+                        result[new_name] = full_name
+
+    # Once we have treated direct parameters, we move to the child modules.
+    for name, child in model.named_children():
+        child_name = name if prefix == "" else f"{prefix}.{name}"
+        find_tied_parameters(child, named_parameters=named_parameters, prefix=child_name, result=result)
+
+    return result
+
+
+def compute_module_sizes(model: nn.Module, dtype: Optional[Union[str, torch.device]] = None):
+    """
+    Compute the size of each submodule of a given model.
+    """
+    if isinstance(dtype, str):
+        # We accept "torch.float16" or just "float16"
+        dtype = dtype.replace("torch.", "")
+        dtype = getattr(torch, dtype)
+    if dtype is not None:
+        dtype_size = dtype_byte_size(dtype)
+    module_sizes = defaultdict(int)
+    for name, tensor in named_module_tensors(model, recurse=True):
+        if dtype is None:
+            size = tensor.numel() * dtype_byte_size(tensor.dtype)
+        else:
+            size = tensor.numel() * min(dtype_size, dtype_byte_size(tensor.dtype))
+        name_parts = name.split(".")
+        for idx in range(len(name_parts) + 1):
+            module_sizes[".".join(name_parts[:idx])] += size
+
+    return module_sizes
+
+
+def get_max_layer_size(
+    modules: List[Tuple[str, torch.nn.Module]], module_sizes: Dict[str, int], no_split_module_classes: List[str]
+):
+    """
+    Utility function that will scan a list of named modules and return the maximum size used by one full layer. The
+    definition of a layer being:
+    - a module with no direct children (just parameters and buffers)
+    - a module whose class name is in the list `no_split_module_classes`
+
+    Args:
+        modules (`List[Tuple[str, torch.nn.Module]]`):
+            The list of named modules where we want to determine the maximum layer size.
+        module_sizes (`Dict[str, int]`):
+            A dictionary mapping each layer name to its size (as generated by `compute_module_sizes`).
+        no_split_module_classes (`List[str]`):
+            A list of class names for layers we don't want to be split.
+
+    Returns:
+        `Tuple[int, List[str]]`: The maximum size of a layer with the list of layer names realizing that maximum size.
+    """
+    max_size = 0
+    layer_names = []
+    modules_to_treat = modules.copy()
+    while len(modules_to_treat) > 0:
+        module_name, module = modules_to_treat.pop(0)
+        modules_children = list(module.named_children())
+        if len(modules_children) == 0 or module.__class__.__name__ in no_split_module_classes:
+            # No splitting this one so we compare to the max_size
+            size = module_sizes[module_name]
+            if size > max_size:
+                max_size = size
+                layer_names = [module_name]
+            elif size == max_size:
+                layer_names.append(module_name)
+        else:
+            modules_to_treat = [(f"{module_name}.{n}", v) for n, v in modules_children] + modules_to_treat
+    return max_size, layer_names
+
+
+def get_max_memory(max_memory: Optional[Dict[Union[int, str], Union[int, str]]] = None):
+    """
+    Get the maximum memory available if nothing is passed, converts string to int otherwise.
+    """
+    import psutil
+
+    if max_memory is None:
+        if not torch.cuda.is_available():
+            max_memory = {}
+        else:
+            # Make sure CUDA is initialized on each GPU to have the right memory info.
+            for i in range(torch.cuda.device_count()):
+                _ = torch.tensor([0], device=i)
+            max_memory = {i: torch.cuda.mem_get_info(i)[0] for i in range(torch.cuda.device_count())}
+        max_memory["cpu"] = psutil.virtual_memory().available
+        return max_memory
+
+    for key in max_memory:
+        if isinstance(max_memory[key], str):
+            max_memory[key] = convert_file_size_to_int(max_memory[key])
+    return max_memory
+
+
+def clean_device_map(device_map: Dict[str, Union[int, str, torch.device]], module_name: str = ""):
+    """
+    Cleans a device_map by grouping all submodules that go on the same device together.
+    """
+    # Get the value of the current module and if there is only one split across several keys, regroup it.
+    prefix = "" if module_name == "" else f"{module_name}."
+    values = [v for k, v in device_map.items() if k.startswith(prefix)]
+    if len(set(values)) == 1 and len(values) > 1:
+        for k in [k for k in device_map if k.startswith(prefix)]:
+            del device_map[k]
+        device_map[module_name] = values[0]
+
+    # Recurse over the children
+    children_modules = [k for k in device_map.keys() if k.startswith(module_name) and len(k) > len(module_name)]
+    idx = len(module_name.split(".")) + 1 if len(module_name) > 0 else 1
+    children_modules = set(".".join(k.split(".")[:idx]) for k in children_modules)
+    for child in children_modules:
+        clean_device_map(device_map, module_name=child)
+
+    return device_map
+
+
+def load_offloaded_weights(model, index, offload_folder):
+    if index is None or len(index) == 0:
+        # Nothing to do
+        return
+
+    for param_name, metadata in index.items():
+        tensor_file = os.path.join(offload_folder, f"{param_name}.dat")
+        shape = tuple(metadata["shape"])
+        weight = np.memmap(tensor_file, dtype=metadata["dtype"], mode="r", shape=shape)
+        set_module_tensor_to_device(model, param_name, "cpu", value=torch.tensor(weight))
+
+
+def infer_auto_device_map(
+    model: nn.Module,
+    max_memory: Optional[Dict[Union[int, str], Union[int, str]]] = None,
+    no_split_module_classes: Optional[List[str]] = None,
+    dtype: Optional[Union[str, torch.dtype]] = None,
+):
+    """
+    Compute a device map for a given model giving priority to GPUs, then offload on CPU and finally offload to disk,
+    such that:
+    - we don't exceed the memory available of any of the GPU.
+    - if offload to the CPU is needed, there is always room left on GPU 0 to put back the layer offloaded on CPU that
+      has the largest size.
+    - if offload to the CPU is needed,we don't exceed the RAM available on the CPU.
+    - if offload to the disk is needed, there is always room left on the CPU to put back the layer offloaded on disk
+      that has the largest size.
+
+    <Tip>
+
+    All computation is done analyzing sizes and dtypes of the model parameters. As a result, the model can be on the
+    meta device (as it would if initialized within the `init_empty_weights` context manager).
+
+    </Tip>
+
+    Args:
+        model (`torch.nn.Module`): The model to analyze.
+        max_memory (`Dict`, *optional*):
+            A dictionary device identifier to maximum memory. Will default to the maximum memory available if unset.
+        no_split_module_classes (`List[str]`, *optional*):
+            A list of layer class names that should never be split across device (for instance any layer that has a
+            residual connection).
+        dtype (`str` or `torch.dtype`, *optional*):
+            If provided, the weights will be converted to that type when loaded.
+    """
+    # Get default / clean up max_memory
+    max_memory = get_max_memory(max_memory)
+    if no_split_module_classes is None:
+        no_split_module_classes = []
+    elif not isinstance(no_split_module_classes, (list, tuple)):
+        no_split_module_classes = [no_split_module_classes]
+
+    devices = list(max_memory.keys())
+    gpus = [device for device in devices if device != "cpu"]
+    if "disk" not in devices:
+        devices.append("disk")
+
+    # Devices that need to keep space for a potential offloaded layer.
+    main_devices = [gpus[0], "cpu"] if len(gpus) > 0 else ["cpu"]
+
+    module_sizes = compute_module_sizes(model, dtype=dtype)
+    tied_parameters = find_tied_parameters(model)
+
+    device_map = {}
+    current_device = 0
+    current_memory_used = 0
+
+    # Direct submodules and parameters
+    modules_to_treat = list(model.named_parameters(recurse=False)) + list(model.named_children())
+    # Initialize maximum largest layer, to know which space to keep in memory
+    max_layer_size, max_layer_names = get_max_layer_size(modules_to_treat, module_sizes, no_split_module_classes)
+
+    # Ready ? This is going to be a bit messy.
+    while len(modules_to_treat) > 0:
+        name, module = modules_to_treat.pop(0)
+        # Max size in the remaining layers may have changed since we took one, so we maybe update it.
+        max_layer_names = [n for n in max_layer_names if not n.startswith(name)]
+        if len(max_layer_names) == 0:
+            max_layer_size, max_layer_names = get_max_layer_size(
+                [(n, m) for n, m in modules_to_treat if isinstance(m, torch.nn.Module)],
+                module_sizes,
+                no_split_module_classes,
+            )
+        # Assess size needed
+        module_size = module_sizes[name]
+        tied_params = [v for k, v in tied_parameters.items() if name in k]
+        # We ignore parameters that are tied when they're tied to > 1 one
+        tied_param = tied_params[0] if len(tied_params) == 1 else None
+
+        device = devices[current_device]
+        current_max_size = max_memory[device] if device != "disk" else None
+        # Reduce max size available by the largest layer.
+        if devices[current_device] in main_devices:
+            current_max_size = current_max_size - max_layer_size
+        # Case 1 -> We're too big!
+        if current_max_size is not None and current_memory_used + module_size > current_max_size:
+            # Split or not split?
+            modules_children = list(module.named_children())
+            if len(modules_children) == 0 or module.__class__.__name__ in no_split_module_classes:
+                # -> no split, we go to the next device
+                current_device += 1
+                modules_to_treat = [(name, module)] + modules_to_treat
+                current_memory_used = 0
+            else:
+                # -> split, we replace the module studied by its children + parameters
+                modules_children = list(module.named_parameters(recurse=False)) + modules_children
+                modules_to_treat = [(f"{name}.{n}", v) for n, v in modules_children] + modules_to_treat
+                # Update the max layer size.
+                max_layer_size, max_layer_names = get_max_layer_size(
+                    [(n, m) for n, m in modules_to_treat if isinstance(m, torch.nn.Module)],
+                    module_sizes,
+                    no_split_module_classes,
+                )
+
+        # Case 2, it fits! We're not entirely out of the wood though, because we may have some tied parameters.
+        elif tied_param is not None:
+            # Determine the sized occupied by this module + the module containing the tied parameter
+            tied_module_size = module_size
+            tied_module_index = [i for i, (n, _) in enumerate(modules_to_treat) if n in tied_param][0]
+            tied_module_name, tied_module = modules_to_treat[tied_module_index]
+            tied_module_size += module_sizes[tied_module_name] - module_sizes[tied_param]
+            if current_max_size is not None and current_memory_used + tied_module_size > current_max_size:
+                # Split or not split?
+                tied_module_children = list(tied_module.named_children())
+                if len(tied_module_children) == 0 or tied_module.__class__.__name__ in no_split_module_classes:
+                    # If the tied module is not split, we go to the next device
+                    current_device += 1
+                    modules_to_treat = [(name, module)] + modules_to_treat
+                    current_memory_used = 0
+                else:
+                    # Otherwise, we replace the tied module by its children.
+                    tied_module_children = list(tied_module.named_parameters(recurse=False)) + tied_module_children
+                    tied_module_children = [(f"{tied_module_name}.{n}", v) for n, v in tied_module_children]
+                    modules_to_treat = (
+                        [(name, module)]
+                        + modules_to_treat[:tied_module_index]
+                        + tied_module_children
+                        + modules_to_treat[tied_module_index + 1 :]
+                    )
+                    # Update the max layer size.
+                    max_layer_size, max_layer_names = get_max_layer_size(
+                        [(n, m) for n, m in modules_to_treat if isinstance(m, torch.nn.Module)],
+                        module_sizes,
+                        no_split_module_classes,
+                    )
+            else:
+                # We really really fit!
+                current_memory_used += tied_module_size
+                device_map[name] = devices[current_device]
+                modules_to_treat.pop(tied_module_index)
+                device_map[tied_module_name] = devices[current_device]
+        else:
+            current_memory_used += module_size
+            device_map[name] = devices[current_device]
+
+    return clean_device_map(device_map)
+
+
+def check_device_map(model: nn.Module, device_map: Dict[str, Union[int, str, torch.device]]):
+    """
+    Checks a device map covers everything in a given model.
+
+    Args:
+        model (`torch.nn.Module`): The model to check the device map against.
+        device_map (`Dict[str, Union[int, str, torch.device]]`): The device map to check.
+    """
+    all_model_tensors = [name for name, _ in model.state_dict().items()]
+    for module_name in device_map.keys():
+        all_model_tensors = [name for name in all_model_tensors if not name.startswith(module_name)]
+    if len(all_model_tensors) > 0:
+        non_covered_params = ", ".join(all_model_tensors)
+        raise ValueError(
+            f"The device_map provided does not give any device for the following parameters: {non_covered_params}"
+        )
+
+
+def load_checkpoint_in_model(
+    model: nn.Module,
+    checkpoint: Union[str, os.PathLike],
+    device_map: Optional[Dict[str, Union[int, str, torch.device]]] = None,
+    offload_folder: Optional[Union[str, os.PathLike]] = None,
+    dtype: Optional[Union[str, torch.dtype]] = None,
+    offload_state_dict: bool = False,
+):
+    """
+    Loads a (potentially sharded) checkpoint inside a model, potentially sending weights to a given device as they are
+    loaded.
+
+    <Tip warning={true}>
+
+    Once loaded across devices, you still need to call [`dispatch_model`] on your model to make it able to run. To
+    group the checkpoint loading and dispatch in one single call, use [`load_checkpoint_and_dispatch`].
+
+    </Tip>
+
+    Args:
+        model (`torch.nn.Module`): The model in which we want to load a checkpoint.
+        checkpoint (`str` or `os.PathLike`):
+            The folder checkpoint to load. It can be:
+            - a path to a file containing a whole model state dict
+            - a path to a `.json` file containing the index to a sharded checkpoint
+            - a path to a folder containing a unique `.index.json` file and the shards of a checkpoint.
+        device_map (`Dict[str, Union[int, str, torch.device]]`, *optional*):
+            A map that specifies where each submodule should go. It doesn't need to be refined to each parameter/buffer
+            name, once a given module name is inside, every submodule of it will be sent to the same device.
+        offload_folder (`str` or `os.PathLike`, *optional*):
+            If the `device_map` contains any value `"disk"`, the folder where we will offload weights.
+        dtype (`str` or `torch.dtype`, *optional*):
+            If provided, the weights will be converted to that type when loaded.
+        offload_state_dict (`bool`, *optional*, defaults to `False`):
+            If `True`, will temporarily offload the CPU state dict on the hard drive to avoig getting out of CPU RAM if
+            the weight of the CPU state dict + the biggest shard does not fit.
+    """
+    if offload_folder is None and device_map is not None and "disk" in device_map.values():
+        raise ValueError(
+            "At least one of the model submodule will be offloaded to disk, please pass along an `offload_folder`."
+        )
+    elif offload_folder is not None and device_map is not None and "disk" in device_map.values():
+        os.makedirs(offload_folder, exist_ok=True)
+
+    if isinstance(dtype, str):
+        # We accept "torch.float16" or just "float16"
+        dtype = dtype.replace("torch.", "")
+        dtype = getattr(torch, dtype)
+
+    checkpoint_files = None
+    index_filename = None
+    if os.path.isfile(checkpoint):
+        if str(checkpoint).endswith(".json"):
+            index_filename = checkpoint
+        else:
+            checkpoint_files = [checkpoint]
+    elif os.path.isdir(checkpoint):
+        potential_index = [f for f in os.listdir(checkpoint) if f.endswith(".index.json")]
+        if len(potential_index) == 0:
+            raise ValueError(f"{checkpoint} is not a folder containing a `.index.json` file.")
+        elif len(potential_index) == 1:
+            index_filename = os.path.join(checkpoint, potential_index[0])
+        else:
+            raise ValueError(f"{checkpoint} containing mote than one `.index.json` file, delete the irrelevant ones.")
+    else:
+        raise ValueError(
+            "`checkpoint` should be the path to a file containing a whole state dict, or the index of a sharded "
+            f"checkpoint, or a folder containing a sharded checkpoint, but got {checkpoint}."
+        )
+
+    if index_filename is not None:
+        checkpoint_folder = os.path.split(index_filename)[0]
+        with open(index_filename, "r") as f:
+            index = json.loads(f.read())
+
+        if "weight_map" in index:
+            index = index["weight_map"]
+        checkpoint_files = sorted(list(set(index.values())))
+        checkpoint_files = [os.path.join(checkpoint_folder, f) for f in checkpoint_files]
+
+    # Logic for missing/unexepected keys goes here.
+
+    offload_index = {}
+    if offload_state_dict:
+        state_dict_folder = tempfile.mkdtemp()
+        state_dict_index = {}
+
+    for checkpoint_file in checkpoint_files:
+        checkpoint = torch.load(checkpoint_file)
+        if device_map is None:
+            model.load_state_dict(checkpoint, strict=False)
+        else:
+            for param_name, param in checkpoint.items():
+                module_name = param_name
+                if dtype is not None:
+                    param = param.to(dtype)
+                while len(module_name) > 0 and module_name not in device_map:
+                    module_name = ".".join(module_name.split(".")[:-1])
+                if module_name == "" and "" not in device_map:
+                    # TODO: group all errors and raise at the end.
+                    raise ValueError(f"{param_name} doesn't have any device set.")
+                param_device = device_map[module_name]
+
+                if param_device == "disk":
+                    set_module_tensor_to_device(model, param_name, "meta")
+                    offload_weight(param, param_name, offload_folder, index=offload_index)
+                elif param_device == "cpu" and offload_state_dict:
+                    set_module_tensor_to_device(model, param_name, "meta")
+                    offload_weight(param, param_name, state_dict_folder, index=state_dict_index)
+                else:
+                    set_module_tensor_to_device(model, param_name, param_device, value=param)
+
+        # Force Python to clean up.
+        del checkpoint
+        gc.collect()
+
+    save_offload_index(offload_index, offload_folder)
+
+    # Load back offloaded state dict on CPU
+    if offload_state_dict:
+        load_offloaded_weights(model, state_dict_index, state_dict_folder)
+        shutil.rmtree(state_dict_folder)
--- a/src/accelerate/utils/offload.py
+++ b/src/accelerate/utils/offload.py
@ -0,0 +1,171 @@
+# Copyright 2022 The HuggingFace Team. All rights reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import json
+import os
+from collections.abc import Mapping
+from typing import Dict, List, Optional, Union
+
+import numpy as np
+import torch
+
+
+def offload_state_dict(save_dir: Union[str, os.PathLike], state_dict: Dict[str, torch.Tensor]):
+    """
+    Offload a state dict in a given folder.
+
+    Args:
+        save_dir (`str` or `os.PathLike`): The directory in which to offload the state dict.
+        state_dict (`Dict[str, torch.Tensor]`): The dictionary of tensors to offload.
+    """
+    os.makedirs(save_dir, exist_ok=True)
+    index = {}
+    for name, parameter in state_dict.items():
+        tensor_file = os.path.join(save_dir, f"{name}.dat")
+        array = parameter.numpy()
+        index[name] = {"dtype": str(array.dtype), "shape": list(array.shape)}
+        if array.ndim == 0:
+            array = array[None]
+        file_array = np.memmap(tensor_file, dtype=array.dtype, mode="w+", shape=array.shape)
+        file_array[:] = array[:]
+        file_array.flush()
+
+    # Update index
+    index_file = os.path.join(save_dir, "index.json")
+    if os.path.isfile(index_file):
+        with open(index_file, "r", encoding="utf-8") as f:
+            current_index = json.load(f)
+    else:
+        current_index = {}
+    current_index.update(index)
+
+    with open(index_file, "w", encoding="utf-8") as f:
+        json.dump(current_index, f, indent=2)
+
+
+def offload_weight(weight, weight_name, offload_folder, index=None):
+    array = weight.numpy()
+    tensor_file = os.path.join(offload_folder, f"{weight_name}.dat")
+    if index is not None:
+        index[weight_name] = {"dtype": str(array.dtype), "shape": list(array.shape)}
+    file_array = np.memmap(tensor_file, dtype=array.dtype, mode="w+", shape=array.shape)
+    file_array[:] = array[:]
+    file_array.flush()
+    return index
+
+
+def save_offload_index(index, offload_folder):
+    if index is None or len(index) == 0:
+        # Nothing to save
+        return
+
+    offload_index_file = os.path.join(offload_folder, "index.json")
+    if os.path.isfile(offload_index_file):
+        with open(offload_index_file, "r", encoding="utf-8") as f:
+            current_index = json.load(f)
+    else:
+        current_index = {}
+    current_index.update(index)
+
+    with open(offload_index_file, "w", encoding="utf-8") as f:
+        json.dump(current_index, f, indent=2)
+
+
+class PrefixedDataset(Mapping):
+    """
+    Will access keys in a given dataset by adding a prefix.
+
+    Args:
+        dataset (`Mapping`): Any map with string keys.
+        prefix (`str`): A prefix to add when trying to access any element in the underlying dataset.
+    """
+
+    def __init__(self, dataset: Mapping, prefix: str):
+        self.dataset = dataset
+        self.prefix = prefix
+
+    def __getitem__(self, key):
+        return self.dataset[f"{self.prefix}{key}"]
+
+    def __iter__(self):
+        return iter([key for key in self.dataset if key.startswith(self.prefix)])
+
+    def __len__(self):
+        return len(self.dataset)
+
+
+class OffloadedWeightsLoader(Mapping):
+    """
+    A collection that loads weights stored in a given state dict or memory-mapped on disk.
+
+    Args:
+        state_dict (`Dict[str, torch.Tensor]`, *optional*):
+            A dictionary parameter name to tensor.
+        save_folder (`str` or `os.PathLike`, *optional*):
+            The directory in which the weights are stored (by `offload_state_dict` for instance).
+        index (`Dict`, *optional*):
+            A dictionary from weight name to their information (`dtype` and `shape`). Will default to the index saved
+            in `save_folder`.
+    """
+
+    def __init__(
+        self,
+        state_dict: Dict[str, torch.Tensor] = None,
+        save_folder: Optional[Union[str, os.PathLike]] = None,
+        index: Mapping = None,
+    ):
+        if state_dict is None and save_folder is None:
+            raise ValueError("Need either a `state_dict` or a `save_folder` containing offloaded weights.")
+
+        self.state_dict = {} if state_dict is None else state_dict
+        self.save_folder = save_folder
+        if index is None and save_folder is not None:
+            with open(os.path.join(save_folder, "index.json")) as f:
+                index = json.load(f)
+        self.index = {} if index is None else index
+        self.all_keys = list(self.state_dict.keys())
+        self.all_keys.extend([key for key in self.index if key not in self.all_keys])
+
+    def __getitem__(self, key: str):
+        # State dict gets priority
+        if key in self.state_dict:
+            return self.state_dict[key]
+        weight_info = self.index[key]
+        weight_file = os.path.join(self.save_folder, f"{key}.dat")
+        shape = tuple(weight_info["shape"])
+        if shape == ():
+            weight = np.memmap(weight_file, dtype=weight_info["dtype"], shape=(1,), mode="r")[0]
+        else:
+            weight = np.memmap(weight_file, dtype=weight_info["dtype"], shape=shape, mode="r")
+        return torch.tensor(weight)
+
+    def __iter__(self):
+        return iter(self.all_keys)
+
+    def __len__(self):
+        return len(self.all_keys)
+
+
+def extract_submodules_state_dict(state_dict: Dict[str, torch.Tensor], submodule_names: List[str]):
+    """
+    Extract the sub state-dict corresponding to a list of given submodules.
+
+    Args:
+        state_dict (`Dict[str, torch.Tensor]`): The state dict to extract from.
+        submodule_names (`List[str]`): The list of submodule names we want to extract.
+    """
+    result = {}
+    for module_name in submodule_names:
+        result.update({key: param for key, param in state_dict.items() if key.startswith(module_name)})
+    return result
--- a/src/accelerate/utils/operations.py
+++ b/src/accelerate/utils/operations.py
@ -1,4 +1,4 @@
-# Copyright 2021 The HuggingFace Team. All rights reserved.
+# Copyright 2022 The HuggingFace Team. All rights reserved.
 #
 # Licensed under the Apache License, Version 2.0 (the "License");
 # you may not use this file except in compliance with the License.
@ -12,116 +12,34 @@
 # See the License for the specific language governing permissions and
 # limitations under the License.

-import importlib
-import os
-import random
-from collections.abc import Mapping
-from contextlib import contextmanager
-from dataclasses import dataclass, field
-from enum import Enum
-from functools import update_wrapper
-from typing import Any, List, Optional, Union
+"""
+A set of basic tensor ops compatible with tpu, gpu, and multigpu
+"""
+
+
+from functools import update_wrapper
+from typing import Any, Mapping

-import numpy as np
 import torch
+from torch.distributed import ReduceOp

 from packaging import version

-from .state import AcceleratorState, DistributedType, is_deepspeed_available, is_tpu_available
+from ..state import AcceleratorState
+from .dataclasses import DistributedType, TensorInformation
+from .imports import is_tpu_available


 if is_tpu_available():
    import torch_xla.core.xla_model as xm


-def is_boto3_available():
-    return importlib.util.find_spec("boto3") is not None
+def is_torch_tensor(tensor):
+    return isinstance(tensor, torch.Tensor)


-def is_sagemaker_available():
-    return importlib.util.find_spec("sagemaker") is not None
-
-
-if is_deepspeed_available():
-    from deepspeed import DeepSpeedEngine
-
-SCALER_NAME = "scaler.pt"
-MODEL_NAME = "pytorch_model"
-RNG_STATE_NAME = "random_states"
-OPTIMIZER_NAME = "optimizer"
-
-
-class RNGType(Enum):
-    TORCH = "torch"
-    CUDA = "cuda"
-    XLA = "xla"
-    GENERATOR = "generator"
-
-
-@dataclass
-class TensorInformation:
-    shape: torch.Size
-    dtype: torch.dtype
-
-
-def set_seed(seed: int, device_specific: bool = False):
-    """
-    Helper function for reproducible behavior to set the seed in `random`, `numpy`, `torch`.
-
-    Args:
-        seed (`int`): The seed to set.
-        device_specific (`bool`, *optional*, defaults to `False`):
-            Whether to differ the seed on each device slightly with `self.process_index`.
-    """
-    if device_specific:
-        seed += AcceleratorState().process_index
-    random.seed(seed)
-    np.random.seed(seed)
-    torch.manual_seed(seed)
-    torch.cuda.manual_seed_all(seed)
-    # ^^ safe to call this function even if cuda is not available
-    if is_tpu_available():
-        xm.set_rng_state(seed)
-
-
-def synchronize_rng_state(rng_type: Optional[RNGType] = None, generator: Optional[torch.Generator] = None):
-    # Get the proper rng state
-    if rng_type == RNGType.TORCH:
-        rng_state = torch.get_rng_state()
-    elif rng_type == RNGType.CUDA:
-        rng_state = torch.cuda.get_rng_state()
-    elif rng_type == RNGType.XLA:
-        assert is_tpu_available(), "Can't synchronize XLA seeds on an environment without TPUs."
-        rng_state = torch.tensor(xm.get_rng_state())
-    elif rng_type == RNGType.GENERATOR:
-        assert generator is not None, "Need a generator to synchronize its seed."
-        rng_state = generator.get_state()
-
-    # Broadcast the rng state from device 0 to other devices
-    state = AcceleratorState()
-    if state.distributed_type == DistributedType.TPU:
-        rng_state = xm.mesh_reduce("random_seed", rng_state, lambda x: x[0])
-    elif state.distributed_type in [DistributedType.DEEPSPEED, DistributedType.MULTI_GPU]:
-        rng_state = rng_state.to(state.device)
-        torch.distributed.broadcast(rng_state, 0)
-        rng_state = rng_state.cpu()
-    elif state.distributed_type == DistributedType.MULTI_CPU:
-        torch.distributed.broadcast(rng_state, 0)
-
-    # Set the broadcast rng state
-    if rng_type == RNGType.TORCH:
-        torch.set_rng_state(rng_state)
-    elif rng_type == RNGType.CUDA:
-        torch.cuda.set_rng_state(rng_state)
-    elif rng_type == RNGType.XLA:
-        xm.set_rng_state(rng_state.item())
-    elif rng_type == RNGType.GENERATOR:
-        generator.set_state(rng_state)
-
-
-def synchronize_rng_states(rng_types: List[Union[str, RNGType]], generator: Optional[torch.Generator] = None):
-    for rng_type in rng_types:
-        synchronize_rng_state(RNGType(rng_type), generator=generator)
+def is_tensor_information(tensor_info):
+    return isinstance(tensor_info, TensorInformation)


 def honor_type(obj, generator):
@ -135,14 +53,6 @@ def honor_type(obj, generator):
        return type(obj)(*list(generator))


-def is_torch_tensor(tensor):
-    return isinstance(tensor, torch.Tensor)
-
-
-def is_tensor_information(tensor_info):
-    return isinstance(tensor_info, TensorInformation)
-
-
 def recursively_apply(func, data, *args, test_type=is_torch_tensor, error_on_other_type=False, **kwargs):
    """
    Recursively apply a function on a data structure that is a nested list/tuple/dictionary of a given base type.
@ -249,73 +159,24 @@ def initialize_tensors(data_structure):
    return recursively_apply(_initialize_tensor, data_structure, test_type=is_tensor_information)


-def convert_to_fp32(tensor):
+def find_batch_size(data):
    """
-    Recursively converts the elements nested list/tuple/dictionary of tensors in FP16/BF16 precision to FP32.
+    Recursively finds the batch size in a nested list/tuple/dictionary of lists of tensors.

    Args:
-        tensor (nested list/tuple/dictionary of `torch.Tensor`):
-            The data to convert from FP16/BF16 to FP32.
+        data (nested list/tuple/dictionary of `torch.Tensor`): The data from which to find the batch size.

    Returns:
-        The same data structure as `tensor` with all tensors that were in FP16/BF16 precision converted to FP32.
+        `int`: The batch size.
    """
-
-    def _convert_to_fp32(tensor):
-        return tensor.float()
-
-    def _is_fp16_bf16_tensor(tensor):
-        return hasattr(tensor, "dtype") and (
-            tensor.dtype == torch.float16
-            or (version.parse(torch.__version__) >= version.parse("1.10") and tensor.dtype == torch.bfloat16)
-        )
-
-    return recursively_apply(_convert_to_fp32, tensor, test_type=_is_fp16_bf16_tensor)
-
-
-class ConvertOutputsToFp32:
-    """
-    Decorator to apply to a function outputing tensors (like a model forward pass) that ensures the outputs in FP16
-    precision will be convert back to FP32.
-
-    Use a class instead of a decorator because otherwise, the prepared model can no longer be pickled (issue #273).
-
-    Args:
-        model_forward (`Callable`):
-            The function which outputs we want to treat.
-
-    Returns:
-        The same function as `model_forward` but with converted outputs.
-    """
-
-    def __init__(self, model_forward):
-        self.model_forward = model_forward
-        update_wrapper(self, model_forward)
-
-    def __call__(self, *args, **kwargs):
-        return convert_to_fp32(self.model_forward(*args, **kwargs))
-
-
-convert_outputs_to_fp32 = ConvertOutputsToFp32
-
-
-def extract_model_from_parallel(model):
-    """
-    Extract a model from its distributed containers.
-
-    Args:
-        model (`torch.nn.Module`): The model to extract.
-
-    Returns:
-        `torch.nn.Module`: The extracted model.
-    """
-    options = (torch.nn.parallel.DistributedDataParallel, torch.nn.DataParallel)
-    if is_deepspeed_available():
-        options += (DeepSpeedEngine,)
-
-    while isinstance(model, options):
-        model = model.module
-    return model
+    if isinstance(data, (tuple, list)):
+        return find_batch_size(data[0])
+    elif isinstance(data, Mapping):
+        for k in data.keys():
+            return find_batch_size(data[k])
+    elif not isinstance(data, torch.Tensor):
+        raise TypeError(f"Can only find the batch size of tensors but got {type(data)}.")
+    return data.shape[0]


 def _tpu_gather(tensor, name="gather tensor"):
@ -480,26 +341,6 @@ def slice_tensors(data, tensor_slice):
    return recursively_apply(_slice_tensor, data, tensor_slice)


-def find_batch_size(data):
-    """
-    Recursively finds the batch size in a nested list/tuple/dictionary of lists of tensors.
-
-    Args:
-        data (nested list/tuple/dictionary of `torch.Tensor`): The data from which to find the batch size.
-
-    Returns:
-        `int`: The batch size.
-    """
-    if isinstance(data, (tuple, list)):
-        return find_batch_size(data[0])
-    elif isinstance(data, Mapping):
-        for k in data.keys():
-            return find_batch_size(data[k])
-    elif not isinstance(data, torch.Tensor):
-        raise TypeError(f"Can only find the batch size of tensors but got {type(data)}.")
-    return data.shape[0]
-
-
 def concatenate(data, dim=0):
    """
    Recursively concatenate the tensors in a nested list/tuple/dictionary of lists of tensors with the same shape.
@ -568,147 +409,105 @@ def pad_across_processes(tensor, dim=0, pad_index=0, pad_first=False):
    )


-def wait_for_everyone():
+def reduce(tensor, reduction="mean"):
    """
-    Introduces a blocking point in the script, making sure all processes have reached this point before continuing.
-
-    <Tip warning={true}>
-
-    Make sure all processes will reach this instruction otherwise one of your processes will hang forever.
-
-    </Tip>
-    """
-    if (
-        AcceleratorState().distributed_type == DistributedType.MULTI_GPU
-        or AcceleratorState().distributed_type == DistributedType.MULTI_CPU
-        or AcceleratorState().distributed_type == DistributedType.DEEPSPEED
-    ):
-        torch.distributed.barrier()
-    elif AcceleratorState().distributed_type == DistributedType.TPU:
-        xm.rendezvous("accelerate.utils.wait_for_everyone")
-
-
-def save(obj, f):
-    """
-    Save the data to disk. Use in place of `torch.save()`.
+    Recursively reduce the tensors in a nested list/tuple/dictionary of lists of tensors across all processes by the
+    mean of a given operation.

    Args:
-        obj: The data to save
-        f: The file (or file-like object) to use to save the data
+        tensor (nested list/tuple/dictionary of `torch.Tensor`):
+            The data to reduce.
+        reduction (`str`, *optional*, defaults to `"mean"`):
+            A reduction method. Can be of "mean", "sum", or "none"
+
+    Returns:
+        The same data structure as `data` with all the tensors reduced.
    """
-    if AcceleratorState().distributed_type == DistributedType.TPU:
-        xm.save(obj, f)
-    elif AcceleratorState().local_process_index == 0:
-        torch.save(obj, f)
+
+    def _reduce_across_processes(tensor, reduction="mean"):
+        state = AcceleratorState()
+        cloned_tensor = tensor.clone()
+        if state.distributed_type == DistributedType.TPU:
+            xm.all_reduce("sum", cloned_tensor)
+            return cloned_tensor
+        elif state.distributed_type in [DistributedType.DEEPSPEED, DistributedType.MULTI_GPU]:
+            torch.distributed.reduce(cloned_tensor, ReduceOp.SUM)
+            return cloned_tensor
+        else:
+            if reduction == "sum":
+                return cloned_tensor.sum()
+            else:
+                return cloned_tensor.mean()
+
+    return recursively_apply(_reduce_across_processes, tensor, error_on_other_type=True, reduction=reduction)


-class PrepareForLaunch:
+def convert_to_fp32(tensor):
    """
-    Prepare a function that will launched in a distributed setup.
+    Recursively converts the elements nested list/tuple/dictionary of tensors in FP16/BF16 precision to FP32.

    Args:
-        launcher (`Callable`):
-            The function to launch.
-        distributed_type ([`~state.DistributedType`]):
-            The distributed type to prepare for.
-        debug (`bool`, *optional*, defaults to `False`):
-            Whether or not this is a debug launch.
+        tensor (nested list/tuple/dictionary of `torch.Tensor`):
+            The data to convert from FP16/BF16 to FP32.
+
+    Returns:
+        The same data structure as `tensor` with all tensors that were in FP16/BF16 precision converted to FP32.
    """

-    def __init__(self, launcher, distributed_type="NO", debug=False):
-        self.launcher = launcher
-        self.distributed_type = DistributedType(distributed_type)
-        self.debug = debug
+    def _convert_to_fp32(tensor):
+        return tensor.float()

-    def __call__(self, index, *args):
-        if self.debug:
-            world_size = int(os.environ.get("WORLD_SIZE"))
-            rdv_file = os.environ.get("ACCELERATE_DEBUG_RDV_FILE")
-            torch.distributed.init_process_group(
-                "gloo",
-                rank=index,
-                store=torch.distributed.FileStore(rdv_file, world_size),
-                world_size=world_size,
-            )
-        elif self.distributed_type == DistributedType.MULTI_GPU or self.distributed_type == DistributedType.MULTI_CPU:
-            # Prepare the environment for torch.distributed
-            os.environ["LOCAL_RANK"] = str(index)
-            os.environ["RANK"] = str(index)
+    def _is_fp16_bf16_tensor(tensor):
+        return hasattr(tensor, "dtype") and (
+            tensor.dtype == torch.float16
+            or (version.parse(torch.__version__) >= version.parse("1.10") and tensor.dtype == torch.bfloat16)
+        )

-        self.launcher(*args)
+    return recursively_apply(_convert_to_fp32, tensor, test_type=_is_fp16_bf16_tensor)


-@dataclass
-class DeepSpeedPlugin:
-
-    gradient_accumulation_steps: int = field(
-        default=None, metadata={"help": "Number of steps to accumulate gradients before updating optimizer states"}
-    )
-    zero_stage: int = field(
-        default=None,
-        metadata={"help": "Possible options are 0,1,2,3; Default will be taken from environment variable"},
-    )
-    is_train_batch_min: str = field(
-        default=True,
-        metadata={"help": "If both train & eval dataloaders are specified, this will decide the train_batch_size"},
-    )
-
-    auto_opt_mapping: bool = field(
-        default=True,
-        metadata={"help": "whether to map torch.adam to deepspeed optimizer version of adam based on config"},
-    )
-
-    offload_optimizer_device: bool = field(default=None, metadata={"help": "Possible options are none|cpu|nvme"})
-
-    def __post_init__(self):
-
-        if self.gradient_accumulation_steps is None:
-            self.gradient_accumulation_steps = int(os.environ.get("GRADIENT_ACCUMULATION_STEPS", 1))
-
-        if self.zero_stage is None:
-            self.zero_stage = int(os.environ.get("DEEPSPEED_ZERO_STAGE", 2))
-
-        if self.offload_optimizer_device is None:
-            self.offload_optimizer_device = os.environ.get("DEEPSPEED_OFFLOAD_OPTIMIZER_DEVICE", "none")
-
-        self.deepspeed_config = {
-            "train_batch_size": None,
-            "gradient_accumulation_steps": self.gradient_accumulation_steps,
-            "zero_optimization": {
-                "stage": self.zero_stage,
-                "offload_optimizer": {
-                    "device": self.offload_optimizer_device,
-                },
-            },
-            "steps_per_print": float("inf"),  # this will stop deepspeed from logging @ stdout
-            "zero_allow_untested_optimizer": True,
-        }
-
-
-@contextmanager
-def patch_environment(**kwargs):
+class ConvertOutputsToFp32:
    """
-    A context manager that will add each keyword argument passed to `os.environ` and remove them when exiting.
+    Decorator to apply to a function outputing tensors (like a model forward pass) that ensures the outputs in FP16
+    precision will be convert back to FP32.

-    Will convert the values in `kwargs` to strings and upper-case all the keys.
+    Use a class instead of a decorator because otherwise, the prepared model can no longer be pickled (issue #273).
+
+    Args:
+        model_forward (`Callable`):
+            The function which outputs we want to treat.
+
+    Returns:
+        The same function as `model_forward` but with converted outputs.
    """
-    for key, value in kwargs.items():
-        os.environ[key.upper()] = str(value)

-    yield
+    def __init__(self, model_forward):
+        self.model_forward = model_forward
+        update_wrapper(self, model_forward)

-    for key in kwargs:
-        del os.environ[key.upper()]
+    def __call__(self, *args, **kwargs):
+        return convert_to_fp32(self.model_forward(*args, **kwargs))


-def get_pretty_name(obj):
+convert_outputs_to_fp32 = ConvertOutputsToFp32
+
+
+def find_device(data):
    """
-    Gets a pretty name from `obj`.
+    Finds the device on which a nested dict/list/tuple of tensors lies (assuming they are all on the same device).
+
+    Args:
+        (nested list/tuple/dictionary of `torch.Tensor`): The data we want to know the device of.
    """
-    if not hasattr(obj, "__qualname__") and not hasattr(obj, "__name__"):
-        obj = getattr(obj, "__class__", obj)
-    if hasattr(obj, "__qualname__"):
-        return obj.__qualname__
-    if hasattr(obj, "__name__"):
-        return obj.__name__
-    return str(obj)
+    if isinstance(data, Mapping):
+        for obj in data.values():
+            device = find_device(obj)
+            if device is not None:
+                return device
+    elif isinstance(data, (tuple, list)):
+        for obj in data:
+            device = find_device(obj)
+            if device is not None:
+                return device
+    elif isinstance(data, torch.Tensor):
+        return data.device
--- a/src/accelerate/utils/other.py
+++ b/src/accelerate/utils/other.py
@ -0,0 +1,156 @@
+# Copyright 2022 The HuggingFace Team. All rights reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import os
+from contextlib import contextmanager
+from pathlib import Path
+
+import torch
+
+from ..commands.config.cluster import ClusterConfig
+from ..commands.config.config_args import default_json_config_file
+from ..state import AcceleratorState
+from .dataclasses import DistributedType
+from .imports import is_deepspeed_available, is_tpu_available
+
+
+if is_deepspeed_available():
+    from deepspeed import DeepSpeedEngine
+
+if is_tpu_available():
+    import torch_xla.core.xla_model as xm
+
+
+def extract_model_from_parallel(model):
+    """
+    Extract a model from its distributed containers.
+
+    Args:
+        model (`torch.nn.Module`): The model to extract.
+
+    Returns:
+        `torch.nn.Module`: The extracted model.
+    """
+    options = (torch.nn.parallel.DistributedDataParallel, torch.nn.DataParallel)
+    if is_deepspeed_available():
+        options += (DeepSpeedEngine,)
+
+    while isinstance(model, options):
+        model = model.module
+    return model
+
+
+def wait_for_everyone():
+    """
+    Introduces a blocking point in the script, making sure all processes have reached this point before continuing.
+
+    <Tip warning={true}>
+
+    Make sure all processes will reach this instruction otherwise one of your processes will hang forever.
+
+    </Tip>
+    """
+    if (
+        AcceleratorState().distributed_type == DistributedType.MULTI_GPU
+        or AcceleratorState().distributed_type == DistributedType.MULTI_CPU
+        or AcceleratorState().distributed_type == DistributedType.DEEPSPEED
+    ):
+        torch.distributed.barrier()
+    elif AcceleratorState().distributed_type == DistributedType.TPU:
+        xm.rendezvous("accelerate.utils.wait_for_everyone")
+
+
+def save(obj, f):
+    """
+    Save the data to disk. Use in place of `torch.save()`.
+
+    Args:
+        obj: The data to save
+        f: The file (or file-like object) to use to save the data
+    """
+    if AcceleratorState().distributed_type == DistributedType.TPU:
+        xm.save(obj, f)
+    elif AcceleratorState().local_process_index == 0:
+        torch.save(obj, f)
+
+
+@contextmanager
+def patch_environment(**kwargs):
+    """
+    A context manager that will add each keyword argument passed to `os.environ` and remove them when exiting.
+
+    Will convert the values in `kwargs` to strings and upper-case all the keys.
+    """
+    for key, value in kwargs.items():
+        os.environ[key.upper()] = str(value)
+
+    yield
+
+    for key in kwargs:
+        del os.environ[key.upper()]
+
+
+def get_pretty_name(obj):
+    """
+    Gets a pretty name from `obj`.
+    """
+    if not hasattr(obj, "__qualname__") and not hasattr(obj, "__name__"):
+        obj = getattr(obj, "__class__", obj)
+    if hasattr(obj, "__qualname__"):
+        return obj.__qualname__
+    if hasattr(obj, "__name__"):
+        return obj.__name__
+    return str(obj)
+
+
+def write_basic_config(mixed_precision="no", save_location: str = default_json_config_file):
+    """
+    Creates and saves a basic cluster config to be used on a local machine with potentially multiple GPUs. Will also
+    set CPU if it is a CPU-only machine.
+
+    Args:
+        mixed_precision (`str`, *optional*, defaults to "no"):
+            Mixed Precision to use. Should be one of "no", "fp16", or "bf16"
+        save_location (`str`, *optional*, defaults to `default_json_config_file`):
+            Optional custom save location. Should be passed to `--config_file` when using `accelerate launch`. Default
+            location is inside the huggingface cache folder (`~/.cache/huggingface`) but can be overriden by setting
+            the `HF_HOME` environmental variable, followed by `accelerate/default_config.yaml`.
+    """
+    path = Path(save_location)
+    path.parent.mkdir(parents=True, exist_ok=True)
+    if path.exists():
+        print(
+            f"Configuration already exists at {save_location}, will not override. Run `accelerate config` manually or pass a different `save_location`."
+        )
+        return
+    mixed_precision = mixed_precision.lower()
+    if mixed_precision not in ["no", "fp16", "bf16"]:
+        raise ValueError(f"`mixed_precision` should be one of 'no', 'fp16', or 'bf16'. Received {mixed_precision}")
+    config = {"compute_environment": "LOCAL_MACHINE", "mixed_precision": mixed_precision}
+    if torch.cuda.is_available():
+        num_gpus = torch.cuda.device_count()
+        config["num_processes"] = num_gpus
+        config["use_cpu"] = False
+        if num_gpus > 1:
+            config["distributed_type"] = "MULTI_GPU"
+        else:
+            config["distributed_type"] = "NO"
+    else:
+        num_gpus = 0
+        config["use_cpu"] = True
+        config["num_processes"] = 1
+        config["distributed_type"] = "NO"
+    if not path.exists():
+        config = ClusterConfig(**config)
+        config.to_json_file(path)
--- a/src/accelerate/utils/random.py
+++ b/src/accelerate/utils/random.py
@ -0,0 +1,87 @@
+# Copyright 2022 The HuggingFace Team. All rights reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import random
+from typing import List, Optional, Union
+
+import numpy as np
+import torch
+
+from ..state import AcceleratorState
+from .dataclasses import DistributedType, RNGType
+from .imports import is_tpu_available
+
+
+if is_tpu_available():
+    import torch_xla.core.xla_model as xm
+
+
+def set_seed(seed: int, device_specific: bool = False):
+    """
+    Helper function for reproducible behavior to set the seed in `random`, `numpy`, `torch`.
+
+    Args:
+        seed (`int`): The seed to set.
+        device_specific (`bool`, *optional*, defaults to `False`):
+            Whether to differ the seed on each device slightly with `self.process_index`.
+    """
+    if device_specific:
+        seed += AcceleratorState().process_index
+    random.seed(seed)
+    np.random.seed(seed)
+    torch.manual_seed(seed)
+    torch.cuda.manual_seed_all(seed)
+    # ^^ safe to call this function even if cuda is not available
+    if is_tpu_available():
+        xm.set_rng_state(seed)
+
+
+def synchronize_rng_state(rng_type: Optional[RNGType] = None, generator: Optional[torch.Generator] = None):
+    # Get the proper rng state
+    if rng_type == RNGType.TORCH:
+        rng_state = torch.get_rng_state()
+    elif rng_type == RNGType.CUDA:
+        rng_state = torch.cuda.get_rng_state()
+    elif rng_type == RNGType.XLA:
+        assert is_tpu_available(), "Can't synchronize XLA seeds on an environment without TPUs."
+        rng_state = torch.tensor(xm.get_rng_state())
+    elif rng_type == RNGType.GENERATOR:
+        assert generator is not None, "Need a generator to synchronize its seed."
+        rng_state = generator.get_state()
+
+    # Broadcast the rng state from device 0 to other devices
+    state = AcceleratorState()
+    if state.distributed_type == DistributedType.TPU:
+        rng_state = xm.mesh_reduce("random_seed", rng_state, lambda x: x[0])
+    elif state.distributed_type in [DistributedType.DEEPSPEED, DistributedType.MULTI_GPU]:
+        rng_state = rng_state.to(state.device)
+        torch.distributed.broadcast(rng_state, 0)
+        rng_state = rng_state.cpu()
+    elif state.distributed_type == DistributedType.MULTI_CPU:
+        torch.distributed.broadcast(rng_state, 0)
+
+    # Set the broadcast rng state
+    if rng_type == RNGType.TORCH:
+        torch.set_rng_state(rng_state)
+    elif rng_type == RNGType.CUDA:
+        torch.cuda.set_rng_state(rng_state)
+    elif rng_type == RNGType.XLA:
+        xm.set_rng_state(rng_state.item())
+    elif rng_type == RNGType.GENERATOR:
+        generator.set_state(rng_state)
+
+
+def synchronize_rng_states(rng_types: List[Union[str, RNGType]], generator: Optional[torch.Generator] = None):
+    for rng_type in rng_types:
+        synchronize_rng_state(RNGType(rng_type), generator=generator)
--- a/src/accelerate/utils/versions.py
+++ b/src/accelerate/utils/versions.py
@ -0,0 +1,25 @@
+# Copyright 2022 The HuggingFace Team. All rights reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import sys
+
+from packaging.version import parse
+
+
+if sys.version_info < (3, 8):
+    import importlib_metadata
+else:
+    import importlib.metadata as importlib_metadata
+
+torch_version = parse(importlib_metadata.version("torch"))
--- a/tests/test_big_modeling.py
+++ b/tests/test_big_modeling.py
@ -0,0 +1,276 @@
+# Copyright 2022 The HuggingFace Team. All rights reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import os
+import unittest
+from tempfile import TemporaryDirectory
+
+import torch
+import torch.nn as nn
+
+from accelerate.big_modeling import (
+    cpu_offload,
+    disk_offload,
+    dispatch_model,
+    init_empty_weights,
+    load_checkpoint_and_dispatch,
+)
+from accelerate.hooks import remove_hook_from_submodules
+from accelerate.test_utils import require_cuda, require_multi_gpu, slow
+from accelerate.utils import offload_state_dict
+from transformers import AutoModelForCausalLM, AutoTokenizer
+
+
+class ModelForTest(nn.Module):
+    def __init__(self):
+        super().__init__()
+        self.linear1 = nn.Linear(3, 4)
+        self.batchnorm = nn.BatchNorm1d(4)
+        self.linear2 = nn.Linear(4, 5)
+
+    def forward(self, x):
+        return self.linear2(self.batchnorm(self.linear1(x)))
+
+
+class BiggerModelForTest(nn.Module):
+    def __init__(self):
+        super().__init__()
+        self.linear1 = nn.Linear(3, 4)
+        self.linear2 = nn.Linear(4, 5)
+        self.batchnorm = nn.BatchNorm1d(5)
+        self.linear3 = nn.Linear(5, 6)
+        self.linear4 = nn.Linear(6, 5)
+
+    def forward(self, x):
+        return self.linear4(self.linear3(self.batchnorm(self.linear2(self.linear1(x)))))
+
+
+class BigModelingTester(unittest.TestCase):
+    def test_init_empty_weights(self):
+        # base use
+        with init_empty_weights():
+            module = nn.Linear(4, 5)
+        self.assertEqual(module.weight.device, torch.device("meta"))
+
+        # base use with buffers, they are not touched
+        with init_empty_weights():
+            module = nn.BatchNorm1d(4)
+        self.assertEqual(module.weight.device, torch.device("meta"))
+        self.assertEqual(module.running_mean.device, torch.device("cpu"))
+
+        # Use with include_buffers=True
+        with init_empty_weights(include_buffers=True):
+            module = nn.BatchNorm1d(4)
+        self.assertEqual(module.weight.device, torch.device("meta"))
+        self.assertEqual(module.running_mean.device, torch.device("meta"))
+
+        # Double check we didn't break PyTorch
+        module = nn.BatchNorm1d(4)
+        self.assertEqual(module.weight.device, torch.device("cpu"))
+        self.assertEqual(module.running_mean.device, torch.device("cpu"))
+
+    def test_init_empty_weights_very_large_model(self):
+        # This is a 100 billion parameters model.
+        with init_empty_weights():
+            _ = nn.Sequential(*[nn.Linear(10000, 10000) for _ in range(1000)])
+
+    def test_cpu_offload(self):
+        model = ModelForTest()
+        x = torch.randn(2, 3)
+        expected = model(x)
+
+        device = torch.device(0 if torch.cuda.is_available() else "cpu")
+
+        cpu_offload(model, execution_device=device)
+        output = model(x)
+        self.assertTrue(torch.allclose(expected, output.cpu()))
+
+        # Clean up for next test.
+        remove_hook_from_submodules(model)
+
+        cpu_offload(model, execution_device=device, offload_buffers=True)
+        output = model(x)
+        self.assertTrue(torch.allclose(expected, output.cpu()))
+
+    @slow
+    @require_cuda
+    def test_cpu_offload_gpt2(self):
+        tokenizer = AutoTokenizer.from_pretrained("gpt2")
+        inputs = tokenizer("Hello world! My name is", return_tensors="pt").to(0)
+
+        gpt2 = AutoModelForCausalLM.from_pretrained("gpt2")
+        cpu_offload(gpt2, execution_device=0)
+        outputs = gpt2.generate(inputs["input_ids"])
+        self.assertEqual(
+            tokenizer.decode(outputs[0].tolist()),
+            "Hello world! My name is Kiyoshi, and I'm a student at the University of Tokyo",
+        )
+
+    def test_disk_offload(self):
+        model = ModelForTest()
+        x = torch.randn(2, 3)
+        expected = model(x)
+
+        device = torch.device(0 if torch.cuda.is_available() else "cpu")
+
+        with TemporaryDirectory() as tmp_dir:
+            disk_offload(model, tmp_dir, execution_device=device)
+            output = model(x)
+            self.assertTrue(torch.allclose(expected, output.cpu()))
+
+            # Clean up for next test.
+            remove_hook_from_submodules(model)
+
+        with TemporaryDirectory() as tmp_dir:
+            disk_offload(model, tmp_dir, execution_device=device, offload_buffers=True)
+            output = model(x)
+            self.assertTrue(torch.allclose(expected, output.cpu()))
+
+    @slow
+    @require_cuda
+    def test_disk_offload_gpt2(self):
+        tokenizer = AutoTokenizer.from_pretrained("gpt2")
+        inputs = tokenizer("Hello world! My name is", return_tensors="pt").to(0)
+
+        gpt2 = AutoModelForCausalLM.from_pretrained("gpt2")
+        with TemporaryDirectory() as tmp_dir:
+            disk_offload(gpt2, tmp_dir, execution_device=0)
+            outputs = gpt2.generate(inputs["input_ids"])
+            self.assertEqual(
+                tokenizer.decode(outputs[0].tolist()),
+                "Hello world! My name is Kiyoshi, and I'm a student at the University of Tokyo",
+            )
+
+    @require_cuda
+    def test_dispatch_model(self):
+        model = ModelForTest()
+        device_map = {"linear1": "disk", "batchnorm": "cpu", "linear2": 0}
+
+        x = torch.randn(2, 3)
+        expected = model(x)
+
+        with TemporaryDirectory() as tmp_dir:
+            dispatch_model(model, device_map, offload_dir=tmp_dir)
+            output = model(x)
+            self.assertTrue(torch.allclose(expected, output.cpu(), atol=1e-5))
+
+    @require_multi_gpu
+    def test_dispatch_model_multi_gpu(self):
+        model = BiggerModelForTest()
+        device_map = {"linear1": "cpu", "linear2": "disk", "batchnorm": "cpu", "linear3": 0, "linear4": 1}
+
+        x = torch.randn(2, 3)
+        expected = model(x)
+
+        with TemporaryDirectory() as tmp_dir:
+            dispatch_model(model, device_map, offload_dir=tmp_dir)
+            output = model(x)
+            self.assertTrue(torch.allclose(expected, output.cpu(), atol=1e-5))
+
+    @slow
+    @require_multi_gpu
+    def test_dispatch_model_gpt2_on_two_gpus(self):
+        tokenizer = AutoTokenizer.from_pretrained("gpt2")
+        inputs = tokenizer("Hello world! My name is", return_tensors="pt").to(0)
+
+        gpt2 = AutoModelForCausalLM.from_pretrained("gpt2")
+        # Dispatch on GPUs 0 and 1
+        device_map = {
+            "transformer.wte": 0,
+            "transformer.wpe": 0,
+            "transformer.ln_f": 1,
+            "lm_head": 1,
+        }
+        for i in range(12):
+            device_map[f"transformer.h.{i}"] = 0 if i <= 5 else 1
+
+        gpt2 = dispatch_model(gpt2, device_map)
+        outputs = gpt2.generate(inputs["input_ids"])
+        self.assertEqual(
+            tokenizer.decode(outputs[0].tolist()),
+            "Hello world! My name is Kiyoshi, and I'm a student at the University of Tokyo",
+        )
+
+        # Dispatch with a bit of CPU offload
+        gpt2 = AutoModelForCausalLM.from_pretrained("gpt2")
+        for i in range(4):
+            device_map[f"transformer.h.{i}"] = "cpu"
+        gpt2 = dispatch_model(gpt2, device_map)
+        outputs = gpt2.generate(inputs["input_ids"])
+        self.assertEqual(
+            tokenizer.decode(outputs[0].tolist()),
+            "Hello world! My name is Kiyoshi, and I'm a student at the University of Tokyo",
+        )
+        # Dispatch with a bit of CPU and disk offload
+        gpt2 = AutoModelForCausalLM.from_pretrained("gpt2")
+        for i in range(2):
+            device_map[f"transformer.h.{i}"] = "disk"
+
+        with TemporaryDirectory() as tmp_dir:
+            state_dict = {
+                k: p for k, p in gpt2.state_dict().items() if "transformer.h.0" in k or "transformer.h.1" in k
+            }
+            offload_state_dict(tmp_dir, state_dict)
+            gpt2 = dispatch_model(gpt2, device_map, offload_dir=tmp_dir)
+            outputs = gpt2.generate(inputs["input_ids"])
+            self.assertEqual(
+                tokenizer.decode(outputs[0].tolist()),
+                "Hello world! My name is Kiyoshi, and I'm a student at the University of Tokyo",
+            )
+
+    @require_cuda
+    def test_load_checkpoint_and_dispatch(self):
+        model = ModelForTest()
+        device_map = {"linear1": "cpu", "batchnorm": "cpu", "linear2": 0}
+
+        x = torch.randn(2, 3)
+        expected = model(x)
+
+        with TemporaryDirectory() as tmp_dir:
+            checkpoint = os.path.join(tmp_dir, "pt_model.bin")
+            torch.save(model.state_dict(), checkpoint)
+
+            new_model = ModelForTest()
+            new_model = load_checkpoint_and_dispatch(new_model, checkpoint, device_map=device_map)
+
+        # CPU-offloaded weights are on the meta device while waiting for the forward pass.
+        self.assertEqual(new_model.linear1.weight.device, torch.device("meta"))
+        self.assertEqual(new_model.linear2.weight.device, torch.device(0))
+
+        output = new_model(x)
+        self.assertTrue(torch.allclose(expected, output.cpu(), atol=1e-5))
+
+    @require_multi_gpu
+    def test_load_checkpoint_and_dispatch_multi_gpu(self):
+        model = BiggerModelForTest()
+        device_map = {"linear1": "cpu", "linear2": "cpu", "batchnorm": 0, "linear3": 0, "linear4": 1}
+
+        x = torch.randn(2, 3)
+        expected = model(x)
+
+        with TemporaryDirectory() as tmp_dir:
+            checkpoint = os.path.join(tmp_dir, "pt_model.bin")
+            torch.save(model.state_dict(), checkpoint)
+
+            new_model = BiggerModelForTest()
+            new_model = load_checkpoint_and_dispatch(new_model, checkpoint, device_map=device_map)
+
+        # CPU-offloaded weights are on the meta device while waiting for the forward pass.
+        self.assertEqual(new_model.linear1.weight.device, torch.device("meta"))
+        self.assertEqual(new_model.linear2.weight.device, torch.device("meta"))
+        self.assertEqual(new_model.linear3.weight.device, torch.device(0))
+        self.assertEqual(new_model.linear4.weight.device, torch.device(1))
+
+        output = new_model(x)
+        self.assertTrue(torch.allclose(expected, output.cpu(), atol=1e-5))
--- a/tests/test_examples.py
+++ b/tests/test_examples.py
@ -0,0 +1,200 @@
+# Copyright 2022 The HuggingFace Team. All rights reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import ast
+import os
+import re
+import shutil
+import subprocess
+import tempfile
+import unittest
+from unittest import mock
+
+from accelerate.test_utils.examples import compare_against_test
+from accelerate.test_utils.testing import TempDirTestCase, slow
+from accelerate.utils import write_basic_config
+
+
+# DataLoaders built from `test_samples/MRPC` for quick testing
+# Should mock `{script_name}.get_dataloaders` via:
+# @mock.patch("{script_name}.get_dataloaders", mocked_dataloaders)
+
+EXCLUDE_EXAMPLES = ["cross_validation.py", "multi_process_metrics.py", "memory.py", "fsdp_with_peak_mem_tracking.py"]
+
+
+class ExampleDifferenceTests(unittest.TestCase):
+    """
+    This TestCase checks that all of the `complete_*` scripts contain all of the
+    information found in the `by_feature` scripts, line for line. If one fails,
+    then a complete example does not contain all of the features in the features
+    scripts, and should be updated.
+
+    Each example script should be a single test (such as `test_nlp_example`),
+    and should run `one_complete_example` twice: once with `parser_only=True`,
+    and the other with `parser_only=False`. This is so that when the test
+    failures are returned to the user, they understand if the discrepancy lies in
+    the `main` function, or the `training_loop` function. Otherwise it will be
+    unclear.
+
+    Also, if there are any expected differences between the base script used and
+    `complete_nlp_example.py` (the canonical base script), these should be included in
+    `special_strings`. These would be differences in how something is logged, print statements,
+    etc (such as calls to `Accelerate.log()`)
+    """
+
+    def one_complete_example(
+        self, complete_file_name: str, parser_only: bool, secondary_filename: str = None, special_strings: list = None
+    ):
+        """
+        Tests a single `complete` example against all of the implemented `by_feature` scripts
+
+        Args:
+            complete_file_name (`str`):
+                The filename of a complete example
+            parser_only (`bool`):
+                Whether to look at the main training function, or the argument parser
+            secondary_filename (`str`, *optional*):
+                A potential secondary base file to strip all script information not relevant for checking,
+                such as "cv_example.py" when testing "complete_cv_example.py"
+            special_strings (`list`, *optional*):
+                A list of strings to potentially remove before checking no differences are left. These should be
+                diffs that are file specific, such as different logging variations between files.
+        """
+        self.maxDiff = None
+        by_feature_path = os.path.abspath(os.path.join("examples", "by_feature"))
+        examples_path = os.path.abspath("examples")
+        for item in os.listdir(by_feature_path):
+            if item not in EXCLUDE_EXAMPLES:
+                item_path = os.path.join(by_feature_path, item)
+                if os.path.isfile(item_path) and ".py" in item_path:
+                    with self.subTest(
+                        tested_script=complete_file_name,
+                        feature_script=item,
+                        tested_section="main()" if parser_only else "training_function()",
+                    ):
+                        diff = compare_against_test(
+                            os.path.join(examples_path, complete_file_name), item_path, parser_only, secondary_filename
+                        )
+                        diff = "\n".join(diff)
+                        if special_strings is not None:
+                            for string in special_strings:
+                                diff = diff.replace(string, "")
+                        self.assertEqual(diff, "")
+
+    def test_nlp_examples(self):
+        self.one_complete_example("complete_nlp_example.py", True)
+        self.one_complete_example("complete_nlp_example.py", False)
+
+    def test_cv_examples(self):
+        cv_path = os.path.abspath(os.path.join("examples", "cv_example.py"))
+        special_strings = [
+            " " * 16 + "{\n\n",
+            " " * 18 + '"accuracy": eval_metric["accuracy"],\n\n',
+            " " * 18 + '"f1": eval_metric["f1"],\n\n',
+            " " * 18 + '"train_loss": total_loss.item(),\n\n',
+            " " * 18 + '"epoch": epoch,\n\n',
+            " " * 16 + "},\n\n",
+            " " * 16 + "step=epoch,\n",
+            " " * 8,
+        ]
+        self.one_complete_example("complete_cv_example.py", True, cv_path, special_strings)
+        self.one_complete_example("complete_cv_example.py", False, cv_path, special_strings)
+
+
+@mock.patch.dict(os.environ, {"TESTING_MOCKED_DATALOADERS": "1"})
+class FeatureExamplesTests(TempDirTestCase):
+    clear_on_setup = False
+
+    @classmethod
+    def setUpClass(cls):
+        super().setUpClass()
+        cls._tmpdir = tempfile.mkdtemp()
+        cls.configPath = os.path.join(cls._tmpdir, "default_config.yml")
+
+        write_basic_config(save_location=cls.configPath)
+        cls._launch_args = ["accelerate", "launch", "--config_file", cls.configPath]
+
+    @classmethod
+    def tearDownClass(cls):
+        super().tearDownClass()
+        shutil.rmtree(cls._tmpdir)
+
+    def test_checkpointing_by_epoch(self):
+        testargs = f"""
+        examples/by_feature/checkpointing.py
+        --checkpointing_steps epoch
+        --output_dir {self.tmpdir}
+        """.split()
+        _ = subprocess.run(self._launch_args + testargs, stdout=subprocess.PIPE)
+        self.assertTrue(os.path.exists(os.path.join(self.tmpdir, "epoch_1")))
+
+    def test_checkpointing_by_steps(self):
+        testargs = f"""
+        examples/by_feature/checkpointing.py
+        --checkpointing_steps 2
+        --output_dir {self.tmpdir}
+        """.split()
+        _ = subprocess.run(self._launch_args + testargs, stdout=subprocess.PIPE, env=os.environ)
+        self.assertTrue(os.path.exists(os.path.join(self.tmpdir, "step_4")))
+
+    def test_load_states_by_epoch(self):
+        testargs = f"""
+        examples/by_feature/checkpointing.py
+        --resume_from_checkpoint {os.path.join(self.tmpdir, "epoch_1")}
+        """.split()
+        output = subprocess.run(
+            self._launch_args + testargs, universal_newlines=True, stdout=subprocess.PIPE, stderr=subprocess.PIPE
+        ).stdout
+        self.assertNotIn("epoch 0:", output)
+        self.assertNotIn("epoch 1:", output)
+        self.assertIn("epoch 2:", output)
+
+    def test_load_states_by_steps(self):
+        testargs = f"""
+        examples/by_feature/checkpointing.py
+        --resume_from_checkpoint {os.path.join(self.tmpdir, "step_4")}
+        """.split()
+        output = subprocess.run(
+            self._launch_args + testargs, universal_newlines=True, stdout=subprocess.PIPE, stderr=subprocess.PIPE
+        ).stdout
+        self.assertNotIn("epoch 0:", output)
+        self.assertIn("epoch 1:", output)
+        self.assertIn("epoch 2:", output)
+
+    @slow
+    def test_cross_validation(self):
+        testargs = """
+        examples/by_feature/cross_validation.py
+        --num_folds 2
+        """.split()
+        with mock.patch.dict(os.environ, {"TESTING_MOCKED_DATALOADERS": "0"}):
+            output = subprocess.run(
+                self._launch_args + testargs, universal_newlines=True, stdout=subprocess.PIPE, stderr=subprocess.PIPE
+            ).stdout
+            results = ast.literal_eval(re.findall("({.+})", output)[-1])
+            self.assertGreaterEqual(results["accuracy"], 0.75)
+
+    def test_multi_process_metrics(self):
+        testargs = ["examples/by_feature/multi_process_metrics.py"]
+        _ = subprocess.run(self._launch_args + testargs, stdout=subprocess.PIPE)
+
+    def test_tracking(self):
+        with tempfile.TemporaryDirectory() as tmpdir:
+            testargs = f"""
+            examples/by_feature/tracking.py
+            --with_tracking
+            --logging_dir {tmpdir}
+            """.split()
+            _ = subprocess.run(self._launch_args + testargs, stdout=subprocess.PIPE)
+            self.assertTrue(os.path.exists(os.path.join(tmpdir, "tracking")))
--- a/tests/test_hooks.py
+++ b/tests/test_hooks.py
@ -0,0 +1,330 @@
+# Copyright 2022 The HuggingFace Team. All rights reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import inspect
+import unittest
+
+import torch
+import torch.nn as nn
+
+from accelerate.hooks import (
+    AlignDevicesHook,
+    ModelHook,
+    SequentialHook,
+    add_hook_to_module,
+    attach_align_device_hook,
+    remove_hook_from_module,
+    remove_hook_from_submodules,
+)
+from accelerate.test_utils import require_multi_gpu
+
+
+class ModelForTest(nn.Module):
+    def __init__(self):
+        super().__init__()
+        self.linear1 = nn.Linear(3, 4)
+        self.batchnorm = nn.BatchNorm1d(4)
+        self.linear2 = nn.Linear(4, 5)
+
+    def forward(self, x):
+        return self.linear2(self.batchnorm(self.linear1(x)))
+
+
+class PreForwardHook(ModelHook):
+    def pre_forward(self, module, *args, **kwargs):
+        return (args[0] + 1,) + args[1:], kwargs
+
+
+class PostForwardHook(ModelHook):
+    def post_forward(self, module, output):
+        return output + 1
+
+
+class HooksModelTester(unittest.TestCase):
+    def test_add_and_remove_hooks(self):
+        test_model = ModelForTest()
+        test_hook = ModelHook()
+
+        add_hook_to_module(test_model, test_hook)
+        self.assertEqual(test_model._hf_hook, test_hook)
+        self.assertTrue(hasattr(test_model, "_old_forward"))
+
+        # Check adding the hook did not change the name or the signature
+        self.assertEqual(test_model.forward.__name__, "forward")
+        self.assertListEqual(list(inspect.signature(test_model.forward).parameters), ["x"])
+
+        remove_hook_from_module(test_model)
+        self.assertFalse(hasattr(test_model, "_hf_hook"))
+        self.assertFalse(hasattr(test_model, "_old_forward"))
+
+    def test_pre_forward_hook_is_executed(self):
+        test_model = ModelForTest()
+        x = torch.randn(2, 3)
+        expected = test_model(x + 1)
+        expected2 = test_model(x + 2)
+
+        test_hook = PreForwardHook()
+        add_hook_to_module(test_model, test_hook)
+        output1 = test_model(x)
+        self.assertTrue(torch.allclose(output1, expected))
+
+        # Attaching a hook to a model when it already has one replaces, does not chain
+        test_hook = PreForwardHook()
+        add_hook_to_module(test_model, test_hook)
+        output1 = test_model(x)
+        self.assertTrue(torch.allclose(output1, expected))
+
+        # You need to use the sequential hook to chain two or more hooks
+        test_hook = SequentialHook(PreForwardHook(), PreForwardHook())
+        add_hook_to_module(test_model, test_hook)
+
+        output2 = test_model(x)
+        assert torch.allclose(output2, expected2)
+
+    def test_post_forward_hook_is_executed(self):
+        test_model = ModelForTest()
+        x = torch.randn(2, 3)
+        output = test_model(x)
+
+        test_hook = PostForwardHook()
+        add_hook_to_module(test_model, test_hook)
+        output1 = test_model(x)
+        self.assertTrue(torch.allclose(output1, output + 1))
+
+        # Attaching a hook to a model when it already has one replaces, does not chain
+        test_hook = PostForwardHook()
+        add_hook_to_module(test_model, test_hook)
+        output1 = test_model(x)
+        self.assertTrue(torch.allclose(output1, output + 1))
+
+        # You need to use the sequential hook to chain two or more hooks
+        test_hook = SequentialHook(PostForwardHook(), PostForwardHook())
+        add_hook_to_module(test_model, test_hook)
+
+        output2 = test_model(x)
+        assert torch.allclose(output2, output + 2)
+
+    def test_no_grad_in_hook(self):
+        test_model = ModelForTest()
+        x = torch.randn(2, 3)
+        output = test_model(x)
+
+        test_hook = PostForwardHook()
+        add_hook_to_module(test_model, test_hook)
+        output1 = test_model(x)
+        self.assertTrue(torch.allclose(output1, output + 1))
+        self.assertTrue(output1.requires_grad)
+
+        test_hook.no_grad = True
+        output1 = test_model(x)
+        self.assertFalse(output1.requires_grad)
+
+    @require_multi_gpu
+    def test_align_devices_as_model_parallelism(self):
+        model = ModelForTest()
+        # Everything is on CPU
+        self.assertEqual(model.linear1.weight.device, torch.device("cpu"))
+        self.assertEqual(model.batchnorm.weight.device, torch.device("cpu"))
+        self.assertEqual(model.linear2.weight.device, torch.device("cpu"))
+
+        # This will move each submodule on different devices
+        add_hook_to_module(model.linear1, AlignDevicesHook(execution_device=0))
+        add_hook_to_module(model.batchnorm, AlignDevicesHook(execution_device=0))
+        add_hook_to_module(model.linear2, AlignDevicesHook(execution_device=1))
+
+        self.assertEqual(model.linear1.weight.device, torch.device(0))
+        self.assertEqual(model.batchnorm.weight.device, torch.device(0))
+        self.assertEqual(model.batchnorm.running_mean.device, torch.device(0))
+        self.assertEqual(model.linear2.weight.device, torch.device(1))
+
+        # We can still make a forward pass. The input does not need to be on any particular device
+        x = torch.randn(2, 3)
+        output = model(x)
+        self.assertEqual(output.device, torch.device(1))
+
+        # We can add a general hook to put back output on same device as input.
+        add_hook_to_module(model, AlignDevicesHook(io_same_device=True))
+        x = torch.randn(2, 3).to(0)
+        output = model(x)
+        self.assertEqual(output.device, torch.device(0))
+
+    def test_align_devices_as_cpu_offload(self):
+        model = ModelForTest()
+
+        # Everything is on CPU
+        self.assertEqual(model.linear1.weight.device, torch.device("cpu"))
+        self.assertEqual(model.batchnorm.weight.device, torch.device("cpu"))
+        self.assertEqual(model.linear2.weight.device, torch.device("cpu"))
+
+        # This will move each submodule on different devices
+        hook_kwargs = {"execution_device": 0 if torch.cuda.is_available() else "cpu", "offload": True}
+
+        add_hook_to_module(model.linear1, AlignDevicesHook(**hook_kwargs))
+        add_hook_to_module(model.batchnorm, AlignDevicesHook(**hook_kwargs))
+        add_hook_to_module(model.linear2, AlignDevicesHook(**hook_kwargs))
+
+        # Parameters have been offloaded, so on the meta device
+        self.assertEqual(model.linear1.weight.device, torch.device("meta"))
+        self.assertEqual(model.batchnorm.weight.device, torch.device("meta"))
+        self.assertEqual(model.linear2.weight.device, torch.device("meta"))
+        # Buffers are not included in the offload by default, so are on the execution device
+        device = torch.device(hook_kwargs["execution_device"])
+        self.assertEqual(model.batchnorm.running_mean.device, device)
+
+        x = torch.randn(2, 3)
+        output = model(x)
+        self.assertEqual(output.device, device)
+
+        # Removing hooks loads back the weights in the model.
+        remove_hook_from_module(model.linear1)
+        remove_hook_from_module(model.batchnorm)
+        remove_hook_from_module(model.linear2)
+        self.assertEqual(model.linear1.weight.device, torch.device("cpu"))
+        self.assertEqual(model.batchnorm.weight.device, torch.device("cpu"))
+        self.assertEqual(model.linear2.weight.device, torch.device("cpu"))
+
+        # Now test with buffers included in the offload
+        hook_kwargs = {
+            "execution_device": 0 if torch.cuda.is_available() else "cpu",
+            "offload": True,
+            "offload_buffers": True,
+        }
+
+        add_hook_to_module(model.linear1, AlignDevicesHook(**hook_kwargs))
+        add_hook_to_module(model.batchnorm, AlignDevicesHook(**hook_kwargs))
+        add_hook_to_module(model.linear2, AlignDevicesHook(**hook_kwargs))
+
+        # Parameters have been offloaded, so on the meta device, buffers included
+        self.assertEqual(model.linear1.weight.device, torch.device("meta"))
+        self.assertEqual(model.batchnorm.weight.device, torch.device("meta"))
+        self.assertEqual(model.linear2.weight.device, torch.device("meta"))
+        self.assertEqual(model.batchnorm.running_mean.device, torch.device("meta"))
+
+        x = torch.randn(2, 3)
+        output = model(x)
+        self.assertEqual(output.device, device)
+
+        # Removing hooks loads back the weights in the model.
+        remove_hook_from_module(model.linear1)
+        remove_hook_from_module(model.batchnorm)
+        remove_hook_from_module(model.linear2)
+        self.assertEqual(model.linear1.weight.device, torch.device("cpu"))
+        self.assertEqual(model.batchnorm.weight.device, torch.device("cpu"))
+        self.assertEqual(model.linear2.weight.device, torch.device("cpu"))
+
+    def test_attach_align_device_hook_as_cpu_offload(self):
+        model = ModelForTest()
+
+        # Everything is on CPU
+        self.assertEqual(model.linear1.weight.device, torch.device("cpu"))
+        self.assertEqual(model.batchnorm.weight.device, torch.device("cpu"))
+        self.assertEqual(model.linear2.weight.device, torch.device("cpu"))
+
+        # This will move each submodule on different devices
+        execution_device = 0 if torch.cuda.is_available() else "cpu"
+        attach_align_device_hook(model, execution_device=execution_device, offload=True)
+
+        # Parameters have been offloaded, so on the meta device
+        self.assertEqual(model.linear1.weight.device, torch.device("meta"))
+        self.assertEqual(model.batchnorm.weight.device, torch.device("meta"))
+        self.assertEqual(model.linear2.weight.device, torch.device("meta"))
+        # Buffers are not included in the offload by default, so are on the execution device
+        device = torch.device(execution_device)
+        self.assertEqual(model.batchnorm.running_mean.device, device)
+
+        x = torch.randn(2, 3)
+        output = model(x)
+        self.assertEqual(output.device, device)
+
+        # Removing hooks loads back the weights in the model.
+        remove_hook_from_submodules(model)
+        self.assertEqual(model.linear1.weight.device, torch.device("cpu"))
+        self.assertEqual(model.batchnorm.weight.device, torch.device("cpu"))
+        self.assertEqual(model.linear2.weight.device, torch.device("cpu"))
+
+        # Now test with buffers included in the offload
+        attach_align_device_hook(model, execution_device=execution_device, offload=True, offload_buffers=True)
+
+        # Parameters have been offloaded, so on the meta device, buffers included
+        self.assertEqual(model.linear1.weight.device, torch.device("meta"))
+        self.assertEqual(model.batchnorm.weight.device, torch.device("meta"))
+        self.assertEqual(model.linear2.weight.device, torch.device("meta"))
+        self.assertEqual(model.batchnorm.running_mean.device, torch.device("meta"))
+
+        x = torch.randn(2, 3)
+        output = model(x)
+        self.assertEqual(output.device, device)
+
+        # Removing hooks loads back the weights in the model.
+        remove_hook_from_submodules(model)
+        self.assertEqual(model.linear1.weight.device, torch.device("cpu"))
+        self.assertEqual(model.batchnorm.weight.device, torch.device("cpu"))
+        self.assertEqual(model.linear2.weight.device, torch.device("cpu"))
+
+    def test_attach_align_device_hook_as_cpu_offload_with_weight_map(self):
+        model = ModelForTest()
+
+        # Everything is on CPU
+        self.assertEqual(model.linear1.weight.device, torch.device("cpu"))
+        self.assertEqual(model.batchnorm.weight.device, torch.device("cpu"))
+        self.assertEqual(model.linear2.weight.device, torch.device("cpu"))
+
+        # This will move each submodule on different devices
+        execution_device = 0 if torch.cuda.is_available() else "cpu"
+        attach_align_device_hook(
+            model, execution_device=execution_device, offload=True, weights_map=model.state_dict()
+        )
+
+        # Parameters have been offloaded, so on the meta device
+        self.assertEqual(model.linear1.weight.device, torch.device("meta"))
+        self.assertEqual(model.batchnorm.weight.device, torch.device("meta"))
+        self.assertEqual(model.linear2.weight.device, torch.device("meta"))
+        # Buffers are not included in the offload by default, so are on the execution device
+        device = torch.device(execution_device)
+        self.assertEqual(model.batchnorm.running_mean.device, device)
+
+        x = torch.randn(2, 3)
+        output = model(x)
+        self.assertEqual(output.device, device)
+
+        # Removing hooks loads back the weights in the model.
+        remove_hook_from_submodules(model)
+        self.assertEqual(model.linear1.weight.device, torch.device("cpu"))
+        self.assertEqual(model.batchnorm.weight.device, torch.device("cpu"))
+        self.assertEqual(model.linear2.weight.device, torch.device("cpu"))
+
+        # Now test with buffers included in the offload
+        attach_align_device_hook(
+            model,
+            execution_device=execution_device,
+            offload=True,
+            weights_map=model.state_dict(),
+            offload_buffers=True,
+        )
+
+        # Parameters have been offloaded, so on the meta device, buffers included
+        self.assertEqual(model.linear1.weight.device, torch.device("meta"))
+        self.assertEqual(model.batchnorm.weight.device, torch.device("meta"))
+        self.assertEqual(model.linear2.weight.device, torch.device("meta"))
+        self.assertEqual(model.batchnorm.running_mean.device, torch.device("meta"))
+
+        x = torch.randn(2, 3)
+        output = model(x)
+        self.assertEqual(output.device, device)
+
+        # Removing hooks loads back the weights in the model.
+        remove_hook_from_submodules(model)
+        self.assertEqual(model.linear1.weight.device, torch.device("cpu"))
+        self.assertEqual(model.batchnorm.weight.device, torch.device("cpu"))
+        self.assertEqual(model.linear2.weight.device, torch.device("cpu"))
--- a/tests/test_kwargs_handlers.py
+++ b/tests/test_kwargs_handlers.py
@ -21,8 +21,8 @@ from dataclasses import dataclass
 import torch

 from accelerate import Accelerator, DistributedDataParallelKwargs, GradScalerKwargs
-from accelerate.kwargs_handlers import KwargsHandler
 from accelerate.test_utils import execute_subprocess_async, require_cuda, require_multi_gpu
+from accelerate.utils import KwargsHandler


@dataclass
--- a/tests/test_memory_utils.py
+++ b/tests/test_memory_utils.py
@ -0,0 +1,91 @@
+# Copyright 2022 The HuggingFace Team. All rights reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import unittest
+
+from accelerate.utils.memory import find_executable_batch_size
+
+
+def raise_fake_out_of_memory():
+    raise RuntimeError("CUDA out of memory.")
+
+
+class MemoryTest(unittest.TestCase):
+    def test_memory_implicit(self):
+        batch_sizes = []
+
+        @find_executable_batch_size(starting_batch_size=128)
+        def mock_training_loop_function(batch_size):
+            nonlocal batch_sizes
+            batch_sizes.append(batch_size)
+            if batch_size != 8:
+                raise_fake_out_of_memory()
+
+        mock_training_loop_function()
+        self.assertListEqual(batch_sizes, [128, 64, 32, 16, 8])
+
+    def test_memory_explicit(self):
+        batch_sizes = []
+
+        @find_executable_batch_size(starting_batch_size=128)
+        def mock_training_loop_function(batch_size, arg1):
+            nonlocal batch_sizes
+            batch_sizes.append(batch_size)
+            if batch_size != 8:
+                raise_fake_out_of_memory()
+            return batch_size, arg1
+
+        bs, arg1 = mock_training_loop_function("hello")
+        self.assertListEqual(batch_sizes, [128, 64, 32, 16, 8])
+        self.assertListEqual([bs, arg1], [8, "hello"])
+
+    def test_start_zero(self):
+        @find_executable_batch_size(starting_batch_size=0)
+        def mock_training_loop_function(batch_size):
+            pass
+
+        with self.assertRaises(RuntimeError) as cm:
+            mock_training_loop_function()
+            self.assertIn("No executable batch size found, reached zero.", cm.exception.args[0])
+
+    def test_approach_zero(self):
+        @find_executable_batch_size(starting_batch_size=16)
+        def mock_training_loop_function(batch_size):
+            if batch_size > 0:
+                raise_fake_out_of_memory()
+            pass
+
+        with self.assertRaises(RuntimeError) as cm:
+            mock_training_loop_function()
+            self.assertIn("No executable batch size found, reached zero.", cm.exception.args[0])
+
+    def test_verbose_guard(self):
+        @find_executable_batch_size(starting_batch_size=128)
+        def mock_training_loop_function(batch_size, arg1, arg2):
+            if batch_size != 8:
+                raise raise_fake_out_of_memory()
+
+        with self.assertRaises(TypeError) as cm:
+            mock_training_loop_function(128, "hello", "world")
+            self.assertIn("Batch size was passed into `f`", cm.exception.args[0])
+            self.assertIn("`f(arg1='hello', arg2='world')", cm.exception.args[0])
+
+    def test_any_other_error(self):
+        @find_executable_batch_size(starting_batch_size=16)
+        def mock_training_loop_function(batch_size):
+            raise ValueError("Oops, we had an error!")
+
+        with self.assertRaises(ValueError) as cm:
+            mock_training_loop_function()
+            self.assertIn("Oops, we had an error!", cm.exception.args[0])
--- a/tests/test_modeling_utils.py
+++ b/tests/test_modeling_utils.py
@ -0,0 +1,360 @@
+# Copyright 2022 The HuggingFace Team. All rights reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import json
+import os
+import tempfile
+import unittest
+
+import torch
+import torch.nn as nn
+
+from accelerate.test_utils import require_cuda, require_multi_gpu
+from accelerate.utils.modeling import (
+    check_device_map,
+    clean_device_map,
+    compute_module_sizes,
+    find_tied_parameters,
+    infer_auto_device_map,
+    load_checkpoint_in_model,
+    named_module_tensors,
+    set_module_tensor_to_device,
+)
+
+
+class ModelForTest(nn.Module):
+    def __init__(self):
+        super().__init__()
+        self.linear1 = nn.Linear(3, 4)
+        self.batchnorm = nn.BatchNorm1d(4)
+        self.linear2 = nn.Linear(4, 5)
+
+    def forward(self, x):
+        return self.linear2(self.batchnorm(self.linear1(x)))
+
+
+class ModelingUtilsTester(unittest.TestCase):
+    def check_set_module_tensor_for_device(self, model, device1, device2):
+        self.assertEqual(model.linear1.weight.device, torch.device(device1))
+
+        with self.subTest("Access by submodule and direct name for a parameter"):
+            set_module_tensor_to_device(model.linear1, "weight", device2)
+            self.assertEqual(model.linear1.weight.device, torch.device(device2))
+
+            if torch.device(device2) == torch.device("meta"):
+                with self.assertRaises(ValueError):
+                    # We need a `value` to set the weight back on device1
+                    set_module_tensor_to_device(model.linear1, "weight", device1)
+
+                set_module_tensor_to_device(model.linear1, "weight", device1, value=torch.randn(4, 3))
+            else:
+                set_module_tensor_to_device(model.linear1, "weight", device1)
+            self.assertEqual(model.linear1.weight.device, torch.device(device1))
+
+        with self.subTest("Access by module and full name for a parameter"):
+            set_module_tensor_to_device(model, "linear1.weight", device2)
+            self.assertEqual(model.linear1.weight.device, torch.device(device2))
+
+            if torch.device(device2) == torch.device("meta"):
+                with self.assertRaises(ValueError):
+                    # We need a `value` to set the weight back on device1
+                    set_module_tensor_to_device(model, "linear1.weight", device1)
+                set_module_tensor_to_device(model, "linear1.weight", device1, value=torch.randn(4, 3))
+            else:
+                set_module_tensor_to_device(model, "linear1.weight", device1)
+            self.assertEqual(model.linear1.weight.device, torch.device(device1))
+
+        self.assertEqual(model.batchnorm.running_mean.device, torch.device(device1))
+
+        with self.subTest("Access by submodule and direct name for a buffer"):
+            set_module_tensor_to_device(model.batchnorm, "running_mean", device2)
+            self.assertEqual(model.batchnorm.running_mean.device, torch.device(device2))
+
+            if torch.device(device2) == torch.device("meta"):
+                with self.assertRaises(ValueError):
+                    # We need a `value` to set the weight back on device1
+                    set_module_tensor_to_device(model.batchnorm, "running_mean", device1)
+                set_module_tensor_to_device(model.batchnorm, "running_mean", device1, value=torch.randn(4))
+            else:
+                set_module_tensor_to_device(model.batchnorm, "running_mean", device1)
+            self.assertEqual(model.batchnorm.running_mean.device, torch.device(device1))
+
+        with self.subTest("Access by module and full name for a parameter"):
+            set_module_tensor_to_device(model, "batchnorm.running_mean", device2)
+            self.assertEqual(model.batchnorm.running_mean.device, torch.device(device2))
+
+            if torch.device(device2) == torch.device("meta"):
+                with self.assertRaises(ValueError):
+                    # We need a `value` to set the weight back on CPU
+                    set_module_tensor_to_device(model, "batchnorm.running_mean", device1)
+
+                set_module_tensor_to_device(model, "batchnorm.running_mean", device1, value=torch.randn(4))
+            else:
+                set_module_tensor_to_device(model, "batchnorm.running_mean", device1)
+            self.assertEqual(model.batchnorm.running_mean.device, torch.device(device1))
+
+    def test_set_module_tensor_to_meta_and_cpu(self):
+        model = ModelForTest()
+        self.check_set_module_tensor_for_device(model, "cpu", "meta")
+
+    @require_cuda
+    def test_set_module_tensor_to_cpu_and_gpu(self):
+        model = ModelForTest()
+        self.check_set_module_tensor_for_device(model, "cpu", 0)
+
+    @require_cuda
+    def test_set_module_tensor_to_meta_and_gpu(self):
+        model = ModelForTest().to(0)
+        self.check_set_module_tensor_for_device(model, 0, "meta")
+
+    @require_multi_gpu
+    def test_set_module_tensor_between_gpus(self):
+        model = ModelForTest().to(0)
+        self.check_set_module_tensor_for_device(model, 0, 1)
+
+    def test_named_tensors(self):
+        model = nn.BatchNorm1d(4)
+        named_tensors = named_module_tensors(model)
+        self.assertListEqual(
+            [name for name, _ in named_tensors],
+            ["weight", "bias", "running_mean", "running_var", "num_batches_tracked"],
+        )
+
+        named_tensors = named_module_tensors(model, include_buffers=False)
+        self.assertListEqual([name for name, _ in named_tensors], ["weight", "bias"])
+
+        model = ModelForTest()
+        named_tensors = named_module_tensors(model)
+        self.assertListEqual([name for name, _ in named_tensors], [])
+
+        named_tensors = named_module_tensors(model, recurse=True)
+        self.assertListEqual(
+            [name for name, _ in named_tensors],
+            [
+                "linear1.weight",
+                "linear1.bias",
+                "batchnorm.weight",
+                "batchnorm.bias",
+                "linear2.weight",
+                "linear2.bias",
+                "batchnorm.running_mean",
+                "batchnorm.running_var",
+                "batchnorm.num_batches_tracked",
+            ],
+        )
+
+        named_tensors = named_module_tensors(model, include_buffers=False, recurse=True)
+        self.assertListEqual(
+            [name for name, _ in named_tensors],
+            ["linear1.weight", "linear1.bias", "batchnorm.weight", "batchnorm.bias", "linear2.weight", "linear2.bias"],
+        )
+
+    def test_find_tied_parameters(self):
+        model = ModelForTest()
+        self.assertDictEqual(find_tied_parameters(model), {})
+        model.linear2.weight = model.linear1.weight
+        self.assertDictEqual(find_tied_parameters(model), {"linear1.weight": "linear2.weight"})
+
+    def test_compute_module_sizes(self):
+        model = ModelForTest()
+        expected_sizes = {"": 236, "linear1": 64, "linear1.weight": 48, "linear1.bias": 16}
+        expected_sizes.update({"linear2": 100, "linear2.weight": 80, "linear2.bias": 20})
+        expected_sizes.update({"batchnorm": 72, "batchnorm.weight": 16, "batchnorm.bias": 16})
+        expected_sizes.update(
+            {"batchnorm.running_mean": 16, "batchnorm.running_var": 16, "batchnorm.num_batches_tracked": 8}
+        )
+
+        module_sizes = compute_module_sizes(model)
+        self.assertDictEqual(module_sizes, expected_sizes)
+
+        model.half()
+        expected_sizes = {k: s // 2 for k, s in expected_sizes.items()}
+        # This one is not converted to half.
+        expected_sizes["batchnorm.num_batches_tracked"] = 8
+        # This impacts batchnorm and total
+        expected_sizes["batchnorm"] += 4
+        expected_sizes[""] += 4
+
+        module_sizes = compute_module_sizes(model)
+        self.assertDictEqual(module_sizes, expected_sizes)
+
+    def test_check_device_map(self):
+        model = ModelForTest()
+        check_device_map(model, {"": 0})
+        with self.assertRaises(ValueError):
+            check_device_map(model, {"linear1": 0, "linear2": 1})
+
+        check_device_map(model, {"linear1": 0, "linear2": 1, "batchnorm": 1})
+
+    def shard_test_model(self, model, tmp_dir):
+        module_index = {
+            "linear1": "checkpoint_part1.bin",
+            "batchnorm": "checkpoint_part2.bin",
+            "linear2": "checkpoint_part3.bin",
+        }
+        index = {}
+        for name, _ in model.state_dict().items():
+            module = name.split(".")[0]
+            index[name] = module_index[module]
+
+        with open(os.path.join(tmp_dir, "weight_map.index.json"), "w") as f:
+            json.dump(index, f)
+
+        for module, fname in module_index.items():
+            state_dict = {k: v for k, v in model.state_dict().items() if k.startswith(module)}
+            full_fname = os.path.join(tmp_dir, fname)
+            torch.save(state_dict, full_fname)
+
+    def test_load_checkpoint_in_model(self):
+        # Check with whole checkpoint
+        model = ModelForTest()
+        with tempfile.TemporaryDirectory() as tmp_dir:
+            fname = os.path.join(tmp_dir, "pt_model.bin")
+            torch.save(model.state_dict(), fname)
+            load_checkpoint_in_model(model, fname)
+
+        # Check with sharded index
+        model = ModelForTest()
+        with tempfile.TemporaryDirectory() as tmp_dir:
+            self.shard_test_model(model, tmp_dir)
+            index_file = os.path.join(tmp_dir, "weight_map.index.json")
+            load_checkpoint_in_model(model, index_file)
+
+        # Check with sharded checkpoint
+        model = ModelForTest()
+        with tempfile.TemporaryDirectory() as tmp_dir:
+            self.shard_test_model(model, tmp_dir)
+            load_checkpoint_in_model(model, tmp_dir)
+
+    @require_cuda
+    def test_load_checkpoint_in_model_one_gpu(self):
+        device_map = {"linear1": 0, "batchnorm": "cpu", "linear2": "cpu"}
+
+        # Check with whole checkpoint
+        model = ModelForTest()
+        with tempfile.TemporaryDirectory() as tmp_dir:
+            fname = os.path.join(tmp_dir, "pt_model.bin")
+            torch.save(model.state_dict(), fname)
+            load_checkpoint_in_model(model, fname, device_map=device_map)
+        self.assertEqual(model.linear1.weight.device, torch.device(0))
+        self.assertEqual(model.batchnorm.weight.device, torch.device("cpu"))
+        self.assertEqual(model.linear2.weight.device, torch.device("cpu"))
+
+        # Check with sharded index
+        model = ModelForTest()
+        with tempfile.TemporaryDirectory() as tmp_dir:
+            self.shard_test_model(model, tmp_dir)
+            index_file = os.path.join(tmp_dir, "weight_map.index.json")
+            load_checkpoint_in_model(model, index_file, device_map=device_map)
+
+        self.assertEqual(model.linear1.weight.device, torch.device(0))
+        self.assertEqual(model.batchnorm.weight.device, torch.device("cpu"))
+        self.assertEqual(model.linear2.weight.device, torch.device("cpu"))
+
+        # Check with sharded checkpoint folder
+        model = ModelForTest()
+        with tempfile.TemporaryDirectory() as tmp_dir:
+            self.shard_test_model(model, tmp_dir)
+            load_checkpoint_in_model(model, tmp_dir, device_map=device_map)
+
+        self.assertEqual(model.linear1.weight.device, torch.device(0))
+        self.assertEqual(model.batchnorm.weight.device, torch.device("cpu"))
+        self.assertEqual(model.linear2.weight.device, torch.device("cpu"))
+
+    @require_multi_gpu
+    def test_load_checkpoint_in_model_two_gpu(self):
+        device_map = {"linear1": 0, "batchnorm": "cpu", "linear2": 1}
+
+        # Check with whole checkpoint
+        model = ModelForTest()
+        with tempfile.TemporaryDirectory() as tmp_dir:
+            fname = os.path.join(tmp_dir, "pt_model.bin")
+            torch.save(model.state_dict(), fname)
+            load_checkpoint_in_model(model, fname, device_map=device_map)
+        self.assertEqual(model.linear1.weight.device, torch.device(0))
+        self.assertEqual(model.batchnorm.weight.device, torch.device("cpu"))
+        self.assertEqual(model.linear2.weight.device, torch.device(1))
+
+        # Check with sharded index
+        model = ModelForTest()
+        with tempfile.TemporaryDirectory() as tmp_dir:
+            self.shard_test_model(model, tmp_dir)
+            index_file = os.path.join(tmp_dir, "weight_map.index.json")
+            load_checkpoint_in_model(model, index_file, device_map=device_map)
+
+        self.assertEqual(model.linear1.weight.device, torch.device(0))
+        self.assertEqual(model.batchnorm.weight.device, torch.device("cpu"))
+        self.assertEqual(model.linear2.weight.device, torch.device(1))
+
+        # Check with sharded checkpoint
+        model = ModelForTest()
+        with tempfile.TemporaryDirectory() as tmp_dir:
+            self.shard_test_model(model, tmp_dir)
+            load_checkpoint_in_model(model, tmp_dir, device_map=device_map)
+
+        self.assertEqual(model.linear1.weight.device, torch.device(0))
+        self.assertEqual(model.batchnorm.weight.device, torch.device("cpu"))
+        self.assertEqual(model.linear2.weight.device, torch.device(1))
+
+    def test_clean_device_map(self):
+        # Regroup everything if all is on the same device
+        self.assertDictEqual(clean_device_map({"a": 0, "b": 0, "c": 0}), {"": 0})
+        # Regroups children of level 1 on the same device
+        self.assertDictEqual(
+            clean_device_map({"a.x": 0, "a.y": 0, "b.x": 1, "b.y": 1, "c": 1}), {"a": 0, "b": 1, "c": 1}
+        )
+        # Regroups children of level 2 on the same device
+        self.assertDictEqual(
+            clean_device_map({"a.x": 0, "a.y": 0, "b.x.0": 1, "b.x.1": 1, "b.y.0": 2, "b.y.1": 2, "c": 2}),
+            {"a": 0, "b.x": 1, "b.y": 2, "c": 2},
+        )
+
+    def test_infer_auto_device_map(self):
+        model = ModelForTest()
+        # model has size 236: linear1 64, batchnorm 72, linear2 100
+
+        device_map = infer_auto_device_map(model, max_memory={0: 200, 1: 200})
+        # only linear1 fits on device 0 as we keep memory available for the maximum layer in case of offload
+        self.assertDictEqual(device_map, {"linear1": 0, "batchnorm": 1, "linear2": 1})
+
+        device_map = infer_auto_device_map(model, max_memory={0: 200, 1: 172, 2: 200})
+        # On device 1, we don't care about keeping size available for the max layer, so even if there is just the
+        # size available for batchnorm + linear2, they fit here.
+        self.assertDictEqual(device_map, {"linear1": 0, "batchnorm": 1, "linear2": 1})
+
+        model.linear1.weight = model.linear2.weight
+        device_map = infer_auto_device_map(model, max_memory={0: 200, 1: 200})
+        # By tying weights, the whole model fits on device 0
+        self.assertDictEqual(device_map, {"": 0})
+
+        # When splitting a bigger model, the split is done at the layer level
+        model = nn.Sequential(ModelForTest(), ModelForTest(), ModelForTest())
+        device_map = infer_auto_device_map(model, max_memory={0: 500, 1: 500})
+        self.assertDictEqual(device_map, {"0": 0, "1.linear1": 0, "1.batchnorm": 0, "1.linear2": 1, "2": 1})
+
+        # With no_split_module_classes, it's done at that module level
+        model = nn.Sequential(ModelForTest(), ModelForTest(), ModelForTest())
+        device_map = infer_auto_device_map(
+            model, max_memory={0: 500, 1: 500}, no_split_module_classes=["ModelForTest"]
+        )
+        self.assertDictEqual(device_map, {"0": 0, "1": 1, "2": 1})
+
+        # Now if we have weights tied inside submodules, tied weights are on the same device.
+        model = nn.Sequential(ModelForTest(), ModelForTest(), ModelForTest())
+        layer0 = getattr(model, "0")
+        layer2 = getattr(model, "2")
+        layer0.linear2.weight = layer2.linear2.weight
+        device_map = infer_auto_device_map(model, max_memory={0: 400, 1: 500})
+        expected = {"0": 0, "2.linear2": 0, "1": 1, "2.linear1": 1, "2.batchnorm": 1}
+        self.assertDictEqual(device_map, expected)
--- a/tests/test_offload.py
+++ b/tests/test_offload.py
@ -0,0 +1,87 @@
+# Copyright 2022 The HuggingFace Team. All rights reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import os
+import unittest
+from tempfile import TemporaryDirectory
+
+import torch
+import torch.nn as nn
+
+from accelerate.utils import OffloadedWeightsLoader, offload_state_dict
+
+
+class ModelForTest(nn.Module):
+    def __init__(self):
+        super().__init__()
+        self.linear1 = nn.Linear(3, 4)
+        self.batchnorm = nn.BatchNorm1d(4)
+        self.linear2 = nn.Linear(4, 5)
+
+    def forward(self, x):
+        return self.linear2(self.batchnorm(self.linear1(x)))
+
+
+class OffloadTester(unittest.TestCase):
+    def test_offload_state_dict(self):
+        from tempfile import TemporaryDirectory
+
+        model = ModelForTest()
+        with TemporaryDirectory() as tmp_dir:
+            offload_state_dict(tmp_dir, model.state_dict())
+            index_file = os.path.join(tmp_dir, "index.json")
+            self.assertTrue(os.path.isfile(index_file))
+            # TODO: add tests on what is inside the index
+
+            for key in ["linear1.weight", "linear1.bias", "linear2.weight", "linear2.bias"]:
+                weight_file = os.path.join(tmp_dir, f"{key}.dat")
+                self.assertTrue(os.path.isfile(weight_file))
+                # TODO: add tests on the fact weights are properly loaded
+
+    def test_offload_weights_loader(self):
+        model = ModelForTest()
+        state_dict = model.state_dict()
+        cpu_part = {k: v for k, v in state_dict.items() if "linear2" not in k}
+        disk_part = {k: v for k, v in state_dict.items() if "linear2" in k}
+
+        with TemporaryDirectory() as tmp_dir:
+            offload_state_dict(tmp_dir, disk_part)
+            weight_map = OffloadedWeightsLoader(state_dict=cpu_part, save_folder=tmp_dir)
+
+            # Every key is there with the right value
+            self.assertEqual(sorted(weight_map), sorted(state_dict.keys()))
+            for key, param in state_dict.items():
+                self.assertTrue(torch.allclose(param, weight_map[key]))
+
+        cpu_part = {k: v for k, v in state_dict.items() if "weight" in k}
+        disk_part = {k: v for k, v in state_dict.items() if "weight" not in k}
+
+        with TemporaryDirectory() as tmp_dir:
+            offload_state_dict(tmp_dir, disk_part)
+            weight_map = OffloadedWeightsLoader(state_dict=cpu_part, save_folder=tmp_dir)
+
+            # Every key is there with the right value
+            self.assertEqual(sorted(weight_map), sorted(state_dict.keys()))
+            for key, param in state_dict.items():
+                self.assertTrue(torch.allclose(param, weight_map[key]))
+
+        with TemporaryDirectory() as tmp_dir:
+            offload_state_dict(tmp_dir, state_dict)
+            # Duplicates are removed
+            weight_map = OffloadedWeightsLoader(state_dict=cpu_part, save_folder=tmp_dir)
+
+            # Every key is there with the right value
+            self.assertEqual(sorted(weight_map), sorted(state_dict.keys()))
+            for key, param in state_dict.items():
+                self.assertTrue(torch.allclose(param, weight_map[key]))
--- a/tests/test_sagemaker.py
+++ b/tests/test_sagemaker.py
@ -4,7 +4,7 @@ from dataclasses import dataclass
 import pytest
 from accelerate.commands.config.config_args import SageMakerConfig
 from accelerate.commands.launch import _convert_nargs_to_dict
-from accelerate.state import ComputeEnvironment
+from accelerate.utils import ComputeEnvironment


@dataclass
--- a/tests/test_samples/MRPC/dev.csv
+++ b/tests/test_samples/MRPC/dev.csv
@ -0,0 +1,7 @@
+label,sentence1,sentence2
+equivalent,He said the foodservice pie business doesn 't fit the company 's long-term growth strategy .,""" The foodservice pie business does not fit our long-term growth strategy ."
+not_equivalent,Magnarelli said Racicot hated the Iraqi regime and looked forward to using his long years of training in the war .,"His wife said he was "" 100 percent behind George Bush "" and looked forward to using his years of training in the war ."
+not_equivalent,"The dollar was at 116.92 yen against the yen , flat on the session , and at 1.2891 against the Swiss franc , also flat .","The dollar was at 116.78 yen JPY = , virtually flat on the session , and at 1.2871 against the Swiss franc CHF = , down 0.1 percent ."
+equivalent,The AFL-CIO is waiting until October to decide if it will endorse a candidate .,The AFL-CIO announced Wednesday that it will decide in October whether to endorse a candidate before the primaries .
+not_equivalent,No dates have been set for the civil or the criminal trial .,"No dates have been set for the criminal or civil cases , but Shanley has pleaded not guilty ."
+equivalent,Wal-Mart said it would check all of its million-plus domestic workers to ensure they were legally employed .,It has also said it would review all of its domestic employees more than 1 million to ensure they have legal status .
--- a/tests/test_samples/MRPC/train.csv
+++ b/tests/test_samples/MRPC/train.csv
@ -0,0 +1,7 @@
+label,sentence1,sentence2
+equivalent,He said the foodservice pie business doesn 't fit the company 's long-term growth strategy .,""" The foodservice pie business does not fit our long-term growth strategy ."
+not_equivalent,Magnarelli said Racicot hated the Iraqi regime and looked forward to using his long years of training in the war .,"His wife said he was "" 100 percent behind George Bush "" and looked forward to using his years of training in the war ."
+not_equivalent,"The dollar was at 116.92 yen against the yen , flat on the session , and at 1.2891 against the Swiss franc , also flat .","The dollar was at 116.78 yen JPY = , virtually flat on the session , and at 1.2871 against the Swiss franc CHF = , down 0.1 percent ."
+equivalent,The AFL-CIO is waiting until October to decide if it will endorse a candidate .,The AFL-CIO announced Wednesday that it will decide in October whether to endorse a candidate before the primaries .
+not_equivalent,No dates have been set for the civil or the criminal trial .,"No dates have been set for the criminal or civil cases , but Shanley has pleaded not guilty ."
+equivalent,Wal-Mart said it would check all of its million-plus domestic workers to ensure they were legally employed .,It has also said it would review all of its domestic employees more than 1 million to ensure they have legal status .
--- a/tests/test_scheduler.py
+++ b/tests/test_scheduler.py
@ -0,0 +1,62 @@
+# Copyright 2021 The HuggingFace Team. All rights reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import unittest
+from functools import partial
+
+import torch
+
+from accelerate import Accelerator, debug_launcher
+
+
+def scheduler_test(num_processes=2, step_scheduler_with_optimizer=True, split_batches=False):
+    accelerator = Accelerator(step_scheduler_with_optimizer=step_scheduler_with_optimizer, split_batches=split_batches)
+    model = torch.nn.Linear(2, 4)
+    optimizer = torch.optim.AdamW(model.parameters(), lr=1.0)
+    scheduler = torch.optim.lr_scheduler.LambdaLR(optimizer, lr_lambda=lambda n: 1 - n / 10)
+
+    model, optimizer, scheduler = accelerator.prepare(model, optimizer, scheduler)
+
+    # Optimizer has stepped
+    optimizer._is_overflow = False
+    scheduler.step()
+    expected_lr = 1 - (num_processes if (step_scheduler_with_optimizer and not split_batches) else 1) / 10
+    assert (
+        scheduler.get_last_lr()[0] == expected_lr
+    ), f"Wrong lr found at first step, expected {expected_lr}, got {scheduler.get_last_lr()[0]}"
+
+    # Optimizer has not stepped
+    optimizer._is_overflow = True
+    scheduler.step()
+    if not step_scheduler_with_optimizer:
+        expected_lr = 1 - 2 / 10
+    assert (
+        scheduler.get_last_lr()[0] == expected_lr
+    ), f"Wrong lr found at second step, expected {expected_lr}, got {scheduler.get_last_lr()[0]}"
+
+
+class SchedulerTester(unittest.TestCase):
+    def test_scheduler_steps_with_optimizer_single_process(self):
+        debug_launcher(partial(scheduler_test, num_processes=1), num_processes=1)
+        debug_launcher(partial(scheduler_test, num_processes=1, split_batches=True), num_processes=1)
+
+    def test_scheduler_not_step_with_optimizer_single_process(self):
+        debug_launcher(partial(scheduler_test, num_processes=1, step_scheduler_with_optimizer=False), num_processes=1)
+
+    def test_scheduler_steps_with_optimizer_multiprocess(self):
+        debug_launcher(scheduler_test)
+        debug_launcher(partial(scheduler_test, num_processes=1, split_batches=True), num_processes=1)
+
+    def test_scheduler_not_step_with_optimizer_multiprocess(self):
+        debug_launcher(partial(scheduler_test, step_scheduler_with_optimizer=False))
--- a/tests/test_tracking.py
+++ b/tests/test_tracking.py
@ -0,0 +1,321 @@
+# Copyright 2022 The HuggingFace Team. All rights reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import csv
+import json
+import logging
+import os
+import re
+import tempfile
+import unittest
+import zipfile
+from pathlib import Path
+from typing import Optional
+from unittest import mock
+
+# We use TF to parse the logs
+from accelerate import Accelerator
+from accelerate.test_utils.testing import (
+    MockingTestCase,
+    TempDirTestCase,
+    require_comet_ml,
+    require_tensorflow,
+    require_wandb,
+)
+from accelerate.tracking import CometMLTracker, GeneralTracker
+from accelerate.utils import is_comet_ml_available, is_tensorflow_available
+
+
+if is_comet_ml_available():
+    from comet_ml import OfflineExperiment
+
+
+if is_tensorflow_available():
+    import tensorflow as tf
+    from tensorboard.plugins.hparams import plugin_data_pb2
+    from tensorflow.core.util import event_pb2
+    from tensorflow.python.summary.summary_iterator import summary_iterator
+
+
+logger = logging.getLogger(__name__)
+
+
+class TensorBoardTrackingTest(unittest.TestCase):
+    @require_tensorflow
+    def test_init_trackers(self):
+        hps = None
+        project_name = "test_project_with_config"
+        with tempfile.TemporaryDirectory() as dirpath:
+            accelerator = Accelerator(log_with="tensorboard", logging_dir=dirpath)
+            config = {"num_iterations": 12, "learning_rate": 1e-2, "some_boolean": False, "some_string": "some_value"}
+            accelerator.init_trackers(project_name, config)
+            accelerator.end_training()
+            for child in Path(f"{dirpath}/{project_name}").glob("*/**"):
+                log = list(filter(lambda x: x.is_file(), child.iterdir()))[0]
+                # The config log is stored one layer deeper in the logged directory
+                # And names are randomly generated each time
+            si = summary_iterator(str(log))
+            # Pull HPS through careful parsing
+            for event in si:
+                for value in event.summary.value:
+                    proto_bytes = value.metadata.plugin_data.content
+                    plugin_data = plugin_data_pb2.HParamsPluginData.FromString(proto_bytes)
+                    if plugin_data.HasField("session_start_info"):
+                        hps = dict(plugin_data.session_start_info.hparams)
+
+        self.assertTrue(isinstance(hps, dict))
+        keys = list(hps.keys())
+        keys.sort()
+        self.assertEqual(keys, ["learning_rate", "num_iterations", "some_boolean", "some_string"])
+        self.assertEqual(hps["num_iterations"].number_value, 12)
+        self.assertEqual(hps["learning_rate"].number_value, 0.01)
+        self.assertEqual(hps["some_boolean"].bool_value, False)
+        self.assertEqual(hps["some_string"].string_value, "some_value")
+
+    @require_tensorflow
+    def test_log(self):
+        step = None
+        project_name = "test_project_with_log"
+        with tempfile.TemporaryDirectory() as dirpath:
+            accelerator = Accelerator(log_with="tensorboard", logging_dir=dirpath)
+            accelerator.init_trackers(project_name)
+            values = {"total_loss": 0.1, "iteration": 1, "my_text": "some_value"}
+            accelerator.log(values, step=0)
+            accelerator.end_training()
+            # Logged values are stored in the outermost-tfevents file and can be read in as a TFRecord
+            # Names are randomly generated each time
+            log = list(filter(lambda x: x.is_file(), Path(f"{dirpath}/{project_name}").iterdir()))[0]
+            serialized_examples = tf.data.TFRecordDataset(log)
+            for e in serialized_examples:
+                event = event_pb2.Event.FromString(e.numpy())
+                if step is None:
+                    step = event.step
+                for value in event.summary.value:
+                    if value.tag == "total_loss":
+                        total_loss = value.simple_value
+                    elif value.tag == "iteration":
+                        iteration = value.simple_value
+                    elif value.tag == "my_text/text_summary":  # Append /text_summary to the key
+                        my_text = value.tensor.string_val[0].decode()
+        self.assertAlmostEqual(total_loss, values["total_loss"])
+        self.assertEqual(iteration, values["iteration"])
+        self.assertEqual(my_text, values["my_text"])
+
+    def test_logging_dir(self):
+        with self.assertRaisesRegex(ValueError, "Logging with `tensorboard` requires a `logging_dir`"):
+            _ = Accelerator(log_with="tensorboard")
+        with tempfile.TemporaryDirectory() as dirpath:
+            _ = Accelerator(log_with="tensorboard", logging_dir=dirpath)
+
+
+@require_wandb
+@mock.patch.dict(os.environ, {"WANDB_MODE": "offline"})
+class WandBTrackingTest(TempDirTestCase, MockingTestCase):
+    def setUp(self):
+        super().setUp()
+        # wandb let's us override where logs are stored to via the WANDB_DIR env var
+        self.add_mocks(mock.patch.dict(os.environ, {"WANDB_DIR": self.tmpdir}))
+
+    @staticmethod
+    def get_value_from_log(key: str, log: str, key_occurance: int = 0):
+        """
+        Parses wandb log for `key` and returns the value.
+        If parsing through multiple calls to .log, pass in a `key_occurance`
+        """
+        res = re.findall(rf"(?<={key} )[^\s]+", log)[key_occurance]
+        if '"' in res:
+            return re.findall(r'"([^"]*)"', res)[0]
+        else:
+            return res
+
+    def test_init_trackers(self):
+        project_name = "test_project_with_config"
+        accelerator = Accelerator(log_with="wandb")
+        config = {"num_iterations": 12, "learning_rate": 1e-2, "some_boolean": False, "some_string": "some_value"}
+        accelerator.init_trackers(project_name, config)
+        accelerator.end_training()
+        # The latest offline log is stored at wandb/latest-run/*.wandb
+        for child in Path(f"{self.tmpdir}/wandb/latest-run").glob("*"):
+            logger.info(child)
+            if child.is_file() and child.suffix == ".wandb":
+                with open(child, "rb") as f:
+                    content = f.read()
+                break
+
+        # Check HPS through careful parsing and cleaning
+        cleaned_log = re.sub(r"[\x00-\x1f]+", " ", content.decode("utf8", "ignore"))
+        self.assertEqual(self.get_value_from_log("num_iterations", cleaned_log), "12")
+        self.assertEqual(self.get_value_from_log("learning_rate", cleaned_log), "0.01")
+        self.assertEqual(self.get_value_from_log("some_boolean", cleaned_log), "false")
+        self.assertEqual(self.get_value_from_log("some_string", cleaned_log), "some_value")
+
+    def test_log(self):
+        project_name = "test_project_with_log"
+        accelerator = Accelerator(log_with="wandb")
+        accelerator.init_trackers(project_name)
+        values = {"total_loss": 0.1, "iteration": 1, "my_text": "some_value"}
+        accelerator.log(values, step=0)
+        accelerator.end_training()
+        # The latest offline log is stored at wandb/latest-run/*.wandb
+        for child in Path(f"{self.tmpdir}/wandb/latest-run").glob("*"):
+            if child.is_file() and child.suffix == ".wandb":
+                with open(child, "rb") as f:
+                    content = f.read()
+                break
+        # Check HPS through careful parsing and cleaning
+        cleaned_log = re.sub(r"[\x00-\x1f]+", " ", content.decode("utf8", "ignore"))
+        self.assertTrue("0.1" in self.get_value_from_log("total_loss", cleaned_log))
+        self.assertTrue("1" in self.get_value_from_log("iteration", cleaned_log))
+        self.assertTrue("some_value" in self.get_value_from_log("my_text", cleaned_log))
+        self.assertTrue("0" in self.get_value_from_log("_step", cleaned_log))
+
+
+# Comet has a special `OfflineExperiment` we need to use for testing
+def offline_init(self, run_name: str, tmpdir: str):
+    self.run_name = run_name
+    self.writer = OfflineExperiment(project_name=run_name, offline_directory=tmpdir)
+    logger.info(f"Initialized offline CometML project {self.run_name}")
+    logger.info("Make sure to log any initial configurations with `self.store_init_configuration` before training!")
+
+
+@require_comet_ml
+@mock.patch.object(CometMLTracker, "__init__", offline_init)
+class CometMLTest(unittest.TestCase):
+    @staticmethod
+    def get_value_from_key(log_list, key: str, is_param: bool = False):
+        "Extracts `key` from Comet `log`"
+        for log in log_list:
+            j = json.loads(log)["payload"]
+            if is_param and "param" in j.keys():
+                if j["param"]["paramName"] == key:
+                    return j["param"]["paramValue"]
+            if "log_other" in j.keys():
+                if j["log_other"]["key"] == key:
+                    return j["log_other"]["val"]
+            if "metric" in j.keys():
+                if j["metric"]["metricName"] == key:
+                    return j["metric"]["metricValue"]
+
+    def test_init_trackers(self):
+        with tempfile.TemporaryDirectory() as d:
+            tracker = CometMLTracker("test_project_with_config", d)
+            accelerator = Accelerator(log_with=tracker)
+            config = {"num_iterations": 12, "learning_rate": 1e-2, "some_boolean": False, "some_string": "some_value"}
+            accelerator.init_trackers(None, config)
+            accelerator.end_training()
+            log = os.listdir(d)[0]  # Comet is nice, it's just a zip file here
+            # We parse the raw logs
+            p = os.path.join(d, log)
+            archive = zipfile.ZipFile(p, "r")
+            log = archive.open("messages.json").read().decode("utf-8")
+        list_of_json = log.split("\n")[:-1]
+        self.assertEqual(self.get_value_from_key(list_of_json, "num_iterations", True), 12)
+        self.assertEqual(self.get_value_from_key(list_of_json, "learning_rate", True), 0.01)
+        self.assertEqual(self.get_value_from_key(list_of_json, "some_boolean", True), False)
+        self.assertEqual(self.get_value_from_key(list_of_json, "some_string", True), "some_value")
+
+    def test_log(self):
+        with tempfile.TemporaryDirectory() as d:
+            tracker = CometMLTracker("test_project_with_config", d)
+            accelerator = Accelerator(log_with=tracker)
+            accelerator.init_trackers(None)
+            values = {"total_loss": 0.1, "iteration": 1, "my_text": "some_value"}
+            accelerator.log(values, step=0)
+            accelerator.end_training()
+            log = os.listdir(d)[0]  # Comet is nice, it's just a zip file here
+            # We parse the raw logs
+            p = os.path.join(d, log)
+            archive = zipfile.ZipFile(p, "r")
+            log = archive.open("messages.json").read().decode("utf-8")
+        list_of_json = log.split("\n")[:-1]
+        self.assertEqual(self.get_value_from_key(list_of_json, "curr_step", True), 0)
+        self.assertEqual(self.get_value_from_key(list_of_json, "total_loss"), 0.1)
+        self.assertEqual(self.get_value_from_key(list_of_json, "iteration"), 1)
+        self.assertEqual(self.get_value_from_key(list_of_json, "my_text"), "some_value")
+
+
+class MyCustomTracker(GeneralTracker):
+    "Basic tracker that writes to a csv for testing"
+    _col_names = [
+        "total_loss",
+        "iteration",
+        "my_text",
+        "learning_rate",
+        "num_iterations",
+        "some_boolean",
+        "some_string",
+    ]
+
+    requires_logging_directory = False
+
+    def __init__(self, dir: str):
+        self.f = open(f"{dir}/log.csv", "w+")
+        self.writer = csv.DictWriter(self.f, fieldnames=self._col_names)
+        self.writer.writeheader()
+
+    def store_init_configuration(self, values: dict):
+        logger.info("Call init")
+        self.writer.writerow(values)
+
+    def log(self, values: dict, step: Optional[int]):
+        logger.info("Call log")
+        self.writer.writerow(values)
+
+    def finish(self):
+        self.f.close()
+
+
+class CustomTrackerTestCase(unittest.TestCase):
+    def test_init_trackers(self):
+        with tempfile.TemporaryDirectory() as d:
+            tracker = MyCustomTracker(d)
+            accelerator = Accelerator(log_with=tracker)
+            config = {"num_iterations": 12, "learning_rate": 1e-2, "some_boolean": False, "some_string": "some_value"}
+            accelerator.init_trackers("Some name", config)
+            accelerator.end_training()
+            with open(f"{d}/log.csv", "r") as f:
+                data = csv.DictReader(f)
+                data = next(data)
+                truth = {
+                    "total_loss": "",
+                    "iteration": "",
+                    "my_text": "",
+                    "learning_rate": "0.01",
+                    "num_iterations": "12",
+                    "some_boolean": "False",
+                    "some_string": "some_value",
+                }
+                self.assertDictEqual(data, truth)
+
+    def test_log(self):
+        with tempfile.TemporaryDirectory() as d:
+            tracker = MyCustomTracker(d)
+            accelerator = Accelerator(log_with=tracker)
+            accelerator.init_trackers("Some name")
+            values = {"total_loss": 0.1, "iteration": 1, "my_text": "some_value"}
+            accelerator.log(values, step=0)
+            accelerator.end_training()
+            with open(f"{d}/log.csv", "r") as f:
+                data = csv.DictReader(f)
+                data = next(data)
+                truth = {
+                    "total_loss": "0.1",
+                    "iteration": "1",
+                    "my_text": "some_value",
+                    "learning_rate": "",
+                    "num_iterations": "",
+                    "some_boolean": "",
+                    "some_string": "",
+                }
+                self.assertDictEqual(data, truth)
--- a/tests/test_utils.py
+++ b/tests/test_utils.py
@ -20,7 +20,7 @@ from collections import UserDict, namedtuple
 import torch

 from accelerate.test_utils.training import RegressionModel
-from accelerate.utils import convert_outputs_to_fp32, patch_environment, send_to_device
+from accelerate.utils import convert_outputs_to_fp32, find_device, patch_environment, send_to_device


 TestNamedTuple = namedtuple("TestNamedTuple", "a b c")
@ -78,3 +78,8 @@ class UtilsTester(unittest.TestCase):
        model = RegressionModel()
        model.forward = convert_outputs_to_fp32(model.forward)
        _ = pickle.dumps(model)
+
+    def test_find_device(self):
+        self.assertEqual(find_device([1, "a", torch.tensor([1, 2, 3])]), torch.device("cpu"))
+        self.assertEqual(find_device({"a": 1, "b": torch.tensor([1, 2, 3])}), torch.device("cpu"))
+        self.assertIsNone(find_device([1, "a"]))
Author	SHA1	Message	Date
Sylvain Gugger	f626d87eb7	Release: v0.9.0	2022-05-20 13:46:17 -04:00
Sylvain Gugger	8b8c5345cd	Refactor some parts in utils (#380 )	2022-05-20 12:23:54 -04:00
Sylvain Gugger	41427c594a	Better check for deepspeed availability (#379 ) * Better check for deepspeed availability * Address comment * Simplify a bit	2022-05-20 11:05:18 -04:00
Loubna Ben Allal	3c45b6f760	fix shuffling for ShufflerIterDataPipe instances (#376 ) * fix shuffling for ShufflerIterDataPipe instances * add versioning test for Pytorch * fix minimum Pytorch version Co-authored-by: Loubna ben allal <loubnabenallal@gmail.com>	2022-05-20 08:55:03 -04:00
Sourab Mangrulkar	b922c63322	fix zero stage-1 (#378 )	2022-05-20 17:18:17 +05:30
Zachary Mueller	23c0341262	Refactor tests to use accelerate launch (#373 ) Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com>	2022-05-19 11:48:12 -04:00
Sourab Mangrulkar	6163e20b14	deepspeed save model temp fix (#374 ) * fix deepspeed model saving * fix deepspeed zero stage-3 model save fixes #369 Co-Authored-By: Kovvuri Satyanarayana Reddy <54667784+KOVVURISATYANARAYANAREDDY@users.noreply.github.com> Co-authored-by: Kovvuri Satyanarayana Reddy <54667784+KOVVURISATYANARAYANAREDDY@users.noreply.github.com>	2022-05-19 18:01:53 +05:30
Sourab Mangrulkar	d33dc39a32	fix deepspeed model saving (#370 )	2022-05-19 00:07:20 +05:30
Zachary Mueller	043d2ec52d	Add a utility for writing a barebones config file (#371 ) * Create a basic_config function	2022-05-18 13:39:19 -04:00
Zachary Mueller	64e41a4995	Remove tensor call (#365 )	2022-05-13 10:51:14 -04:00
Sourab Mangrulkar	4736c754bf	fix tracking (#361 ) * fixing trackers * quality * bug fix * bug fix * addressing comments and fixing tests * Fixing script diff test	2022-05-13 17:20:27 +05:30
Tanishq Abraham	28edac2c4c	Update launchers.py (#363 )	2022-05-13 07:25:44 -04:00
Zachary Mueller	1700716760	Handle deprication errors in launch (#360 ) * Adjust based on deprication	2022-05-12 11:13:50 -04:00
Sylvain Gugger	aa9b614967	v0.9.0.dev0	2022-05-12 11:02:19 -04:00
Sylvain Gugger	2943172b8f	v0.8.0 Release	2022-05-12 10:52:54 -04:00
Sylvain Gugger	f56f4441b3	Big model inference (#345 ) * Big model inference * Reorganize port cleanup * Last cleanup * Test fix * Quality * Update src/accelerate/big_modeling.py Co-authored-by: Patrick von Platen <patrick.v.platen@gmail.com> * Fix bug in default mem * Check device map is complete * More tests * Make load function more general * Apply suggestions from code review Co-authored-by: Zachary Mueller <muellerzr@gmail.com> * Quality * Address more review comments * Check generation results for gpt2 * Add main wrapper around everything * Tests for final API * Clean infer_auto_device * Type annotations * Apply suggestions from code review Co-authored-by: Sourab Mangrulkar <13534540+pacman100@users.noreply.github.com> Co-authored-by: Lysandre Debut <lysandre.debut@reseau.eseo.fr> * Address review comments * Last review comment for now * Fix bug in clean_device_map * Add doc * Style * Fixes + dtype support * Fix test * Add option to offload CPU state_dict * Indent typo * Final tweaks Co-authored-by: Patrick von Platen <patrick.v.platen@gmail.com> Co-authored-by: Zachary Mueller <muellerzr@gmail.com> Co-authored-by: Sourab Mangrulkar <13534540+pacman100@users.noreply.github.com> Co-authored-by: Lysandre Debut <lysandre.debut@reseau.eseo.fr>	2022-05-12 10:09:28 -04:00
Sourab Mangrulkar	45359a73ff	DeepSpeed and FSDP plugin support through script (#356 ) * DeepSpeed and FSDP plugin support through script Setting env variables when DeepSpeed /FSDP plugins are provided directly through script without using accelerate launch. * quality	2022-05-11 19:37:49 +05:30
Sourab Mangrulkar	b5b68fbb4d	Fixing metric eval in distributed setup (#355 )	2022-05-10 17:17:22 +05:30
Zachary Mueller	d190ed7e41	Fix sample calculation in examples (#352 ) * Fix metric calculation across examples	2022-05-09 15:44:49 -04:00
Sourab Mangrulkar	b923e134e7	Fix prompt for num_processes (#347 ) * Fix prompt for num_processes * Fix prompting Handling FSDP and DeepSpeed num_processes while prompting. * quality	2022-05-06 17:42:23 +05:30
Zachary Mueller	b2956acbe9	Better prompt for number of training devices (#344 ) * TPU specific	2022-05-05 13:12:32 -04:00
Sourab Mangrulkar	be0f7ce44f	Handle Manual Wrapping in FSDP. Minor fix of fsdp example. (#342 ) * Handle manual wrapping in FSDP. Fix fsdp example.	2022-05-05 21:15:53 +05:30
Zachary Mueller	603a53f056	Improve num_processes question in CLI (#343 ) * Rephrase num_processes question Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com>	2022-05-05 11:07:23 -04:00
Zachary Mueller	02e2ed567b	Refactor utils into its own module (#340 ) Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com>	2022-05-05 10:48:07 -04:00
Zachary Mueller	8abd274a7f	Introduce multiprocess logger (#337 )	2022-05-02 09:45:10 -04:00
Idodox	b05d483944	Fixed a typo to enable running accelerate correctly (#339 )	2022-05-02 07:54:57 -04:00
Sourab Mangrulkar	a74c7c9538	Create peak_memory_uasge_tracker.py (#336 ) * Create peak_memory_uasge_tracker.py Adding the example by feature for tracking peak memory usage of GPU. One example of usage is to track the peak memory reduction when using FSDP. * fixing the typo in the file name * reformatting * exclude peak_memory_usage_tracker.py from tests * renaming and highlighting proper usage * Update test_examples.py 😅	2022-04-29 22:38:34 +05:30
Zachary Mueller	a60640d7e2	Patchfix infinite loop (#335 )	2022-04-29 08:34:37 -04:00
Zachary Mueller	611546f12d	Add guards for batch size finder (#334 ) * Fix zero reached Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com>	2022-04-28 16:34:07 -04:00
Zachary Mueller	7d2a259e3d	Fix fdsp config in cluster (#331 ) Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com>	2022-04-28 16:01:28 -04:00
Zachary Mueller	e5c17f36a8	Clean up tests + fix import (#330 )	2022-04-28 13:37:02 -04:00
Sylvain Gugger	20de3fc959	v0.8.0.dev0 with setup	2022-04-28 11:27:50 -04:00
Sylvain Gugger	f84cb0c1fa	v0.8.0.dev0	2022-04-28 11:27:39 -04:00
Sylvain Gugger	136437e3e8	Fix default config dicts (#329 ) * Fix default config dicts * style	2022-04-28 11:23:44 -04:00
Sourab Mangrulkar	2622cc0f98	PyTorch FSDP Feature Incorporation (#321 ) * PyTorch FSDP Feature Incorporation Changes to enable the PyTorch FSDP features. * removing fsdp_kwargs * Addressing the comments and removing the .DS_Store files * adding fsdp_config to the FSDP Plugin * Apply suggestions from code review Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com> * Addressing comments and little refactoring * Create fsdp.mdx * Update _toctree.yml * refactoring documentation and undo indentation in _toctree.yml Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com>	2022-04-28 17:09:01 +05:30
Zachary Mueller	5f433673e1	Introduce reduce operator (#326 ) Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com>	2022-04-26 15:39:11 -04:00
Zachary Mueller	b028a1981d	Add a memory-aware decorator for CUDA OOM avoidance (#324 ) Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com>	2022-04-26 10:43:06 -04:00
Zachary Mueller	3e14dd16be	Fixup all checkpointing examples (#323 )	2022-04-21 14:25:10 -04:00
Zachary Mueller	fa476d03ce	Update examples to show how to deal with extra validation copies (#319 ) Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com>	2022-04-20 14:02:58 -04:00
Loubna Ben Allal	53638352a0	fix typo (#320 )	2022-04-20 07:29:41 -04:00
Zachary Mueller	5791d3dd6b	Create alias for Accelerator.free_memory (#318 ) * Add `Accelerator.clear` alias Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com>	2022-04-19 16:26:03 -04:00
Zachary Mueller	2d7fbbdc73	Create Cross-Validation example (#317 )	2022-04-19 16:14:07 -04:00
Zachary Mueller	461ac7d476	Refactor Tracker logic and write guards for logging_dir (#316 )	2022-04-19 10:21:11 -04:00
Zachary Mueller	209db19dc8	Create a testing framework for example scripts and fix current ones (#313 ) Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com>	2022-04-13 13:24:36 -04:00
Christopher Dewan	381ae20027	Fix DataLoader sharding for deepspeed in accelerate (#315 ) * set first_pass on calls from deepspeed to _prepare_one(...) so that it is not a noop and actually wraps our dataloaders * fixed style	2022-04-13 11:35:06 -04:00
Zachary Mueller	8595834292	Refactor Examples by Feature (#312 ) Splits up examples into by feature scripts	2022-04-11 15:59:13 -04:00
Zachary Mueller	fa2ec4ba16	Fix Accelerate CLI CPU option + small fix for W&B tests (#311 ) * Fix command input * Make W&B log test more stable by changing assertEqual -> assertTrue	2022-04-08 12:22:08 -04:00
Sylvain Gugger	1d95ebdaa4	Use --no_local_rank for DeepSpeed launch (#309 ) * Use --no_local_rank for DeepSpeed launch * Plus one typo	2022-04-04 17:58:55 -04:00
Zachary Mueller	38e6d941fa	Update example scripts (#307 ) Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com>	2022-04-04 17:19:25 -04:00
Sylvain Gugger	7eb5255694	Fix training in DeepSpeed (#308 ) * Fix training in DeepSpeed * Be more defensive * Apply suggestions from code review Co-authored-by: Zachary Mueller <muellerzr@gmail.com> Co-authored-by: Zachary Mueller <muellerzr@gmail.com>	2022-04-04 16:52:05 -04:00
Zachary Mueller	e72a125502	Write tests for comet_ml (#306 ) * Write tests for comet_ml * No need for second mock	2022-03-31 17:39:18 -04:00
Zachary Mueller	e361dcc2a7	Have custom trackers work with the API (#305 ) Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com>	2022-03-31 14:57:19 -04:00
Zachary Mueller	e66ba31af2	Create new TestCase classes and clean up W&B tests (#304 )	2022-03-31 14:04:26 -04:00
Sylvain Gugger	2c554b056c	Pass `lr_scheduler` to `Accelerator.prepare` (#301 ) * Work in progress * Pass scheduler to Accelerator.prapre * Fix tests * Apply suggestions from code review Co-authored-by: Zachary Mueller <muellerzr@gmail.com> * Style post comments Co-authored-by: Zachary Mueller <muellerzr@gmail.com>	2022-03-31 09:55:41 -04:00
Zachary Mueller	5668270de7	Add logging capabilities (#293 ) Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com> - Added experiment tracking API, and support for Weights and Biases, TensorBoard, and CometML + Tests - Added `tensorflow` to a new dependency list to be used during tests - Added three new functions in `Accelerator` to interact with the API	2022-03-30 17:40:32 -04:00
Sylvain Gugger	f03f18252f	Leave default as None (#300 )	2022-03-30 13:48:10 -04:00
Sylvain Gugger	5b2e6edab2	Fix example for datasets v2 (#298 )	2022-03-29 15:48:16 -04:00
Sylvain Gugger	1e0b96f814	Load model and optimizet states on CPU to void OOMs (#299 )	2022-03-29 14:49:44 -04:00
Zachary Mueller	5d83eed3d2	Refactor precisions to its own enum (#292 ) * Refactor precision * Add in enum subclass and inheritence	2022-03-24 12:44:06 -04:00
Zachary Mueller	69ff072643	Document save/load state (#290 )	2022-03-23 14:19:07 -04:00