mirror of https://github.com/huggingface/accelerate.git synced 2025-10-21 02:33:46 +08:00

Files

Matej Sirovatka d7c741a6bc Initial FSDP2 support (#3394 )

* Feat: initial conversion tool draft

* Feat: add value mapping to conversion tool

* Refactor: move from os to pathlib

* Feat: add first tests

* Feat: more tests

* Feat: minor fixes + dataclass conversions

* Feat: more remapping

* Fix: namespace has no attribute version + style

* Fix: offload params behavior

* Feat: add option to only rename keys in the config file to

* Fix: wrong attr name

* Fix: partially resolve comments

* Feat: work on config command + minor fixes to reflect changes

* Refactor: style + quality

* Feat: fsdp2 initial work

* Feat: some cleanups and first running fsdp2

* Fix: version checks + mixed precision policy

* Refactor: style + quality

* Remove obsolete todos

* Feat: grad norm clipping

* Fix: tests + rename attrs

* Refactor: style + quality

* Fix: None object is not iterable

* Fix: default cpu_offload for fsdp2

* Fix: cpu offload now behaves correctly

* Feat: apply_activation_checkpointing

* Fix: append to models

* Feat: start on concept guide

* wip: concept guide

* Fix: toctree

* cleanup of the concept guide

* Fix: minor fixes + mp

* Fix: quality + | to union

* Feat: backwards compatibility + args cleanup

* Fix: style + quality

* Feat: enable dropping refs when getting named params

* Fix: memory footprint with fsdp2

* Feat: cpu ram efficient loading

* Fix: mp

* Fix: not warn about sync_modules if fsdp version is 1

* Refactor: minor changes

* Small fixes + refactors

* Feat: docs + cleanup

* Feat: saving works (not sure about optim)

* More loading/saving work

* Feat: disable local_state_dict for fsdp2

* Fix: fsdp2 convergence

* Feat: working comparison script

* Feat: memory tracking fsdp2

* Feat: memory visualizer

* Feat: more work on benchmark

* Fix: raise error if model+optimizer arent prepared together

* Minor fixes

* Style

* More warnings

* Fix: reshard_after_forward vs sharding_strategy conflict

* Refactor: clean up accelerator

* Feat: more testing in fsdp2 benchmark

* Fix: memory visualizer

* Untested: support load/save_state

* Feat: concept guide improvements

* Refactor: concept guide

* Feat: benchmark works

* Feat: more work on fsdp2 benchmark

* Fix: note syntax

* Fix: small fixes + make original tests work

* Fix: grad scaling

* Feat: reshard after forward tests

* Feat: backward prefetch tests

* Feat: tests for fsdp2

* Refactor: minor fixes

* Feat: fsdp_utils docstrings

* Feat: autodoc fsdp.md

* Docs: get_module_children_bottom_up

* Fix: remove unused images

* Refactor: benchmark cleanup

* Fix: docs

* Feat: final doc changes

* Fix: torch.distributed has no attribute tensor

* Fix: style

* Feat: tests include version in failures

* Fix: benchmark force model to load in fp32

* Fix: rename runs

* Feat: last minor fixes

* Feat: new benchmark images

2025-03-27 15:01:18 -04:00

6.2 KiB

Raw Permalink Blame History

Utility functions and classes

Below are a variety of utility functions that 🤗 Accelerate provides, broken down by use-case.

Constants

Constants used throughout 🤗 Accelerate for reference

The following are constants used when utilizing [Accelerator.save_state]

utils.MODEL_NAME: "pytorch_model" utils.OPTIMIZER_NAME: "optimizer" utils.RNG_STATE_NAME: "random_states" utils.SCALER_NAME: "scaler.pt utils.SCHEDULER_NAME: "scheduler

The following are constants used when utilizing [Accelerator.save_model]

utils.WEIGHTS_NAME: "pytorch_model.bin" utils.SAFE_WEIGHTS_NAME: "model.safetensors" utils.WEIGHTS_INDEX_NAME: "pytorch_model.bin.index.json" utils.SAFE_WEIGHTS_INDEX_NAME: "model.safetensors.index.json"

Data Classes

These are basic dataclasses used throughout 🤗 Accelerate and they can be passed in as parameters.

Standalone

These are standalone dataclasses used for checks, such as the type of distributed system being used

autodoc utils.ComputeEnvironment

autodoc utils.DistributedType

autodoc utils.DynamoBackend

autodoc utils.LoggerType

autodoc utils.PrecisionType

autodoc utils.RNGType

autodoc utils.SageMakerDistributedType

Kwargs

These are configurable arguments for specific interactions throughout the PyTorch ecosystem that Accelerate handles under the hood.

autodoc utils.AutocastKwargs

autodoc utils.DistributedDataParallelKwargs

autodoc utils.FP8RecipeKwargs

autodoc utils.GradScalerKwargs

autodoc utils.InitProcessGroupKwargs

autodoc utils.KwargsHandler

Plugins

These are plugins that can be passed to the [Accelerator] object. While they are defined elsewhere in the documentation, for convenience all of them are available to see here:

autodoc utils.DeepSpeedPlugin

autodoc utils.FullyShardedDataParallelPlugin

autodoc utils.GradientAccumulationPlugin

autodoc utils.MegatronLMPlugin

autodoc utils.TorchDynamoPlugin

Configurations

These are classes which can be configured and passed through to the appropriate integration

autodoc utils.BnbQuantizationConfig

autodoc utils.DataLoaderConfiguration

autodoc utils.ProjectConfiguration

Environmental Variables

These are environmental variables that can be enabled for different use cases

ACCELERATE_DEBUG_MODE (str): Whether to run accelerate in debug mode. More info available here.

Data Manipulation and Operations

These include data operations that mimic the same torch ops but can be used on distributed processes.

autodoc utils.broadcast

autodoc utils.broadcast_object_list

autodoc utils.concatenate

autodoc utils.convert_outputs_to_fp32

autodoc utils.convert_to_fp32

autodoc utils.gather

autodoc utils.gather_object

autodoc utils.get_grad_scaler

autodoc utils.get_mixed_precision_context_manager

autodoc utils.listify

autodoc utils.pad_across_processes

autodoc utils.recursively_apply

autodoc utils.reduce

autodoc utils.send_to_device

autodoc utils.slice_tensors

Environment Checks

These functionalities check the state of the current working environment including information about the operating system itself, what it can support, and if particular dependencies are installed.

autodoc utils.is_bf16_available

autodoc utils.is_ipex_available

autodoc utils.is_mps_available

autodoc utils.is_npu_available

autodoc utils.is_torch_version

autodoc utils.is_torch_xla_available

autodoc utils.is_xpu_available

Environment Manipulation

autodoc utils.patch_environment

autodoc utils.clear_environment

autodoc utils.write_basic_config

When setting up 🤗 Accelerate for the first time, rather than running accelerate config [~utils.write_basic_config] can be used as an alternative for quick configuration.

autodoc utils.set_numa_affinity

autodoc utils.environment.override_numa_affinity

autodoc utils.purge_accelerate_environment

Memory

autodoc utils.find_executable_batch_size

Modeling

These utilities relate to interacting with PyTorch models

autodoc utils.calculate_maximum_sizes

autodoc utils.compute_module_sizes

autodoc utils.extract_model_from_parallel

autodoc utils.get_balanced_memory

autodoc utils.get_max_layer_size

autodoc utils.infer_auto_device_map

autodoc utils.load_checkpoint_in_model

autodoc utils.load_offloaded_weights

autodoc utils.load_state_dict

autodoc utils.offload_state_dict

autodoc utils.retie_parameters

autodoc utils.set_module_tensor_to_device

autodoc utils.get_module_children_bottom_up

Parallel

These include general utilities that should be used when working in parallel.

autodoc utils.extract_model_from_parallel

autodoc utils.save

autodoc utils.load

autodoc utils.wait_for_everyone

Random

These utilities relate to setting and synchronizing of all the random states.

autodoc utils.set_seed

autodoc utils.synchronize_rng_state

autodoc utils.synchronize_rng_states

PyTorch XLA

These include utilities that are useful while using PyTorch with XLA.

autodoc utils.install_xla

Loading model weights

These include utilities that are useful to load checkpoints.