Compare commits

...

229 Commits

Author SHA1 Message Date
65b5d2e74a Try to make it work for BLOOM 2022-05-31 13:42:50 -04:00
95d1edbf8d Update requirements to pin tensorboard and include psutil (#408)
* Update test requirements to include psutil, tensorboard, and the right tensorflow version
2022-05-31 09:52:16 -04:00
a91575f1bb Fix CUDA examples tests (#407)
* Fix CUDA tests

* Use num_processes to keep everything under one test
2022-05-31 09:51:21 -04:00
146ce3df48 Move datasets and transformers to under func (#411) 2022-05-31 08:47:16 -04:00
94d88fb50d Fix CUDA Dockerfile (#409)
* Install git

* Fix CPU image as well
2022-05-31 08:47:08 -04:00
b515800947 Hotfix all failing GPU tests (#401)
* Fix up makefile
2022-05-26 14:13:19 -04:00
d1f7f99684 improve metrics logged in examples (#399) 2022-05-26 17:29:49 +05:30
00ee34d9a6 Refactor offload_state_dict and fix in offload_weight (#398) 2022-05-25 16:09:25 -04:00
f6ec2660f0 Refactor version checking into a utility (#395)
Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com>
2022-05-25 14:07:39 -04:00
b3e21686de Include fastai in frameworks (#396) 2022-05-25 13:42:09 -04:00
f12ef1416e Add packaging to requirements (#394)
* Add packaging to requirements
2022-05-25 11:33:14 -04:00
18085fa250 Better dispatch for submodules (#392) 2022-05-25 10:51:18 -04:00
6be221f15e Build Docker Images nightly (#391) 2022-05-24 15:02:08 -04:00
3c4308e8cd Revert "Better dispatch for modules"
This reverts commit 17046bfaf8b805ebbc8ac4695f731b58c61004ed.
2022-05-24 13:48:19 -04:00
17046bfaf8 Better dispatch for modules 2022-05-24 13:47:39 -04:00
07ed7e92b5 Small bugfix for the stalebot workflow (#390)
* Bugfix dispatch
2022-05-24 11:58:26 -04:00
5a679d08d3 Introduce stalebot (#387)
* Add stalebot
2022-05-23 17:10:14 -04:00
5a00ece500 Create Dockerfiles for Accelerate (#377) 2022-05-23 17:09:56 -04:00
f62ae86cfb Mix precision -> Mixed precision (#388) 2022-05-23 15:02:29 -04:00
f9de557037 Fix OneCycle step length when in multiprocess (#385)
* Special onecycle fix
2022-05-23 12:28:44 -04:00
517cbf408b V0.10.0.dev0 2022-05-20 13:51:21 -04:00
f626d87eb7 Release: v0.9.0 2022-05-20 13:46:17 -04:00
8b8c5345cd Refactor some parts in utils (#380) 2022-05-20 12:23:54 -04:00
41427c594a Better check for deepspeed availability (#379)
* Better check for deepspeed availability

* Address comment

* Simplify a bit
2022-05-20 11:05:18 -04:00
3c45b6f760 fix shuffling for ShufflerIterDataPipe instances (#376)
* fix shuffling for ShufflerIterDataPipe instances

* add versioning test for Pytorch

* fix minimum Pytorch version

Co-authored-by: Loubna ben allal <loubnabenallal@gmail.com>
2022-05-20 08:55:03 -04:00
b922c63322 fix zero stage-1 (#378) 2022-05-20 17:18:17 +05:30
23c0341262 Refactor tests to use accelerate launch (#373)
Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com>
2022-05-19 11:48:12 -04:00
6163e20b14 deepspeed save model temp fix (#374)
* fix deepspeed model saving

* fix deepspeed zero stage-3 model save

fixes #369

Co-Authored-By: Kovvuri Satyanarayana Reddy <54667784+KOVVURISATYANARAYANAREDDY@users.noreply.github.com>

Co-authored-by: Kovvuri Satyanarayana Reddy <54667784+KOVVURISATYANARAYANAREDDY@users.noreply.github.com>
2022-05-19 18:01:53 +05:30
d33dc39a32 fix deepspeed model saving (#370) 2022-05-19 00:07:20 +05:30
043d2ec52d Add a utility for writing a barebones config file (#371)
* Create a basic_config function
2022-05-18 13:39:19 -04:00
64e41a4995 Remove tensor call (#365) 2022-05-13 10:51:14 -04:00
4736c754bf fix tracking (#361)
* fixing trackers

* quality

* bug fix

* bug fix

* addressing comments and fixing tests

* Fixing script diff test
2022-05-13 17:20:27 +05:30
28edac2c4c Update launchers.py (#363) 2022-05-13 07:25:44 -04:00
1700716760 Handle deprication errors in launch (#360)
* Adjust based on deprication
2022-05-12 11:13:50 -04:00
aa9b614967 v0.9.0.dev0 2022-05-12 11:02:19 -04:00
2943172b8f v0.8.0 Release 2022-05-12 10:52:54 -04:00
f56f4441b3 Big model inference (#345)
* Big model inference

* Reorganize port cleanup

* Last cleanup

* Test fix

* Quality

* Update src/accelerate/big_modeling.py

Co-authored-by: Patrick von Platen <patrick.v.platen@gmail.com>

* Fix bug in default mem

* Check device map is complete

* More tests

* Make load function more general

* Apply suggestions from code review

Co-authored-by: Zachary Mueller <muellerzr@gmail.com>

* Quality

* Address more review comments

* Check generation results for gpt2

* Add main wrapper around everything

* Tests for final API

* Clean infer_auto_device

* Type annotations

* Apply suggestions from code review

Co-authored-by: Sourab Mangrulkar <13534540+pacman100@users.noreply.github.com>
Co-authored-by: Lysandre Debut <lysandre.debut@reseau.eseo.fr>

* Address review comments

* Last review comment for now

* Fix bug in clean_device_map

* Add doc

* Style

* Fixes + dtype support

* Fix test

* Add option to offload CPU state_dict

* Indent typo

* Final tweaks

Co-authored-by: Patrick von Platen <patrick.v.platen@gmail.com>
Co-authored-by: Zachary Mueller <muellerzr@gmail.com>
Co-authored-by: Sourab Mangrulkar <13534540+pacman100@users.noreply.github.com>
Co-authored-by: Lysandre Debut <lysandre.debut@reseau.eseo.fr>
2022-05-12 10:09:28 -04:00
45359a73ff DeepSpeed and FSDP plugin support through script (#356)
* DeepSpeed and FSDP plugin support through script

Setting env variables when DeepSpeed /FSDP plugins are provided directly through script without using accelerate launch.

* quality
2022-05-11 19:37:49 +05:30
b5b68fbb4d Fixing metric eval in distributed setup (#355) 2022-05-10 17:17:22 +05:30
d190ed7e41 Fix sample calculation in examples (#352)
* Fix metric calculation across examples
2022-05-09 15:44:49 -04:00
b923e134e7 Fix prompt for num_processes (#347)
* Fix prompt for num_processes

* Fix prompting

Handling FSDP and DeepSpeed num_processes while prompting.

* quality
2022-05-06 17:42:23 +05:30
b2956acbe9 Better prompt for number of training devices (#344)
* TPU specific
2022-05-05 13:12:32 -04:00
be0f7ce44f Handle Manual Wrapping in FSDP. Minor fix of fsdp example. (#342)
* Handle manual wrapping in FSDP. Fix fsdp example.
2022-05-05 21:15:53 +05:30
603a53f056 Improve num_processes question in CLI (#343)
* Rephrase num_processes question
Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com>
2022-05-05 11:07:23 -04:00
02e2ed567b Refactor utils into its own module (#340)
Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com>
2022-05-05 10:48:07 -04:00
8abd274a7f Introduce multiprocess logger (#337) 2022-05-02 09:45:10 -04:00
b05d483944 Fixed a typo to enable running accelerate correctly (#339) 2022-05-02 07:54:57 -04:00
a74c7c9538 Create peak_memory_uasge_tracker.py (#336)
* Create peak_memory_uasge_tracker.py

Adding the example by feature for tracking peak memory usage of GPU. One example of usage is to track the peak memory reduction when using FSDP.

* fixing the typo in the file name

* reformatting

* exclude peak_memory_usage_tracker.py from tests

* renaming and highlighting proper usage

* Update test_examples.py

😅
2022-04-29 22:38:34 +05:30
a60640d7e2 Patchfix infinite loop (#335) 2022-04-29 08:34:37 -04:00
611546f12d Add guards for batch size finder (#334)
* Fix zero reached

Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com>
2022-04-28 16:34:07 -04:00
7d2a259e3d Fix fdsp config in cluster (#331)
Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com>
2022-04-28 16:01:28 -04:00
e5c17f36a8 Clean up tests + fix import (#330) 2022-04-28 13:37:02 -04:00
20de3fc959 v0.8.0.dev0 with setup 2022-04-28 11:27:50 -04:00
f84cb0c1fa v0.8.0.dev0 2022-04-28 11:27:39 -04:00
136437e3e8 Fix default config dicts (#329)
* Fix default config dicts

* style
2022-04-28 11:23:44 -04:00
2622cc0f98 PyTorch FSDP Feature Incorporation (#321)
* PyTorch FSDP Feature Incorporation

Changes to enable the PyTorch FSDP features.

* removing fsdp_kwargs

* Addressing the comments and removing the .DS_Store files

* adding fsdp_config to the FSDP Plugin

* Apply suggestions from code review

Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com>

* Addressing comments and little refactoring

* Create fsdp.mdx

* Update _toctree.yml

* refactoring documentation and undo indentation in _toctree.yml

Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com>
2022-04-28 17:09:01 +05:30
5f433673e1 Introduce reduce operator (#326)
Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com>
2022-04-26 15:39:11 -04:00
b028a1981d Add a memory-aware decorator for CUDA OOM avoidance (#324)
Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com>
2022-04-26 10:43:06 -04:00
3e14dd16be Fixup all checkpointing examples (#323) 2022-04-21 14:25:10 -04:00
fa476d03ce Update examples to show how to deal with extra validation copies (#319)
Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com>
2022-04-20 14:02:58 -04:00
53638352a0 fix typo (#320) 2022-04-20 07:29:41 -04:00
5791d3dd6b Create alias for Accelerator.free_memory (#318)
* Add `Accelerator.clear` alias

Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com>
2022-04-19 16:26:03 -04:00
2d7fbbdc73 Create Cross-Validation example (#317) 2022-04-19 16:14:07 -04:00
461ac7d476 Refactor Tracker logic and write guards for logging_dir (#316) 2022-04-19 10:21:11 -04:00
209db19dc8 Create a testing framework for example scripts and fix current ones (#313)
Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com>
2022-04-13 13:24:36 -04:00
381ae20027 Fix DataLoader sharding for deepspeed in accelerate (#315)
* set first_pass on calls from deepspeed to _prepare_one(...) so that it is not a noop and actually wraps our dataloaders

* fixed style
2022-04-13 11:35:06 -04:00
8595834292 Refactor Examples by Feature (#312)
Splits up examples into by feature scripts
2022-04-11 15:59:13 -04:00
fa2ec4ba16 Fix Accelerate CLI CPU option + small fix for W&B tests (#311)
* Fix command input

* Make W&B log test more stable by changing assertEqual -> assertTrue
2022-04-08 12:22:08 -04:00
1d95ebdaa4 Use --no_local_rank for DeepSpeed launch (#309)
* Use --no_local_rank for DeepSpeed launch

* Plus one typo
2022-04-04 17:58:55 -04:00
38e6d941fa Update example scripts (#307)
Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com>
2022-04-04 17:19:25 -04:00
7eb5255694 Fix training in DeepSpeed (#308)
* Fix training in DeepSpeed

* Be more defensive

* Apply suggestions from code review

Co-authored-by: Zachary Mueller <muellerzr@gmail.com>

Co-authored-by: Zachary Mueller <muellerzr@gmail.com>
2022-04-04 16:52:05 -04:00
e72a125502 Write tests for comet_ml (#306)
* Write tests for comet_ml

* No need for second mock
2022-03-31 17:39:18 -04:00
e361dcc2a7 Have custom trackers work with the API (#305)
Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com>
2022-03-31 14:57:19 -04:00
e66ba31af2 Create new TestCase classes and clean up W&B tests (#304) 2022-03-31 14:04:26 -04:00
2c554b056c Pass lr_scheduler to Accelerator.prepare (#301)
* Work in progress

* Pass scheduler to Accelerator.prapre

* Fix tests

* Apply suggestions from code review

Co-authored-by: Zachary Mueller <muellerzr@gmail.com>

* Style post comments

Co-authored-by: Zachary Mueller <muellerzr@gmail.com>
2022-03-31 09:55:41 -04:00
5668270de7 Add logging capabilities (#293)
Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com>

- Added experiment tracking API, and support for Weights and Biases, TensorBoard, and CometML + Tests
- Added `tensorflow` to a new dependency list to be used during tests
- Added three new functions in `Accelerator` to interact with the API
2022-03-30 17:40:32 -04:00
f03f18252f Leave default as None (#300) 2022-03-30 13:48:10 -04:00
5b2e6edab2 Fix example for datasets v2 (#298) 2022-03-29 15:48:16 -04:00
1e0b96f814 Load model and optimizet states on CPU to void OOMs (#299) 2022-03-29 14:49:44 -04:00
5d83eed3d2 Refactor precisions to its own enum (#292)
* Refactor precision

* Add in enum subclass and inheritence
2022-03-24 12:44:06 -04:00
69ff072643 Document save/load state (#290) 2022-03-23 14:19:07 -04:00
211e6555fa Fix breaking change 2022-03-18 17:40:32 -04:00
a5b782b0a1 v0.7.0.dev0 2022-03-18 09:40:22 -04:00
339d4e0372 Release for real 2022-03-18 09:36:58 -04:00
3cfebcc93a Release v0.6.0 2022-03-18 09:33:02 -04:00
4628652866 Pass along execution info to the exit of autocast (#284) 2022-03-16 12:20:35 -04:00
0e0ac26fdf Use workflow from doc-builder (#275)
* Use workflow from doc-builder to build PR docs

* Adjust branch

* Consecutive jobs

* Transfer lib install to the workflow

* Remove dep

* Add dev install

* Use delete doc comment workflow

* Trigger

* Last job and better token maybe?

* Adapt token

* Use temp variable

* Use temp variable for real

* Pass the token better

* Let the template fetch the token

* Try to build the main doc!

* With the right name, preferably

* Notebook try

* Test

* Put back

* Final cleanup

* Final cleanup for realsies

* Switch to main branch
2022-03-14 14:56:30 -04:00
2fcbc81d4b replace texts and link (master -> main) (#282)
* replace texts and link (master -> main)

* replace texts and link (master -> main)
2022-03-14 13:26:07 -04:00
06df083041 add env command (#280)
* add env command

modified:   src/accelerate/commands/accelerate_cli.py
	- added the the env command to the command cli

new file:   src/accelerate/commands/env.py
	- added the env command parser and env command

new file:   src/accelerate/file_utils.py
	- added is_torch_available
		- based on a69e185074/src/transformers/file_utils.py (L69)

modified:   src/accelerate/utils.py
	- add import of importlib_metadata
		- maybe can do this at the file_utils? or maybe added the is_torch_available to the `utils.py`? I just based in the organization from `transformers` repo

* remove unnecessary is torch available
modified:   src/accelerate/commands/env.py
- remove use of is_torch_available

deleted:    src/accelerate/file_utils.py
- remove is torch available

modified:   src/accelerate/utils.py
- revert to 00e80dcfff899440b743cf9d1453cc762268591b

* add default configs of accelerate

* add default configs of accelerate

* Update src/accelerate/commands/env.py

Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com>

Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com>
2022-03-14 11:27:41 -04:00
f7dc733685 Contributing guide (#254)
* Contributing guide

* Typo

* Review comments

Co-authored-by: Adrin Jalali <adrin.jalali@gmail.com>
Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com>

* Update CONTRIBUTING.md

Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com>

Co-authored-by: Adrin Jalali <adrin.jalali@gmail.com>
Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com>
2022-03-11 14:36:39 -05:00
00e80dcfff Make accelerated model with AMP possible to pickle (#274)
* Make accelerated model with AMP possible to pickle

Solves issue 273

As discussed in the issue, the solution is to use a class instead of a
closure to achieve the same result.

I created an alias convert_outputs_to_fp32 = ConvertOutputsToFp32 so
that importing convert_outputs_to_fp32 can still be imported.

Alternatively, I could remove the alias and change import to use
ConvertOutputsToFp32 directly, but this may break backwards
compatibility. Or I could name the class convert_outputs_to_fp32 because
it works like a function.

Regarding the testing, I added a check to test_script.py for the model
trained with mixed_precision=fp16. Locally, this test could not trigger
the error in the issue because the forward method is never replaced. I
believe this is because AcceleratorState detects that my machine can't
perform fp16 training. I hope that in CI, this would be detected.

* Move test and style (#1)

* Remove unnecessary import

Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com>

Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com>
2022-03-10 17:18:55 -05:00
a2e3e5ebec Trigger doc build 2022-03-10 12:06:43 -05:00
3ef1724b3f Fix typo 2022-03-10 11:51:41 -05:00
f4bd8e3cc5 Proper install 2022-03-10 11:44:41 -05:00
1e7ec4a5c2 Update doc build job 2022-03-10 11:24:21 -05:00
986d5b93b7 Trigger doc build 2022-03-10 11:20:25 -05:00
fb5ed62c10 Convert documentation to the new front (#271)
* Main conversion

* Doc styling

* Style

* New front deploy

* Fixes

* Fixes

* Fix new docstrings

* Style
2022-03-10 11:13:40 -05:00
6ffab178ac Implementation of saving and loading custom states (#270) 2022-03-08 16:52:14 -05:00
515fcca9ed Don't use dispatch_batches when torch is < 1.8.0 (#269) 2022-03-07 12:20:00 -05:00
2a5f4c6311 Ability to set the seed with randomness from inside Accelerate (#266)
Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com>
2022-03-02 17:31:20 -05:00
1e630cd3a7 Add in checkpointing capability (#255)
Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com>
2022-03-02 12:26:12 -05:00
bbccd2c3fb Basic fixes for DeepSpeed (#264) 2022-03-01 15:04:53 -05:00
c7e9e10bad Add a flag to use CPU only in the config (#263) 2022-03-01 14:15:53 -05:00
49658cdc20 enhance compatibility of honor type (#241)
* enhance compatibility of honor type

fixes #237

* format code
2022-02-24 19:23:04 +01:00
503a9ffa7f Add debug_launcher (#259)
* Add debug_launcher

* Pass along local rank for CPU

* Add util test
2022-02-24 15:53:42 +01:00
badfbced27 Update README.md (#249) 2022-02-24 15:30:22 +01:00
9e5e7f32d5 Add launch flags --module and --no_python (#256) (#258)
* Add launch flags --module and --no_python (#256)

* Update src/accelerate/commands/launch.py

Use `elif` instead of consecutive `if`

Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com>

* Use `elif` instead of consecutive `if`

Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com>

* Run black formatting

Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com>
2022-02-24 09:22:03 +01:00
d742bce525 add support of gather_object (#238)
* add support of gather_object

fixes #235

* Update src/accelerate/utils.py

Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com>

* Update src/accelerate/utils.py

Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com>

* skip error_on_other_type

* fix picklable object described as any object

* do not concat gathered objects

* Update utils.py

* format code

* Update src/accelerate/utils.py

Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com>

* Update src/accelerate/utils.py

Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com>

Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com>
2022-02-22 21:28:54 +01:00
4fc586f5af [WIP] Add bfloat16 support #243 (#247)
* Add bfloat16 support (#243)

* Compatibility with previous config files

* Fix non-default argument 'num_processes' follows default argument

* Compatibility with previous config files

* Fix user input config

* Add use_fp16 compatibility

* Show dtype

* Verbosity

* Remove dtype verbosity

* Update src/accelerate/accelerator.py

Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com>

* Update src/accelerate/commands/launch.py

Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com>

* Add version checks and fix "no"/None issues

* Update src/accelerate/accelerator.py

Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com>

* Update src/accelerate/commands/launch.py

Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com>

* Update src/accelerate/commands/launch.py

Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com>

* Update src/accelerate/commands/launch.py

Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com>

* Update src/accelerate/commands/launch.py

Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com>

* Update src/accelerate/notebook_launcher.py

Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com>

* Update src/accelerate/commands/launch.py

Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com>

* Raise error if required PyTorch version is not available

* Style

* Make style

* Update src/accelerate/accelerator.py

Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com>

* Delete unused import

* Raise error if requited Pytorch is not available (Fix previous commit)

Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com>
2022-02-10 11:41:29 -05:00
76b35124dd Upgrade black to version ~=22.0 (#250) 2022-02-08 14:05:38 -05:00
a0995e1ccb make deepspeed optimizer match parameters of passed optimizer (#246)
* make deepspeed optimizer match parameters of passed optimizer, instead of all model parameters

* style

Co-authored-by: Jack Hessel <jackh@allenai.org>
2022-02-04 15:12:00 -05:00
ace103ee63 docs: fix transformers -> accelerate (#234) 2022-01-28 07:55:39 -05:00
29a09a8ddc Add customization point for init_process_group kwargs (#228)
* Fix lr scheduler num samples

* Add kwargs to customize init_process_group
2022-01-11 07:31:55 -05:00
c607532f4f Fix lr scheduler num samples (#227) 2022-01-10 14:11:40 -05:00
21f2c5bce4 Rename state to avoid name conflicts with pytorch's Optimizer class. (#224)
Co-authored-by: YU Xinyuan <yuxinyuan02@corp.netease.com>
2022-01-10 07:57:04 -05:00
18e5a56cbb Pass along drop_last in DispatchDataLoader (#212)
* Pass along drop_last in DispatchDataLoader

* Fix last empty batches
2021-12-15 11:15:19 -05:00
31fa5e0ce3 fix rng_types in accelerator (#206) 2021-12-08 13:36:45 -05:00
a5ea2a932c Add high-level API reference to README (#204)
* Add high level API reference to README.md

Update 'Why shouldn't I use' section with reference to pytorch-accelerated

* Fix typo in notebook launcher

Update 'Launching a training' to 'Launching training' in one instance

* Create 'Frameworks Using Accelerate' section

* Improve README.md formatting

* Remove newlines
2021-12-06 11:42:07 -05:00
d820a584d7 fix typo in code snippet (#199)
Both `accelerator` or `Accelerator` would do the trick, but `accelerate` won't since we never import it and even we do, the `backward()` method doesn't exist in `accelerate`.
2021-11-12 07:53:32 -05:00
39a0b30a95 Add signature check for set_to_none in Optimizer.zero_grad (#189) 2021-10-20 10:27:19 -04:00
75421766d3 Update README.md (#187)
This adds a sentence to the README to remind the user that they can still use the default python launch methods.
2021-10-19 11:17:16 -04:00
34a4e4ea15 fix: use store_true on argparse in nlp example (#183) 2021-10-11 08:11:30 -04:00
c5c73e0238 Use collections.abc.Mapping to handle both the dict and the UserDict types (#180)
* Use Mapping to handle dict and UserDict

* Address comments

* Remove ** syntax
2021-10-04 09:58:23 -04:00
5343b4e8e2 Handle UserDict in all utils (#179) 2021-09-30 12:23:59 -04:00
120b82bfce Fix send_to_device with non-tensor data (#177) 2021-09-29 10:54:55 -04:00
d0cc908438 Document v0.5.1 2021-09-27 11:09:44 -04:00
19ec4a782c Release v0.5.1 2021-09-27 11:04:29 -04:00
929e17d3b0 Fix length of DispatchDataLoader (#175) 2021-09-27 10:38:50 -04:00
d6247b7cc1 fix bug #172 (#173) 2021-09-26 12:01:48 -04:00
1b1463fe2c v0.6.0.dev0 2021-09-23 10:42:47 -04:00
56d8760856 Release: v0.5.0 2021-09-23 10:38:47 -04:00
e35baa5274 Raise errors instead of warnings with better tests (#170) 2021-09-21 14:33:13 -04:00
270e978159 Dynamic default for dispatch_batches (#168)
* Dynamic default for `dispatch_batches`

* Fix style
2021-09-20 16:14:58 -04:00
67141ebef7 Fix wrong copy-paste 2021-09-20 12:17:43 -04:00
7ad23dc269 And the second one 2021-09-20 11:55:34 -04:00
a8de5bd93f Replace error by warning 2021-09-20 11:54:07 -04:00
abbf844423 Fix missing numpy dep 2021-09-20 10:06:17 -04:00
e549cea65c Fix doc 2021-09-20 10:03:13 -04:00
37e4f036f0 Central dataloader (#164)
* PoC on main dataloader

* Support `split_batches`

* Add TPU support

* Fix typo

* More fixes

* Final fix

* Remove last print

* Add comments in the code

* Add test

* Style and sanity check

* Update src/accelerate/accelerator.py

Co-authored-by: Lysandre Debut <lysandre@huggingface.co>

* Address review comments

Co-authored-by: Lysandre Debut <lysandre@huggingface.co>
2021-09-16 15:29:28 -04:00
3379d64dab allow untested optimizers deepspeed (#150) 2021-09-01 15:50:41 +02:00
545cddb528 Fix gather for 0d tensor (#152) 2021-08-31 06:34:15 -04:00
50ac7483de up (#151) 2021-08-30 10:20:03 -04:00
be3cc4d144 fix fp16 covert back to fp32 (#149) 2021-08-30 09:58:55 -04:00
eb8b342dd4 v0.5.0.dev0 2021-08-10 04:40:24 -04:00
5d99345b78 Release: v0.4.0 2021-08-10 04:35:40 -04:00
49d1f04b4f DeepSpeed documentation (#140)
* DeepSpeed documentation

* Update docs/source/quicktour.rst

Co-authored-by: Lysandre Debut <lysandre@huggingface.co>

* Little mention

Co-authored-by: Lysandre Debut <lysandre@huggingface.co>
2021-08-10 10:15:04 +02:00
c4c6ea51a5 Document support for DeepSpeed 2021-08-09 08:37:46 -04:00
1d9366b439 Add optimizer not stepped property (#139)
* Add optimizer not stepped property

* Apply suggestions from code review

Co-authored-by: Lysandre Debut <lysandre@huggingface.co>

Co-authored-by: Lysandre Debut <lysandre@huggingface.co>
2021-08-09 13:54:53 +02:00
54433053b2 Add caveat on weight-tying on TPUs (#138) 2021-08-09 13:53:27 +02:00
c8c9314f59 Fix fp16 by converting outputs back to FP32 (#134)
* Fix FP16 mode by converting model outputs to FP32

* Add context manager and doc

* Update src/accelerate/accelerator.py

Co-authored-by: Lysandre Debut <lysandre@huggingface.co>

Co-authored-by: Lysandre Debut <lysandre@huggingface.co>
2021-08-05 18:41:55 +02:00
b08fd560a4 Fix typo 2021-07-28 08:45:03 -04:00
05c1eacfa0 Fix #126 (#127) 2021-07-20 10:22:48 +02:00
cc69904b52 Fix DataLoader length when split_batches=True (#121) 2021-07-08 07:21:25 -04:00
29da658a20 Unwrap optimizer before unscaling (#115) 2021-06-28 10:21:29 -04:00
6608466d0a Use independent copy of rng_types 2021-06-28 09:48:55 -04:00
42cb31c107 Unpin 2021-06-16 13:06:39 -04:00
43f0694151 Use real pyyaml as a dep 2021-06-16 13:03:08 -04:00
f7d5676322 Move import that requires a specific torch version (#108) 2021-06-16 12:57:13 -04:00
45c185d847 added closure argument to optimizer.step() (#105)
* added closure argument to optimizer.step()

* XLA argument as kwarg

Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com>

* check for closure in XLA args

Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com>

Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com>
2021-06-15 14:08:00 -04:00
02ad92a14e Use right address :facepalms: 2021-06-15 10:01:20 -04:00
7e0249c24e Add course banner (#107) 2021-06-15 09:57:34 -04:00
8fd891668b Pass along kwargs to backward (#104)
* Pass along kwargs to backward

* Fix import error

* Revert "Fix import error"

This reverts commit b65f47ddd110049b08a04c113aecda4bccea252d.
2021-06-15 07:44:18 -04:00
c95dff8748 Fix import error 2021-06-14 16:29:58 -04:00
f0cdbf152e [Feature] Add context manager to allow main process first. (#98)
* Added with master first method

* Remove examples from docstring

* Renamed is_master to is_main
2021-06-03 15:22:02 -04:00
f1333b54ad Add DeepSpeed support (#82)
* add script with acclerator

* squash

* save progess

* fix some; deepspeed giving error now

* fixed everything

* rebased

* stage2 fix

* fix optimizer cpu offload

* small fix

* fix suggestions

* update readme

* fix suggestions

* extract fp16 state_dict

* remove deepspeed dependency

* add fp16-32 conversion; readme update

* remove run script

* make style

* Apply suggestions from code review

Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com>

* make quality

Co-authored-by: Ubuntu <ubuntu@ip-172-31-71-52.ec2.internal>
Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com>
2021-05-27 16:50:33 -04:00
ee04aece8d Add Accelerator.free_memory (#89) 2021-05-17 09:55:32 -04:00
13c8c6dff0 Add unscale_gradients method. (#88) 2021-05-17 09:51:17 -04:00
d525265d68 Update README.md (#87)
fix typo in pip install
2021-05-17 07:49:55 -04:00
a96fbaaf19 Use optimizer for consistency (#81) 2021-05-13 08:19:23 -04:00
8fd72e6655 Fix accelerate test with no config file (#79) 2021-05-11 14:02:41 -04:00
4ad11b12d9 Pass args in notebook_launcher for multi-GPU (#78) 2021-05-11 11:42:38 -04:00
5e6447c257 TPU not available in kaggle (#73)
in_colab_or_kaggle will end false if run in kaggle. Changed the sequence of if in line 45-48.
2021-05-10 10:29:07 -04:00
f523ab611b Fix examples README (#70) 2021-05-09 14:32:26 -04:00
70b0aba7fd Honor namedtuples in inputs/outputs (#67) 2021-05-07 08:56:23 -04:00
df260fa71a Add distributed multi-node cpu only support (MULTI_CPU) (#63) 2021-05-04 08:12:54 -04:00
78b775391c Fix batch_sampler error for IterableDataset (#62)
* Fix batch_sampler error for IterableDataset

* Update src/accelerate/data_loader.py

Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com>

* Update src/accelerate/data_loader.py

Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com>

Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com>
2021-05-03 09:57:38 -04:00
de65feacbd Dev 0.4.0 2021-04-29 11:49:31 -04:00
e00ce7b2b1 Documentation v0.3.0 2021-04-29 11:48:35 -04:00
dd9f7aa657 Release: v0.3.0 2021-04-29 11:41:05 -04:00
96c21c66b4 Omit master port addition when it's not defined 2021-04-29 10:48:29 -04:00
4f376df41c Support for multi-GPU in notebook_launcher (#56)
* Support for multi-GPU in notebook_launcher

* Quality

* Address review comment

* Update src/accelerate/notebook_launcher.py

Co-authored-by: Philipp Schmid <32632186+philschmid@users.noreply.github.com>

Co-authored-by: Philipp Schmid <32632186+philschmid@users.noreply.github.com>
2021-04-28 11:45:43 -04:00
727e8eeaf2 update launch.py (#58)
update launch.py for supporting the case in which one wanna run multi accelerate instances in single node multigpu. avoiding the port conflict.
2021-04-28 11:19:00 -04:00
ae578b2c05 fix #53 (#54) 2021-04-27 07:58:00 -04:00
9b0dad4c64 Notebook Example (#52) 2021-04-26 17:47:05 -04:00
72c58d69c2 Notebook launcher (#44)
* Notebook launcher

* Add to init

* Fix arg name

* Default start method

* No mutli-GPU yet

* Documentation

* Update docs/source/launcher.rst

Co-authored-by: Philipp Schmid <32632186+philschmid@users.noreply.github.com>

* Add support for Kaggle kernels

Co-authored-by: Philipp Schmid <32632186+philschmid@users.noreply.github.com>
2021-04-26 11:03:09 -04:00
67a32dc471 Fix port in config creation (#50) 2021-04-26 09:29:26 -04:00
0d0985fd10 Fix launch command for multinode training
Co-authored-by: fengdalu@gmail.com
2021-04-26 08:30:54 -04:00
8394fe4f16 Pin black to 21.4b0 2021-04-26 08:13:22 -04:00
e0a420f7cb Add set_to_none to AcceleratedOptimizer.zero_grad (#43) 2021-04-23 14:22:59 -04:00
4d8a2f6452 Documentation links 2021-04-23 09:31:42 -04:00
548109297b Cleanup 2021-04-22 16:42:21 -04:00
e2243b55f2 Set all defaults from config in launcher (#38)
* Set all defaults from config in launcher

* Style
2021-04-22 11:25:56 -04:00
9f59f393bc Organize diffs better 2021-04-20 17:28:18 -04:00
d6845ee3bc fix cluster.py indent error (#35) 2021-04-20 14:57:55 -04:00
991c5e3481 docs: minor spelling tweaks (#33) 2021-04-19 19:43:21 -04:00
b19088684e Bad copy paste 2021-04-19 11:06:10 -04:00
57e340b41b Merge remote-tracking branch 'origin/main' into main 2021-04-19 11:03:33 -04:00
43ebab877f Sanity checks in pad_across_processes 2021-04-19 11:03:26 -04:00
693b30a2ff Fix load from config (#31) 2021-04-19 08:38:09 -04:00
10b911c971 Fix typos in examples README (#28) 2021-04-16 20:29:15 -04:00
9d3edb1d3b Document release and v0.3.0.dev0 2021-04-15 11:51:19 -04:00
499a5e506a Release: v0.2.0 2021-04-15 11:45:06 -04:00
e93cb7a3bd Launch script on sagemaker (#26)
* fixed loading SageMaker Environment

* added utils and dynamic args parser for sagemaker

* added args converted for sagemaker with type inference

* added launch test

* added sagemaker launcher

* added test

* better print statements

* accelerate as requirements.txt for sagemaker

* make style

* adjusted nlp example and remove action_store since sagemaker cannot handle this

* added documentation side

* added pyaml as dependency

* added doc changes

* reworked doc to .rst to highlight warning and notes better

* quality

* added error raise for store actions and added test

* quality

* Update docs/source/sagemaker.rst

Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com>

* Update docs/source/sagemaker.rst

Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com>

* moved fp16 from parameter to environment

* Update docs/source/sagemaker.rst

Co-authored-by: Lysandre Debut <lysandre@huggingface.co>

* Update docs/source/sagemaker.rst

Co-authored-by: Lysandre Debut <lysandre@huggingface.co>

* Update docs/source/sagemaker.rst

Co-authored-by: Lysandre Debut <lysandre@huggingface.co>

Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com>
Co-authored-by: Lysandre Debut <lysandre@huggingface.co>
2021-04-14 16:51:27 -04:00
de3a54137a Typos in the README 2021-04-08 09:09:20 -04:00
379b3d7c09 Set default device in distributed training 2021-04-08 09:06:32 -04:00
b3d181fa6b added thumbnail (#25)
* added thumbnail

* added extension
2021-04-07 20:27:19 +02:00
93ee98d2b4 Pin docutils 2021-04-06 08:44:05 -04:00
adcb68b17d Fix accelerate launch when no config is present 2021-04-01 12:36:13 -04:00
13656cda38 Cleaner diffs in README and index (#22) 2021-03-30 15:58:45 -04:00
5ab0a2d6f7 Add defaults for compute_environmnent (#23) 2021-03-30 15:58:36 -04:00
f7e0c26881 Add Configuration setup for SageMaker (#17)
* decoupled config sub-cli

* added more ci workflows

* added sagemaker config

* fix actions

* fixed matrix actions

* removed python matrix from actions

* changed actions name

* changed step name

* changed CUSTOM_CLUSTER to LOCAL_MACHINE and added feedback

* added feedback

* make style

* make quality

* replaced private variable with method

* is it fixing quality
2021-03-30 08:25:54 +02:00
d46d1e85fd Use proper size (#21) 2021-03-29 11:00:39 -04:00
8e61853039 Alternate diff (#20) 2021-03-29 08:50:22 -04:00
82971af8c5 Add utility to pad tensor across processes to max length (#19) 2021-03-26 14:33:17 -04:00
703c702ecb Fix types in launch parser 2021-03-26 11:12:36 -04:00
f477f935b6 Add YAML config support (#16)
* Add YAML config support

* Update src/accelerate/commands/config.py

Co-authored-by: Philipp Schmid <32632186+philschmid@users.noreply.github.com>

Co-authored-by: Philipp Schmid <32632186+philschmid@users.noreply.github.com>
2021-03-24 12:20:54 -04:00
2575bc829e Add KwargsHandlers (#15)
* Add KwargsHandlers

* Typo

* Add more tests
2021-03-23 10:06:59 -04:00
25bf0bcafb Don't error on non-Tensors objects in move to device (#13) 2021-03-18 12:57:26 -04:00
b206aad14b Add CV example (#10)
* Add CV example

* Update README

* Make get_label picklable
2021-03-15 09:33:45 -04:00
1320618cf7 Readme clean-up (#9)
No need for a device actually
2021-03-09 18:14:49 -05:00
e67aa2e525 More flexible RNG synchronization (#8)
* More flexible RNG synchronization

* Address review comment
2021-03-09 14:44:41 -05:00
38c4138de0 Fix typos and tighten grammar in README (#7) 2021-03-09 10:59:21 -05:00
cd37674729 Update README.md (#6)
fix typo - accelerate should be accelerator #5
2021-03-08 07:43:14 -05:00
03a5f8870d Fix TPU training in example (#4)
* fix TPU error in example (accelerator not created)

* few clean-ups
2021-03-05 17:17:14 -05:00
68013d06b9 Fix example name in README (#3)
* Mention which repo

* Actually fix script name

* Update README.md

Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com>
2021-03-05 17:01:14 -05:00
16d20d7bc9 Merge pull request #2 from huggingface/thomwolf-patch-1
Quick wording proposal
2021-03-05 16:59:58 -05:00
1495069ad1 Doc release 2021-03-05 16:59:26 -05:00
58d58a1a8d fix link in example 2021-03-05 22:55:02 +01:00
9a4ba4ab90 quick wording proposal 2021-03-05 22:52:08 +01:00
118 changed files with 14443 additions and 2747 deletions

38
.github/deploy_doc.sh vendored
View File

@ -1,38 +0,0 @@
#!/bin/bash
set -ex
function deploy_doc(){
echo "Creating doc at commit $1 and pushing to folder $2"
git checkout $1
cd "$GITHUB_WORKSPACE"
pip install -U .
cd "$GITHUB_WORKSPACE/docs"
if [ ! -z "$2" ]
then
if [ "$2" == "main" ]; then
echo "Pushing main"
make clean && make html && scp -r -oStrictHostKeyChecking=no _build/html/* $DOC_HOST:$DOC_PATH/$2/
cp -r _build/html/_static .
elif ssh -oStrictHostKeyChecking=no $DOC_HOST "[ -d $DOC_PATH/$2 ]"; then
echo "Directory" $2 "already exists"
scp -r -oStrictHostKeyChecking=no _static/* $DOC_HOST:$DOC_PATH/$2/_static/
else
echo "Pushing version" $2
make clean && make html
rm -rf _build/html/_static
cp -r _static _build/html
scp -r -oStrictHostKeyChecking=no _build/html $DOC_HOST:$DOC_PATH/$2
fi
else
echo "Pushing stable"
make clean && make html
rm -rf _build/html/_static
cp -r _static _build/html
scp -r -oStrictHostKeyChecking=no _build/html/* $DOC_HOST:$DOC_PATH
fi
}
# You can find the commit for each tag on https://github.com/huggingface/accelerate/tags
deploy_doc "main" main
deploy_doc "main" # No stable-release yet

View File

@ -0,0 +1,53 @@
name: Build Docker images (scheduled)
on:
repository_dispatch:
schedule:
- cron: "0 1 * * *"
concurrency:
group: docker-image-builds
cancel-in-progress: false
jobs:
latest-cpu:
name: "Latest Accelerate CPU [dev]"
runs-on: ubuntu-latest
steps:
- name: Set up Docker Buildx
uses: docker/setup-buildx-action@v1
- name: Check out code
uses: actions/checkout@v2
- name: Login to DockerHub
uses: docker/login-action@v1
with:
username: ${{ secrets.DOCKERHUB_USERNAME }}
password: ${{ secrets.DOCKERHUB_PASSWORD }}
- name: Build and Push CPU
uses: docker/build-push-action@v2
with:
context: ./docker/accelerate-cpu
push: true
tags: huggingface/accelerate-cpu
latest-cuda:
name: "Latest Accelerate GPU [dev]"
runs-on: ubuntu-latest
steps:
- name: Set up Docker Buildx
uses: docker/setup-buildx-action@v1
- name: Check out code
uses: actions/checkout@v2
- name: Login to DockerHub
uses: docker/login-action@v1
with:
username: ${{ secrets.DOCKERHUB_USERNAME }}
password: ${{ secrets.DOCKERHUB_PASSWORD }}
- name: Build and Push GPU
uses: docker/build-push-action@v2
with:
context: ./docker/accelerate-gpu
push: true
tags: huggingface/accelerate-gpu

View File

@ -0,0 +1,17 @@
name: Build documentation
on:
push:
branches:
- main
- doc-builder*
- v*-release
jobs:
build:
uses: huggingface/doc-builder/.github/workflows/build_main_documentation.yml@main
with:
commit_sha: ${{ github.sha }}
package: accelerate
secrets:
token: ${{ secrets.HUGGINGFACE_PUSH }}

View File

@ -0,0 +1,16 @@
name: Build PR Documentation
on:
pull_request:
concurrency:
group: ${{ github.workflow }}-${{ github.head_ref || github.run_id }}
cancel-in-progress: true
jobs:
build:
uses: huggingface/doc-builder/.github/workflows/build_pr_documentation.yml@main
with:
commit_sha: ${{ github.event.pull_request.head.sha }}
pr_number: ${{ github.event.number }}
package: accelerate

View File

@ -0,0 +1,13 @@
name: Delete dev documentation
on:
pull_request:
types: [ closed ]
jobs:
delete:
uses: huggingface/doc-builder/.github/workflows/delete_doc_comment.yml@main
with:
pr_number: ${{ github.event.number }}
package: accelerate

View File

@ -1,37 +0,0 @@
name: Deploy Documentation
on:
push:
branches:
- main
jobs:
deploy:
runs-on: ubuntu-latest
steps:
- name: Checkout repository
uses: actions/checkout@v1
with:
fetch-depth: 0
- name: Install SSH Key
uses: shimataro/ssh-key-action@v2
with:
key: ${{ secrets.DOC_SSH_KEY }}
name: id_rsa
known_hosts: ${{ secrets.DOC_KNOWN_HOST }}
- name: Install Python
uses: actions/setup-python@v1
with:
python-version: 3.6
- name: Install Python dependencies
working-directory: ./
run: pip install -e .[docs]
- name: Deploy documentation
env:
DOC_HOST: ${{ secrets.DOC_HOST }}
DOC_PATH: ${{ secrets.DOC_PATH }}
run: ./.github/deploy_doc.sh

17
.github/workflows/quality.yml vendored Normal file
View File

@ -0,0 +1,17 @@
name: Quality Check
on: [pull_request]
jobs:
quality:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v2
- name: Set up Python 3.6
uses: actions/setup-python@v2
with:
python-version: 3.6
- name: Install Python dependencies
run: pip install -e .[quality]
- name: Run Quality check
run: make quality

28
.github/workflows/stale.yml vendored Normal file
View File

@ -0,0 +1,28 @@
name: Stale Bot
on:
schedule:
- cron: "0 15 * * *"
workflow_dispatch:
jobs:
close_stale_issues:
name: Close Stale Issues
if: github.repository == 'huggingface/accelerate'
runs-on: ubuntu-latest
env:
GITHUB_TOKEN: ${{ secrets.GITHUB_TOKEN }}
steps:
- uses: actions/checkout@v2
- name: Setup Python
uses: actions/setup-python@v1
with:
python-version: 3.7
- name: Install requirements
run: |
pip install PyGithub
- name: Close stale issues
run: |
python utils/stale.py

30
.github/workflows/test.yml vendored Normal file
View File

@ -0,0 +1,30 @@
name: Run Tests
on: [pull_request]
jobs:
test:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v2
- name: Set up Python 3.6
uses: actions/setup-python@v2
with:
python-version: 3.6
- name: Install Python dependencies
run: pip install setuptools==59.5.0; pip install -e .[test,test_trackers]
- name: Run Tests
run: make test_cpu
test_examples:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v2
- name: Set up Python 3.6
uses: actions/setup-python@v2
with:
python-version: 3.6
- name: Install Python dependencies
run: pip install setuptools==59.5.0; pip install -e .[test] tensorboard
- name: Run Tests
run: make test_examples

6
.gitignore vendored
View File

@ -130,3 +130,9 @@ dmypy.json
# VSCode
.vscode
# IntelliJ
.idea
# Mac .DS_Store
.DS_Store

235
CONTRIBUTING.md Normal file
View File

@ -0,0 +1,235 @@
<!---
Copyright 2022 The HuggingFace Team. All rights reserved.
Licensed under the Apache License, Version 2.0 (the "License");
you may not use this file except in compliance with the License.
You may obtain a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing, software
distributed under the License is distributed on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
See the License for the specific language governing permissions and
limitations under the License.
-->
# How to contribute to 🤗 Accelerate?
Everyone is welcome to contribute, and we value everybody's contribution. Code
is thus not the only way to help the community. Answering questions, helping
others, reaching out and improving the documentations are immensely valuable to
the community.
It also helps us if you spread the word: reference the library from blog posts
on the awesome projects it made possible, shout out on Twitter every time it has
helped you, or simply star the repo to say "thank you".
Whichever way you choose to contribute, please be mindful to respect our
[code of conduct](https://github.com/huggingface/accelerate/blob/main/CODE_OF_CONDUCT.md).
## You can contribute in so many ways!
Some of the ways you can contribute to Accelerate:
* Fixing outstanding issues with the existing code;
* Contributing to the examples or to the documentation;
* Submitting issues related to bugs or desired new features.
## Submitting a new issue or feature request
Do your best to follow these guidelines when submitting an issue or a feature
request. It will make it easier for us to come back to you quickly and with good
feedback.
### Did you find a bug?
The 🤗 Accelerate library is robust and reliable thanks to the users who notify us of
the problems they encounter. So thank you for reporting an issue.
First, we would really appreciate it if you could **make sure the bug was not
already reported** (use the search bar on Github under Issues).
Did not find it? :( So we can act quickly on it, please follow these steps:
* Include your **OS type and version**, the versions of **Python** and **PyTorch**.
* A short, self-contained, code snippet that allows us to reproduce the bug in
less than 30s;
* Provide the with your Accelerate configuration (located by default in `~/.cache/huggingface/accelerate/default_config.yaml`)
### Do you want a new feature?
A good feature request addresses the following points:
1. Motivation first:
* Is it related to a problem/frustration with the library? If so, please explain
why. Providing a code snippet that demonstrates the problem is best.
* Is it related to something you would need for a project? We'd love to hear
about it!
* Is it something you worked on and think could benefit the community?
Awesome! Tell us what problem it solved for you.
2. Write a *full paragraph* describing the feature;
3. Provide a **code snippet** that demonstrates its future use;
4. In case this is related to a paper, please attach a link;
5. Attach any additional information (drawings, screenshots, etc.) you think may help.
If your issue is well written we're already 80% of the way there by the time you
post it.
## Submitting a pull request (PR)
Before writing code, we strongly advise you to search through the existing PRs or
issues to make sure that nobody is already working on the same thing. If you are
unsure, it is always a good idea to open an issue to get some feedback.
You will need basic `git` proficiency to be able to contribute to
🤗 Accelerate. `git` is not the easiest tool to use but it has the greatest
manual. Type `git --help` in a shell and enjoy. If you prefer books, [Pro
Git](https://git-scm.com/book/en/v2) is a very good reference.
Follow these steps to start contributing:
1. Fork the [repository](https://github.com/huggingface/accelerate) by
clicking on the 'Fork' button on the repository's page. This creates a copy of the code
under your GitHub user account.
2. Clone your fork to your local disk, and add the base repository as a remote. The following command
assumes you have your public SSH key uploaded to GitHub. See the following guide for more
[information](https://docs.github.com/en/repositories/creating-and-managing-repositories/cloning-a-repository).
```bash
$ git clone git@github.com:<your Github handle>/accelerate.git
$ cd accelerate
$ git remote add upstream https://github.com/huggingface/accelerate.git
```
3. Create a new branch to hold your development changes, and do this for every new PR you work on.
Start by synchronizing your `main` branch with the `upstream/main` branch (ore details in the [GitHub Docs](https://docs.github.com/en/github/collaborating-with-issues-and-pull-requests/syncing-a-fork)):
```bash
$ git checkout main
$ git fetch upstream
$ git merge upstream/main
```
Once your `main` branch is synchronized, create a new branch from it:
```bash
$ git checkout -b a-descriptive-name-for-my-changes
```
**Do not** work on the `main` branch.
4. Set up a development environment by running the following command in a conda or a virtual environment you've created for working on this library:
```bash
$ pip install -e ".[quality]"
```
(If accelerate was already installed in the virtual environment, remove
it with `pip uninstall accelerate` before reinstalling it in editable
mode with the `-e` flag.)
5. Develop the features on your branch.
As you work on the features, you should make sure that the test suite
passes. You should run the tests impacted by your changes like this (see
below an explanation regarding the environment variable):
```bash
$ pytest tests/<TEST_TO_RUN>.py
```
> For the following commands leveraging the `make` utility, we recommend using the WSL system when running on
> Windows. More information [here](https://docs.microsoft.com/en-us/windows/wsl/about).
You can also run the full suite with the following command.
```bash
$ make test
```
`accelerate` relies on `black` and `isort` to format its source code
consistently. After you make changes, apply automatic style corrections and code verifications
that can't be automated in one go with:
This target is also optimized to only work with files modified by the PR you're working on.
If you prefer to run the checks one after the other, the following command apply the
style corrections:
```bash
$ make style
```
`accelerate` also uses `flake8` and a few custom scripts to check for coding mistakes. Quality
control runs in CI, however you can also run the same checks with:
```bash
$ make quality
```
Once you're happy with your changes, add changed files using `git add` and
make a commit with `git commit` to record your changes locally:
```bash
$ git add modified_file.py
$ git commit
```
Please write [good commit messages](https://chris.beams.io/posts/git-commit/).
It is a good idea to sync your copy of the code with the original
repository regularly. This way you can quickly account for changes:
```bash
$ git fetch upstream
$ git rebase upstream/main
```
Push the changes to your account using:
```bash
$ git push -u origin a-descriptive-name-for-my-changes
```
6. Once you are satisfied (**and the checklist below is happy too**), go to the
webpage of your fork on GitHub. Click on 'Pull request' to send your changes
to the project maintainers for review.
7. It's ok if maintainers ask you for changes. It happens to core contributors
too! So everyone can see the changes in the Pull request, work in your local
branch and push the changes to your fork. They will automatically appear in
the pull request.
### Checklist
1. The title of your pull request should be a summary of its contribution;
2. If your pull request addresses an issue, please mention the issue number in
the pull request description to make sure they are linked (and people
consulting the issue know you are working on it);
3. To indicate a work in progress please prefix the title with `[WIP]`, or mark
the PR as a draft PR. These are useful to avoid duplicated work, and to differentiate
it from PRs ready to be merged;
4. Make sure existing tests pass;
5. Add high-coverage tests. No quality testing = no merge.
See an example of a good PR here: https://github.com/huggingface/accelerate/pull/255
### Tests
An extensive test suite is included to test the library behavior and several examples. Library tests can be found in
the [tests folder](https://github.com/huggingface/accelerate/tree/main/tests).
We use `pytest` in order to run the tests. From the root of the
repository, here's how to run tests with `pytest` for the library:
```bash
$ python -m pytest -sv ./tests
```
In fact, that's how `make test` is implemented (sans the `pip install` line)!
You can specify a smaller set of tests in order to test only the feature
you're working on.

View File

@ -24,9 +24,12 @@ style:
python utils/style_doc.py src/accelerate docs/source --max_len 119
# Run tests for the library
test:
python -m pytest -n auto --dist=loadfile -s -v ./tests/
test_cpu:
python -m pytest -s -v ./tests/ --ignore=./tests/test_examples.py
# Check that docs can build
docs:
cd docs && make html SPHINXOPTS="-W"
test_cuda:
python -m pytest -s -v ./tests/ --ignore=./tests/test_examples.py --ignore=./tests/test_scheduler.py --ignore=./tests/test_cpu.py
python -m pytest -s -v ./tests/test_cpu.py ./tests/test_scheduler.py
test_examples:
python -m pytest -s -v ./tests/test_examples.py

257
README.md
View File

@ -26,16 +26,16 @@ limitations under the License.
<img alt="Build" src="https://img.shields.io/circleci/build/github/huggingface/transformers/master">
</a>
-->
<a href="https://github.com/huggingface/accelerate/blob/master/LICENSE">
<a href="https://github.com/huggingface/accelerate/blob/main/LICENSE">
<img alt="License" src="https://img.shields.io/github/license/huggingface/accelerate.svg?color=blue">
</a>
<a href="https://huggingface.co/transformers/index.html">
<img alt="Documentation" src="https://img.shields.io/website/http/huggingface.co/transformers/index.html.svg?down_color=red&down_message=offline&up_message=online">
<a href="https://huggingface.co/docs/accelerate/index.html">
<img alt="Documentation" src="https://img.shields.io/website/http/huggingface.co/docs/accelerate/index.html.svg?down_color=red&down_message=offline&up_message=online">
</a>
<a href="https://github.com/huggingface/accelerate/releases">
<img alt="GitHub release" src="https://img.shields.io/github/release/huggingface/accelerate.svg">
</a>
<a href="https://github.com/huggingface/accelerate/blob/master/CODE_OF_CONDUCT.md">
<a href="https://github.com/huggingface/accelerate/blob/main/CODE_OF_CONDUCT.md">
<img alt="Contributor Covenant" src="https://img.shields.io/badge/Contributor%20Covenant-v2.0%20adopted-ff69b4.svg">
</a>
</p>
@ -44,88 +44,35 @@ limitations under the License.
<p>Run your *raw* PyTorch training script on any kind of device
</h3>
<h3 align="center">
<a href="https://hf.co/course"><img src="https://raw.githubusercontent.com/huggingface/accelerate/main/docs/source/imgs/course_banner.png"></a>
</h3>
## Easy to integrate
🤗 Accelerate was created for PyTorch users who like to write the training loop of PyTorch models but are reluctant to write and maintain the boiler code needed to use multi-GPUs/TPU/fp16.
🤗 Accelerate was created for PyTorch users who like to write the training loop of PyTorch models but are reluctant to write and maintain the boilerplate code needed to use multi-GPUs/TPU/fp16.
🤗 Accelerate abstracts exactly and only the boiler code related to multi-GPUs/TPU/fp16 and let the rest of your code unchanged.
🤗 Accelerate abstracts exactly and only the boilerplate code related to multi-GPUs/TPU/fp16 and leaves the rest of your code unchanged.
Here is an example:
<table>
<tr>
<th> Original training code <br> (CPU or mono-GPU only)</th>
<th> With Accelerate <br> (CPU/GPU/multi-GPUs/TPUs/fp16) </th>
</tr>
<tr>
<td>
```python
import torch
import torch.nn.functional as F
from datasets import load_dataset
device = 'cpu'
model = torch.nn.Transformer().to(device)
optim = torch.optim.Adam(
model.parameters()
)
dataset = load_dataset('my_dataset')
data = torch.utils.data.Dataloader(
dataset
)
model.train()
for epoch in range(10):
for source, targets in data:
source = source.to(device)
targets = targets.to(device)
optimizer.zero_grad()
output = model(source, targets)
loss = F.cross_entropy(
output, targets
)
loss.backward()
optimizer.step()
```
</td>
<td>
```python
```diff
import torch
import torch.nn.functional as F
from datasets import load_dataset
+ from accelerate import Accelerator
+ accelerator = Accelerator()
- device = 'cpu'
+ device = accelerator.device
model = torch.nn.Transformer().to(device)
optim = torch.optim.Adam(
model.parameters()
)
optimizer = torch.optim.Adam(model.parameters())
dataset = load_dataset('my_dataset')
data = torch.utils.data.Dataloader(
dataset
)
data = torch.utils.data.DataLoader(dataset, shuffle=True)
+ model, optim, data = accelerator.prepare(
+ model, optim, data
+ )
+ model, optimizer, data = accelerator.prepare(model, optimizer, data)
model.train()
for epoch in range(10):
@ -135,126 +82,61 @@ for epoch in range(10):
optimizer.zero_grad()
output = model(source, targets)
loss = F.cross_entropy(
output, targets
)
output = model(source)
loss = F.cross_entropy(output, targets)
+ accelerate.backward(loss)
- loss.backward()
+ accelerator.backward(loss)
optimizer.step()
```
</td>
</tr>
</table>
As you can see in this example, by adding 5-lines to any standard PyTorch training script you can now run on any kind of single or distributed node setting (single CPU, single GPU, multi-GPUs and TPUs) as well as with or without mixed precision (fp16).
As you can see on this example, by adding 5-lines to any standard PyTorch training script you can now run on any kind of single or distributed node setting (single CPU, single GPU, multi-GPUs and TPUs) as well as with or without mixed precision (fp16).
In particular, the same code can then be run without modification on your local machine for debugging or your training environment.
The same code can then in particular run without modification on your local machine for debugging or your training environment.
🤗 Accelerate even handles the device placement for you (which requires a few more changes to your code, but is safer in general), so you can even simplify your training loop further:
🤗 Accelerate even handles the device placement for you (a bit more changes to your code but safer in general), so you can even simplify your training loop further:
<table>
<tr>
<th> Original training code <br> (CPU or mono-GPU only)</th>
<th> With Accelerate <br> (CPU/GPU/multi-GPUs/TPUs/fp16) </th>
</tr>
<tr>
<td>
```python
import torch
import torch.nn.functional as F
from datasets import load_dataset
device = 'cpu'
model = torch.nn.Transformer().to(device)
optim = torch.optim.Adam(
model.parameters()
)
dataset = load_dataset('my_dataset')
data = torch.utils.data.Dataloader(
dataset
)
model.train()
for epoch in range(10):
for source, targets in data:
source = source.to(device)
targets = targets.to(device)
optimizer.zero_grad()
output = model(source, targets)
loss = F.cross_entropy(
output, targets
)
loss.backward()
optimizer.step()
```
</td>
<td>
```python
```diff
import torch
import torch.nn.functional as F
from datasets import load_dataset
+ from accelerate import Accelerator
+ accelerator = Accelerator()
+ device = accelerator.device
- device = 'cpu'
+ accelerator = Accelerator()
- model = torch.nn.Transformer().to(device)
+ model = torch.nn.Transformer()
optim = torch.optim.Adam(
model.parameters()
)
optimizer = torch.optim.Adam(model.parameters())
dataset = load_dataset('my_dataset')
data = torch.utils.data.Dataloader(
dataset
)
data = torch.utils.data.DataLoader(dataset, shuffle=True)
+ model, optim, data = accelerator.prepare(
+ model, optim, data
+ )
+ model, optimizer, data = accelerator.prepare(model, optimizer, data)
model.train()
for epoch in range(10):
for source, targets in data:
-
-
- source = source.to(device)
- targets = targets.to(device)
optimizer.zero_grad()
output = model(source, targets)
loss = F.cross_entropy(
output, targets
)
output = model(source)
loss = F.cross_entropy(output, targets)
+ accelerate.backward(loss)
- loss.backward()
+ accelerator.backward(loss)
optimizer.step()
```
</td>
</tr>
</table>
Want to learn more? Check out the [documentation](https://huggingface.co/docs/accelerate) or have look at our [examples](https://github.com/huggingface/accelerate/tree/main/examples).
## Launching script
🤗 Accelerate also provides a CLI tool that allows you to quickly configure and test your training environment then launch the scripts. No need to remember how to use `torch.distributed.launch` or to write a specific launcher for TPU training!
🤗 Accelerate also provides an optional CLI tool that allows you to quickly configure and test your training environment before launching the scripts. No need to remember how to use `torch.distributed.launch` or to write a specific launcher for TPU training!
On your machine(s) just run:
```bash
@ -270,17 +152,71 @@ accelerate launch my_script.py --args_to_my_script
For instance, here is how you would run the GLUE example on the MRPC task (from the root of the repo):
```bash
accelerate launch examples/glue_example.py --task_name mrpc --model_name_or_path bert-base-cased
accelerate launch examples/nlp_example.py
```
This CLI tool is **optional**, and you can still use `python my_script.py` or `python -m torch.distributed.launch my_script.py` at your convenance.
## Launching multi-CPU run using MPI
🤗 Here is another way to launch multi-CPU run using MPI. You can learn how to install Open MPI on [this page](https://www.open-mpi.org/faq/?category=building#easy-build). You can use Intel MPI or MVAPICH as well.
Once you have MPI setup on your cluster, just run:
```bash
mpirun -np 2 python examples/nlp_example.py
```
## Launching training using DeepSpeed
🤗 Accelerate supports training on single/multiple GPUs using DeepSpeed. To use it, you don't need to change anything in your training code; you can set everything using just `accelerate config`. However, if you desire to tweak your DeepSpeed related args from your python script, we provide you the `DeepSpeedPlugin`.
```python
from accelerator import Accelerator, DeepSpeedPlugin
# deepspeed needs to know your gradient accumulation steps before hand, so don't forget to pass it
# Remember you still need to do gradient accumulation by yourself, just like you would have done without deepspeed
deepspeed_plugin = DeepSpeedPlugin(zero_stage=2, gradient_accumulation_steps=2)
accelerator = Accelerator(fp16=True, deepspeed_plugin=deepspeed_plugin)
# How to save your 🤗 Transformer?
accelerator.wait_for_everyone()
unwrapped_model = accelerator.unwrap_model(model)
unwrapped_model.save_pretrained(save_dir, save_function=accelerator.save, state_dict=accelerator.get_state_dict(model))
```
Note: DeepSpeed support is experimental for now. In case you get into some problem, please open an issue.
## Launching your training from a notebook
🤗 Accelerate also provides a `notebook_launcher` function you can use in a notebook to launch a distributed training. This is especially useful for Colab or Kaggle notebooks with a TPU backend. Just define your training loop in a `training_function` then in your last cell, add:
```python
from accelerate import notebook_launcher
notebook_launcher(training_function)
```
An example can be found in [this notebook](https://github.com/huggingface/notebooks/blob/master/examples/accelerate/simple_nlp_example.ipynb). [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/huggingface/notebooks/blob/master/examples/accelerate/simple_nlp_example.ipynb)
## Why should I use 🤗 Accelerate?
You should use 🤗 Accelerate when you want to easily run your training scripts in a distributed environment without having to renounce full control over your training loop. This is not a high-level framework above PyTorch, just a thin wrapper so you don't have to learn a new library, In fact the whole API of 🤗 Accelerate is in one class, the `Accelerator` object.
## Why shouldn't use 🤗 Accelerate?
## Why shouldn't I use 🤗 Accelerate?
You shouldn't use 🤗 Accelerate if you don't want to write a training loop yourself. There are plenty of high-level libraries above PyTorch that will offer you that, 🤗 Accelerate is not one of them.
## Frameworks using 🤗 Accelerate
If you like the simplicity of 🤗 Accelerate but would prefer a higher-level abstraction around your training loop, some frameworks that are built on top of 🤗 Accelerate are listed below:
* [Animus](https://github.com/Scitator/animus) is a minimalistic framework to run machine learning experiments. Animus highlights common "breakpoints" in ML experiments and provides a unified interface for them within [IExperiment](https://github.com/Scitator/animus/blob/main/animus/core.py#L76).
* [Catalyst](https://github.com/catalyst-team/catalyst#getting-started) is a PyTorch framework for Deep Learning Research and Development. It focuses on reproducibility, rapid experimentation, and codebase reuse so you can create something new rather than write yet another train loop. Catalyst provides a [Runner](https://catalyst-team.github.io/catalyst/api/core.html#runner) to connect all parts of the experiment: hardware backend, data transformations, model train, and inference logic.
* [fastai](https://github.com/fastai/fastai#installing) is a PyTorch framework for Deep Learning that simplifies training fast and accurate neural nets using modern best practices. fastai provides a [Learner](https://docs.fast.ai/learner.html#Learner) to handle the training, fine-tuning, and inference of deep learning algorithms.
* [Kornia](https://kornia.readthedocs.io/en/latest/get-started/introduction.html) is a differentiable library that allows classical computer vision to be integrated into deep learning models. Kornia provides a [Trainer](https://kornia.readthedocs.io/en/latest/x.html#kornia.x.Trainer) with the specific purpose to train and fine-tune the supported deep learning algorithms within the library.
* [pytorch-accelerated](https://github.com/Chris-hughes10/pytorch-accelerated) is a lightweight training library, with a streamlined feature set centred around a general-purpose [Trainer](https://pytorch-accelerated.readthedocs.io/en/latest/trainer.html), that places a huge emphasis on simplicity and transparency; enabling users to understand exactly what is going on under the hood, but without having to write and maintain the boilerplate themselves!
## Installation
This repository is tested on Python 3.6+ and PyTorch 1.4.0+
@ -298,8 +234,11 @@ pip install accelerate
## Supported integrations
- CPU only
- multi-CPU on one node (machine)
- multi-CPU on several nodes (machines)
- single GPU
- multi-GPU on one node (machine)
- multi-GPU on several nodes (machines)
- TPU
- FP16 with native AMP (apex on the roadmap)
- DeepSpeed support (experimental)

View File

@ -0,0 +1,38 @@
# Builds CPU-only Docker image of PyTorch
# Uses multi-staged approach to reduce size
# Stage 1
FROM python:3.6-slim as compile-image
ARG DEBIAN_FRONTEND=noninteractive
RUN apt update
RUN apt-get install -y --no-install-recommends \
build-essential \
git \
gcc
# Setup virtual environment for Docker
ENV VIRTUAL_ENV=/opt/venv
RUN python3 -m venv ${VIRTUAL_ENV}
# Make sure we use the virtualenv
ENV PATH="${VIRTUAL_ENV}/bin:$PATH"
WORKDIR /workspace
# Install specific CPU torch wheel to save on space
RUN python3 -m pip install --upgrade --no-cache-dir pip
RUN python3 -m pip install --no-cache-dir \
jupyter \
torch --extra-index-url https://download.pytorch.org/whl/cpu \
git+https://github.com/huggingface/accelerate#egg=accelerate[dev]
# Stage 2
FROM python:3.6-slim AS build-image
COPY --from=compile-image /opt/venv /opt/venv
# Install apt libs
RUN apt-get update && \
apt-get install -y curl git wget && \
apt-get clean && \
rm -rf /var/lib/apt/lists*
# Make sure we use the virtualenv
ENV PATH="/opt/venv/bin:$PATH"
CMD ["/bin/bash"]

View File

@ -0,0 +1,42 @@
# Builds GPU docker image of PyTorch
# Uses multi-staged approach to reduce size
# Stage 1
# Use base conda image to reduce time
FROM continuumio/miniconda3:latest AS compile-image
# Specify py version
ENV PYTHON_VERSION=3.6
# Install apt libs
RUN apt-get update && \
apt-get install -y curl git wget && \
apt-get clean && \
rm -rf /var/lib/apt/lists*
# Create our conda env
RUN conda create --name accelerate python=${PYTHON_VERSION} ipython jupyter pip
# We don't install pytorch here yet since CUDA isn't available
# instead we use the direct torch wheel
ENV PATH /opt/conda/envs/accelerate/bin:$PATH
# Activate our bash shell
RUN chsh -s /bin/bash
SHELL ["/bin/bash", "-c"]
# Activate the conda env and install torch + accelerate
RUN source activate accelerate && \
python3 -m pip install --no-cache-dir \
torch --extra-index-url https://download.pytorch.org/whl/cu113 \
git+https://github.com/huggingface/accelerate#egg=accelerate[dev]
# Stage 2
FROM nvidia/cuda:11.2.2-cudnn8-devel-ubuntu20.04 AS build-image
COPY --from=compile-image /opt/conda /opt/conda
ENV PATH /opt/conda/bin:$PATH
# Install apt libs
RUN apt-get update && \
apt-get install -y curl git wget && \
apt-get clean && \
rm -rf /var/lib/apt/lists*
RUN echo "source activate accelerate" >> ~/.profile
# Activate the virtualenv
CMD ["/bin/bash"]

View File

@ -1,16 +0,0 @@
.highlight .c1, .highlight .sd{
color: #999
}
.highlight .nn, .highlight .k, .highlight .s1, .highlight .nb, .highlight .bp, .highlight .kc {
color: #FB8D68;
}
.highlight .kn, .highlight .nv, .highlight .s2, .highlight .ow {
color: #6670FF;
}
.highlight .gp {
color: #FB8D68;
}

View File

@ -1,350 +0,0 @@
/* Our DOM objects */
/* Colab dropdown */
table.center-aligned-table td {
text-align: center;
}
table.center-aligned-table th {
text-align: center;
vertical-align: middle;
}
.colab-dropdown {
position: relative;
display: inline-block;
}
.colab-dropdown-content {
display: none;
position: absolute;
background-color: #f9f9f9;
min-width: 117px;
box-shadow: 0px 8px 16px 0px rgba(0,0,0,0.2);
z-index: 1;
}
.colab-dropdown-content button {
color: #6670FF;
background-color: #f9f9f9;
font-size: 12px;
border: none;
min-width: 117px;
padding: 5px 5px;
text-decoration: none;
display: block;
}
.colab-dropdown-content button:hover {background-color: #eee;}
.colab-dropdown:hover .colab-dropdown-content {display: block;}
/* Version control */
.version-button {
background-color: #6670FF;
color: white;
border: none;
padding: 5px;
font-size: 15px;
cursor: pointer;
}
.version-button:hover, .version-button:focus {
background-color: #A6B0FF;
}
.version-dropdown {
display: none;
background-color: #6670FF;
min-width: 160px;
overflow: auto;
font-size: 15px;
}
.version-dropdown a {
color: white;
padding: 3px 4px;
text-decoration: none;
display: block;
}
.version-dropdown a:hover {
background-color: #A6B0FF;
}
.version-show {
display: block;
}
/* Framework selector */
.framework-selector {
display: flex;
flex-direction: row;
justify-content: flex-end;
margin-right: 30px;
}
.framework-selector > button {
background-color: white;
color: #6670FF;
border: 1px solid #6670FF;
padding: 5px;
}
.framework-selector > button.selected{
background-color: #6670FF;
color: white;
border: 1px solid #6670FF;
padding: 5px;
}
/* Copy button */
a.copybtn {
margin: 3px;
}
/* The literal code blocks */
.rst-content tt.literal, .rst-content tt.literal, .rst-content code.literal {
color: #6670FF;
}
/* To keep the logo centered */
.wy-side-scroll {
width: auto;
font-size: 20px;
}
/* The div that holds the Hugging Face logo */
.HuggingFaceDiv {
width: 100%
}
/* The research field on top of the toc tree */
.wy-side-nav-search{
padding-top: 0;
background-color: #6670FF;
}
/* The toc tree */
.wy-nav-side{
background-color: #6670FF;
}
/* The section headers in the toc tree */
.wy-menu-vertical p.caption{
background-color: #4d59ff;
line-height: 40px;
}
/* The selected items in the toc tree */
.wy-menu-vertical li.current{
background-color: #A6B0FF;
}
/* When a list item that does belong to the selected block from the toc tree is hovered */
.wy-menu-vertical li.current a:hover{
background-color: #B6C0FF;
}
/* When a list item that does NOT belong to the selected block from the toc tree is hovered. */
.wy-menu-vertical li a:hover{
background-color: #A7AFFB;
}
/* The text items on the toc tree */
.wy-menu-vertical a {
color: #FFFFDD;
font-family: Calibre-Light, sans-serif;
}
.wy-menu-vertical header, .wy-menu-vertical p.caption{
color: white;
font-family: Calibre-Light, sans-serif;
}
/* The color inside the selected toc tree block */
.wy-menu-vertical li.toctree-l2 a, .wy-menu-vertical li.toctree-l3 a, .wy-menu-vertical li.toctree-l4 a {
color: black;
}
/* Inside the depth-2 selected toc tree block */
.wy-menu-vertical li.toctree-l2.current>a {
background-color: #B6C0FF
}
.wy-menu-vertical li.toctree-l2.current li.toctree-l3>a {
background-color: #C6D0FF
}
/* Inside the depth-3 selected toc tree block */
.wy-menu-vertical li.toctree-l3.current li.toctree-l4>a{
background-color: #D6E0FF
}
/* Inside code snippets */
.rst-content dl:not(.docutils) dt{
font-size: 15px;
}
/* Links */
a {
color: #6670FF;
}
/* Content bars */
.rst-content dl:not(.docutils) dt {
background-color: rgba(251, 141, 104, 0.1);
border-right: solid 2px #FB8D68;
border-left: solid 2px #FB8D68;
color: #FB8D68;
font-family: Calibre-Light, sans-serif;
border-top: none;
font-style: normal !important;
}
/* Expand button */
.wy-menu-vertical li.toctree-l2 span.toctree-expand,
.wy-menu-vertical li.on a span.toctree-expand, .wy-menu-vertical li.current>a span.toctree-expand,
.wy-menu-vertical li.toctree-l3 span.toctree-expand{
color: black;
}
/* Max window size */
.wy-nav-content{
max-width: 1200px;
}
/* Mobile header */
.wy-nav-top{
background-color: #6670FF;
}
/* Source spans */
.rst-content .viewcode-link, .rst-content .viewcode-back{
color: #6670FF;
font-size: 110%;
letter-spacing: 2px;
text-transform: uppercase;
}
/* It would be better for table to be visible without horizontal scrolling */
.wy-table-responsive table td, .wy-table-responsive table th{
white-space: normal;
}
.footer {
margin-top: 20px;
}
.footer__Social {
display: flex;
flex-direction: row;
}
.footer__CustomImage {
margin: 2px 5px 0 0;
}
/* class and method names in doc */
.rst-content dl:not(.docutils) tt.descname, .rst-content dl:not(.docutils) tt.descclassname, .rst-content dl:not(.docutils) tt.descname, .rst-content dl:not(.docutils) code.descname, .rst-content dl:not(.docutils) tt.descclassname, .rst-content dl:not(.docutils) code.descclassname{
font-family: Calibre, sans-serif;
font-size: 20px !important;
}
/* class name in doc*/
.rst-content dl:not(.docutils) tt.descname, .rst-content dl:not(.docutils) tt.descname, .rst-content dl:not(.docutils) code.descname{
margin-right: 10px;
font-family: Calibre-Medium, sans-serif;
}
/* Method and class parameters */
.sig-param{
line-height: 23px;
}
/* Class introduction "class" string at beginning */
.rst-content dl:not(.docutils) .property{
font-size: 18px;
color: black;
}
/* FONTS */
body{
font-family: Calibre, sans-serif;
font-size: 16px;
}
h1 {
font-family: Calibre-Thin, sans-serif;
font-size: 70px;
}
h2, .rst-content .toctree-wrapper p.caption, h3, h4, h5, h6, legend{
font-family: Calibre-Medium, sans-serif;
}
@font-face {
font-family: Calibre-Medium;
src: url(./Calibre-Medium.otf);
font-weight:400;
}
@font-face {
font-family: Calibre;
src: url(./Calibre-Regular.otf);
font-weight:400;
}
@font-face {
font-family: Calibre-Light;
src: url(./Calibre-Light.ttf);
font-weight:400;
}
@font-face {
font-family: Calibre-Thin;
src: url(./Calibre-Thin.otf);
font-weight:400;
}
/**
* Nav Links to other parts of huggingface.co
*/
div.menu {
position: absolute;
top: 0;
right: 0;
padding-top: 20px;
padding-right: 20px;
z-index: 1000;
}
div.menu a {
font-size: 14px;
letter-spacing: 0.3px;
text-transform: uppercase;
color: white;
-webkit-font-smoothing: antialiased;
background: linear-gradient(0deg, #6671ffb8, #9a66ffb8 50%);
padding: 10px 16px 6px 16px;
border-radius: 3px;
margin-left: 12px;
position: relative;
}
div.menu a:active {
top: 1px;
}
@media (min-width: 768px) and (max-width: 1750px) {
.wy-breadcrumbs {
margin-top: 32px;
}
}
@media (max-width: 768px) {
div.menu {
display: none;
}
}

File diff suppressed because one or more lines are too long

File diff suppressed because one or more lines are too long

Before

Width:  |  Height:  |  Size: 7.6 KiB

32
docs/source/_toctree.yml Normal file
View File

@ -0,0 +1,32 @@
- sections:
- local: index
title: 🤗 Accelerate
- local: quicktour
title: Quick tour
- local: installation
title: Installation
title: Get started
- sections:
- local: big_modeling
title: Handling big models
- local: sagemaker
title: Amazon SageMaker
title: Guides
- sections:
- local: accelerator
title: Accelerator
- local: launcher
title: Notebook Launcher
- local: kwargs
title: Kwargs Handlers
- local: internal
title: Internals
- local: checkpoint
title: Checkpointing
- local: tracking
title: Experiment Tracking
- local: fsdp
title: Fully Sharded Data Parallel
- local: memory
title: Memory Utilities
title: API Reference

View File

@ -0,0 +1,41 @@
<!--Copyright 2021 The HuggingFace Team. All rights reserved.
Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with
the License. You may obtain a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on
an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the
specific language governing permissions and limitations under the License.
-->
# Accelerator
The [`Accelerator`] is the main class provided by 🤗 Accelerate. It serves at the main entrypoint for
the API. To quickly adapt your script to work on any kind of setup with 🤗 Accelerate juste:
1. Initialize an [`Accelerator`] object (that we will call `accelerator` in the rest of this
page) as early as possible in your script.
2. Pass along your model(s), optimizer(s), dataloader(s) to the [`~Accelerator.prepare`] method.
3. (Optional but best practice) Remove all the `.cuda()` or `.to(device)` in your code and let the
`accelerator` handle device placement for you.
4. Replace the `loss.backward()` in your code by `accelerator.backward(loss)`.
5. (Optional, when using distributed evaluation) Gather your predictions and labelsbefore storing them or using them
for metric computation using [`~Accelerator.gather`].
This is all what is needed in most cases. For more advanced case or a nicer experience here are the functions you
should search for and replace by the corresponding methods of your `accelerator`:
- `print` statements should be replaced by [`~Accelerator.print`] to be only printed once per
process.
- Use [`~Accelerator.is_local_main_process`] for statements that should be executed once per server.
- Use [`~Accelerator.is_main_process`] for statements that should be executed once only.
- Use [`~Accelerator.wait_for_everyone`] to make sure all processes join that point before continuing
(useful before a model save for instance).
- Use [`~Accelerator.unwrap_model`] to unwrap your model before saving it.
- Use [`~Accelerator.save`] instead of `torch.save`.
- Use [`~Accelerator.clip_grad_norm_`] instead of `torch.nn.utils.clip_grad_norm_` and
[`~Accelerator.clip_grad_value_`] instead of `torch.nn.utils.clip_grad_value_`.
[[autodoc]] Accelerator

View File

@ -1,43 +0,0 @@
..
Copyright 2021 The HuggingFace Team. All rights reserved.
Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with
the License. You may obtain a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on
an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the
specific language governing permissions and limitations under the License.
Accelerator
=======================================================================================================================
The :class:`~accelerate.Accelerator` is the main class provided by 🤗 Accelerate. It serves at the main entrypoint for
the API. To quickly adapt your script to work on any kind of setup with 🤗 Accelerate juste:
1. Initialize an :class:`~accelerate.Accelerator` object (that we will call :obj:`accelerator` in the rest of this
page) as early as possible in your script.
2. Pass along your model(s), optimizer(s), dataloader(s) to the :meth:`~accelerate.Accelerator.prepare` method.
3. (Optional but best practice) Remove all the :obj:`.cuda()` or :obj:`.to(device)` in your code and let the
:obj:`accelerator` handle device placement for you.
4. Replace the :obj:`loss.backward()` in your code by :obj:`accelerator.backward(loss)`.
5. (Optional, when using distributed evaluation) Gather your predictions and labelsbefore storing them or using them
for metric computation using :meth:`~accelerate.Accelerator.gather`.
This is all what is needed in most cases. For more advanced case or a nicer experience here are the functions you
should search for and replace by the corresponding methods of your :obj:`accelerator`:
- :obj:`print` statements should be replaced by :meth:`~accelerate.Accelerator.print` to be only printed once per
process.
- Use :meth:`~accelerate.Accelerator.is_local_main_process` for statements that should be executed once per server.
- Use :meth:`~accelerate.Accelerator.is_main_process` for statements that should be executed once only.
- Use :meth:`~accelerate.Accelerator.wait_for_everyone` to make sure all processes join that point before continuing
(useful before a model save for instance).
- Use :meth:`~accelerate.Accelerator.unwrap_model` to unwrap your model before saving it.
- Use :meth:`~accelerate.Accelerator.save` instead of :obj:`torch.save`.
- Use :meth:`~accelerate.Accelerator.clip_grad_norm_` instead of :obj:`torch.nn.utils.clip_grad_norm_` and
:meth:`~accelerate.Accelerator.clip_grad_value_` instead of :obj:`torch.nn.utils.clip_grad_value_`.
.. autoclass:: accelerate.Accelerator
:members:

View File

@ -0,0 +1,232 @@
<!--Copyright 2022 The HuggingFace Team. All rights reserved.
Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with
the License. You may obtain a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on
an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the
specific language governing permissions and limitations under the License.
-->
# Handling big models
When loading a pretrained model in PyTorch, the usual workflow looks like this:
```py
import torch
my_model = ModelClass(...)
state_dict = torch.load(checkpoint_file)
my_model.load_state_dict(state_dict)
```
In plain English, those steps are:
1. Create the model with randomly initialized weights
2. Load the model weights (in a dictionary usually called a state dict) from the disk
3. Load those weights inside the model
While this works very well for regularly sized models, this workflow has some clear limitation when we deal with a huge model: in step 1, we load a full version of the model in RAM, and spend some time randomly initializing the weights (which will be discarded in step 3). In step 2, we load another full version of the model in RAM, with the pretrained weights. If you're loading a model with 6 billions parameters, this needs you will need 24GB of RAM for each copy of the model, so 48GB in total (half of it to load the model in FP16).
<Tip warning={true}>
This API is quite new and still in its experimental stage. While we strive to provide a stable API, it's possible some small parts of the public API will change in the future.
</Tip>
## Instantiating an empty model
The first tool Accelerate introduces to help with big models is a context manager [`init_empty_weights`] that helps you initialize a model without using any RAM, so that step 1 can be done on models of any size. Here is how it works:
```py
from accelerate import init_empty_weights
with init_empty_weights():
my_model = ModelClass(...)
```
For instance:
```py
with init_empty_weights():
model = nn.Sequential(*[nn.Linear(10000, 10000) for _ in range(1000)])
```
initializes an empty model with a bit more than 100B parameters. Behind the scenes, this relies on the meta device introduced in PyTorch 1.9. During the initialization under the context manager, each time a parameter is created, it is instantly moved on that device.
<Tip warning={true}>
You can't move a model initialized like this on CPU or another device directly, since it doesn't have any data. It's also very likely that a forward pass with that empty model will fail, as not all operations are supported on the meta device.
</Tip>
## Sharded checkpoints
It's possible your model is so big that even a single copy won't fit in RAM. That doesn't mean it can't be loaded: if you have one or several GPUs, this is more memory available to store your model. In this case, it's better if your checkpoint is split in several smaller files that we call checkpoint shards.
Accelerate will handle sharded checkpoints as long as you follow the following format: your checkpoint should be in a folder, with several files containing the partial state dicts, and there should be an index in the JSON format that contains a dictionary mapping parameter names to the file containing their weights. For instance we could have a folder containing:
```bash
first_state_dict.bin
index.json
second_state_dict.bin
```
with index.json being the following file:
```
{
"linear1.weight": "first_state_dict.bin",
"linear1.bias": "first_state_dict.bin",
"linear2.weight": "second_state_dict.bin",
"linear2.bias": "second_state_dict.bin"
}
```
and `first_state_dict.bin` containing the weights for `"linear1.weight"` and `"linear1.bias"`, `second_state_dict.bin` the ones for `"linear2.weight"` and `"linear2.bias"`
## Loading weights
The second tool Accelerate introduces is a function [`load_checkpoint_and_dispatch`], that will allow you to load a checkpoint inside your empty model. This supports full checkpoints (a single file containing the whole state dict) as well as sharded checkpoints. It will also automatically dispatch those weights across the devices you have available (GPUs, CPU RAM), so if you are loading a sharded checkpoint, the maximum RAM usage will be the size of the biggest shard.
Here is how we can use this to load the [GPT-J-6B](https://huggingface.co/EleutherAI/gpt-j-6B) model. You clone the sharded version of this model with:
```bash
git clone https://huggingface.co/sgugger/sharded-gpt-j-6B
cd sharded-gpt-j-6B
git-lfs install
git pull
```
then we can initialize the model with
```py
from accelerate import init_empty_weights
from transformers import AutoConfig, AutoModelForCausalLM
checkpoint = "EleutherAI/gpt-j-6B"
config = AutoConfig.from_pretrained(checkpoint)
with init_empty_weights():
model = AutoModelForCausalLM.from_config(config)
```
and load the checkpoint we just downloaded with:
```py
from accelerate import load_checkpoint_and_dispatch
model = load_checkpoint_and_dispatch(
model, "sharded-gpt-j-6B", device_map="auto", no_split_module_classes=["GPTJBlock"]
)
```
By passing `device_map="auto"`, we tell Accelerate to determine automatically where to put each layer of the model depending on the available resources:
- first we use the maximum space available on the GPU(s)
- if we still need space, we store the remaining weights on the CPU
- if there is not enough RAM, we store the remaining weights on the hard drive as memory-mapped tensors
`no_split_module_classes=["GPTJBlock"]` indicates that the modules that are `GPTJBlock` should not be split on different devices. You should set here all blocks that include a residual connection of some kind.
You can see the `device_map` that Accelerate picked by accessing the `hf_device_map` attribute of your model:
```py
model.hf_device_map
```
```python out
{'transformer.wte': 0,
'transformer.drop': 0,
'transformer.h.0': 0,
'transformer.h.1': 0,
'transformer.h.2': 0,
'transformer.h.3': 0,
'transformer.h.4': 0,
'transformer.h.5': 0,
'transformer.h.6': 0,
'transformer.h.7': 0,
'transformer.h.8': 0,
'transformer.h.9': 0,
'transformer.h.10': 0,
'transformer.h.11': 0,
'transformer.h.12': 0,
'transformer.h.13': 0,
'transformer.h.14': 0,
'transformer.h.15': 0,
'transformer.h.16': 0,
'transformer.h.17': 0,
'transformer.h.18': 0,
'transformer.h.19': 0,
'transformer.h.20': 0,
'transformer.h.21': 0,
'transformer.h.22': 0,
'transformer.h.23': 0,
'transformer.h.24': 1,
'transformer.h.25': 1,
'transformer.h.26': 1,
'transformer.h.27': 1,
'transformer.ln_f': 1,
'lm_head': 1}
```
You can also design your `device_map` yourself, if you prefer to explicitly decide where each layer should be. In this case, the command above becomes:
```py
model = load_checkpoint_and_dispatch(model, "sharded-gpt-j-6B", device_map=my_device_map)
```
## Run the model
Now that we have done this, our model lies across several devices, and maybe the hard drive. But it can still be used as a regular PyTorch model:
```py
from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained(checkpoint)
inputs = tokenizer("Hello, my name is", return_tensors="pt")
inputs = inputs.to(0)
output = model.generate(inputs["input_ids"])
tokenizer.decode(output[0].tolist())
```
Behind the scenes, Accelerate added hooks to the model, so that:
- at each layer, the inputs are put on the right device (so even if your model is spread across several GPUs, it works)
- for the weights offloaded on the CPU, they are put on a GPU just before the forward pass, and cleaned up just after
- for the weights offloaded on the hard drive, they are loaded in RAM then put on a GPU just before the forward pass, and cleaned up just after
This way, you model can run for inference even if it doesn't fit on one of the GPUs or the CPU RAM!
<Tip warning={true}>
This only supports inference of your model, not training. Most of the computation happens behind `torch.no_grad()` context managers to avoid spending some GPU memory with intermediate activations.
</Tip>
## Limits and further development
We are aware of the current limitations in the API:
- While this could theoretically work just one CPU with potential disk offload, you need at least one GPU to run this API. This will be fixed in further development.
- [`infer_auto_device_map`] (or `device_map="auto"` in [`load_checkpoint_and_dispatch`]) tries to maximize GPU and CPU RAM it sees available when you execute it. While PyTorch is very good at managing GPU RAM efficiently (and giving it back when not needed), it's not entirely true with Python and CPU RAM. Therefore, an automatically computed device map might be too intense on the CPU. Move a few modules to the disk device if you get crashes due to lack of RAM.
- [`infer_auto_device_map`] (or `device_map="auto"` in [`load_checkpoint_and_dispatch`]) attributes devices sequentially (to avoid moving things back and forth) so if your first layer is bigger than the size of the GPU you have, it will end up with everything on the CPU/Disk.
- [`load_checkpoint_and_dispatch`] and [`load_checkpoint_in_model`] do not perform any check on the correctness of your state dict compared to your model at the moment (this will be fixed in a future version), so you may get some weird errors if trying to load a checkpoint with mismatched or missing keys.
- The model parallelism used when your model is split on several GPUs is naive and not optimized, meaning that only one GPU works at a given time and the other sits idle.
- When weights are offloaded on the CPU/hard drive, there is no pre-fetching (yet, we will work on this for future versions) which means the weights are put on the GPU when they are needed and not before.
- Hard-drive offloading might be very slow if the hardware you run on does not have fast communication between disk and CPU (like NVMes).
## API doc
[[autodoc]] cpu_offload
[[autodoc]] disk_offload
[[autodoc]] dispatch_model
[[autodoc]] infer_auto_device_map
[[autodoc]] init_empty_weights
[[autodoc]] load_checkpoint_and_dispatch
[[autodoc]] load_checkpoint_in_model

View File

@ -0,0 +1,60 @@
<!--Copyright 2022 The HuggingFace Team. All rights reserved.
Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with
the License. You may obtain a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on
an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the
specific language governing permissions and limitations under the License.
-->
# Checkpointing
When training a PyTorch model with Accelerate, you may often want to save and continue a state of training. Doing so requires
saving and loading the model, optimizer, RNG generators, and the GradScaler. Inside Accelerate are two convience functions to achieve this quickly:
- Use [`~Accelerator.save_state`] for saving everything mentioned above to a folder location
- Use [`~Accelerator.load_state`] for loading everything stored from an earlier `save_state`
It should be noted that the expectation is that those states come from the same training script, they should not be from two separate scripts.
- By using [`~Accelerator.register_for_checkpointing`], you can register custom objects to be automatically stored or loaded from the two prior functions,
so long as the object has a `state_dict` **and** a `load_state_dict` functionality. This could include objects such as a learning rate scheduler.
Below is a brief example using checkpointing to save and reload a state during training:
```python
from accelerate import Accelerator
import torch
accelerator = Accelerator()
my_scheduler = torch.optim.lr_scheduler.StepLR(my_optimizer, step_size=1, gamma=0.99)
my_model, my_optimizer, my_training_dataloader = accelerate.prepare(my_model, my_optimizer, my_training_dataloader)
# Register the LR scheduler
accelerate.register_for_checkpointing(my_scheduler)
# Save the starting state
accelerate.save_state("my/save/path")
device = accelerator.device
my_model.to(device)
# Perform training
for epoch in range(num_epochs):
for batch in my_training_dataloader:
my_optimizer.zero_grad()
inputs, targets = batch
inputs = inputs.to(device)
targets = targets.to(device)
outputs = my_model(inputs)
loss = my_loss_function(outputs, targets)
accelerator.backward(loss)
my_optimizer.step()
my_scheduler.step()
# Restore previous state
accelerate.load_state("my/save/path")
```

View File

@ -1,196 +0,0 @@
# -*- coding: utf-8 -*-
#
# Configuration file for the Sphinx documentation builder.
#
# This file does only contain a selection of the most common options. For a
# full list see the documentation:
# http://www.sphinx-doc.org/en/master/config
# -- Path setup --------------------------------------------------------------
# If extensions (or modules to document with autodoc) are in another directory,
# add these directories to sys.path here. If the directory is relative to the
# documentation root, use os.path.abspath to make it absolute, like shown here.
#
import os
import sys
sys.path.insert(0, os.path.abspath('../../src'))
# -- Project information -----------------------------------------------------
project = u'accelerate'
copyright = u'2020, The Hugging Face Team, Licenced under the Apache License, Version 2.0'
author = u'huggingface'
# The short X.Y version
version = u'0.1.0'
# -- General configuration ---------------------------------------------------
# If your documentation needs a minimal Sphinx version, state it here.
#
# needs_sphinx = '1.0'
# Add any Sphinx extension module names here, as strings. They can be
# extensions coming with Sphinx (named 'sphinx.ext.*') or your custom
# ones.
extensions = [
'sphinx.ext.autodoc',
'sphinx.ext.extlinks',
'sphinx.ext.coverage',
'sphinx.ext.napoleon',
'recommonmark',
'sphinx.ext.viewcode',
'sphinx_markdown_tables',
'sphinx_copybutton'
]
# Add any paths that contain templates here, relative to this directory.
templates_path = ['_templates']
# The suffix(es) of source filenames.
# You can specify multiple suffix as a list of string:
#
source_suffix = ['.rst', '.md']
# source_suffix = '.rst'
# The master toctree document.
master_doc = 'index'
# The language for content autogenerated by Sphinx. Refer to documentation
# for a list of supported languages.
#
# This is also used if you do content translation via gettext catalogs.
# Usually you set "language" from the command line for these cases.
language = None
# List of patterns, relative to source directory, that match files and
# directories to ignore when looking for source files.
# This pattern also affects html_static_path and html_extra_path.
exclude_patterns = [u'_build', 'Thumbs.db', '.DS_Store']
# The name of the Pygments (syntax highlighting) style to use.
pygments_style = None
# Remove the prompt when copying examples
copybutton_prompt_text = r">>> |\.\.\. "
copybutton_prompt_is_regexp = True
# -- Options for HTML output -------------------------------------------------
# The theme to use for HTML and HTML Help pages. See the documentation for
# a list of builtin themes.
#
html_theme = 'sphinx_rtd_theme'
# Theme options are theme-specific and customize the look and feel of a theme
# further. For a list of options available for each theme, see the
# documentation.
#
html_theme_options = {
'analytics_id': 'UA-83738774-2'
}
# Add any paths that contain custom static files (such as style sheets) here,
# relative to this directory. They are copied after the builtin static files,
# so a file named "default.css" will overwrite the builtin "default.css".
html_static_path = ['_static']
# Custom sidebar templates, must be a dictionary that maps document names
# to template names.
#
# The default sidebars (for documents that don't match any pattern) are
# defined by theme itself. Builtin themes are using these templates by
# default: ``['localtoc.html', 'relations.html', 'sourcelink.html',
# 'searchbox.html']``.
#
# html_sidebars = {}
# This must be the name of an image file (path relative to the configuration
# directory) that is the favicon of the docs. Modern browsers use this as
# the icon for tabs, windows and bookmarks. It should be a Windows-style
# icon file (.ico).
html_favicon = 'favicon.ico'
# -- Options for HTMLHelp output ---------------------------------------------
# Output file base name for HTML help builder.
htmlhelp_basename = 'acceleratedoc'
# -- Options for LaTeX output ------------------------------------------------
latex_elements = {
# The paper size ('letterpaper' or 'a4paper').
#
# 'papersize': 'letterpaper',
# The font size ('10pt', '11pt' or '12pt').
#
# 'pointsize': '10pt',
# Additional stuff for the LaTeX preamble.
#
# 'preamble': '',
# Latex figure (float) alignment
#
# 'figure_align': 'htbp',
}
# Grouping the document tree into LaTeX files. List of tuples
# (source start file, target name, title,
# author, documentclass [howto, manual, or own class]).
latex_documents = [
(master_doc, 'accelerate.tex', u'accelerate Documentation',
u'huggingface', 'manual'),
]
# -- Options for manual page output ------------------------------------------
# One entry per manual page. List of tuples
# (source start file, name, description, authors, manual section).
man_pages = [
(master_doc, 'accelerate', u'accelerate Documentation',
[author], 1)
]
# -- Options for Texinfo output ----------------------------------------------
# Grouping the document tree into Texinfo files. List of tuples
# (source start file, target name, title, author,
# dir menu entry, description, category)
texinfo_documents = [
(master_doc, 'accelerate', u'accelerate Documentation',
author, 'accelerate', 'One line description of project.',
'Miscellaneous'),
]
# -- Options for Epub output -------------------------------------------------
# Bibliographic Dublin Core info.
epub_title = project
# The unique identifier of the text. This can be a ISBN number
# or the project homepage.
#
# epub_identifier = ''
# A unique identification for the text.
#
# epub_uid = ''
# A list of files that should not be packed into the epub file.
epub_exclude_files = ['search.html']
def setup(app):
app.add_css_file('css/huggingface.css')
app.add_css_file('css/code-snippets.css')
app.add_js_file('js/custom.js')
# -- Extension configuration -------------------------------------------------

Binary file not shown.

Before

Width:  |  Height:  |  Size: 47 KiB

120
docs/source/fsdp.mdx Normal file
View File

@ -0,0 +1,120 @@
<!--Copyright 2022 The HuggingFace Team. All rights reserved.
Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with
the License. You may obtain a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on
an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the
specific language governing permissions and limitations under the License.
-->
# Fully Sharded Data Parallel
To accelerate training huge models on larger batch sizes, we can use a fully sharded data parallel model.
This type of data parallel paradigm enables fitting more data and larger models by sharding the optimizer states, gradients and parameters.
To read more about it and the benefits, check out the [Fully Sharded Data Parallel blog](https://pytorch.org/blog/introducing-pytorch-fully-sharded-data-parallel-api/).
We have integrated the latest PyTorch's Fully Sharded Data Parallel (FSDP) training feature.
All you need to do is enable it through the config.
## How it works out the box
On your machine(s) just run:
```bash
accelerate config
```
and answer the questions asked. This will generate a config file that will be used automatically to properly set the
default options when doing
```bash
accelerate launch my_script.py --args_to_my_script
```
For instance, here is how you would run the NLP example (from the root of the repo) with FSDP enabled:
```bash
compute_environment: LOCAL_MACHINE
deepspeed_config: {}
distributed_type: FSDP
fsdp_config:
min_num_params: 2000
offload_params: false
sharding_strategy: 1
machine_rank: 0
main_process_ip: null
main_process_port: null
main_training_function: main
mixed_precision: 'no'
num_machines: 1
num_processes: 2
use_cpu: false
```
```bash
accelerate launch examples/nlp_example.py
```
Currently, `Accelerate` supports following config through the CLI:
```bash
`Sharding Strategy`: [1] FULL_SHARD, [2] SHARD_GRAD_OP
`Min Num Params`: FSDP\'s minimum number of parameters for Default Auto Wrapping.
`Offload Params`: Decides Whether to offload parameters and gradients to CPU.
```
## Few caveats to be aware of
- PyTorch FSDP auto wraps sub-modules, flattens the parameters and shards the parameters in place.
Due to this, any optimizer created before model wrapping gets broken and occupies more memory.
Hence, it is highly recommended and efficient to prepare model before creating optimizer.
`Accelerate` will automatically wrap the model and create an optimizer for you in case of single model with a warning message.
> FSDP Warning: When using FSDP, it is efficient and recommended to call prepare for the model before creating the optimizer
However, below is the recommended way to prepare model and optimizer while using FSDP:
```diff
model = AutoModelForSequenceClassification.from_pretrained("bert-base-cased", return_dict=True)
+ model = accelerator.prepare(model)
optimizer = torch.optim.AdamW(params=model.parameters(), lr=lr)
- model, optimizer, train_dataloader, eval_dataloader, lr_scheduler = accelerator.prepare(model,
- optimizer, train_dataloader, eval_dataloader, lr_scheduler
- )
+ optimizer, train_dataloader, eval_dataloader, lr_scheduler = accelerator.prepare(
+ optimizer, train_dataloader, eval_dataloader, lr_scheduler
+ )
```
- In case of a single model, if you have created optimizer with multiple parameter groups and called prepare with them together,
then the parameter groups will be lost and the following warning is displayed:
> FSDP Warning: When using FSDP, several parameter groups will be conflated into
> a single one due to nested module wrapping and parameter flattening.
This is because parameter groups created before wrapping will have no meaning post wrapping due parameter flattening of nested FSDP modules into 1D arrays (which can consume many layers).
For instance, below are the named parameters of FSDP model on GPU 0 (When using 2 GPUs. Around 55M (110M/2) params in 1D arrays as this will have the 1st shard of the parameters).
Here, if one has applied no weight decay for [bias, LayerNorm.weight] named parameters of unwrapped BERT model,
it can't be applied to the below FSDP wrapped model as there are no named parameters with either of those strings and
the parameters of those layers are concatenated with parameters of various other layers.
```
{
'_fsdp_wrapped_module.flat_param': torch.Size([494209]),
'_fsdp_wrapped_module._fpw_module.bert.embeddings.word_embeddings._fsdp_wrapped_module.flat_param': torch.Size([11720448]),
'_fsdp_wrapped_module._fpw_module.bert.encoder._fsdp_wrapped_module.flat_param': torch.Size([42527232])
}
```
- In case of multiple models, it is necessary to prepare the models before creating optimizers else it will throw an error.
- Mixed precision is currently not supported with FSDP.
For more control, users can leverage the `FullyShardedDataParallelPlugin` wherein they can specify `auto_wrap_policy`, `backward_prefetch` and `ignored_modules`.
After creating an instance of this class, users can pass it to the Accelerator class instantiation.
For more information on these options, please refer to the PyTorch [FullyShardedDataParallel](https://github.com/pytorch/pytorch/blob/0df2e863fbd5993a7b9e652910792bd21a516ff3/torch/distributed/fsdp/fully_sharded_data_parallel.py#L236) code.
[[autodoc]] utils.FullyShardedDataParallelPlugin

Binary file not shown.

After

Width:  |  Height:  |  Size: 78 KiB

132
docs/source/index.mdx Normal file
View File

@ -0,0 +1,132 @@
<!--Copyright 2021 The HuggingFace Team. All rights reserved.
Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with
the License. You may obtain a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on
an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the
specific language governing permissions and limitations under the License.
-->
# Accelerate
Run your *raw* PyTorch training script on any kind of device
## Features
- 🤗 Accelerate provides an easy API to make your scripts run with mixed precision and on any kind of distributed
setting (multi-GPUs, TPUs etc.) while still letting you write your own training loop. The same code can then runs
seamlessly on your local machine for debugging or your training environment.
- 🤗 Accelerate also provides a CLI tool that allows you to quickly configure and test your training environment then
launch the scripts.
## Easy to integrate
A traditional training loop in PyTorch looks like this:
```python
my_model.to(device)
for batch in my_training_dataloader:
my_optimizer.zero_grad()
inputs, targets = batch
inputs = inputs.to(device)
targets = targets.to(device)
outputs = my_model(inputs)
loss = my_loss_function(outputs, targets)
loss.backward()
my_optimizer.step()
```
Changing it to work with accelerate is really easy and only adds a few lines of code:
```diff
+ from accelerate import Accelerator
+ accelerator = Accelerator()
# Use the device given by the *accelerator* object.
+ device = accelerator.device
my_model.to(device)
# Pass every important object (model, optimizer, dataloader) to *accelerator.prepare*
+ my_model, my_optimizer, my_training_dataloader = accelerator.prepare(
+ my_model, my_optimizer, my_training_dataloader
+ )
for batch in my_training_dataloader:
my_optimizer.zero_grad()
inputs, targets = batch
inputs = inputs.to(device)
targets = targets.to(device)
outputs = my_model(inputs)
loss = my_loss_function(outputs, targets)
# Just a small change for the backward instruction
- loss.backward()
+ accelerator.backward(loss)
my_optimizer.step()
```
and with this, your script can now run in a distributed environment (multi-GPU, TPU).
You can even simplify your script a bit by letting 🤗 Accelerate handle the device placement for you (which is safer,
especially for TPU training):
```diff
+ from accelerate import Accelerator
+ accelerator = Accelerator()
- my_model.to(device)
# Pass every important object (model, optimizer, dataloader) to *accelerator.prepare*
+ my_model, my_optimizer, my_training_dataloader = accelerate.prepare(
+ my_model, my_optimizer, my_training_dataloader
+ )
for batch in my_training_dataloader:
my_optimizer.zero_grad()
inputs, targets = batch
- inputs = inputs.to(device)
- targets = targets.to(device)
outputs = my_model(inputs)
loss = my_loss_function(outputs, targets)
# Just a small change for the backward instruction
- loss.backward()
+ accelerator.backward(loss)
my_optimizer.step()
```
## Script launcher
No need to remember how to use `torch.distributed.launch` or to write a specific launcher for TPU training! 🤗
Accelerate comes with a CLI tool that will make your life easier when launching distributed scripts.
On your machine(s) just run:
```bash
accelerate config
```
and answer the questions asked. This will generate a config file that will be used automatically to properly set the
default options when doing
```bash
accelerate launch my_script.py --args_to_my_script
```
For instance, here is how you would run the NLP example (from the root of the repo):
```bash
accelerate launch examples/nlp_example.py
```
## Supported integrations
- CPU only
- single GPU
- multi-GPU on one node (machine)
- multi-GPU on several nodes (machines)
- TPU
- FP16 with native AMP (apex on the roadmap)
- DeepSpeed (experimental support)

View File

@ -1,148 +0,0 @@
..
Copyright 2021 The HuggingFace Team. All rights reserved.
Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with
the License. You may obtain a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on
an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the
specific language governing permissions and limitations under the License.
Accelerate
=======================================================================================================================
Run your *raw* PyTorch training script on any kind of device
Features
-----------------------------------------------------------------------------------------------------------------------
- 🤗 Accelerate provides an easy API to make your scripts run with mixed precision and on any kind of distributed
setting (multi-GPUs, TPUs etc.) while still letting you write your own training loop. The same code can then runs
seamlessly on your local machine for debugging or your training environment.
- 🤗 Accelerate also provides a CLI tool that allows you to quickly configure and test your training environment then
launch the scripts.
Easy to integrate
-----------------------------------------------------------------------------------------------------------------------
A traditional training loop in PyTorch looks like this:
.. code-block:: python
my_model.to(device)
for batch in my_training_dataloader:
my_optimizer.zero_grad()
inputs, targets = batch
inputs = inputs.to(device)
targets = targets.to(device)
outputs = my_model(inputs)
loss = my_loss_function(outputs, targets)
loss.backward()
my_optimizer.step()
Changing it to work with accelerate is really easy and only adds a few lines of code:
.. code-block:: python
from accelerate import Accelerator
accelerator = Accelerator()
# Use the device given by the `accelerator` object.
device = accelerator.device
my_model.to(device)
# Pass every important object (model, optimizer, dataloader) to `accelerator.prepare`
my_model, my_optimizer, my_training_dataloader = accelerate.prepare(
my_model, my_optimizer, my_training_dataloader
)
for batch in my_training_dataloader:
my_optimizer.zero_grad()
inputs, targets = batch
inputs = inputs.to(device)
targets = targets.to(device)
outputs = my_model(inputs)
loss = my_loss_function(outputs, targets)
# Just a small change for the backward instruction
accelerate.backward(loss)
my_optimizer.step()
and with this, your script can now run in a distributed environment (multi-GPU, TPU).
You can even simplify your script a bit by letting 🤗 Accelerate handle the device placement for you (which is safer,
especially for TPU training):
.. code-block:: python
from accelerate import Accelerator
accelerator = Accelerator()
# Pass every important object (model, optimizer, dataloader) to `accelerator.prepare`
my_model, my_optimizer, my_training_dataloader = accelerate.prepare(
my_model, my_optimizer, my_training_dataloader
)
for batch in my_training_dataloader:
my_optimizer.zero_grad()
inputs, targets = batch
outputs = my_model(inputs)
loss = my_loss_function(outputs, targets)
# Just a small change for the backward instruction
accelerate.backward(loss)
my_optimizer.step()
Script launcher
-----------------------------------------------------------------------------------------------------------------------
No need to remember how to use ``torch.distributed.launch`` or to write a specific launcher for TPU training! 🤗
Accelerate comes with a CLI tool that will make your life easier when launching distributed scripts.
On your machine(s) just run:
.. code-block:: bash
accelerate config
and answer the questions asked. This will generate a config file that will be used automatically to properly set the
default options when doing
.. code-block:: bash
accelerate launch my_script.py --args_to_my_script
For instance, here is how you would run the NLP example (from the root of the repo):
.. code-block:: bash
accelerate launch examples/nlp_example.py
Supported integrations
-----------------------------------------------------------------------------------------------------------------------
- CPU only
- single GPU
- multi-GPU on one node (machine)
- multi-GPU on several nodes (machines)
- TPU
- FP16 with native AMP (apex on the roadmap)
.. toctree::
:maxdepth: 2
:caption: Get started
quicktour
installation
.. toctree::
:maxdepth: 2
:caption: API reference
accelerator
internal

View File

@ -55,9 +55,9 @@ Here is how to quickly install `accelerate` from source:
pip install git+https://github.com/huggingface/accelerate
```
Note that this will install not the latest released version, but the bleeding edge `master` version, which you may want to use in case a bug has been fixed since the last official release and a new release hasn't been yet rolled out.
Note that this will install not the latest released version, but the bleeding edge `main` version, which you may want to use in case a bug has been fixed since the last official release and a new release hasn't been yet rolled out.
While we strive to keep `master` operational at all times, if you notice some issues, they usually get fixed within a few hours or a day and and you're more than welcome to help us detect any problems by opening an [Issue](https://github.com/huggingface/accelerate/issues) and this way, things will get fixed even sooner.
While we strive to keep `main` operational at all times, if you notice some issues, they usually get fixed within a few hours or a day and and you're more than welcome to help us detect any problems by opening an [Issue](https://github.com/huggingface/accelerate/issues) and this way, things will get fixed even sooner.
Again, you can run:
@ -69,11 +69,11 @@ to check 🤗 Accelerate is properly installed.
## Editable install
If you want to constantly use the bleeding edge `master` version of the source code, or if you want to contribute to the library and need to test the changes in the code you're making, you will need an editable install. This is done by cloning the repository and installing with the following commands:
If you want to constantly use the bleeding edge `main` version of the source code, or if you want to contribute to the library and need to test the changes in the code you're making, you will need an editable install. This is done by cloning the repository and installing with the following commands:
``` bash
git clone https://github.com/huggingface/accelerate.git
cd transformers
cd accelerate
pip install -e .
```
@ -85,9 +85,9 @@ now this editable install will reside where you clone the folder to, e.g. `~/acc
Do note that you have to keep that `accelerate` folder around and not delete it to continue using the 🤗 Accelerate library.
Now, let's get to the real benefit of this installation approach. Say, you saw some new feature has been just committed into `master`. If you have already performed all the steps above, to update your accelerate repo to include all the latest commits, all you need to do is to `cd` into that cloned repository folder and update the clone to the latest version:
Now, let's get to the real benefit of this installation approach. Say, you saw some new feature has been just committed into `main`. If you have already performed all the steps above, to update your accelerate repo to include all the latest commits, all you need to do is to `cd` into that cloned repository folder and update the clone to the latest version:
```
```bash
cd ~/accelerate/
git pull
```

71
docs/source/internal.mdx Normal file
View File

@ -0,0 +1,71 @@
<!--Copyright 2021 The HuggingFace Team. All rights reserved.
Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with
the License. You may obtain a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on
an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the
specific language governing permissions and limitations under the License.
-->
# Internals
## Optimizer
[[autodoc]] optimizer.AcceleratedOptimizer
## DataLoader
The main work on your PyTorch `DataLoader` is done by the following function:
[[autodoc]] data_loader.prepare_data_loader
### BatchSamplerShard
[[autodoc]] data_loader.DataLoaderShard
### BatchSamplerShard
[[autodoc]] data_loader.BatchSamplerShard
### IterableDatasetShard
[[autodoc]] data_loader.IterableDatasetShard
## Scheduler
[[autodoc]] scheduler.AcceleratedScheduler
## Distributed Config
### AcceleratorState
[[autodoc]] state.AcceleratorState
### DistributedType
[[autodoc]] state.DistributedType
## Tracking
[[autodoc]] tracking.GeneralTracker
## Utilities
[[autodoc]] utils.extract_model_from_parallel
[[autodoc]] utils.gather
[[autodoc]] utils.send_to_device
[[autodoc]] utils.set_seed
[[autodoc]] utils.synchronize_rng_state
[[autodoc]] utils.synchronize_rng_states
[[autodoc]] utils.wait_for_everyone
[[autodoc]] utils.write_basic_config

View File

@ -1,83 +0,0 @@
..
Copyright 2021 The HuggingFace Team. All rights reserved.
Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with
the License. You may obtain a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on
an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the
specific language governing permissions and limitations under the License.
Internals
=======================================================================================================================
Optimizer
-----------------------------------------------------------------------------------------------------------------------
.. autoclass:: accelerate.optimizer.AcceleratedOptimizer
DataLoader
-----------------------------------------------------------------------------------------------------------------------
The main work on your PyTorch :obj:`DataLoader` is done by the following function:
.. autofunction:: accelerate.data_loader.prepare_data_loader
BatchSamplerShard
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
.. autoclass:: accelerate.data_loader.DataLoaderShard
:members:
BatchSamplerShard
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
.. autoclass:: accelerate.data_loader.BatchSamplerShard
:members:
IterableDatasetShard
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
.. autoclass:: accelerate.data_loader.IterableDatasetShard
:members:
Distributed Config
-----------------------------------------------------------------------------------------------------------------------
AcceleratorState
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
.. autoclass:: accelerate.state.AcceleratorState
:members:
DistributedType
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
.. autoclass:: accelerate.state.DistributedType
:members:
Utilities
-----------------------------------------------------------------------------------------------------------------------
.. autofunction:: accelerate.utils.extract_model_from_parallel
.. autofunction:: accelerate.utils.gather
.. autofunction:: accelerate.utils.send_to_device
.. autofunction:: accelerate.utils.set_seed
.. autofunction:: accelerate.utils.synchronize_rng_states
.. autofunction:: accelerate.utils.wait_for_everyone

29
docs/source/kwargs.mdx Normal file
View File

@ -0,0 +1,29 @@
<!--Copyright 2021 The HuggingFace Team. All rights reserved.
Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with
the License. You may obtain a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on
an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the
specific language governing permissions and limitations under the License.
-->
# Kwargs Handlers
The following objects can be passed to the main [`Accelerator`] to customize how some PyTorch objects
related to distributed training or mixed precision are created.
## DistributedDataParallelKwargs
[[autodoc]] DistributedDataParallelKwargs
## GradScalerKwargs
[[autodoc]] GradScalerKwargs
## InitProcessGroupKwargs
[[autodoc]] InitProcessGroupKwargs

28
docs/source/launcher.mdx Normal file
View File

@ -0,0 +1,28 @@
<!--Copyright 2021 The HuggingFace Team. All rights reserved.
Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with
the License. You may obtain a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on
an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the
specific language governing permissions and limitations under the License.
-->
# Notebook Launcher
Launch your training function inside a notebook. Currently supports launching a training with TPUs on [Google
Colab](https://colab.research.google.com/) and [Kaggle kernels](https://www.kaggle.com/code), as well as training on
several GPUs (if the machine on which you are running your notebook has them).
An example can be found in [this notebook](https://github.com/huggingface/notebooks/blob/master/examples/accelerate/simple_nlp_example.ipynb).
<Tip warning={true}>
Your `Accelerator` object should only be defined inside the training function. This is because the
initialization should be done inside the launcher only.
</Tip>
[[autodoc]] notebook_launcher

51
docs/source/memory.mdx Normal file
View File

@ -0,0 +1,51 @@
<!--Copyright 2022 The HuggingFace Team. All rights reserved.
Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with
the License. You may obtain a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on
an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the
specific language governing permissions and limitations under the License.
-->
# Memory Utilities
One of the most frustrating errors when it comes to running training scripts is hitting "CUDA Out-of-Memory",
as the entire script needs to be restarted, progress is lost, and typically a developer would want to simply
start their script and let it run.
`Accelerate` provides a utility heavily based on [toma](https://github.com/BlackHC/toma) to give this capability.
## find_executable_batch_size
This algorithm operates with exponential decay, decreasing the batch size in half after each failed run on some
training script. To use it, restructure your training function to include an inner function that includes this wrapper,
and build your dataloaders inside it. At a minimum, this could look like 4 new lines of code.
> Note: The inner function *must* take in the batch size as the first parameter, but we do not pass one to it when called. The wrapper handles this for us
```diff
def training_function(args):
accelerator = Accelerator()
model = get_model()
model.to(accelerator.device)
optimizer = get_optimizer()
+ @find_executable_batch_size(starting_batch_size=args.batch_size)
+ def inner_training_loop(batch_size):
+ nonlocal model, optimizer # Ensure they can be used in our context
train_dataloader, eval_dataloader = get_dataloaders(accelerator, batch_size)
lr_scheduler = get_scheduler(
optimizer,
num_training_steps=len(train_dataloader)*num_epochs
)
model, optimizer, train_dataloader, eval_dataloader, lr_scheduler = accelerator.prepare(
model, optimizer, train_dataloader, eval_dataloader, lr_scheduler
)
train(model, optimizer, train_dataloader, lr_scheduler)
validate(model, eval_dataloader)
+ inner_training_loop()
```
[[autodoc]] utils.find_executable_batch_size

460
docs/source/quicktour.mdx Normal file
View File

@ -0,0 +1,460 @@
<!--Copyright 2021 The HuggingFace Team. All rights reserved.
Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with
the License. You may obtain a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on
an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the
specific language governing permissions and limitations under the License.
-->
# Quick tour
Let's have a look at a look at 🤗 Accelerate main features and traps to avoid.
## Main use
To use 🤗 Accelerate in your own script, you have to change four things:
1. Import the [`Accelerator`] main class instantiate one in an `accelerator` object:
```python
from accelerate import Accelerator
accelerator = Accelerator()
```
This should happen as early as possible in your training script as it will initialize everything necessary for
distributed training. You don't need to indicate the kind of environment you are in (just one machine with a GPU, one
match with several GPUs, several machines with multiple GPUs or a TPU), the library will detect this automatically.
2. Remove the call `.to(device)` or `.cuda()` for your model and input data. The `accelerator` object
will handle this for you and place all those objects on the right device for you. If you know what you're doing, you
can leave those `.to(device)` calls but you should use the device provided by the `accelerator` object:
`accelerator.device`.
To fully deactivate the automatic device placement, pass along `device_placement=False` when initializing your
[`Accelerator`].
<Tip warning={true}>
If you place your objects manually on the proper device, be careful to create your optimizer after putting your
model on `accelerator.device` or your training will fail on TPU.
</Tip>
3. Pass all objects relevant to training (optimizer, model, training dataloader, learning rate scheduler) to the
[`~Accelerator.prepare`] method. This will make sure everything is ready for training.
```python
model, optimizer, train_dataloader, lr_scheduler = accelerator.prepare(
model, optimizer, train_dataloader, lr_scheduler
)
```
In particular, your training dataloader will be sharded accross all GPUs/TPU cores available so that each one sees a
different portion of the training dataset. Also, the random states of all processes will be synchronized at the
beginning of each iteration through your dataloader, to make sure the data is shuffled the same way (if you decided to
use `shuffle=True` or any kind of random sampler).
<Tip>
The actual batch size for your training will be the number of devices used multiplied by the batch size you set in
your script: for instance training on 4 GPUs with a batch size of 16 set when creating the training dataloader will
train at an actual batch size of 64.
</Tip>
Alternatively, you can use the option `split_batches=True` when creating initializing your
[`Accelerator`], in which case the batch size will always stay the same, whether your run your
script on 1, 2, 4 or 64 GPUs.
You should execute this instruction as soon as all objects for training are created, before starting your actual
training loop.
<Tip warning={true}>
You should only pass the learning rate scheduler to [`~Accelerator.prepare`] when the scheduler needs to be stepped
at each optimizer step.
</Tip>
<Tip warning={true}>
Your training dataloader may change length when going through this method: if you run on X GPUs, it will have its
length divided by X (since your actual batch size will be multiplied by X), unless you set
`split_batches=True`.
</Tip>
Any instruction using your training dataloader length (for instance if you want to log the number of total training
steps) should go after the call to [`~Accelerator.prepare`].
You can perfectly send your dataloader to [`~Accelerator.prepare`] on its own, but it's best to send the
model and optimizer to [`~Accelerator.prepare`] together.
You may or may not want to send your validation dataloader to [`~Accelerator.prepare`], depending on
whether you want to run distributed evaluation or not (see below).
4. Replace the line `loss.backward()` by `accelerator.backward(loss)`.
And you're all set! With all these changes, your script will run on your local machine as well as on multiple GPUs or a
TPU! You can either use your favorite tool to launch the distributed training, or you can use the 🤗 Accelerate
launcher.
## Distributed evaluation
You can perform regular evaluation in your training script, if you leave your validation dataloader out of the
[`~Accelerator.prepare`] method. In this case, you will need to put the input data on the
`accelerator.device` manually.
To perform distributed evaluation, send along your validation dataloader to the [`~Accelerator.prepare`]
method:
```python
validation_dataloader = accelerator.prepare(validation_dataloader)
```
Like for your training dataloader, it will mean that (should you run your script on multiple devices) each device will
only see part of the evaluation data. This means you will need to group your predictions together. This is very easy to
do with the [`~Accelerator.gather`] method.
```python
for inputs, targets in validation_dataloader:
predictions = model(inputs)
# Gather all predictions and targets
all_predictions = accelerator.gather(predictions)
all_targets = accelerator.gather(targets)
# Example of use with a *Datasets.Metric*
metric.add_batch(all_predictions, all_targets)
```
<Tip warning={true}>
Like for the training dataloader, passing your validation dataloader through
[`~Accelerator.prepare`] may change its: if you run on X GPUs, it will have its length divided by X
(since your actual batch size will be multiplied by X), unless you set `split_batches=True`.
Any instruction using your training dataloader length (for instance if you need the number of total training steps
to create a learning rate scheduler) should go after the call to [`~Accelerator.prepare`].
</Tip>
<Tip warning={true}>
The [`~Accelerator.gather`] method requires the tensors to be all the same size on each process. If
you have tensors of different sizes on each process (for instance when dynamically padding to the maximum length in
a batch), you should use the [`~Accelerator.pad_across_processes`] method to pad you tensor to the
biggest size across processes.
</Tip>
## Launching your distributed script
You can use the regular commands to launch your distributed training (like `torch.distributed.launch` for
PyTorch), they are fully compatible with 🤗 Accelerate. The only caveat here is that 🤗 Accelerate uses the environment
to determine all useful information, so `torch.distributed.launch` should be used with the flag `--use_env`.
🤗 Accelerate also provides a CLI tool that unifies all launcher, so you only have to remember one command. To use it,
just run
```bash
accelerate config
```
on your machine and reply to the questions asked. This will save a *default_config.yaml* file in your cache folder for
🤗 Accelerate. That cache folder is (with decreasing order of priority):
- The content of your environment variable `HF_HOME` suffixed with *accelerate*.
- If it does not exist, the content of your environment variable `XDG_CACHE_HOME` suffixed with
*huggingface/accelerate*.
- If this does not exist either, the folder *~/.cache/huggingface/accelerate*
You can also specify with the flag `--config_file` the location of the file you want to save.
Once this is done, you can test everything is going well on your setup by running
```bash
accelerate test
```
This will launch a short script that will test the distributed environment. If it runs fine, you are ready for the next
step!
Note that if you specified a location for the config file in the previous step, you need to pass it here as well:
```bash
accelerate test --config_file path_to_config.yaml
```
Now that this is done, you can run your script with the following command:
```bash
accelerate launch path_to_script.py --args_for_the_script
```
If you stored the config file in a non-default location, you can indicate it to the launcher like his:
```bash
accelerate launch --config_file path_to_config.yaml path_to_script.py --args_for_the_script
```
You can also override any of the arguments determined by your config file, see TODO: insert ref here.
## Launching training from a notebook
In Accelerate 0.3.0, a new [`notebook_launcher`] has been introduced to help you launch your training
function from a notebook. This launcher supports launching a training with TPUs on Colab or Kaggle, as well as training
on several GPUs (if the machine on which you are running your notebook has them).
Just define a function responsible for your whole training and/or evaluation in a cell of the notebook, then execute a
cell with the following code:
```python
from accelerate import notebook_launcher
notebook_launcher(training_function)
```
<Tip warning={true}>
Your `Accelerator` object should only be defined inside the training function. This is because the
initialization should be done inside the launcher only.
</Tip>
## Training on TPU
If you want to launch your script on TPUs, there are a few caveats you should be aware of. Behind the scenes, the TPUs
will create a graph of all the operations happening in your training step (forward pass, backward pass and optimizer
step). This is why your first step of training will always be very long as building and compiling this graph for
optimizations takes some time.
The good news is that this compilation will be cached so the second step and all the following will be much faster. The
bas news is that it only applies if all of your steps do exactly the same operations, which implies:
- having all tensors of the same length in all your lengths
- having static code (i.e., not a for loop of length that could change from step to step)
Having any of the things above change between two steps will trigger a new compilation which will, once again, take a
lot of time. In practice, that means you must take special care to have all your tensors in your inputs of the same
shape (so no dynamic padding for instance if you are in an NLP problem) and should not use layer with for loops that
have different lengths depending on the inputs (such as an LSTM) or the training will be excruciatingly slow.
To introduce special behavior in your script for TPUs you can check the `distributed_type` of your
`accelerator`:
```python docstyle-ignore
from accelerate import DistributedType
if accelerator.distributed_type == DistributedType.TPU:
# do something of static shape
else:
# go crazy and be dynamic
```
The [NLP example](https://github.com/huggingface/accelerate/blob/main/examples/nlp_example.py) shows an example in
situation with dynamic padding.
One last thing to pay close attnetion to: if your model has tied weights (such as language models which tie the weights
of the embedding matrix with the weights of the decoder), moving this model to the TPU (either yourself or after you
passed your model to [`~Accelerator.prepare`]) will break the tying. You will need to retie the weights
after. You can find an example of this in the [run_clm_no_trainer](https://github.com/huggingface/transformers/blob/master/examples/pytorch/language-modeling/run_clm.py) script in
the Transformers repository.
## Other caveats
We list here all smaller issues you could have in your script conversion and how to resolve them.
### Execute a statement only on one processes
Some of your instructions only need to run for one process on a given server: for instance a data download or a log
statement. To do this, wrap the statement in a test like this:
```python docstyle-ignore
if accelerator.is_local_main_process:
# Is executed once per server
```
Another example is progress bars: to avoid having multiple progress bars in your output, you should only display one on
the local main process:
```python
from tqdm.auto import tqdm
progress_bar = tqdm(range(args.max_train_steps), disable=not accelerator.is_local_main_process)
```
The *local* means per machine: if you are running your training on two servers with several GPUs, the instruction will
be executed once on each of those servers. If you need to execute something only once for all processes (and not per
machine) for instance, uploading the final model to the 🤗 model hub, wrap it in a test like this:
```python docstyle-ignore
if accelerator.is_main_process:
# Is executed once only
```
For printing statements you only want executed once per machine, you can just replace the `print` function by
`accelerator.print`.
### Defer execution
When you run your usual script, instructions are executed in order. Using 🤗 Accelerate to deploy your script on several
GPUs at the same time introduces a complication: while each process executes all instructions in order, some may be
faster than others.
You might need to wait for all processes to have reached a certain point before executing a given instruction. For
instance, you shouldn't save a model before being sure every process is done with training. To do this, just write the
following line in your code:
```
accelerator.wait_for_everyone()
```
This instruction will block all the processes that arrive them first until all the other processes have reached that
point (if you run your script on just one GPU or CPU, this wont' do anything).
### Saving/loading a model
Saving the model you trained might need a bit of adjustment: first you should wait for all processes to reach that
point in the script as shown above, and then, you should unwrap your model before saving it. This is because when going
through the [`~Accelerator.prepare`] method, your model may have been placed inside a bigger model,
which deals with the distributed training. This in turn means that saving your model state dictionary without taking
any precaution will take that potential extra layer into account, and you will end up with weights you can't load back
in your base model.
This is why it's recommended to *unwrap* your model first. Here is an example:
```
accelerator.wait_for_everyone()
unwrapped_model = accelerator.unwrap_model(model)
accelerator.save(unwrapped_model.state_dict(), filename)
```
If your script contains a logic to load checkpoint, we also recommend you load your weights in the unwrapped model
(this is only useful if you use the load function after making your model go through
[`~Accelerator.prepare`]). Here is an example:
```
unwrapped_model = accelerator.unwrap_model(model)
unwrapped_model.load_state_dict(torch.load(filename))
```
Note that since all the model parameters are references to tensors, this will load your weights inside `model`.
## Saving/loading entire states
When training your model, you may want to save the current state of the model, optimizer, random generators, and potentially LR schedulers to be restored in the _same script_.
You can use `accelerator.save_state` and `accelerator.load_state` respectively to do so, just by simply passing in a save location.
If you have registered any other stateful items to be stored through `accelerator.register_for_checkpointing` they will also be saved and/or loaded.
<Tip>
Every object passed to `register_for_checkpointing` must have a `load_state_dict` and `save_dict` function to be stored
</Tip>
### Gradient clipping
If you are using gradient clipping in your script, you should replace the calls to
`torch.nn.utils.clip_grad_norm_` or `torch.nn.utils.clip_grad_value_` with `accelerator.clip_grad_norm_`
and `accelerator.clip_grad_value_` respectively.
### Mixed Precision training
If you are running your training in Mixed Precision with Accelerate, you will get the best result with your loss being
computed inside your model (like in Transformer models for instance). Every computation outside of the model will be
executed in full precision (which is generally what you want for loss computation, expecially if it involves a
softmax). However you might want to put your loss computation inside the *accelerator.autocast* context manager:
```
with accelerator.autocast():
loss = complex_loss_function(outputs, target):
```
Another caveat with Mixed Precision training is that the gradient will skip a few updates at the beginning and
sometimes during training: because of the dynamic loss scaling strategy, there are points during training where the
gradients have overflown, and the loss scaling factor is reduced to avoid this happening again at the next step.
This means that you may update your learning rate scheduler when there was no update, which is fine in general, but may
have an impact when you have very little training data, or if the first learning rate values of your scheduler are very
important. In this case, you can skip the learning rate scheduler updates when the optimizer step was not done like
this:
```
if not accelerator.optimizer_step_was_skipped:
lr_scheduler.step()
```
### DeepSpeed
DeepSpeed support is experimental, so the underlying API will evolve in the near future and may have some slight
breaking changes. In particular, 🤗 Accelerate does not support DeepSpeed config you have written yourself yet, this
will be added in a next version.
<Tip warning={true}>
The [`notebook_launcher`] does not support the DeepSpeed integration yet.
</Tip>
## Internal mechanism
Internally, the library works by first analyzing the environment in which the script is launched to determine which
kind of distributed setup is used, how many different processes there are and which one the current script is in. All
that information is stored in the [`~AcceleratorState`].
This class is initialized the first time you instantiate a [`Accelerator`] as well as performing any
specific initialization your distributed setup needs. Its state is then uniquely shared through all instances of
[`~state.AcceleratorState`].
Then, when calling [`~Accelerator.prepare`], the library:
- wraps your model(s) in the container adapted for the distributed setup,
- wraps your optimizer(s) in a [`~optimizer.AcceleratedOptimizer`],
- creates a new version of your dataloader(s) in a [`~data_loader.DataLoaderShard`].
While the model(s) and optimizer(s) are just put in simple wrappers, the dataloader(s) are re-created. This is mostly
because PyTorch does not let the user change the `batch_sampler` of a dataloader once it's been created and the
library handles the sharding of your data between processes by changing that `batch_sampler` to yield every other
`num_processes` batches.
The [`~data_loader.DataLoaderShard`] subclasses `DataLoader` to add the following functionality:
- it synchronizes the appropriate random number generator of all processes at each new iteration, to ensure any
randomization (like shuffling) is done the exact same way across processes.
- it puts the batches on the proper device before yielding them (unless you have opted out of
`device_placement=True`).
The random number generator synchronization will by default synchronize:
- the `generator` attribute of a given sampler (like the PyTorch `RandomSampler`) for PyTorch >= 1.6
- the main random number generator in PyTorch <=1.5.1
You can choose which random number generator(s) to synchronize with the `rng_types` argument of the main
[`Accelerator`]. In PyTorch >= 1.6, it is recommended to rely on local `generator` to avoid
setting the same seed in the main random number generator in all processes.
<Tip warning={true}>
Synchronization the main torch (or CUDA or XLA) random number generator will affect any other potential random
artifacts you could have in your dataset (like random data augmentation) in the sense all processes will get the
same random numbers from the torch random modules (so will apply the same random data augmentation if it's
controlled by torch).
</Tip>
<Tip>
The randomization part of your custom sampler, batch sampler or iterable dataset should be done using a local
`torch.Generator` object (in PyTorch >= 1.6), see the traditional `RandomSampler`, as an example.
</Tip>
See more details about the internal in the [Internals page](internal).

View File

@ -1,359 +0,0 @@
..
Copyright 2021 The HuggingFace Team. All rights reserved.
Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with
the License. You may obtain a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on
an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the
specific language governing permissions and limitations under the License.
Quick tour
=======================================================================================================================
Let's have a look at a look at 🤗 Accelerate main features and traps to avoid.
Main use
-----------------------------------------------------------------------------------------------------------------------
To use 🤗 Accelerate in your own script, you have to change four things:
1. Import the :class:`~accelerate.Accelerator` main class instantiate one in an :obj:`accelerator` object:
.. code-block:: python
from accelerate import Accelerator
accelerator = Accelerator()
This should happen as early as possible in your training script as it will initialize everything necessary for
distributed training. You don't need to indicate the kind of environment you are in (just one machine with a GPU, one
match with several GPUs, several machines with multiple GPUs or a TPU), the library will detect this automatically.
2. Remove the call :obj:`.to(device)` or :obj:`.cuda()` for your model and input data. The :obj:`accelerator` object
will handle this for you and place all those objects on the right device for you. If you know what you're doing, you
can leave those :obj:`.to(device)` calls but you should use the device provided by the :obj:`accelerator` object:
:obj:`accelerator.device`.
To fully deactivate the automatic device placement, pass along :obj:`device_placement=False` when initializing your
:class:`~accelerate.Accelerator`.
.. Warning::
If you place your objects manually on the proper device, be careful to create your optimizer after putting your
model on :obj:`accelerator.device` or your training will fail on TPU.
3. Pass all objects relevant to training (optimizer, model, training dataloader) to the
:meth:`~accelerate.Accelerator.prepare` method. This will make sure everything is ready for training.
.. code-block:: python
model, optimizer, train_dataloader = accelerator.prepare(model, optimizer, train_dataloader)
In particular, your training dataloader will be sharded accross all GPUs/TPU cores available so that each one sees a
different portion of the training dataset. Also, the random states of all processes will be synchronized at the
beginning of each iteration through your dataloader, to make sure the data is shuffled the same way (if you decided to
use :obj:`shuffle=True` or any kind of random sampler).
.. Note::
The actual batch size for your training will be the number of devices used multiplied by the batch size you set in
your script: for instance training on 4 GPUs with a batch size of 16 set when creating the training dataloader will
train at an actual batch size of 64.
Alternatively, you can use the option :obj:`split_batches=True` when creating initializing your
:class:`~accelerate.Accelerator`, in which case the batch size will always stay the same, whether your run your
script on 1, 2, 4 or 64 GPUs.
You should execute this instruction as soon as all objects for training are created, before starting your actual
training loop.
.. Warning::
Your training dataloader may change length when going through this method: if you run on X GPUs, it will have its
length divided by X (since your actual batch size will be multiplied by X), unless you set
:obj:`split_batches=True`.
Any instruction using your training dataloader length (for instance if you need the number of total training steps
to create a learning rate scheduler) should go after the call to :meth:`~accelerate.Accelerator.prepare`.
You can perfectly send your dataloader to :meth:`~accelerate.Accelerator.prepare` on its own, but it's best to send the
model and optimizer to :meth:`~accelerate.Accelerator.prepare` together.
You may or may not want to send your validation dataloader to :meth:`~accelerate.Accelerator.prepare`, depending on
whether you want to run distributed evaluation or not (see below).
4. Replace the line :obj:`loss.backward()` by :obj:`accelerator.backward(loss)`.
And you're all set! With all these changes, your script will run on your local machine as well as on multiple GPUs or a
TPU! You can either use your favorite tool to launch the distributed training, or you can use the 🤗 Accelerate
launcher.
Distributed evaluation
-----------------------------------------------------------------------------------------------------------------------
You can perform regular evaluation in your training script, if you leave your validation dataloader out of the
:meth:`~accelerate.Accelerator.prepare` method. In this case, you will need to put the input data on the
:obj:`accelerator.device` manually.
To perform distributed evaluation, send along your validation dataloader to the :meth:`~accelerate.Accelerator.prepare`
method:
.. code-block:: python
validation_dataloader = accelerator.prepare(validation_dataloader)
Like for your training dataloader, it will mean that (should you run your script on multiple devices) each device will
only see part of the evaluation data. This means you will need to group your predictions together. This is very easy to
do with the :meth:`~accelerate.Accelerator.gather` method.
.. code-block:: python
for inputs, targets in validation_dataloader:
predictions = model(inputs)
# Gather all predictions and targets
all_predictions = accelerator.gather(predictions)
all_targets = accelerator.gather(targets)
# Example of use with a `Datasets.Metric`
metric.add_batch(all_predictions, all_targets)
.. Warning::
Like for the training dataloader, passing your validation dataloader through
:meth:`~accelerate.Accelerator.prepare` may change its: if you run on X GPUs, it will have its length divided by X
(since your actual batch size will be multiplied by X), unless you set :obj:`split_batches=True`.
Any instruction using your training dataloader length (for instance if you need the number of total training steps
to create a learning rate scheduler) should go after the call to :meth:`~accelerate.Accelerator.prepare`.
Launching your distributed script
-----------------------------------------------------------------------------------------------------------------------
You can use the regular commands to launch your distributed training (like :obj:`torch.distributed.launch` for
PyTorch), they are fully compatible with 🤗 Accelerate. The only caveat here is that 🤗 Accelerate uses the environment
to determine all useful information, so :obj:`torch.distributed.launch` should be used with the flag :obj:`--use_env`.
🤗 Accelerate also provides a CLI tool that unifies all launcher, so you only have to remember one command. To use it,
just run
.. code-block:: bash
accelerate config
on your machine and reply to the questions asked. This will save a `default_config.json` file in your cache folder for
🤗 Accelerate. That cache folder is (with decreasing order of priority):
- The content of your environment variable ``HF_HOME`` suffixed with `accelerate`.
- If it does not exist, the content of your environment variable ``XDG_CACHE_HOME`` suffixed with
`huggingface/accelerate`.
- If this does not exist either, the folder `~/.cache/huggingface/accelerate`
You can also specify with the flag :obj:`--config_file` the location of the file you want to save.
Once this is done, you can test everything is going well on your setup by running
.. code-block:: bash
accelerate test
This will launch a short script that will test the distributed environment. If it runs fine, you are ready for the next
step!
Note that if you specified a location for the config file in the previous step, you need to pass it here as well:
.. code-block:: bash
accelerate test --config_file path_to_config.json
Now that this is done, you can run your script with the following command:
.. code-block:: bash
accelerate launch path_to_script.py --args_for_the_script
If you stored the config file in a non-default location, you can indicate it to the launcher like his:
.. code-block:: bash
accelerate launch --config_file path_to_config.json path_to_script.py --args_for_the_script
You can also override any of the arguments determined by your config file, see TODO: insert ref here.
Training on TPU
-----------------------------------------------------------------------------------------------------------------------
If you want to launch your script on TPUs, there are a few caveats you should be aware of. Behind the scenes, the TPUs
will create a graph of all the operations happening im your training step (forward pass, backward pass and optimizer
step). This is why your first step of training will always be very long as building and compiling this graph for
optimizations takes some time.
The good news is that this compilation will be cached so the second step and all the following will be much faster. The
bas news is that it only applies if all of your steps do exactly the same operations, which implies:
- having all tensors of the same length in all your lenghts
- having static code (i.e., not a foor loop of length that could change from step to step)
Having any of the things above change between two steps will trigger a new compilation which will, once again, take a
lof of time. In practice, that means you must take special care to have all your tensors in your inputs of the same
shape (so no dynamic padding for instance if you are in an NLP problem) and should not use layer with for loops that
have different lengths depending on the inputs (such as an LSTM) or the training will be excruciatingly slow.
To introduce special behavior in your script for TPUs you can check the :obj:`distributed_type` of your :obj:`accelerator`:
.. code-block:: python
from accelerate import DistributedType
if accelerator.distributed_type == DistributedType.TPU:
# do something of static shape
else:
# go crazy and be dynamic
The `NLP example <https://github.com/huggingface/accelerate/blob/main/examples/nlp_example.py>`__ shows an example in
situation with dynamic padding.
Other caveats
-----------------------------------------------------------------------------------------------------------------------
We list here all smaller issues you could have in your script conversion and how to resolve them.
Execute a statement only on one processes
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Some of your instructions only need to run for one process on a given server: for instance a data download or a log
statement. To do this, wrap the statement in a test like this:
.. code-block:: python
if accelerator.is_local_main_process:
# Is executed once per server
Another example is progress bars: to avoid having multiple progress bars in your output, you should only display one on
the local main process:
.. code-block:: python
from tqdm.auto import tqdm
progress_bar = tqdm(range(args.max_train_steps), disable=not accelerator.is_local_main_process)
The `local` means per machine: if you are running your training on two servers with several GPUs, the instruction will
be executed once on each of those servers. If you need to execute something only once for all processes (and not per
machine) for instance, uploading the final model to the 🤗 model hub, wrap it in a test like this:
.. code-block:: python
if accelerator.is_main_process:
# Is executed once only
For printing statements you only want executed once per machine, you can just replace the :obj:`print` function by
:obj:`accelerator.print`.
Defer execution
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
When you run your usual script, instructions are executed in order. Using 🤗 Accelerate to deploy your script on several
GPUs at the same time introduces a complication: while each process executes all instructions in order, some may be
faster than others.
You might need to wait for all processes to have reached a certain point before executing a given instruction. For
instance, you shouldn't save a model before being sure every process is done with training. To do this, just write the
following line in your code:
.. code-block::
accelerator.wait_for_everyone()
This instruction will block all the processes that arrive them first until all the other processes have reached that
point (if you run your script on just one GPU or CPU, this wont' do anything).
Saving/loading a model
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Saving the model you trained might need a bit of adjustment: first you should wait for all processes to reach that
point in the script as shown above, and then, you should unwrap your model before saving it. This is because when going
through the :meth:`~accelerate.Accelerator.prepare` method, your model may have been placed inside a bigger model,
which deals with the distributed training. This in turn means that saving your model state dictionary without taking
any precaution will take that potential extra layer into account, and you will end up with weights you can't load back
in your base model.
This is why it's recommended to `unwrap` your model first. Here is an example:
.. code-block::
accelerator.wait_for_everyone()
unwrapped_model = accelerator.unwrap_model(model)
accelerator.save(unwrapped_model.state_dict(), filename)
If your script contains a logic to load checkpoint, we also recommend you load your weights in the unwrapped model
(this is only useful if you use the load function after making your model go through
:meth:`~accelerate.Accelerator.prepare`). Here is an example:
.. code-block::
unwrapped_model = accelerator.unwrap_model(model)
unwrapped_model.load_state_dict(torch.load(filename))
Note that since all the model parameters are references to tensors, this will load your weights inside :obj:`model`.
Gradient clipping
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
If you are using gradient clipping in your script, you should replace the calls to
:obj:`torch.nn.utils.clip_grad_norm_` or :obj:`torch.nn.utils.clip_grad_value_` with :obj:`accelerator.clip_grad_norm_`
and :obj:`accelerator.clip_grad_value_` respectively.
Internal mechanism
-----------------------------------------------------------------------------------------------------------------------
Internally, the library works by first analyzing the environment in which the script is launched to determine which
kind of distributed setup is used, how many different processes there are and which one the current script is in. All
that information is stored in the :class:`~accelerate.state.AcceleratorState`.
This class is initialized the first time you instantiate a :class:`~accelerate.Accelerator` as well as performing any
specific initialization your distributed setup needs. Its state is then uniquely shared through all instances of
:class:`~accelerate.state.AcceleratorState`.
Then, when calling :meth:`~accelerate.Accelerator.prepare`, the library:
- wraps your model(s) in the container adapted for the distributed setup,
- wraps your optimizer(s) in a :class:`~accelerate.optimizer.AcceleratedOptimizer`,
- creates a new version of your dataloader(s) in a :class:`~accelerate.data_loader.DataLoaderShard`.
While the model(s) and optimizer(s) are just put in simple wrappers, the dataloader(s) are re-created. This is mostly
because PyTorch does not let the user change the :obj:`batch_sampler` of a dataloader once it's been created and the
library handles the sharding of your data between processes by changing that :obj:`batch_sampler` to yield every other
:obj:`num_processes` batches.
The :class:`~accelerate.data_loader.DataLoaderShard` subclasses :obj:`DataLoader` to add the following functionality:
- it synchronizes the torch random number generators of all processes at each new iteration, to ensure any
randomization (like shuffling) is done the exact same way across processes.
- it puts the batches on the proper device before yielding them (unless you have opted out of
:obj:`device_placement=True`).
.. Warning::
The random number generator synchronization will affect any other potential random artifacts you could have in your
dataset (like random data augmentation) in the sense all processes will get the same random numbers from the torch
random modules (so will apply the same random data augmentation if it's controlled by torch). While this is usually
fine, you should use the random number generator from the Python :obj:`random` module or NumPy for your data
augmentation if you think this will be a problem.
The randomization part of your sampler on the other hand should absolutely be done using the torch random number
generator (like in the traditional :obj:`RandomSampler`).
See more details about the internal in the :doc:`Internals page <internal>`.

150
docs/source/sagemaker.mdx Normal file
View File

@ -0,0 +1,150 @@
<!--Copyright 2021 The HuggingFace Team. All rights reserved.
Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with
the License. You may obtain a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on
an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the
specific language governing permissions and limitations under the License.
-->
# Amazon SageMaker
Hugging Face and Amazon introduced new [Hugging Face Deep Learning Containers (DLCs)](https://github.com/aws/deep-learning-containers/blob/master/available_images.md#huggingface-training-containers) to
make it easier than ever to train Hugging Face Transformer models in [Amazon SageMaker](https://aws.amazon.com/sagemaker/).
## Getting Started
### Setup & Installation
Before you can run your 🤗 Accelerate scripts on Amazon SageMaker you need to sign up for an AWS account. If you do not
have an AWS account yet learn more [here](https://docs.aws.amazon.com/sagemaker/latest/dg/gs-set-up.html).
After you have your AWS Account you need to install the `sagemaker` sdk for 🤗 Accelerate with.
```bash
pip install "accelerate[sagemaker]" --upgrade
```
🤗 Accelerate currently uses the 🤗 DLCs, with `transformers`, `datasets` and `tokenizers` pre-installed. 🤗
Accelerate is not in the DLC yet (will soon be added!) so to use it within Amazon SageMaker you need to create a
`requirements.txt` in the same directory where your training script is located and add it as dependency.
```
accelerate
```
You should also add any other dependencies you have to this `requirements.txt`.
### Configure 🤗 Accelerate
You can configure the launch configuration for Amazon SageMaker the same as you do for non SageMaker training jobs with
the 🤗 Accelerate CLI.
```bash
accelerate config
# In which compute environment are you running? ([0] This machine, [1] AWS (Amazon SageMaker)): 1
```
🤗 Accelerate will go through a questionnaire about your Amazon SageMaker setup and create a config file you can edit.
<Tip>
🤗 Accelerate is not saving any of your credentials.
</Tip>
### Prepare a 🤗 Accelerate fine-tuning script
The training script is very similar to a training script you might run outside of SageMaker, but to save your model
after training you need to specify either `/opt/ml/model` or use `os.environ["SM_MODEL_DIR"]` as your save
directory. After training, artifacts in this directory are uploaded to S3.
```diff
- torch.save('/opt/ml/model`)
+ accelerator.save('/opt/ml/model')
```
<Tip warning={true}>
SageMaker doesnt support argparse actions. If you want to use, for example, boolean hyperparameters, you need to
specify type as bool in your script and provide an explicit True or False value for this hyperparameter. [[REF]](https://sagemaker.readthedocs.io/en/stable/frameworks/pytorch/using_pytorch.html#prepare-a-pytorch-training-script).
</Tip>
### Launch Training
You can launch your training with 🤗 Accelerate CLI with
```
accelerate launch path_to_script.py --args_to_the_script
```
This will launch your training script using your configuration. The only thing you have to do is provide all the
arguments needed by your training script as named arguments.
**Examples**
<Tip>
If you run one of the example scripts, don't forget to add `accelerator.save('/opt/ml/model')` to it.
</Tip>
```bash
accelerate launch ./examples/sagemaker_example.py
```
Outputs:
```
Configuring Amazon SageMaker environment
Converting Arguments to Hyperparameters
Creating Estimator
2021-04-08 11:56:50 Starting - Starting the training job...
2021-04-08 11:57:13 Starting - Launching requested ML instancesProfilerReport-1617883008: InProgress
.........
2021-04-08 11:58:54 Starting - Preparing the instances for training.........
2021-04-08 12:00:24 Downloading - Downloading input data
2021-04-08 12:00:24 Training - Downloading the training image..................
2021-04-08 12:03:39 Training - Training image download completed. Training in progress..
........
epoch 0: {'accuracy': 0.7598039215686274, 'f1': 0.8178438661710037}
epoch 1: {'accuracy': 0.8357843137254902, 'f1': 0.882249560632689}
epoch 2: {'accuracy': 0.8406862745098039, 'f1': 0.8869565217391304}
........
2021-04-08 12:05:40 Uploading - Uploading generated training model
2021-04-08 12:05:40 Completed - Training job completed
Training seconds: 331
Billable seconds: 331
You can find your model data at: s3://your-bucket/accelerate-sagemaker-1-2021-04-08-11-56-47-108/output/model.tar.gz
```
## Advanced Features
### Distributed Training: Data Parallelism
*currently in development, will be supported soon.*
### Distributed Training: Model Parallelism
*currently in development, will be supported soon.*
### Python packages and dependencies
🤗 Accelerate currently uses the 🤗 DLCs, with `transformers`, `datasets` and `tokenizers` pre-installed. If you
want to use different/other Python packages you can do this by adding them to the `requirements.txt`. These packages
will be installed before your training script is started.
### Remote scripts: Use scripts located on Github
*undecided if feature is needed. Contact us if you would like this feature.*
### Use Spot Instances
*undecided if feature is needed. Contact us if you would like this feature.*

163
docs/source/tracking.mdx Normal file
View File

@ -0,0 +1,163 @@
<!--Copyright 2022 The HuggingFace Team. All rights reserved.
Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with
the License. You may obtain a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on
an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the
specific language governing permissions and limitations under the License.
-->
# Tracking
There are a large number of experiment tracking API's available, however getting them all to work with in a multi-processing environment can oftentimes be complex.
Accelerate provides a general tracking API that can be used to log useful items during your script through [`~Accelerator.log`]
## Integrated Trackers
Currently `Accelerate` supports three trackers out-of-the-box:
[[autodoc]] tracking.TensorBoardTracker
[[autodoc]] tracking.WandBTracker
[[autodoc]] tracking.CometMLTracker
To use any of them, pass in the selected type(s) to the `log_with` parameter in [`Accelerate`]:
```python
from accelerate import Accelerator
from accelerate.utils import LoggerType
accelerator = Accelerator(log_with="all") # For all available trackers in the environment
accelerator = Accelerator(log_with="wandb")
accelerator = Accelerator(log_with=["wandb", LoggerType.TENSORBOARD])
```
At the start of your experiment [`~Accelerator.init_trackers`] should be used to setup your project, and potentially add any experiment hyperparameters to be logged:
```python
hps = {"num_iterations": 5, "learning_rate": 1e-2}
accelerator.init_trackers("my_project", config=hps)
```
When you are ready to log any data, [`~Accelerator.log`] should be used.
A `step` can also be passed in to correlate the data with a particular step in the training loop.
```python
accelerator.log({"train_loss": 1.12, "valid_loss": 0.8}, step=1)
```
Once you've finished training, make sure to run [`~Accelerator.end_training`] so that all the trackers can run their finish functionalities if they have any.
```python
accelerator.end_training()
```
A full example is below:
```python
from accelerate import Accelerator
accelerator = Accelerator(log_with="all")
config = {
"num_iterations": 5,
"learning_rate": 1e-2,
"loss_function": str(my_loss_function),
}
accelerator.init_trackers("example_project", config=config)
my_model, my_optimizer, my_training_dataloader = accelerate.prepare(my_model, my_optimizer, my_training_dataloader)
device = accelerator.device
my_model.to(device)
for iteration in config["num_iterations"]:
for step, batch in my_training_dataloader:
my_optimizer.zero_grad()
inputs, targets = batch
inputs = inputs.to(device)
targets = targets.to(device)
outputs = my_model(inputs)
loss = my_loss_function(outputs, targets)
accelerator.backward(loss)
my_optimizer.step()
accelerator.log({"training_loss": loss}, step=step)
accelerator.end_training()
```
## Implementing Custom Trackers
To implement a new tracker to be used in `Accelerator`, a new one can be made through implementing the [`~GeneralTracker`] class.
Every tracker must implement three functions:
- `__init__`:
- Should store a `run_name` and initialize the tracker API of the integrated library.
- If a tracker stores their data locally (such as TensorBoard), a `logging_dir` parameter can be added.
- `store_init_configuration`:
- Should take in a `values` dictionary and store them as a one-time experiment configuration
- `log`:
- Should take in a `values` dictionary and a `step`, and should log them to the run
A brief example can be seen below with an integration with Weights and Biases, containing only the relevent information:
```python
from accelerate.tracking import GeneralTracker
from typing import Optional
import wandb
class MyCustomTracker(GeneralTracker):
def __init__(self, run_name: str):
self.run_name = run_name
wandb.init(self.run_name)
def store_init_configuration(self, values: dict):
wandb.config(values)
def log(self, values: dict, step: Optional[int] = None):
wandb.log(values, step=step)
```
When you are ready to build your `Accelerator` object, pass in an **instance** of your tracker to [`~Accelerator.log_with`] to have it automatically
be used with the API:
```python
tracker = MyCustomTracker("some_run_name")
accelerator = Accelerator(log_with=tracker)
```
These also can be mixed with existing trackers, including with `"all"`:
```python
tracker = MyCustomTracker("some_run_name")
accelerator = Accelerator(log_with=[tracker, "all"])
```
## When a wrapper cannot work
If a library has an API that does not follow a strict `.log` with an overall dictionary such as Neptune.AI, logging can be done manually under an `if accelerator.is_main_process` statement:
```diff
from accelerate import Accelerator
+ import neptune.new as neptune
accelerator = Accelerator()
+ run = neptune.init(...)
my_model, my_optimizer, my_training_dataloader = accelerate.prepare(my_model, my_optimizer, my_training_dataloader)
device = accelerator.device
my_model.to(device)
for iteration in config["num_iterations"]:
for batch in my_training_dataloader:
my_optimizer.zero_grad()
inputs, targets = batch
inputs = inputs.to(device)
targets = targets.to(device)
outputs = my_model(inputs)
loss = my_loss_function(outputs, targets)
total_loss += loss
accelerator.backward(loss)
my_optimizer.step()
+ if accelerator.is_main_process:
+ run["logs/training/batch/loss"].log(loss)
```

View File

@ -18,11 +18,17 @@ limitations under the License.
## Simple NLP example
The [nlp_example.py](./nlp_example.py) script is a simple example to train a Bert model on a classification task ([GLUE's MRPC]()).
The [nlp_example.py](./nlp_example.py) script is a simple example to train a Bert model on a classification task ([GLUE's MRPC](https://www.microsoft.com/en-us/download/details.aspx?id=52398)).
Prior to running it you should install 🤗 Dataset and 🤗 Transformers:
```bash
pip install datasets transformers
```
The same script can be run in any of the following configurations:
- single CPU or single GPU
- multi GPUS (using PyTorch distributed mode)
- multi GPUs (using PyTorch distributed mode)
- (multi) TPUs
- fp16 (mixed-precision) or fp32 (normal precision)
@ -51,8 +57,8 @@ To run it in each of these various modes, use the following commands:
```
* from any server with Accelerate launcher
```bash
accelerate launch --fb16 ./nlp_example.py
- multi GPUS (using PyTorch distributed mode)
accelerate launch --fp16 ./nlp_example.py
- multi GPUs (using PyTorch distributed mode)
* With Accelerate config and launcher
```bash
accelerate config # This will create a config file on your server
@ -89,3 +95,111 @@ To run it in each of these various modes, use the following commands:
```
* In PyTorch:
Add an `xmp.spawn` line in your script as you usually do.
## Simple vision example
The [cv_example.py](./cv_example.py) script is a simple example to fine-tune a ResNet-50 on a classification task ([Ofxord-IIT Pet Dataset](https://www.robots.ox.ac.uk/~vgg/data/pets/)).
The same script can be run in any of the following configurations:
- single CPU or single GPU
- multi GPUs (using PyTorch distributed mode)
- (multi) TPUs
- fp16 (mixed-precision) or fp32 (normal precision)
Prior to running it you should install timm and torchvision:
```bash
pip install timm torchvision
```
and you should download the data with the following commands:
```bash
wget https://www.robots.ox.ac.uk/~vgg/data/pets/data/images.tar.gz
tar -xzf images.tar.gz
```
To run it in each of these various modes, use the following commands:
- single CPU:
* from a server without GPU
```bash
python ./cv_example.py --data_dir path_to_data
```
* from any server by passing `cpu=True` to the `Accelerator`.
```bash
python ./cv_example.py --data_dir path_to_data --cpu
```
* from any server with Accelerate launcher
```bash
accelerate launch --cpu ./cv_example.py --data_dir path_to_data
```
- single GPU:
```bash
python ./nlp_example.py # from a server with a GPU
```
- with fp16 (mixed-precision)
* from any server by passing `fp16=True` to the `Accelerator`.
```bash
python ./cv_example.py --data_dir path_to_data --fp16
```
* from any server with Accelerate launcher
```bash
accelerate launch --fp16 ./cv_example.py --data_dir path_to_data
- multi GPUs (using PyTorch distributed mode)
* With Accelerate config and launcher
```bash
accelerate config # This will create a config file on your server
accelerate launch ./cv_example.py --data_dir path_to_data # This will run the script on your server
```
* With traditional PyTorch launcher
```bash
python -m torch.distributed.launch --nproc_per_node 2 --use_env ./cv_example.py --data_dir path_to_data
```
- multi GPUs, multi node (several machines, using PyTorch distributed mode)
* With Accelerate config and launcher, on each machine:
```bash
accelerate config # This will create a config file on each server
accelerate launch ./cv_example.py --data_dir path_to_data # This will run the script on each server
```
* With PyTorch launcher only
```bash
python -m torch.distributed.launch --nproc_per_node 2 \
--use_env \
--node_rank 0 \
--master_addr master_node_ip_address \
./cv_example.py --data_dir path_to_data # On the first server
python -m torch.distributed.launch --nproc_per_node 2 \
--use_env \
--node_rank 1 \
--master_addr master_node_ip_address \
./cv_example.py --data_dir path_to_data # On the second server
```
- (multi) TPUs
* With Accelerate config and launcher
```bash
accelerate config # This will create a config file on your TPU server
accelerate launch ./cv_example.py --data_dir path_to_data # This will run the script on each server
```
* In PyTorch:
Add an `xmp.spawn` line in your script as you usually do.
## Finer Examples
While the first two scripts are extremely barebones when it comes to what you can do with accelerate, more advanced features are documented in two other locations.
### `by_feature` examples
These scripts are *individual* examples highlighting one particular feature or use-case within Accelerate. They all stem from the [nlp_example.py](./nlp_example.py) script, and any changes or modifications is denoted with a `# New Code #` comment.
Read the README.md file located in the `by_feature` folder for more information.
### `complete_*` examples
These two scripts contain *every* single feature currently available in Accelerate in one place, as one giant script.
New arguments that can be passed include:
- `checkpointing_steps`, whether the various states should be saved at the end of every `n` steps, or `"epoch"` for each epoch. States are then saved to folders named `step_{n}` or `epoch_{n}`
- `resume_from_checkpoint`, should be used if you want to resume training off of a previous call to the script and passed a `checkpointing_steps` to it.
- `with_tracking`, should be used if you want to log the training run using all available experiment trackers in your environment. Currently supported trackers include TensorBoard, Weights and Biases, and CometML.

View File

@ -0,0 +1,68 @@
# What are these scripts?
All scripts in this folder originate from the `nlp_example.py` file, as it is a very simplistic NLP training example using Accelerate with zero extra features.
From there, each further script adds in just **one** feature of Accelerate, showing how you can quickly modify your own scripts to implement these capabilities.
A full example with all of these parts integrated together can be found in the `complete_nlp_example.py` script and `complete_cv_example.py` script.
Adjustments to each script from the base `nlp_example.py` file can be found quickly by searching for "# New Code #"
## Example Scripts by Feature and their Arguments
### Base Example (`../nlp_example.py`)
- Shows how to use `Accelerator` in an extremely simplistic PyTorch training loop
- Arguments available:
- `mixed_precision`, whether to use mixed precision. ("no", "fp16", or "bf16")
- `cpu`, whether to train using only the CPU. (yes/no/1/0)
All following scripts also accept these arguments in addition to their added ones.
These arguments should be added at the end of any method for starting the python script (such as `python`, `accelerate launch`, `python -m torch.distributed.launch`), such as:
```bash
accelerate launch ../nlp_example.py --mixed_precision fp16 --cpu 0
```
### Checkpointing and Resuming Training (`checkpointing.py`)
- Shows how to use `Accelerator.save_state` and `Accelerator.load_state` to save or continue training
- **It is assumed you are continuing off the same training script**
- Arguments available:
- `checkpointing_steps`, after how many steps the various states should be saved. ("epoch", 1, 2, ...)
- `output_dir`, where saved state folders should be saved to, default is current working directory
- `resume_from_checkpoint`, what checkpoint folder to resume from. ("epoch_0", "step_22", ...)
These arguments should be added at the end of any method for starting the python script (such as `python`, `accelerate launch`, `python -m torch.distributed.launch`), such as:
(Note, `resume_from_checkpoint` assumes that we've ran the script for one epoch with the `--checkpointing_steps epoch` flag)
```bash
accelerate launch ./checkpointing.py --checkpointing_steps epoch output_dir "checkpointing_tutorial" --resume_from_checkpoint "checkpointing_tutorial/epoch_0"
```
### Experiment Tracking (`tracking.py`)
- Shows how to use `Accelerate.init_trackers` and `Accelerator.log`
- Can be used with Weights and Biases, TensorBoard, or CometML.
- Arguments available:
- `with_tracking`, whether to load in all available experiment trackers from the environment.
These arguments should be added at the end of any method for starting the python script (such as `python`, `accelerate launch`, `python -m torch.distributed.launch`), such as:
```bash
accelerate launch ./tracking.py --with_tracking
```
### Cross Validation (`cross_validation.py`)
- Shows how to use `Accelerator.free_memory` and run cross validation efficiently with `datasets`.
- Arguments available:
- `num_folds`, the number of folds the training dataset should be split into.
These arguments should be added at the end of any method for starting the python script (such as `python`, `accelerate launch`, `python -m torch.distributed.launch`), such as:
```bash
accelerate launch ./cross_validation.py --num_folds 2
```

View File

@ -0,0 +1,304 @@
# coding=utf-8
# Copyright 2021 The HuggingFace Inc. team. All rights reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
import argparse
import os
import torch
from torch.utils.data import DataLoader
from accelerate import Accelerator, DistributedType
from datasets import load_dataset, load_metric
from transformers import (
AdamW,
AutoModelForSequenceClassification,
AutoTokenizer,
get_linear_schedule_with_warmup,
set_seed,
)
########################################################################
# This is a fully working simple example to use Accelerate,
# specifically showcasing the checkpointing capability,
# and builds off the `nlp_example.py` script.
#
# This example trains a Bert base model on GLUE MRPC
# in any of the following settings (with the same script):
# - single CPU or single GPU
# - multi GPUS (using PyTorch distributed mode)
# - (multi) TPUs
# - fp16 (mixed-precision) or fp32 (normal precision)
#
# To help focus on the differences in the code, building `DataLoaders`
# was refactored into its own function.
# New additions from the base script can be found quickly by
# looking for the # New Code # tags
#
# To run it in each of these various modes, follow the instructions
# in the readme for examples:
# https://github.com/huggingface/accelerate/tree/main/examples
#
########################################################################
MAX_GPU_BATCH_SIZE = 16
EVAL_BATCH_SIZE = 32
def get_dataloaders(accelerator: Accelerator, batch_size: int = 16):
"""
Creates a set of `DataLoader`s for the `glue` dataset,
using "bert-base-cased" as the tokenizer.
Args:
accelerator (`Accelerator`):
An `Accelerator` object
batch_size (`int`, *optional*):
The batch size for the train and validation DataLoaders.
"""
tokenizer = AutoTokenizer.from_pretrained("bert-base-cased")
datasets = load_dataset("glue", "mrpc")
def tokenize_function(examples):
# max_length=None => use the model max length (it's actually the default)
outputs = tokenizer(examples["sentence1"], examples["sentence2"], truncation=True, max_length=None)
return outputs
# Apply the method we just defined to all the examples in all the splits of the dataset
tokenized_datasets = datasets.map(
tokenize_function,
batched=True,
remove_columns=["idx", "sentence1", "sentence2"],
)
# We also rename the 'label' column to 'labels' which is the expected name for labels by the models of the
# transformers library
tokenized_datasets = tokenized_datasets.rename_column("label", "labels")
def collate_fn(examples):
# On TPU it's best to pad everything to the same length or training will be very slow.
if accelerator.distributed_type == DistributedType.TPU:
return tokenizer.pad(examples, padding="max_length", max_length=128, return_tensors="pt")
return tokenizer.pad(examples, padding="longest", return_tensors="pt")
# Instantiate dataloaders.
train_dataloader = DataLoader(
tokenized_datasets["train"], shuffle=True, collate_fn=collate_fn, batch_size=batch_size
)
eval_dataloader = DataLoader(
tokenized_datasets["validation"], shuffle=False, collate_fn=collate_fn, batch_size=EVAL_BATCH_SIZE
)
return train_dataloader, eval_dataloader
# For testing only
if os.environ.get("TESTING_MOCKED_DATALOADERS", None) == "1":
from accelerate.test_utils.training import mocked_dataloaders
get_dataloaders = mocked_dataloaders # noqa: F811
def training_function(config, args):
# Initialize accelerator
accelerator = Accelerator(cpu=args.cpu, mixed_precision=args.mixed_precision)
# Sample hyper-parameters for learning rate, batch size, seed and a few other HPs
lr = config["lr"]
num_epochs = int(config["num_epochs"])
correct_bias = config["correct_bias"]
seed = int(config["seed"])
batch_size = int(config["batch_size"])
# New Code #
# Parse out whether we are saving every epoch or after a certain number of batches
if hasattr(args.checkpointing_steps, "isdigit"):
if args.checkpointing_steps == "epoch":
checkpointing_steps = args.checkpointing_steps
elif args.checkpointing_steps.isdigit():
checkpointing_steps = int(args.checkpointing_steps)
else:
raise ValueError(
f"Argument `checkpointing_steps` must be either a number or `epoch`. `{args.checkpointing_steps}` passed."
)
else:
checkpointing_steps = None
set_seed(seed)
train_dataloader, eval_dataloader = get_dataloaders(accelerator, batch_size)
metric = load_metric("glue", "mrpc")
# If the batch size is too big we use gradient accumulation
gradient_accumulation_steps = 1
if batch_size > MAX_GPU_BATCH_SIZE:
gradient_accumulation_steps = batch_size // MAX_GPU_BATCH_SIZE
batch_size = MAX_GPU_BATCH_SIZE
# Instantiate the model (we build the model here so that the seed also control new weights initialization)
model = AutoModelForSequenceClassification.from_pretrained("bert-base-cased", return_dict=True)
# We could avoid this line since the accelerator is set with `device_placement=True` (default value).
# Note that if you are placing tensors on devices manually, this line absolutely needs to be before the optimizer
# creation otherwise training will not work on TPU (`accelerate` will kindly throw an error to make us aware of that).
model = model.to(accelerator.device)
# Instantiate optimizer
optimizer = AdamW(params=model.parameters(), lr=lr, correct_bias=correct_bias)
# Instantiate scheduler
lr_scheduler = get_linear_schedule_with_warmup(
optimizer=optimizer,
num_warmup_steps=100,
num_training_steps=(len(train_dataloader) * num_epochs) // gradient_accumulation_steps,
)
# Prepare everything
# There is no specific order to remember, we just need to unpack the objects in the same order we gave them to the
# prepare method.
model, optimizer, train_dataloader, eval_dataloader, lr_scheduler = accelerator.prepare(
model, optimizer, train_dataloader, eval_dataloader, lr_scheduler
)
# New Code #
# We need to keep track of how many total steps we have iterated over
overall_step = 0
# We also need to keep track of the stating epoch so files are named properly
starting_epoch = 0
# We need to load the checkpoint back in before training here with `load_state`
# The total number of epochs is adjusted based on where the state is being loaded from,
# as we assume continuation of the same training script
if args.resume_from_checkpoint:
if args.resume_from_checkpoint is not None or args.resume_from_checkpoint != "":
accelerator.print(f"Resumed from checkpoint: {args.resume_from_checkpoint}")
accelerator.load_state(args.resume_from_checkpoint)
path = os.path.basename(args.resume_from_checkpoint)
else:
# Get the most recent checkpoint
dirs = [f.name for f in os.scandir(os.getcwd()) if f.is_dir()]
dirs.sort(key=os.path.getctime)
path = dirs[-1] # Sorts folders by date modified, most recent checkpoint is the last
# Extract `epoch_{i}` or `step_{i}`
training_difference = os.path.splitext(path)[0]
if "epoch" in training_difference:
starting_epoch = int(training_difference.replace("epoch_", "")) + 1
resume_step = None
else:
resume_step = int(training_difference.replace("step_", ""))
starting_epoch = resume_step // len(train_dataloader)
resume_step -= starting_epoch * len(train_dataloader)
# Now we train the model
for epoch in range(starting_epoch, num_epochs):
model.train()
for step, batch in enumerate(train_dataloader):
# New Code #
# We need to skip steps until we reach the resumed step during the first epoch
if args.resume_from_checkpoint and epoch == starting_epoch:
if resume_step is not None and step < resume_step:
overall_step += 1
continue
# We could avoid this line since we set the accelerator with `device_placement=True`.
batch.to(accelerator.device)
outputs = model(**batch)
loss = outputs.loss
loss = loss / gradient_accumulation_steps
accelerator.backward(loss)
if step % gradient_accumulation_steps == 0:
optimizer.step()
lr_scheduler.step()
optimizer.zero_grad()
# New Code #
overall_step += 1
# New Code #
# We save the model, optimizer, lr_scheduler, and seed states by calling `save_state`
# These are saved to folders named `step_{overall_step}`
# Will contain files: "pytorch_model.bin", "optimizer.bin", "scheduler.bin", and "random_states.pkl"
# If mixed precision was used, will also save a "scalar.bin" file
if isinstance(checkpointing_steps, int):
output_dir = f"step_{overall_step}"
if overall_step % checkpointing_steps == 0:
if args.output_dir is not None:
output_dir = os.path.join(args.output_dir, output_dir)
accelerator.save_state(output_dir)
model.eval()
for step, batch in enumerate(eval_dataloader):
# We could avoid this line since we set the accelerator with `device_placement=True` (the default).
batch.to(accelerator.device)
with torch.no_grad():
outputs = model(**batch)
predictions = outputs.logits.argmax(dim=-1)
# It is slightly faster to call this once, than multiple times
predictions, references = accelerator.gather((predictions, batch["labels"]))
metric.add_batch(
predictions=predictions,
references=references,
)
eval_metric = metric.compute()
# Use accelerator.print to print only on the main process.
accelerator.print(f"epoch {epoch}:", eval_metric)
# New Code #
# We save the model, optimizer, lr_scheduler, and seed states by calling `save_state`
# These are saved to folders named `epoch_{epoch}`
# Will contain files: "pytorch_model.bin", "optimizer.bin", "scheduler.bin", and "random_states.pkl"
# If mixed precision was used, will also save a "scalar.bin" file
if checkpointing_steps == "epoch":
output_dir = f"epoch_{epoch}"
if args.output_dir is not None:
output_dir = os.path.join(args.output_dir, output_dir)
accelerator.save_state(output_dir)
def main():
parser = argparse.ArgumentParser(description="Simple example of training script.")
parser.add_argument(
"--mixed_precision",
type=str,
default="no",
choices=["no", "fp16", "bf16"],
help="Whether to use mixed precision. Choose"
"between fp16 and bf16 (bfloat16). Bf16 requires PyTorch >= 1.10."
"and an Nvidia Ampere GPU.",
)
parser.add_argument("--cpu", action="store_true", help="If passed, will train on the CPU.")
parser.add_argument(
"--checkpointing_steps",
type=str,
default=None,
help="Whether the various states should be saved at the end of every n steps, or 'epoch' for each epoch.",
)
parser.add_argument(
"--output_dir",
type=str,
default=".",
help="Optional save directory where all checkpoint folders will be stored. Default is the current working directory.",
)
parser.add_argument(
"--resume_from_checkpoint",
type=str,
default=None,
help="If the training should continue from a checkpoint folder.",
)
args = parser.parse_args()
config = {"lr": 2e-5, "num_epochs": 3, "correct_bias": True, "seed": 42, "batch_size": 16}
training_function(config, args)
if __name__ == "__main__":
main()

View File

@ -0,0 +1,275 @@
# coding=utf-8
# Copyright 2022 The HuggingFace Inc. team. All rights reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
import argparse
from typing import List
import numpy as np
import torch
from torch.utils.data import DataLoader
from accelerate import Accelerator, DistributedType
from datasets import DatasetDict, load_dataset, load_metric
# New Code #
# We'll be using StratifiedKFold for this example
from sklearn.model_selection import StratifiedKFold
from transformers import (
AdamW,
AutoModelForSequenceClassification,
AutoTokenizer,
get_linear_schedule_with_warmup,
set_seed,
)
########################################################################
# This is a fully working simple example to use Accelerate,
# specifically showcasing how to perform Cross Validation,
# and builds off the `nlp_example.py` script.
#
# This example trains a Bert base model on GLUE MRPC
# in any of the following settings (with the same script):
# - single CPU or single GPU
# - multi GPUS (using PyTorch distributed mode)
# - (multi) TPUs
# - fp16 (mixed-precision) or fp32 (normal precision)
#
# To help focus on the differences in the code, building `DataLoaders`
# was refactored into its own function.
# New additions from the base script can be found quickly by
# looking for the # New Code # tags
#
# To run it in each of these various modes, follow the instructions
# in the readme for examples:
# https://github.com/huggingface/accelerate/tree/main/examples
#
########################################################################
MAX_GPU_BATCH_SIZE = 16
EVAL_BATCH_SIZE = 32
# New Code #
# We need a different `get_dataloaders` function that will build dataloaders by indexs
def get_fold_dataloaders(
accelerator: Accelerator, dataset: DatasetDict, train_idxs: List[int], valid_idxs: List[int], batch_size: int = 16
):
"""
Gets a set of train, valid, and test dataloaders for a particular fold
Args:
accelerator (`Accelerator`):
The main `Accelerator` object
train_idxs (list of `int`):
The split indicies for the training dataset
valid_idxs (list of `int`):
The split indicies for the validation dataset
batch_size (`int`):
The size of the minibatch. Default is 16
"""
tokenizer = AutoTokenizer.from_pretrained("bert-base-cased")
datasets = DatasetDict(
{
"train": dataset["train"].select(train_idxs),
"validation": dataset["train"].select(valid_idxs),
"test": dataset["validation"],
}
)
def tokenize_function(examples):
# max_length=None => use the model max length (it's actually the default)
outputs = tokenizer(examples["sentence1"], examples["sentence2"], truncation=True, max_length=None)
return outputs
# Apply the method we just defined to all the examples in all the splits of the dataset
tokenized_datasets = datasets.map(
tokenize_function,
batched=True,
remove_columns=["idx", "sentence1", "sentence2"],
)
# We also rename the 'label' column to 'labels' which is the expected name for labels by the models of the
# transformers library
tokenized_datasets = tokenized_datasets.rename_column("label", "labels")
def collate_fn(examples):
# On TPU it's best to pad everything to the same length or training will be very slow.
if accelerator.distributed_type == DistributedType.TPU:
return tokenizer.pad(examples, padding="max_length", max_length=128, return_tensors="pt")
return tokenizer.pad(examples, padding="longest", return_tensors="pt")
# Instantiate dataloaders.
train_dataloader = DataLoader(
tokenized_datasets["train"], shuffle=True, collate_fn=collate_fn, batch_size=batch_size
)
eval_dataloader = DataLoader(
tokenized_datasets["validation"], shuffle=False, collate_fn=collate_fn, batch_size=EVAL_BATCH_SIZE
)
test_dataloader = DataLoader(
tokenized_datasets["test"], shuffle=False, collate_fn=collate_fn, batch_size=EVAL_BATCH_SIZE
)
return train_dataloader, eval_dataloader, test_dataloader
def training_function(config, args):
# New Code #
test_labels = None
test_predictions = []
# Download the dataset
datasets = load_dataset("glue", "mrpc")
# Create our splits
kfold = StratifiedKFold(n_splits=int(args.num_folds))
# Initialize accelerator
accelerator = Accelerator(cpu=args.cpu, mixed_precision=args.mixed_precision)
# Sample hyper-parameters for learning rate, batch size, seed and a few other HPs
lr = config["lr"]
num_epochs = int(config["num_epochs"])
correct_bias = config["correct_bias"]
seed = int(config["seed"])
batch_size = int(config["batch_size"])
metric = load_metric("glue", "mrpc")
# If the batch size is too big we use gradient accumulation
gradient_accumulation_steps = 1
if batch_size > MAX_GPU_BATCH_SIZE:
gradient_accumulation_steps = batch_size // MAX_GPU_BATCH_SIZE
batch_size = MAX_GPU_BATCH_SIZE
set_seed(seed)
# New Code #
# Create our folds:
folds = kfold.split(np.zeros(datasets["train"].num_rows), datasets["train"]["label"])
# Iterate over them
for train_idxs, valid_idxs in folds:
train_dataloader, eval_dataloader, test_dataloader = get_fold_dataloaders(
accelerator,
datasets,
train_idxs,
valid_idxs,
)
if test_labels is None:
test_labels = datasets["validation"]["label"]
# Instantiate the model (we build the model here so that the seed also control new weights initialization)
model = AutoModelForSequenceClassification.from_pretrained("bert-base-cased", return_dict=True)
# We could avoid this line since the accelerator is set with `device_placement=True` (default value).
# Note that if you are placing tensors on devices manually, this line absolutely needs to be before the optimizer
# creation otherwise training will not work on TPU (`accelerate` will kindly throw an error to make us aware of that).
model = model.to(accelerator.device)
# Instantiate optimizer
optimizer = AdamW(params=model.parameters(), lr=lr, correct_bias=correct_bias)
# Instantiate scheduler
lr_scheduler = get_linear_schedule_with_warmup(
optimizer=optimizer,
num_warmup_steps=100,
num_training_steps=(len(train_dataloader) * num_epochs) // gradient_accumulation_steps,
)
# Prepare everything
# There is no specific order to remember, we just need to unpack the objects in the same order we gave them to the
# prepare method.
model, optimizer, train_dataloader, eval_dataloader, lr_scheduler = accelerator.prepare(
model, optimizer, train_dataloader, eval_dataloader, lr_scheduler
)
# Now we train the model
for epoch in range(num_epochs):
model.train()
for step, batch in enumerate(train_dataloader):
# We could avoid this line since we set the accelerator with `device_placement=True`.
batch.to(accelerator.device)
outputs = model(**batch)
loss = outputs.loss
loss = loss / gradient_accumulation_steps
accelerator.backward(loss)
if step % gradient_accumulation_steps == 0:
optimizer.step()
lr_scheduler.step()
optimizer.zero_grad()
model.eval()
for step, batch in enumerate(eval_dataloader):
# We could avoid this line since we set the accelerator with `device_placement=True`.
batch.to(accelerator.device)
with torch.no_grad():
outputs = model(**batch)
predictions = outputs.logits.argmax(dim=-1)
predictions, references = accelerator.gather((predictions, batch["labels"]))
metric.add_batch(
predictions=predictions,
references=references,
)
eval_metric = metric.compute()
# Use accelerator.print to print only on the main process.
accelerator.print(f"epoch {epoch}:", eval_metric)
# New Code #
# We also run predictions on the test set at the very end
fold_predictions = []
for step, batch in enumerate(test_dataloader):
# We could avoid this line since we set the accelerator with `device_placement=True`.
batch.to(accelerator.device)
with torch.no_grad():
outputs = model(**batch)
predictions = outputs.logits
predictions, references = accelerator.gather((predictions, batch["labels"]))
fold_predictions.append(predictions.cpu())
metric.add_batch(
predictions=predictions.argmax(dim=-1),
references=references,
)
test_metric = metric.compute()
# Use accelerator.print to print only on the main process.
test_predictions.append(torch.cat(fold_predictions, dim=0))
# We now need to release all our memory and get rid of the current model, optimizer, etc
accelerator.free_memory()
# New Code #
# Finally we check the accuracy of our folded results:
preds = torch.stack(test_predictions, dim=0).sum(dim=0).div(int(config["n_splits"])).argmax(dim=-1)
test_metric = metric.compute(predictions=preds, references=test_labels)
accelerator.print("Average test metrics from all folds:", test_metric)
def main():
parser = argparse.ArgumentParser(description="Simple example of training script.")
parser.add_argument(
"--mixed_precision",
type=str,
default="no",
choices=["no", "fp16", "bf16"],
help="Whether to use mixed precision. Choose"
"between fp16 and bf16 (bfloat16). Bf16 requires PyTorch >= 1.10."
"and an Nvidia Ampere GPU.",
)
parser.add_argument("--cpu", action="store_true", help="If passed, will train on the CPU.")
# New Code #
parser.add_argument("--num_folds", type=int, default=3, help="The number of splits to perform across the dataset")
args = parser.parse_args()
config = {"lr": 2e-5, "num_epochs": 3, "correct_bias": True, "seed": 42, "batch_size": 16}
training_function(config, args)
if __name__ == "__main__":
main()

View File

@ -0,0 +1,388 @@
# coding=utf-8
# Copyright 2021 The HuggingFace Inc. team. All rights reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
import argparse
import gc
import os
import torch
from torch.utils.data import DataLoader
from accelerate import Accelerator, DistributedType
from datasets import load_dataset, load_metric
from transformers import AutoModelForSequenceClassification, AutoTokenizer, get_linear_schedule_with_warmup, set_seed
########################################################################
# This is a fully working simple example to use Accelerate
#
# This example trains a Bert base model on GLUE MRPC
# in any of the following settings (with the same script):
# - single CPU or single GPU
# - multi GPUS (using PyTorch distributed mode)
# - (multi) TPUs
# - fp16 (mixed-precision) or fp32 (normal precision)
# - FSDP
#
# This example also demonstrates the checkpointing and sharding capabilities
#
# To run it in each of these various modes, follow the instructions
# in the readme for examples:
# https://github.com/huggingface/accelerate/tree/main/examples
#
########################################################################
MAX_GPU_BATCH_SIZE = 16
EVAL_BATCH_SIZE = 32
# New Code #
# Converting Bytes to Megabytes
def b2mb(x):
return int(x / 2**20)
# New Code #
# This context manager is used to track the peak memory usage of the process
class TorchTracemalloc:
def __enter__(self):
gc.collect()
torch.cuda.empty_cache()
torch.cuda.reset_max_memory_allocated() # reset the peak gauge to zero
self.begin = torch.cuda.memory_allocated()
return self
def __exit__(self, *exc):
gc.collect()
torch.cuda.empty_cache()
self.end = torch.cuda.memory_allocated()
self.peak = torch.cuda.max_memory_allocated()
self.used = b2mb(self.end - self.begin)
self.peaked = b2mb(self.peak - self.begin)
# print(f"delta used/peak {self.used:4d}/{self.peaked:4d}")
# For testing only
if os.environ.get("TESTING_MOCKED_DATALOADERS", None) == "1":
from accelerate.test_utils.training import mocked_dataloaders
get_dataloaders = mocked_dataloaders # noqa: F811
def training_function(config, args):
# Initialize accelerator
if args.with_tracking:
accelerator = Accelerator(
cpu=args.cpu, mixed_precision=args.mixed_precision, log_with="wandb", logging_dir=args.logging_dir
)
else:
accelerator = Accelerator()
accelerator.print(accelerator.distributed_type)
if hasattr(args.checkpointing_steps, "isdigit"):
if args.checkpointing_steps == "epoch":
checkpointing_steps = args.checkpointing_steps
elif args.checkpointing_steps.isdigit():
checkpointing_steps = int(args.checkpointing_steps)
else:
raise ValueError(
f"Argument `checkpointing_steps` must be either a number or `epoch`. `{args.checkpointing_steps}` passed."
)
else:
checkpointing_steps = None
# Sample hyper-parameters for learning rate, batch size, seed and a few other HPs
lr = config["lr"]
num_epochs = int(config["num_epochs"])
seed = int(config["seed"])
batch_size = int(config["batch_size"])
# We need to initialize the trackers we use, and also store our configuration
if args.with_tracking:
if accelerator.is_main_process:
run = os.path.split(__file__)[-1].split(".")[0]
if args.logging_dir:
run = os.path.join(args.logging_dir, run)
accelerator.print(run)
accelerator.init_trackers(run, config)
tokenizer = AutoTokenizer.from_pretrained(args.model_name_or_path)
datasets = load_dataset("glue", "mrpc")
metric = load_metric("glue", "mrpc")
def tokenize_function(examples):
# max_length=None => use the model max length (it's actually the default)
outputs = tokenizer(examples["sentence1"], examples["sentence2"], truncation=True, max_length=None)
return outputs
# Apply the method we just defined to all the examples in all the splits of the dataset
tokenized_datasets = datasets.map(
tokenize_function,
batched=True,
remove_columns=["idx", "sentence1", "sentence2"],
)
# We also rename the 'label' column to 'labels' which is the expected name for labels by the models of the
# transformers library
tokenized_datasets = tokenized_datasets.rename_column("label", "labels")
# If the batch size is too big we use gradient accumulation
gradient_accumulation_steps = 1
if batch_size > MAX_GPU_BATCH_SIZE:
gradient_accumulation_steps = batch_size // MAX_GPU_BATCH_SIZE
batch_size = MAX_GPU_BATCH_SIZE
def collate_fn(examples):
# On TPU it's best to pad everything to the same length or training will be very slow.
if accelerator.distributed_type == DistributedType.TPU:
return tokenizer.pad(examples, padding="max_length", max_length=128, return_tensors="pt")
return tokenizer.pad(examples, padding="longest", return_tensors="pt")
# Instantiate dataloaders.
train_dataloader = DataLoader(
tokenized_datasets["train"], shuffle=True, collate_fn=collate_fn, batch_size=batch_size
)
eval_dataloader = DataLoader(
tokenized_datasets["validation"], shuffle=False, collate_fn=collate_fn, batch_size=EVAL_BATCH_SIZE
)
set_seed(seed)
# Instantiate the model (we build the model here so that the seed also control new weights initialization)
model = AutoModelForSequenceClassification.from_pretrained(args.model_name_or_path, return_dict=True)
# New Code #
# For FSDP feature, it is highly recommended and efficient to prepare the model before creating optimizer
model = accelerator.prepare(model)
# Instantiate optimizer
# New Code #
# For FSDP feature, at present it doesn't support multiple parameter groups,
# so we need to create a single parameter group for the whole model
optimizer = torch.optim.AdamW(params=model.parameters(), lr=lr, weight_decay=2e-4)
# Instantiate scheduler
lr_scheduler = get_linear_schedule_with_warmup(
optimizer=optimizer,
num_warmup_steps=10,
num_training_steps=(len(train_dataloader) * num_epochs) // gradient_accumulation_steps,
)
# New Code #
# For FSDP feature, prepare everything except the model as we have already prepared the model
# before creating the optimizer
# There is no specific order to remember, we just need to unpack the objects in the same order we gave them to the
# prepare method.
optimizer, train_dataloader, eval_dataloader, lr_scheduler = accelerator.prepare(
optimizer, train_dataloader, eval_dataloader, lr_scheduler
)
overall_step = 0
# Potentially load in the weights and states from a previous save
if args.resume_from_checkpoint:
if args.resume_from_checkpoint is not None or args.resume_from_checkpoint != "":
accelerator.print(f"Resumed from checkpoint: {args.resume_from_checkpoint}")
accelerator.load_state(args.resume_from_checkpoint)
path = os.path.basename(args.resume_from_checkpoint)
else:
# Get the most recent checkpoint
dirs = [f.name for f in os.scandir(os.getcwd()) if f.is_dir()]
dirs.sort(key=os.path.getctime)
path = dirs[-1] # Sorts folders by date modified, most recent checkpoint is the last
# Extract `epoch_{i}` or `step_{i}`
training_difference = os.path.splitext(path)[0]
if "epoch" in training_difference:
num_epochs -= int(training_difference.replace("epoch_", ""))
resume_step = None
else:
resume_step = int(training_difference.replace("step_", ""))
num_epochs -= resume_step // len(train_dataloader)
# If resuming by step, we also need to know exactly how far into the DataLoader we went
resume_step = (num_epochs * len(train_dataloader)) - resume_step
# Now we train the model
for epoch in range(num_epochs):
# New Code #
# context manager to track the peak memory usage during the training epoch
with TorchTracemalloc() as tracemalloc:
model.train()
if args.with_tracking:
total_loss = 0
for step, batch in enumerate(train_dataloader):
# We need to skip steps until we reach the resumed step
if args.resume_from_checkpoint and epoch == 0:
if resume_step is not None and step < resume_step:
pass
# We could avoid this line since we set the accelerator with `device_placement=True`.
batch.to(accelerator.device)
outputs = model(**batch)
loss = outputs.loss
loss = loss / gradient_accumulation_steps
# We keep track of the loss at each epoch
if args.with_tracking:
total_loss += loss.detach().float()
accelerator.backward(loss)
if step % gradient_accumulation_steps == 0:
optimizer.step()
lr_scheduler.step()
optimizer.zero_grad()
# accelerator.print(lr_scheduler.get_lr())
overall_step += 1
if isinstance(checkpointing_steps, int):
output_dir = f"step_{overall_step}"
if overall_step % checkpointing_steps == 0:
if args.output_dir is not None:
output_dir = os.path.join(args.output_dir, output_dir)
accelerator.save_state(output_dir)
# New Code #
# Printing the GPU memory usage details such as allocated memory, peak memory, and total memory usage
accelerator.print("Memory before entering the train : {}".format(b2mb(tracemalloc.begin)))
accelerator.print("Memory consumed at the end of the train (end-begin): {}".format(tracemalloc.used))
accelerator.print("Peak Memory consumed during the train (max-begin): {}".format(tracemalloc.peaked))
accelerator.print(
"Total Peak Memory consumed during the train (max): {}".format(
tracemalloc.peaked + b2mb(tracemalloc.begin)
)
)
# Logging the peak memory usage of the GPU to the tracker
if args.with_tracking:
accelerator.log(
{
"train_total_peak_memory": tracemalloc.peaked + b2mb(tracemalloc.begin),
},
step=epoch,
)
# New Code #
# context manager to track the peak memory usage during the evaluation
with TorchTracemalloc() as tracemalloc:
model.eval()
samples_seen = 0
for step, batch in enumerate(eval_dataloader):
# We could avoid this line since we set the accelerator with `device_placement=True`.
batch.to(accelerator.device)
with torch.no_grad():
outputs = model(**batch)
predictions = outputs.logits.argmax(dim=-1)
# It is slightly faster to call this once, than multiple times
predictions, references = accelerator.gather(
(predictions, batch["labels"])
) # If we are in a multiprocess environment, the last batch has duplicates
if accelerator.num_processes > 1:
if step == len(eval_dataloader) - 1:
predictions = predictions[: len(eval_dataloader.dataset) - samples_seen]
references = references[: len(eval_dataloader.dataset) - samples_seen]
else:
samples_seen += references.shape[0]
metric.add_batch(
predictions=predictions,
references=references,
)
eval_metric = metric.compute()
# Use accelerator.print to print only on the main process.
accelerator.print(f"epoch {epoch}:", eval_metric)
if args.with_tracking:
accelerator.log(
{
"accuracy": eval_metric["accuracy"],
"f1": eval_metric["f1"],
"train_loss": total_loss.item() / len(train_dataloader),
},
step=epoch,
)
if checkpointing_steps == "epoch":
output_dir = f"epoch_{epoch}"
if args.output_dir is not None:
output_dir = os.path.join(args.output_dir, output_dir)
accelerator.save_state(output_dir)
# New Code #
# Printing the GPU memory usage details such as allocated memory, peak memory, and total memory usage
accelerator.print("Memory before entering the eval : {}".format(b2mb(tracemalloc.begin)))
accelerator.print("Memory consumed at the end of the eval (end-begin): {}".format(tracemalloc.used))
accelerator.print("Peak Memory consumed during the eval (max-begin): {}".format(tracemalloc.peaked))
accelerator.print(
"Total Peak Memory consumed during the eval (max): {}".format(tracemalloc.peaked + b2mb(tracemalloc.begin))
)
# Logging the peak memory usage of the GPU to the tracker
if args.with_tracking:
accelerator.log(
{
"eval_total_peak_memory": tracemalloc.peaked + b2mb(tracemalloc.begin),
},
step=epoch,
)
if args.with_tracking:
accelerator.end_training()
def main():
parser = argparse.ArgumentParser(description="Simple example of training script.")
parser.add_argument(
"--mixed_precision",
type=str,
default="no",
choices=["no", "fp16", "bf16"],
help="Whether to use mixed precision. Choose"
"between fp16 and bf16 (bfloat16). Bf16 requires PyTorch >= 1.10."
"and an Nvidia Ampere GPU.",
)
parser.add_argument("--cpu", action="store_true", help="If passed, will train on the CPU.")
parser.add_argument(
"--checkpointing_steps",
type=str,
default=None,
help="Whether the various states should be saved at the end of every n steps, or 'epoch' for each epoch.",
)
parser.add_argument(
"--resume_from_checkpoint",
type=str,
default=None,
help="If the training should continue from a checkpoint folder.",
)
parser.add_argument(
"--with_tracking",
action="store_true",
help="Whether to load in all available experiment trackers from the environment and use them for logging.",
)
parser.add_argument(
"--output_dir",
type=str,
default=".",
help="Optional save directory where all checkpoint folders will be stored. Default is the current working directory.",
)
parser.add_argument(
"--logging_dir",
type=str,
default="logs",
help="Location on where to store experiment tracking logs`",
)
parser.add_argument(
"--model_name_or_path",
type=str,
help="Path to pretrained model or model identifier from huggingface.co/models.",
required=True,
)
args = parser.parse_args()
config = {"lr": 2e-5, "num_epochs": 3, "seed": 1, "batch_size": 16}
training_function(config, args)
if __name__ == "__main__":
main()

View File

@ -0,0 +1,226 @@
# Copyright 2022 The HuggingFace Team. All rights reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
import argparse
import os
import torch
from torch.utils.data import DataLoader
from accelerate import Accelerator, DistributedType
# New Code #
from accelerate.utils import find_executable_batch_size
from datasets import load_dataset, load_metric
from transformers import (
AdamW,
AutoModelForSequenceClassification,
AutoTokenizer,
get_linear_schedule_with_warmup,
set_seed,
)
########################################################################
# This is a fully working simple example to use Accelerate,
# specifically showcasing how to ensure out-of-memory errors never
# iterrupt training, and builds off the `nlp_example.py` script.
#
# This example trains a Bert base model on GLUE MRPC
# in any of the following settings (with the same script):
# - single CPU or single GPU
# - multi GPUS (using PyTorch distributed mode)
# - (multi) TPUs
# - fp16 (mixed-precision) or fp32 (normal precision)
#
# New additions from the base script can be found quickly by
# looking for the # New Code # tags
#
# To run it in each of these various modes, follow the instructions
# in the readme for examples:
# https://github.com/huggingface/accelerate/tree/main/examples
#
########################################################################
MAX_GPU_BATCH_SIZE = 16
EVAL_BATCH_SIZE = 32
def get_dataloaders(accelerator: Accelerator, batch_size: int = 16):
"""
Creates a set of `DataLoader`s for the `glue` dataset,
using "bert-base-cased" as the tokenizer.
Args:
accelerator (`Accelerator`):
An `Accelerator` object
batch_size (`int`, *optional*):
The batch size for the train and validation DataLoaders.
"""
tokenizer = AutoTokenizer.from_pretrained("bert-base-cased")
datasets = load_dataset("glue", "mrpc")
def tokenize_function(examples):
# max_length=None => use the model max length (it's actually the default)
outputs = tokenizer(examples["sentence1"], examples["sentence2"], truncation=True, max_length=None)
return outputs
# Apply the method we just defined to all the examples in all the splits of the dataset
tokenized_datasets = datasets.map(
tokenize_function,
batched=True,
remove_columns=["idx", "sentence1", "sentence2"],
)
# We also rename the 'label' column to 'labels' which is the expected name for labels by the models of the
# transformers library
tokenized_datasets = tokenized_datasets.rename_column("label", "labels")
def collate_fn(examples):
# On TPU it's best to pad everything to the same length or training will be very slow.
if accelerator.distributed_type == DistributedType.TPU:
return tokenizer.pad(examples, padding="max_length", max_length=128, return_tensors="pt")
return tokenizer.pad(examples, padding="longest", return_tensors="pt")
# Instantiate dataloaders.
train_dataloader = DataLoader(
tokenized_datasets["train"], shuffle=True, collate_fn=collate_fn, batch_size=batch_size
)
eval_dataloader = DataLoader(
tokenized_datasets["validation"], shuffle=False, collate_fn=collate_fn, batch_size=EVAL_BATCH_SIZE
)
return train_dataloader, eval_dataloader
# For testing only
if os.environ.get("TESTING_MOCKED_DATALOADERS", None) == "1":
from accelerate.test_utils.training import mocked_dataloaders
get_dataloaders = mocked_dataloaders # noqa: F811
def training_function(config, args):
# Initialize accelerator
accelerator = Accelerator(cpu=args.cpu, mixed_precision=args.mixed_precision)
# Sample hyper-parameters for learning rate, batch size, seed and a few other HPs
lr = config["lr"]
num_epochs = int(config["num_epochs"])
correct_bias = config["correct_bias"]
seed = int(config["seed"])
batch_size = int(config["batch_size"])
metric = load_metric("glue", "mrpc")
# If the batch size is too big we use gradient accumulation
gradient_accumulation_steps = 1
if batch_size > MAX_GPU_BATCH_SIZE:
gradient_accumulation_steps = batch_size // MAX_GPU_BATCH_SIZE
batch_size = MAX_GPU_BATCH_SIZE
set_seed(seed)
# Instantiate the model (we build the model here so that the seed also control new weights initialization)
model = AutoModelForSequenceClassification.from_pretrained("bert-base-cased", return_dict=True)
# We could avoid this line since the accelerator is set with `device_placement=True` (default value).
# Note that if you are placing tensors on devices manually, this line absolutely needs to be before the optimizer
# creation otherwise training will not work on TPU (`accelerate` will kindly throw an error to make us aware of that).
model = model.to(accelerator.device)
# Instantiate optimizer
optimizer = AdamW(params=model.parameters(), lr=lr, correct_bias=correct_bias)
# New Code #
# We now can define an inner training loop function. It should take a batch size as the only parameter,
# and build the dataloaders in there.
# It also gets our decorator
@find_executable_batch_size(starting_batch_size=batch_size)
def inner_training_loop(batch_size):
# And now just move everything below under this function
# Ensure that anything declared outside this function is set as `nonlocal`
# so it is in scope
nonlocal model, optimizer
train_dataloader, eval_dataloader = get_dataloaders(accelerator, batch_size)
# Instantiate scheduler
lr_scheduler = get_linear_schedule_with_warmup(
optimizer=optimizer,
num_warmup_steps=100,
num_training_steps=(len(train_dataloader) * num_epochs) // gradient_accumulation_steps,
)
# Prepare everything
# There is no specific order to remember, we just need to unpack the objects in the same order we gave them to the
# prepare method.
model, optimizer, train_dataloader, eval_dataloader, lr_scheduler = accelerator.prepare(
model, optimizer, train_dataloader, eval_dataloader, lr_scheduler
)
# Now we train the model
for epoch in range(num_epochs):
model.train()
for step, batch in enumerate(train_dataloader):
# We could avoid this line since we set the accelerator with `device_placement=True`.
batch.to(accelerator.device)
outputs = model(**batch)
loss = outputs.loss
loss = loss / gradient_accumulation_steps
accelerator.backward(loss)
if step % gradient_accumulation_steps == 0:
optimizer.step()
lr_scheduler.step()
optimizer.zero_grad()
model.eval()
for step, batch in enumerate(eval_dataloader):
# We could avoid this line since we set the accelerator with `device_placement=True`.
batch.to(accelerator.device)
with torch.no_grad():
outputs = model(**batch)
predictions = outputs.logits.argmax(dim=-1)
predictions, references = accelerator.gather((predictions, batch["labels"]))
metric.add_batch(
predictions=predictions,
references=references,
)
eval_metric = metric.compute()
# Use accelerator.print to print only on the main process.
accelerator.print(f"epoch {epoch}:", eval_metric)
# New Code #
# And call it at the end with no arguments
# Note: You could also refactor this outside of your training loop function
inner_training_loop()
def main():
parser = argparse.ArgumentParser(description="Simple example of training script.")
parser.add_argument(
"--mixed_precision",
type=str,
default="no",
choices=["no", "fp16", "bf16"],
help="Whether to use mixed precision. Choose"
"between fp16 and bf16 (bfloat16). Bf16 requires PyTorch >= 1.10."
"and an Nvidia Ampere GPU.",
)
parser.add_argument("--cpu", action="store_true", help="If passed, will train on the CPU.")
args = parser.parse_args()
config = {"lr": 2e-5, "num_epochs": 3, "correct_bias": True, "seed": 42, "batch_size": 16}
training_function(config, args)
if __name__ == "__main__":
main()

View File

@ -0,0 +1,223 @@
# coding=utf-8
# Copyright 2022 The HuggingFace Inc. team. All rights reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
import argparse
import os
import torch
from torch.utils.data import DataLoader
from accelerate import Accelerator, DistributedType
from datasets import load_dataset, load_metric
from transformers import (
AdamW,
AutoModelForSequenceClassification,
AutoTokenizer,
get_linear_schedule_with_warmup,
set_seed,
)
########################################################################
# This is a fully working simple example to use Accelerate,
# specifically showcasing how to properly calculate the metrics on the
# validation dataset when in a distributed system, and builds off the
# `nlp_example.py` script.
#
# This example trains a Bert base model on GLUE MRPC
# in any of the following settings (with the same script):
# - single CPU or single GPU
# - multi GPUS (using PyTorch distributed mode)
# - (multi) TPUs
# - fp16 (mixed-precision) or fp32 (normal precision)
#
# To help focus on the differences in the code, building `DataLoaders`
# was refactored into its own function.
# New additions from the base script can be found quickly by
# looking for the # New Code # tags
#
# To run it in each of these various modes, follow the instructions
# in the readme for examples:
# https://github.com/huggingface/accelerate/tree/main/examples
#
########################################################################
MAX_GPU_BATCH_SIZE = 16
EVAL_BATCH_SIZE = 32
def get_dataloaders(accelerator: Accelerator, batch_size: int = 16):
"""
Creates a set of `DataLoader`s for the `glue` dataset,
using "bert-base-cased" as the tokenizer.
Args:
accelerator (`Accelerator`):
An `Accelerator` object
batch_size (`int`, *optional*):
The batch size for the train and validation DataLoaders.
"""
tokenizer = AutoTokenizer.from_pretrained("bert-base-cased")
datasets = load_dataset("glue", "mrpc")
def tokenize_function(examples):
# max_length=None => use the model max length (it's actually the default)
outputs = tokenizer(examples["sentence1"], examples["sentence2"], truncation=True, max_length=None)
return outputs
# Apply the method we just defined to all the examples in all the splits of the dataset
tokenized_datasets = datasets.map(
tokenize_function,
batched=True,
remove_columns=["idx", "sentence1", "sentence2"],
)
# We also rename the 'label' column to 'labels' which is the expected name for labels by the models of the
# transformers library
tokenized_datasets = tokenized_datasets.rename_column("label", "labels")
def collate_fn(examples):
# On TPU it's best to pad everything to the same length or training will be very slow.
if accelerator.distributed_type == DistributedType.TPU:
return tokenizer.pad(examples, padding="max_length", max_length=128, return_tensors="pt")
return tokenizer.pad(examples, padding="longest", return_tensors="pt")
# Instantiate dataloaders.
train_dataloader = DataLoader(
tokenized_datasets["train"], shuffle=True, collate_fn=collate_fn, batch_size=batch_size
)
eval_dataloader = DataLoader(
tokenized_datasets["validation"], shuffle=False, collate_fn=collate_fn, batch_size=EVAL_BATCH_SIZE
)
return train_dataloader, eval_dataloader
# For testing only
if os.environ.get("TESTING_MOCKED_DATALOADERS", None) == "1":
from accelerate.test_utils.training import mocked_dataloaders
get_dataloaders = mocked_dataloaders # noqa: F811
def training_function(config, args):
# Initialize accelerator
accelerator = Accelerator(cpu=args.cpu, mixed_precision=args.mixed_precision)
# Sample hyper-parameters for learning rate, batch size, seed and a few other HPs
lr = config["lr"]
num_epochs = int(config["num_epochs"])
correct_bias = config["correct_bias"]
seed = int(config["seed"])
batch_size = int(config["batch_size"])
metric = load_metric("glue", "mrpc")
# If the batch size is too big we use gradient accumulation
gradient_accumulation_steps = 1
if batch_size > MAX_GPU_BATCH_SIZE:
gradient_accumulation_steps = batch_size // MAX_GPU_BATCH_SIZE
batch_size = MAX_GPU_BATCH_SIZE
set_seed(seed)
train_dataloader, eval_dataloader = get_dataloaders(accelerator, batch_size)
# Instantiate the model (we build the model here so that the seed also control new weights initialization)
model = AutoModelForSequenceClassification.from_pretrained("bert-base-cased", return_dict=True)
# We could avoid this line since the accelerator is set with `device_placement=True` (default value).
# Note that if you are placing tensors on devices manually, this line absolutely needs to be before the optimizer
# creation otherwise training will not work on TPU (`accelerate` will kindly throw an error to make us aware of that).
model = model.to(accelerator.device)
# Instantiate optimizer
optimizer = AdamW(params=model.parameters(), lr=lr, correct_bias=correct_bias)
# Instantiate scheduler
lr_scheduler = get_linear_schedule_with_warmup(
optimizer=optimizer,
num_warmup_steps=100,
num_training_steps=(len(train_dataloader) * num_epochs) // gradient_accumulation_steps,
)
# Prepare everything
# There is no specific order to remember, we just need to unpack the objects in the same order we gave them to the
# prepare method.
model, optimizer, train_dataloader, eval_dataloader, lr_scheduler = accelerator.prepare(
model, optimizer, train_dataloader, eval_dataloader, lr_scheduler
)
# Now we train the model
for epoch in range(num_epochs):
model.train()
for step, batch in enumerate(train_dataloader):
# We could avoid this line since we set the accelerator with `device_placement=True`.
batch.to(accelerator.device)
outputs = model(**batch)
loss = outputs.loss
loss = loss / gradient_accumulation_steps
accelerator.backward(loss)
if step % gradient_accumulation_steps == 0:
optimizer.step()
lr_scheduler.step()
optimizer.zero_grad()
model.eval()
samples_seen = 0
for step, batch in enumerate(eval_dataloader):
# We could avoid this line since we set the accelerator with `device_placement=True`.
batch.to(accelerator.device)
with torch.no_grad():
outputs = model(**batch)
predictions = outputs.logits.argmax(dim=-1)
predictions, references = accelerator.gather((predictions, batch["labels"]))
# New Code #
# First we check if it's a distributed system
if accelerator.num_processes > 1:
# Then see if we're on the last batch of our eval dataloader
if step == len(eval_dataloader) - 1:
# Last batch needs to be truncated on distributed systems as it contains additional samples
predictions = predictions[: len(eval_dataloader.dataset) - samples_seen]
references = references[: len(eval_dataloader.dataset) - samples_seen]
else:
# Otherwise we add the number of samples seen
samples_seen += references.shape[0]
metric.add_batch(
predictions=predictions,
references=references,
)
eval_metric = metric.compute()
# Use accelerator.print to print only on the main process.
accelerator.print(f"epoch {epoch}:", eval_metric)
def main():
parser = argparse.ArgumentParser(description="Simple example of training script.")
parser.add_argument(
"--mixed_precision",
type=str,
default="no",
choices=["no", "fp16", "bf16"],
help="Whether to use mixed precision. Choose"
"between fp16 and bf16 (bfloat16). Bf16 requires PyTorch >= 1.10."
"and an Nvidia Ampere GPU.",
)
parser.add_argument("--cpu", action="store_true", help="If passed, will train on the CPU.")
args = parser.parse_args()
config = {"lr": 2e-5, "num_epochs": 3, "correct_bias": True, "seed": 42, "batch_size": 16}
training_function(config, args)
if __name__ == "__main__":
main()

View File

@ -0,0 +1,267 @@
# coding=utf-8
# Copyright 2021 The HuggingFace Inc. team. All rights reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
import argparse
import os
import torch
from torch.utils.data import DataLoader
from accelerate import Accelerator, DistributedType
from datasets import load_dataset, load_metric
from transformers import (
AdamW,
AutoModelForSequenceClassification,
AutoTokenizer,
get_linear_schedule_with_warmup,
set_seed,
)
########################################################################
# This is a fully working simple example to use Accelerate,
# specifically showcasing the experiment tracking capability,
# and builds off the `nlp_example.py` script.
#
# This example trains a Bert base model on GLUE MRPC
# in any of the following settings (with the same script):
# - single CPU or single GPU
# - multi GPUS (using PyTorch distributed mode)
# - (multi) TPUs
# - fp16 (mixed-precision) or fp32 (normal precision)
#
# To help focus on the differences in the code, building `DataLoaders`
# was refactored into its own function.
# New additions from the base script can be found quickly by
# looking for the # New Code # tags
#
# To run it in each of these various modes, follow the instructions
# in the readme for examples:
# https://github.com/huggingface/accelerate/tree/main/examples
#
########################################################################
MAX_GPU_BATCH_SIZE = 16
EVAL_BATCH_SIZE = 32
def get_dataloaders(accelerator: Accelerator, batch_size: int = 16):
"""
Creates a set of `DataLoader`s for the `glue` dataset,
using "bert-base-cased" as the tokenizer.
Args:
accelerator (`Accelerator`):
An `Accelerator` object
batch_size (`int`, *optional*):
The batch size for the train and validation DataLoaders.
"""
tokenizer = AutoTokenizer.from_pretrained("bert-base-cased")
datasets = load_dataset("glue", "mrpc")
def tokenize_function(examples):
# max_length=None => use the model max length (it's actually the default)
outputs = tokenizer(examples["sentence1"], examples["sentence2"], truncation=True, max_length=None)
return outputs
# Apply the method we just defined to all the examples in all the splits of the dataset
tokenized_datasets = datasets.map(
tokenize_function,
batched=True,
remove_columns=["idx", "sentence1", "sentence2"],
)
# We also rename the 'label' column to 'labels' which is the expected name for labels by the models of the
# transformers library
tokenized_datasets = tokenized_datasets.rename_column("label", "labels")
def collate_fn(examples):
# On TPU it's best to pad everything to the same length or training will be very slow.
if accelerator.distributed_type == DistributedType.TPU:
return tokenizer.pad(examples, padding="max_length", max_length=128, return_tensors="pt")
return tokenizer.pad(examples, padding="longest", return_tensors="pt")
# Instantiate dataloaders.
train_dataloader = DataLoader(
tokenized_datasets["train"], shuffle=True, collate_fn=collate_fn, batch_size=batch_size
)
eval_dataloader = DataLoader(
tokenized_datasets["validation"], shuffle=False, collate_fn=collate_fn, batch_size=EVAL_BATCH_SIZE
)
return train_dataloader, eval_dataloader
# For testing only
if os.environ.get("TESTING_MOCKED_DATALOADERS", None) == "1":
from accelerate.test_utils.training import mocked_dataloaders
get_dataloaders = mocked_dataloaders # noqa: F811
def training_function(config, args):
# Initialize Accelerator
# New Code #
# We pass in "all" to `log_with` to grab all available trackers in the environment
# Note: If using a custom `Tracker` class, should be passed in here such as:
# >>> log_with = ["all", MyCustomTrackerClassInstance()]
if args.with_tracking:
accelerator = Accelerator(
cpu=args.cpu, mixed_precision=args.mixed_precision, log_with="all", logging_dir=args.logging_dir
)
else:
accelerator = Accelerator(cpu=args.cpu, mixed_precision=args.mixed_precision)
# Sample hyper-parameters for learning rate, batch size, seed and a few other HPs
lr = config["lr"]
num_epochs = int(config["num_epochs"])
correct_bias = config["correct_bias"]
seed = int(config["seed"])
batch_size = int(config["batch_size"])
set_seed(seed)
train_dataloader, eval_dataloader = get_dataloaders(accelerator, batch_size)
metric = load_metric("glue", "mrpc")
# If the batch size is too big we use gradient accumulation
gradient_accumulation_steps = 1
if batch_size > MAX_GPU_BATCH_SIZE:
gradient_accumulation_steps = batch_size // MAX_GPU_BATCH_SIZE
batch_size = MAX_GPU_BATCH_SIZE
# Instantiate the model (we build the model here so that the seed also control new weights initialization)
model = AutoModelForSequenceClassification.from_pretrained("bert-base-cased", return_dict=True)
# We could avoid this line since the accelerator is set with `device_placement=True` (default value).
# Note that if you are placing tensors on devices manually, this line absolutely needs to be before the optimizer
# creation otherwise training will not work on TPU (`accelerate` will kindly throw an error to make us aware of that).
model = model.to(accelerator.device)
# Instantiate optimizer
optimizer = AdamW(params=model.parameters(), lr=lr, correct_bias=correct_bias)
# Instantiate scheduler
lr_scheduler = get_linear_schedule_with_warmup(
optimizer=optimizer,
num_warmup_steps=100,
num_training_steps=(len(train_dataloader) * num_epochs) // gradient_accumulation_steps,
)
# Prepare everything
# There is no specific order to remember, we just need to unpack the objects in the same order we gave them to the
# prepare method.
model, optimizer, train_dataloader, eval_dataloader, lr_scheduler = accelerator.prepare(
model, optimizer, train_dataloader, eval_dataloader, lr_scheduler
)
# New Code #
# We need to initalize the trackers we use. Overall configurations can also be stored
if args.with_tracking:
if accelerator.is_main_process:
run = os.path.split(__file__)[-1].split(".")[0]
if args.logging_dir:
run = os.path.join(args.logging_dir, run)
accelerator.init_trackers(run, config)
# Now we train the model
for epoch in range(num_epochs):
model.train()
# New Code #
# For our tracking example, we will log the total loss of each epoch
if args.with_tracking:
total_loss = 0
for step, batch in enumerate(train_dataloader):
# We could avoid this line since we set the accelerator with `device_placement=True`.
batch.to(accelerator.device)
outputs = model(**batch)
loss = outputs.loss
# New Code #
if args.with_tracking:
total_loss += loss.detach().float()
loss = loss / gradient_accumulation_steps
accelerator.backward(loss)
if step % gradient_accumulation_steps == 0:
optimizer.step()
lr_scheduler.step()
optimizer.zero_grad()
model.eval()
for step, batch in enumerate(eval_dataloader):
# We could avoid this line since we set the accelerator with `device_placement=True` (the default).
batch.to(accelerator.device)
with torch.no_grad():
outputs = model(**batch)
predictions = outputs.logits.argmax(dim=-1)
# It is slightly faster to call this once, than multiple times
predictions, references = accelerator.gather((predictions, batch["labels"]))
metric.add_batch(
predictions=predictions,
references=references,
)
eval_metric = metric.compute()
# Use accelerator.print to print only on the main process.
accelerator.print(f"epoch {epoch}:", eval_metric)
# New Code #
# To actually log, we call `Accelerator.log`
# The values passed can be of `str`, `int`, `float` or `dict` of `str` to `float`/`int`
if args.with_tracking:
accelerator.log(
{
"accuracy": eval_metric["accuracy"],
"f1": eval_metric["f1"],
"train_loss": total_loss.item() / len(train_dataloader),
"epoch": epoch,
},
step=epoch,
)
# New Code #
# When a run is finished, you should call `accelerator.end_training()`
# to close all of the open trackers
if args.with_tracking:
accelerator.end_training()
def main():
parser = argparse.ArgumentParser(description="Simple example of training script.")
parser.add_argument(
"--mixed_precision",
type=str,
default="no",
choices=["no", "fp16", "bf16"],
help="Whether to use mixed precision. Choose"
"between fp16 and bf16 (bfloat16). Bf16 requires PyTorch >= 1.10."
"and an Nvidia Ampere GPU.",
)
parser.add_argument("--cpu", action="store_true", help="If passed, will train on the CPU.")
parser.add_argument(
"--with_tracking",
action="store_true",
help="Whether to load in all available experiment trackers from the environment and use them for logging.",
)
parser.add_argument(
"--logging_dir",
type=str,
default="logs",
help="Location on where to store experiment tracking logs`",
)
args = parser.parse_args()
config = {"lr": 2e-5, "num_epochs": 3, "correct_bias": True, "seed": 42, "batch_size": 16}
training_function(config, args)
if __name__ == "__main__":
main()

View File

@ -0,0 +1,327 @@
# coding=utf-8
# Copyright 2021 The HuggingFace Inc. team. All rights reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
import argparse
import os
import re
import numpy as np
import torch
from torch.optim.lr_scheduler import OneCycleLR
from torch.utils.data import DataLoader, Dataset
import PIL
from accelerate import Accelerator
from timm import create_model
from torchvision.transforms import Compose, RandomResizedCrop, Resize, ToTensor
########################################################################
# This is a fully working simple example to use Accelerate
#
# This example trains a ResNet50 on the Oxford-IIT Pet Dataset
# in any of the following settings (with the same script):
# - single CPU or single GPU
# - multi GPUS (using PyTorch distributed mode)
# - (multi) TPUs
# - fp16 (mixed-precision) or fp32 (normal precision)
#
# To run it in each of these various modes, follow the instructions
# in the readme for examples:
# https://github.com/huggingface/accelerate/tree/main/examples
#
########################################################################
# Function to get the label from the filename
def extract_label(fname):
stem = fname.split(os.path.sep)[-1]
return re.search(r"^(.*)_\d+\.jpg$", stem).groups()[0]
class PetsDataset(Dataset):
def __init__(self, file_names, image_transform=None, label_to_id=None):
self.file_names = file_names
self.image_transform = image_transform
self.label_to_id = label_to_id
def __len__(self):
return len(self.file_names)
def __getitem__(self, idx):
fname = self.file_names[idx]
raw_image = PIL.Image.open(fname)
image = raw_image.convert("RGB")
if self.image_transform is not None:
image = self.image_transform(image)
label = extract_label(fname)
if self.label_to_id is not None:
label = self.label_to_id[label]
return {"image": image, "label": label}
def training_function(config, args):
# Initialize accelerator
if args.with_tracking:
accelerator = Accelerator(
cpu=args.cpu, mixed_precision=args.mixed_precision, log_with="all", logging_dir=args.logging_dir
)
else:
accelerator = Accelerator(cpu=args.cpu, mixed_precision=args.mixed_precision)
# Sample hyper-parameters for learning rate, batch size, seed and a few other HPs
lr = config["lr"]
num_epochs = int(config["num_epochs"])
seed = int(config["seed"])
batch_size = int(config["batch_size"])
image_size = config["image_size"]
if not isinstance(image_size, (list, tuple)):
image_size = (image_size, image_size)
# Parse out whether we are saving every epoch or after a certain number of batches
if hasattr(args.checkpointing_steps, "isdigit"):
if args.checkpointing_steps == "epoch":
checkpointing_steps = args.checkpointing_steps
elif args.checkpointing_steps.isdigit():
checkpointing_steps = int(args.checkpointing_steps)
else:
raise ValueError(
f"Argument `checkpointing_steps` must be either a number or `epoch`. `{args.checkpointing_steps}` passed."
)
else:
checkpointing_steps = None
# We need to initialize the trackers we use, and also store our configuration
if args.with_tracking:
if accelerator.is_main_process:
run = os.path.split(__file__)[-1].split(".")[0]
if args.logging_dir:
run = os.path.join(args.logging_dir, run)
accelerator.init_trackers(run, config)
# Grab all the image filenames
file_names = [os.path.join(args.data_dir, fname) for fname in os.listdir(args.data_dir) if fname.endswith(".jpg")]
# Build the label correspondences
all_labels = [extract_label(fname) for fname in file_names]
id_to_label = list(set(all_labels))
id_to_label.sort()
label_to_id = {lbl: i for i, lbl in enumerate(id_to_label)}
# Set the seed before splitting the data.
np.random.seed(seed)
torch.manual_seed(seed)
torch.cuda.manual_seed_all(seed)
# Split our filenames between train and validation
random_perm = np.random.permutation(len(file_names))
cut = int(0.8 * len(file_names))
train_split = random_perm[:cut]
eval_split = random_perm[cut:]
# For training we use a simple RandomResizedCrop
train_tfm = Compose([RandomResizedCrop(image_size, scale=(0.5, 1.0)), ToTensor()])
train_dataset = PetsDataset(
[file_names[i] for i in train_split], image_transform=train_tfm, label_to_id=label_to_id
)
# For evaluation, we use a deterministic Resize
eval_tfm = Compose([Resize(image_size), ToTensor()])
eval_dataset = PetsDataset([file_names[i] for i in eval_split], image_transform=eval_tfm, label_to_id=label_to_id)
# Instantiate dataloaders.
train_dataloader = DataLoader(train_dataset, shuffle=True, batch_size=batch_size, num_workers=4)
eval_dataloader = DataLoader(eval_dataset, shuffle=False, batch_size=batch_size, num_workers=4)
# Instantiate the model (we build the model here so that the seed also control new weights initialization)
model = create_model("resnet50d", pretrained=True, num_classes=len(label_to_id))
# We could avoid this line since the accelerator is set with `device_placement=True` (default value).
# Note that if you are placing tensors on devices manually, this line absolutely needs to be before the optimizer
# creation otherwise training will not work on TPU (`accelerate` will kindly throw an error to make us aware of that).
model = model.to(accelerator.device)
# Freezing the base model
for param in model.parameters():
param.requires_grad = False
for param in model.get_classifier().parameters():
param.requires_grad = True
# We normalize the batches of images to be a bit faster.
mean = torch.tensor(model.default_cfg["mean"])[None, :, None, None].to(accelerator.device)
std = torch.tensor(model.default_cfg["std"])[None, :, None, None].to(accelerator.device)
# Instantiate optimizer
optimizer = torch.optim.Adam(params=model.parameters(), lr=lr / 25)
# Instantiate learning rate scheduler
lr_scheduler = OneCycleLR(optimizer=optimizer, max_lr=lr, epochs=num_epochs, steps_per_epoch=len(train_dataloader))
# Prepare everything
# There is no specific order to remember, we just need to unpack the objects in the same order we gave them to the
# prepare method.
model, optimizer, train_dataloader, eval_dataloader, lr_scheduler = accelerator.prepare(
model, optimizer, train_dataloader, eval_dataloader, lr_scheduler
)
# We need to keep track of how many total steps we have iterated over
overall_step = 0
# We also need to keep track of the stating epoch so files are named properly
starting_epoch = 0
# Potentially load in the weights and states from a previous save
if args.resume_from_checkpoint:
if args.resume_from_checkpoint is not None or args.resume_from_checkpoint != "":
accelerator.print(f"Resumed from checkpoint: {args.resume_from_checkpoint}")
accelerator.load_state(args.resume_from_checkpoint)
path = os.path.basename(args.resume_from_checkpoint)
else:
# Get the most recent checkpoint
dirs = [f.name for f in os.scandir(os.getcwd()) if f.is_dir()]
dirs.sort(key=os.path.getctime)
path = dirs[-1] # Sorts folders by date modified, most recent checkpoint is the last
# Extract `epoch_{i}` or `step_{i}`
training_difference = os.path.splitext(path)[0]
if "epoch" in training_difference:
starting_epoch = int(training_difference.replace("epoch_", "")) + 1
resume_step = None
else:
resume_step = int(training_difference.replace("step_", ""))
starting_epoch = resume_step // len(train_dataloader)
resume_step -= starting_epoch * len(train_dataloader)
# Now we train the model
for epoch in range(starting_epoch, num_epochs):
model.train()
if args.with_tracking:
total_loss = 0
for step, batch in enumerate(train_dataloader):
# We need to skip steps until we reach the resumed step
if args.resume_from_checkpoint and epoch == starting_epoch:
if resume_step is not None and step < resume_step:
overall_step += 1
continue
# We could avoid this line since we set the accelerator with `device_placement=True`.
batch = {k: v.to(accelerator.device) for k, v in batch.items()}
inputs = (batch["image"] - mean) / std
outputs = model(inputs)
loss = torch.nn.functional.cross_entropy(outputs, batch["label"])
# We keep track of the loss at each epoch
if args.with_tracking:
total_loss += loss.detach().float()
accelerator.backward(loss)
optimizer.step()
lr_scheduler.step()
optimizer.zero_grad()
overall_step += 1
if isinstance(checkpointing_steps, int):
output_dir = f"step_{overall_step}"
if overall_step % checkpointing_steps == 0:
if args.output_dir is not None:
output_dir = os.path.join(args.output_dir, output_dir)
accelerator.save_state(output_dir)
model.eval()
accurate = 0
samples_seen = 0
for step, batch in enumerate(eval_dataloader):
# We could avoid this line since we set the accelerator with `device_placement=True`.
batch = {k: v.to(accelerator.device) for k, v in batch.items()}
inputs = (batch["image"] - mean) / std
with torch.no_grad():
outputs = model(inputs)
predictions = outputs.argmax(dim=-1)
predictions, references = accelerator.gather((predictions, batch["label"]))
if accelerator.num_processes > 1:
if step == len(eval_dataloader) - 1:
predictions = predictions[: len(eval_dataloader) - samples_seen]
references = references[: len(eval_dataloader) - samples_seen]
else:
samples_seen += references.shape[0]
else:
samples_seen += references.shape[0]
accurate_preds = predictions == references
accurate += accurate_preds.long().sum()
eval_metric = accurate.item() / samples_seen
# Use accelerator.print to print only on the main process.
accelerator.print(f"epoch {epoch}: {100 * eval_metric:.2f}")
if args.with_tracking:
accelerator.log(
{
"accuracy": 100 * eval_metric,
"train_loss": total_loss.item() / len(train_dataloader),
"epoch": epoch,
},
step=overall_step,
)
if checkpointing_steps == "epoch":
output_dir = f"epoch_{epoch}"
if args.output_dir is not None:
output_dir = os.path.join(args.output_dir, output_dir)
accelerator.save_state(output_dir)
if args.with_tracking:
accelerator.end_training()
def main():
parser = argparse.ArgumentParser(description="Simple example of training script.")
parser.add_argument("--data_dir", required=True, help="The data folder on disk.")
parser.add_argument("--fp16", action="store_true", help="If passed, will use FP16 training.")
parser.add_argument(
"--mixed_precision",
type=str,
default="no",
choices=["no", "fp16", "bf16"],
help="Whether to use mixed precision. Choose"
"between fp16 and bf16 (bfloat16). Bf16 requires PyTorch >= 1.10."
"and an Nvidia Ampere GPU.",
)
parser.add_argument("--cpu", action="store_true", help="If passed, will train on the CPU.")
parser.add_argument(
"--checkpointing_steps",
type=str,
default=None,
help="Whether the various states should be saved at the end of every n steps, or 'epoch' for each epoch.",
)
parser.add_argument(
"--output_dir",
type=str,
default=".",
help="Optional save directory where all checkpoint folders will be stored. Default is the current working directory.",
)
parser.add_argument(
"--resume_from_checkpoint",
type=str,
default=None,
help="If the training should continue from a checkpoint folder.",
)
parser.add_argument(
"--with_tracking",
action="store_true",
help="Whether to load in all available experiment trackers from the environment and use them for logging.",
)
parser.add_argument(
"--logging_dir",
type=str,
default="logs",
help="Location on where to store experiment tracking logs`",
)
args = parser.parse_args()
config = {"lr": 3e-2, "num_epochs": 3, "seed": 42, "batch_size": 64, "image_size": 224}
training_function(config, args)
if __name__ == "__main__":
main()

View File

@ -0,0 +1,312 @@
# coding=utf-8
# Copyright 2021 The HuggingFace Inc. team. All rights reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
import argparse
import os
import torch
from torch.utils.data import DataLoader
from accelerate import Accelerator, DistributedType
from datasets import load_dataset, load_metric
from transformers import (
AdamW,
AutoModelForSequenceClassification,
AutoTokenizer,
get_linear_schedule_with_warmup,
set_seed,
)
########################################################################
# This is a fully working simple example to use Accelerate
#
# This example trains a Bert base model on GLUE MRPC
# in any of the following settings (with the same script):
# - single CPU or single GPU
# - multi GPUS (using PyTorch distributed mode)
# - (multi) TPUs
# - fp16 (mixed-precision) or fp32 (normal precision)
#
# This example also demonstrates the checkpointing and sharding capabilities
#
# To run it in each of these various modes, follow the instructions
# in the readme for examples:
# https://github.com/huggingface/accelerate/tree/main/examples
#
########################################################################
MAX_GPU_BATCH_SIZE = 16
EVAL_BATCH_SIZE = 32
def training_function(config, args):
# Initialize accelerator
if args.with_tracking:
accelerator = Accelerator(
cpu=args.cpu, mixed_precision=args.mixed_precision, log_with="all", logging_dir=args.logging_dir
)
else:
accelerator = Accelerator(cpu=args.cpu, mixed_precision=args.mixed_precision)
if hasattr(args.checkpointing_steps, "isdigit"):
if args.checkpointing_steps == "epoch":
checkpointing_steps = args.checkpointing_steps
elif args.checkpointing_steps.isdigit():
checkpointing_steps = int(args.checkpointing_steps)
else:
raise ValueError(
f"Argument `checkpointing_steps` must be either a number or `epoch`. `{args.checkpointing_steps}` passed."
)
else:
checkpointing_steps = None
# Sample hyper-parameters for learning rate, batch size, seed and a few other HPs
lr = config["lr"]
num_epochs = int(config["num_epochs"])
correct_bias = config["correct_bias"]
seed = int(config["seed"])
batch_size = int(config["batch_size"])
# We need to initialize the trackers we use, and also store our configuration
if args.with_tracking:
if accelerator.is_main_process:
run = os.path.split(__file__)[-1].split(".")[0]
if args.logging_dir:
run = os.path.join(args.logging_dir, run)
accelerator.init_trackers(run, config)
tokenizer = AutoTokenizer.from_pretrained("bert-base-cased")
datasets = load_dataset("glue", "mrpc")
metric = load_metric("glue", "mrpc")
def tokenize_function(examples):
# max_length=None => use the model max length (it's actually the default)
outputs = tokenizer(examples["sentence1"], examples["sentence2"], truncation=True, max_length=None)
return outputs
# Apply the method we just defined to all the examples in all the splits of the dataset
tokenized_datasets = datasets.map(
tokenize_function,
batched=True,
remove_columns=["idx", "sentence1", "sentence2"],
)
# We also rename the 'label' column to 'labels' which is the expected name for labels by the models of the
# transformers library
tokenized_datasets = tokenized_datasets.rename_column("label", "labels")
# If the batch size is too big we use gradient accumulation
gradient_accumulation_steps = 1
if batch_size > MAX_GPU_BATCH_SIZE:
gradient_accumulation_steps = batch_size // MAX_GPU_BATCH_SIZE
batch_size = MAX_GPU_BATCH_SIZE
def collate_fn(examples):
# On TPU it's best to pad everything to the same length or training will be very slow.
if accelerator.distributed_type == DistributedType.TPU:
return tokenizer.pad(examples, padding="max_length", max_length=128, return_tensors="pt")
return tokenizer.pad(examples, padding="longest", return_tensors="pt")
# Instantiate dataloaders.
train_dataloader = DataLoader(
tokenized_datasets["train"], shuffle=True, collate_fn=collate_fn, batch_size=batch_size
)
eval_dataloader = DataLoader(
tokenized_datasets["validation"], shuffle=False, collate_fn=collate_fn, batch_size=EVAL_BATCH_SIZE
)
set_seed(seed)
# Instantiate the model (we build the model here so that the seed also control new weights initialization)
model = AutoModelForSequenceClassification.from_pretrained("bert-base-cased", return_dict=True)
# We could avoid this line since the accelerator is set with `device_placement=True` (default value).
# Note that if you are placing tensors on devices manually, this line absolutely needs to be before the optimizer
# creation otherwise training will not work on TPU (`accelerate` will kindly throw an error to make us aware of that).
model = model.to(accelerator.device)
# Instantiate optimizer
optimizer = AdamW(params=model.parameters(), lr=lr, correct_bias=correct_bias)
# Instantiate scheduler
lr_scheduler = get_linear_schedule_with_warmup(
optimizer=optimizer,
num_warmup_steps=100,
num_training_steps=(len(train_dataloader) * num_epochs) // gradient_accumulation_steps,
)
# Prepare everything
# There is no specific order to remember, we just need to unpack the objects in the same order we gave them to the
# prepare method.
model, optimizer, train_dataloader, eval_dataloader, lr_scheduler = accelerator.prepare(
model, optimizer, train_dataloader, eval_dataloader, lr_scheduler
)
# We need to keep track of how many total steps we have iterated over
overall_step = 0
# We also need to keep track of the stating epoch so files are named properly
starting_epoch = 0
# Potentially load in the weights and states from a previous save
if args.resume_from_checkpoint:
if args.resume_from_checkpoint is not None or args.resume_from_checkpoint != "":
accelerator.print(f"Resumed from checkpoint: {args.resume_from_checkpoint}")
accelerator.load_state(args.resume_from_checkpoint)
path = os.path.basename(args.resume_from_checkpoint)
else:
# Get the most recent checkpoint
dirs = [f.name for f in os.scandir(os.getcwd()) if f.is_dir()]
dirs.sort(key=os.path.getctime)
path = dirs[-1] # Sorts folders by date modified, most recent checkpoint is the last
# Extract `epoch_{i}` or `step_{i}`
training_difference = os.path.splitext(path)[0]
if "epoch" in training_difference:
starting_epoch = int(training_difference.replace("epoch_", "")) + 1
resume_step = None
else:
resume_step = int(training_difference.replace("step_", ""))
starting_epoch = resume_step // len(train_dataloader)
resume_step -= starting_epoch * len(train_dataloader)
# Now we train the model
for epoch in range(starting_epoch, num_epochs):
model.train()
if args.with_tracking:
total_loss = 0
for step, batch in enumerate(train_dataloader):
# We need to skip steps until we reach the resumed step
if args.resume_from_checkpoint and epoch == starting_epoch:
if resume_step is not None and step < resume_step:
overall_step += 1
continue
# We could avoid this line since we set the accelerator with `device_placement=True`.
batch.to(accelerator.device)
outputs = model(**batch)
loss = outputs.loss
loss = loss / gradient_accumulation_steps
# We keep track of the loss at each epoch
if args.with_tracking:
total_loss += loss.detach().float()
accelerator.backward(loss)
if step % gradient_accumulation_steps == 0:
optimizer.step()
lr_scheduler.step()
optimizer.zero_grad()
overall_step += 1
if isinstance(checkpointing_steps, int):
output_dir = f"step_{overall_step}"
if overall_step % checkpointing_steps == 0:
if args.output_dir is not None:
output_dir = os.path.join(args.output_dir, output_dir)
accelerator.save_state(output_dir)
model.eval()
samples_seen = 0
for step, batch in enumerate(eval_dataloader):
# We could avoid this line since we set the accelerator with `device_placement=True`.
batch.to(accelerator.device)
with torch.no_grad():
outputs = model(**batch)
predictions = outputs.logits.argmax(dim=-1)
# It is slightly faster to call this once, than multiple times
predictions, references = accelerator.gather(
(predictions, batch["labels"])
) # If we are in a multiprocess environment, the last batch has duplicates
if accelerator.num_processes > 1:
if step == len(eval_dataloader) - 1:
predictions = predictions[: len(eval_dataloader.dataset) - samples_seen]
references = references[: len(eval_dataloader.dataset) - samples_seen]
else:
samples_seen += references.shape[0]
metric.add_batch(
predictions=predictions,
references=references,
)
eval_metric = metric.compute()
# Use accelerator.print to print only on the main process.
accelerator.print(f"epoch {epoch}:", eval_metric)
if args.with_tracking:
accelerator.log(
{
"accuracy": eval_metric["accuracy"],
"f1": eval_metric["f1"],
"train_loss": total_loss.item() / len(train_dataloader),
"epoch": epoch,
},
step=epoch,
)
if checkpointing_steps == "epoch":
output_dir = f"epoch_{epoch}"
if args.output_dir is not None:
output_dir = os.path.join(args.output_dir, output_dir)
accelerator.save_state(output_dir)
if args.with_tracking:
accelerator.end_training()
def main():
parser = argparse.ArgumentParser(description="Simple example of training script.")
parser.add_argument(
"--mixed_precision",
type=str,
default="no",
choices=["no", "fp16", "bf16"],
help="Whether to use mixed precision. Choose"
"between fp16 and bf16 (bfloat16). Bf16 requires PyTorch >= 1.10."
"and an Nvidia Ampere GPU.",
)
parser.add_argument("--cpu", action="store_true", help="If passed, will train on the CPU.")
parser.add_argument(
"--checkpointing_steps",
type=str,
default=None,
help="Whether the various states should be saved at the end of every n steps, or 'epoch' for each epoch.",
)
parser.add_argument(
"--resume_from_checkpoint",
type=str,
default=None,
help="If the training should continue from a checkpoint folder.",
)
parser.add_argument(
"--with_tracking",
action="store_true",
help="Whether to load in all available experiment trackers from the environment and use them for logging.",
)
parser.add_argument(
"--output_dir",
type=str,
default=".",
help="Optional save directory where all checkpoint folders will be stored. Default is the current working directory.",
)
parser.add_argument(
"--logging_dir",
type=str,
default="logs",
help="Location on where to store experiment tracking logs`",
)
args = parser.parse_args()
config = {"lr": 2e-5, "num_epochs": 3, "correct_bias": True, "seed": 42, "batch_size": 16}
training_function(config, args)
if __name__ == "__main__":
main()

210
examples/cv_example.py Normal file
View File

@ -0,0 +1,210 @@
# coding=utf-8
# Copyright 2021 The HuggingFace Inc. team. All rights reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
import argparse
import os
import re
import numpy as np
import torch
from torch.optim.lr_scheduler import OneCycleLR
from torch.utils.data import DataLoader, Dataset
import PIL
from accelerate import Accelerator
from timm import create_model
from torchvision.transforms import Compose, RandomResizedCrop, Resize, ToTensor
########################################################################
# This is a fully working simple example to use Accelerate
#
# This example trains a ResNet50 on the Oxford-IIT Pet Dataset
# in any of the following settings (with the same script):
# - single CPU or single GPU
# - multi GPUS (using PyTorch distributed mode)
# - (multi) TPUs
# - fp16 (mixed-precision) or fp32 (normal precision)
#
# To run it in each of these various modes, follow the instructions
# in the readme for examples:
# https://github.com/huggingface/accelerate/tree/main/examples
#
########################################################################
# Function to get the label from the filename
def extract_label(fname):
stem = fname.split(os.path.sep)[-1]
return re.search(r"^(.*)_\d+\.jpg$", stem).groups()[0]
class PetsDataset(Dataset):
def __init__(self, file_names, image_transform=None, label_to_id=None):
self.file_names = file_names
self.image_transform = image_transform
self.label_to_id = label_to_id
def __len__(self):
return len(self.file_names)
def __getitem__(self, idx):
fname = self.file_names[idx]
raw_image = PIL.Image.open(fname)
image = raw_image.convert("RGB")
if self.image_transform is not None:
image = self.image_transform(image)
label = extract_label(fname)
if self.label_to_id is not None:
label = self.label_to_id[label]
return {"image": image, "label": label}
def training_function(config, args):
# Initialize accelerator
accelerator = Accelerator(cpu=args.cpu, mixed_precision=args.mixed_precision)
# Sample hyper-parameters for learning rate, batch size, seed and a few other HPs
lr = config["lr"]
num_epochs = int(config["num_epochs"])
seed = int(config["seed"])
batch_size = int(config["batch_size"])
image_size = config["image_size"]
if not isinstance(image_size, (list, tuple)):
image_size = (image_size, image_size)
# Grab all the image filenames
file_names = [os.path.join(args.data_dir, fname) for fname in os.listdir(args.data_dir) if fname.endswith(".jpg")]
# Build the label correspondences
all_labels = [extract_label(fname) for fname in file_names]
id_to_label = list(set(all_labels))
id_to_label.sort()
label_to_id = {lbl: i for i, lbl in enumerate(id_to_label)}
# Set the seed before splitting the data.
np.random.seed(seed)
torch.manual_seed(seed)
torch.cuda.manual_seed_all(seed)
# Split our filenames between train and validation
random_perm = np.random.permutation(len(file_names))
cut = int(0.8 * len(file_names))
train_split = random_perm[:cut]
eval_split = random_perm[cut:]
# For training we use a simple RandomResizedCrop
train_tfm = Compose([RandomResizedCrop(image_size, scale=(0.5, 1.0)), ToTensor()])
train_dataset = PetsDataset(
[file_names[i] for i in train_split], image_transform=train_tfm, label_to_id=label_to_id
)
# For evaluation, we use a deterministic Resize
eval_tfm = Compose([Resize(image_size), ToTensor()])
eval_dataset = PetsDataset([file_names[i] for i in eval_split], image_transform=eval_tfm, label_to_id=label_to_id)
# Instantiate dataloaders.
train_dataloader = DataLoader(train_dataset, shuffle=True, batch_size=batch_size, num_workers=4)
eval_dataloader = DataLoader(eval_dataset, shuffle=False, batch_size=batch_size, num_workers=4)
# Instantiate the model (we build the model here so that the seed also control new weights initialization)
model = create_model("resnet50d", pretrained=True, num_classes=len(label_to_id))
# We could avoid this line since the accelerator is set with `device_placement=True` (default value).
# Note that if you are placing tensors on devices manually, this line absolutely needs to be before the optimizer
# creation otherwise training will not work on TPU (`accelerate` will kindly throw an error to make us aware of that).
model = model.to(accelerator.device)
# Freezing the base model
for param in model.parameters():
param.requires_grad = False
for param in model.get_classifier().parameters():
param.requires_grad = True
# We normalize the batches of images to be a bit faster.
mean = torch.tensor(model.default_cfg["mean"])[None, :, None, None].to(accelerator.device)
std = torch.tensor(model.default_cfg["std"])[None, :, None, None].to(accelerator.device)
# Instantiate optimizer
optimizer = torch.optim.Adam(params=model.parameters(), lr=lr / 25)
# Instantiate learning rate scheduler
lr_scheduler = OneCycleLR(optimizer=optimizer, max_lr=lr, epochs=num_epochs, steps_per_epoch=len(train_dataloader))
# Prepare everything
# There is no specific order to remember, we just need to unpack the objects in the same order we gave them to the
# prepare method.
model, optimizer, train_dataloader, eval_dataloader, lr_scheduler = accelerator.prepare(
model, optimizer, train_dataloader, eval_dataloader, lr_scheduler
)
# Now we train the model
for epoch in range(num_epochs):
model.train()
for step, batch in enumerate(train_dataloader):
# We could avoid this line since we set the accelerator with `device_placement=True`.
batch = {k: v.to(accelerator.device) for k, v in batch.items()}
inputs = (batch["image"] - mean) / std
outputs = model(inputs)
loss = torch.nn.functional.cross_entropy(outputs, batch["label"])
accelerator.backward(loss)
optimizer.step()
lr_scheduler.step()
optimizer.zero_grad()
model.eval()
accurate = 0
num_elems = 0
for _, batch in enumerate(eval_dataloader):
# We could avoid this line since we set the accelerator with `device_placement=True`.
batch = {k: v.to(accelerator.device) for k, v in batch.items()}
inputs = (batch["image"] - mean) / std
with torch.no_grad():
outputs = model(inputs)
predictions = outputs.argmax(dim=-1)
accurate_preds = accelerator.gather(predictions) == accelerator.gather(batch["label"])
num_elems += accurate_preds.shape[0]
accurate += accurate_preds.long().sum()
eval_metric = accurate.item() / num_elems
# Use accelerator.print to print only on the main process.
accelerator.print(f"epoch {epoch}: {100 * eval_metric:.2f}")
def main():
parser = argparse.ArgumentParser(description="Simple example of training script.")
parser.add_argument("--data_dir", required=True, help="The data folder on disk.")
parser.add_argument(
"--mixed_precision",
type=str,
default="no",
choices=["no", "fp16", "bf16"],
help="Whether to use mixed precision. Choose"
"between fp16 and bf16 (bfloat16). Bf16 requires PyTorch >= 1.10."
"and an Nvidia Ampere GPU.",
)
parser.add_argument(
"--checkpointing_steps",
type=str,
default=None,
help="Whether the various states should be saved at the end of every n steps, or 'epoch' for each epoch.",
)
parser.add_argument("--cpu", action="store_true", help="If passed, will train on the CPU.")
args = parser.parse_args()
config = {"lr": 3e-2, "num_epochs": 3, "seed": 42, "batch_size": 64, "image_size": 224}
training_function(config, args)
if __name__ == "__main__":
main()

View File

@ -14,6 +14,7 @@
# limitations under the License.
import argparse
import torch
from torch.utils.data import DataLoader
from accelerate import Accelerator, DistributedType
@ -30,7 +31,7 @@ from transformers import (
########################################################################
# This is a fully working simple example to use Accelerate
#
# This example train a Bert base model on GLUE MRPC
# This example trains a Bert base model on GLUE MRPC
# in any of the following settings (with the same script):
# - single CPU or single GPU
# - multi GPUS (using PyTorch distributed mode)
@ -39,7 +40,7 @@ from transformers import (
#
# To run it in each of these various modes, follow the instructions
# in the readme for examples:
# https://github.com/huggingface/accelerate/examples
# https://github.com/huggingface/accelerate/tree/main/examples
#
########################################################################
@ -48,17 +49,19 @@ MAX_GPU_BATCH_SIZE = 16
EVAL_BATCH_SIZE = 32
def training_function(config, args):
# Sample hyper-parameters for learning rate, batch size, seed and a few other HPs
lr = config["lr"]
num_epochs = int(config["num_epochs"])
correct_bias = config["correct_bias"]
seed = int(config["seed"])
batch_size = int(config["batch_size"])
def get_dataloaders(accelerator: Accelerator, batch_size: int = 16):
"""
Creates a set of `DataLoader`s for the `glue` dataset,
using "bert-base-cased" as the tokenizer.
Args:
accelerator (`Accelerator`):
An `Accelerator` object
batch_size (`int`, *optional*):
The batch size for the train and validation DataLoaders.
"""
tokenizer = AutoTokenizer.from_pretrained("bert-base-cased")
datasets = load_dataset("glue", "mrpc")
metric = load_metric("glue", "mrpc")
def tokenize_function(examples):
# max_length=None => use the model max length (it's actually the default)
@ -74,13 +77,7 @@ def training_function(config, args):
# We also rename the 'label' column to 'labels' which is the expected name for labels by the models of the
# transformers library
tokenized_datasets.rename_column_("label", "labels")
# If the batch size is too big we use gradient accumulation
gradient_accumulation_steps = 1
if batch_size > MAX_GPU_BATCH_SIZE:
gradient_accumulation_steps = batch_size // MAX_GPU_BATCH_SIZE
batch_size = MAX_GPU_BATCH_SIZE
tokenized_datasets = tokenized_datasets.rename_column("label", "labels")
def collate_fn(examples):
# On TPU it's best to pad everything to the same length or training will be very slow.
@ -96,36 +93,55 @@ def training_function(config, args):
tokenized_datasets["validation"], shuffle=False, collate_fn=collate_fn, batch_size=EVAL_BATCH_SIZE
)
set_seed(seed)
# Initialize accelerator
accelerator = Accelerator(fp16=args.fp16, cpu=args.cpu)
return train_dataloader, eval_dataloader
def training_function(config, args):
# Initialize accelerator
accelerator = Accelerator(cpu=args.cpu, mixed_precision=args.mixed_precision)
# Sample hyper-parameters for learning rate, batch size, seed and a few other HPs
lr = config["lr"]
num_epochs = int(config["num_epochs"])
correct_bias = config["correct_bias"]
seed = int(config["seed"])
batch_size = int(config["batch_size"])
metric = load_metric("glue", "mrpc")
# If the batch size is too big we use gradient accumulation
gradient_accumulation_steps = 1
if batch_size > MAX_GPU_BATCH_SIZE:
gradient_accumulation_steps = batch_size // MAX_GPU_BATCH_SIZE
batch_size = MAX_GPU_BATCH_SIZE
set_seed(seed)
train_dataloader, eval_dataloader = get_dataloaders(accelerator, batch_size)
# Instantiate the model (we build the model here so that the seed also control new weights initialization)
model = AutoModelForSequenceClassification.from_pretrained("bert-base-cased", return_dict=True)
# We could avoid this line since we set the accelerator with `device_placement=True`.
# If setting devices manually, this line absolutely needs to be before the optimizer creation otherwise training
# will not work on TPU (`accelerate` will kindly throw an error to make us aware of that).
# We could avoid this line since the accelerator is set with `device_placement=True` (default value).
# Note that if you are placing tensors on devices manually, this line absolutely needs to be before the optimizer
# creation otherwise training will not work on TPU (`accelerate` will kindly throw an error to make us aware of that).
model = model.to(accelerator.device)
# Instantiate optimizer
optimizer = AdamW(params=model.parameters(), lr=lr, correct_bias=correct_bias)
# Prepare everything
# There is no specific order to remember, we just need to unpack the objects in the same order we gave them to the
# prepare method.
model, optimizer, train_dataloader, eval_dataloader = accelerator.prepare(
model, optimizer, train_dataloader, eval_dataloader
)
# Instantiate learning rate scheduler after preparing the training dataloader as the prepare method
# may change its length.
# Instantiate scheduler
lr_scheduler = get_linear_schedule_with_warmup(
optimizer=optimizer,
num_warmup_steps=100,
num_training_steps=len(train_dataloader) * num_epochs,
num_training_steps=(len(train_dataloader) * num_epochs) // gradient_accumulation_steps,
)
# Now we train the model - We prune bad trials after each epoch if needed
# Prepare everything
# There is no specific order to remember, we just need to unpack the objects in the same order we gave them to the
# prepare method.
model, optimizer, train_dataloader, eval_dataloader, lr_scheduler = accelerator.prepare(
model, optimizer, train_dataloader, eval_dataloader, lr_scheduler
)
# Now we train the model
for epoch in range(num_epochs):
model.train()
for step, batch in enumerate(train_dataloader):
@ -144,11 +160,13 @@ def training_function(config, args):
for step, batch in enumerate(eval_dataloader):
# We could avoid this line since we set the accelerator with `device_placement=True`.
batch.to(accelerator.device)
outputs = model(**batch)
with torch.no_grad():
outputs = model(**batch)
predictions = outputs.logits.argmax(dim=-1)
predictions, references = accelerator.gather((predictions, batch["labels"]))
metric.add_batch(
predictions=accelerator.gather(predictions),
references=accelerator.gather(batch["labels"]),
predictions=predictions,
references=references,
)
eval_metric = metric.compute()
@ -159,15 +177,15 @@ def training_function(config, args):
def main():
parser = argparse.ArgumentParser(description="Simple example of training script.")
parser.add_argument(
"--fp16",
action="store_true",
help="If passed, will use FP16 training.",
)
parser.add_argument(
"--cpu",
action="store_true",
help="If passed, will train on the CPU.",
"--mixed_precision",
type=str,
default="no",
choices=["no", "fp16", "bf16"],
help="Whether to use mixed precision. Choose"
"between fp16 and bf16 (bfloat16). Bf16 requires PyTorch >= 1.10."
"and an Nvidia Ampere GPU.",
)
parser.add_argument("--cpu", action="store_true", help="If passed, will train on the CPU.")
args = parser.parse_args()
config = {"lr": 2e-5, "num_epochs": 3, "correct_bias": True, "seed": 42, "batch_size": 16}
training_function(config, args)

View File

@ -0,0 +1 @@
accelerate # used to be installed in Amazon SageMaker environment

View File

@ -16,12 +16,28 @@ from setuptools import setup
from setuptools import find_packages
extras = {}
extras["quality"] = ["black >= 20.8b1", "isort >= 5.5.4", "flake8 >= 3.8.3"]
extras["docs"] = ["recommonmark", "sphinx==3.2.1", "sphinx-markdown-tables", "sphinx-rtd-theme==0.4.3", "sphinx-copybutton"]
extras["quality"] = ["black ~= 22.0", "isort >= 5.5.4", "flake8 >= 3.8.3"]
extras["docs"] = []
extras["test"] = [
"psutil",
"pytest",
"pytest-xdist",
"pytest-subtests",
"datasets",
"transformers",
"scipy",
"sklearn"
]
extras["test_trackers"] = ["wandb", "comet-ml", "tensorflow>=2.6.2", "tensorboard"]
extras["dev"] = extras["quality"] + extras["test"]
extras["sagemaker"] = [
"sagemaker", # boto3 is a required package in sagemaker
]
setup(
name="accelerate",
version="0.1.0",
version="0.10.0.dev0",
description="Accelerate",
long_description=open("README.md", "r", encoding="utf-8").read(),
long_description_content_type="text/markdown",
@ -32,13 +48,15 @@ setup(
url="https://github.com/huggingface/accelerate",
package_dir={"": "src"},
packages=find_packages("src"),
entry_points={"console_scripts": [
"accelerate=accelerate.commands.accelerate_cli:main",
"accelerate-config=accelerate.commands.config:main",
"accelerate-launch=accelerate.commands.launch:main",
]},
entry_points={
"console_scripts": [
"accelerate=accelerate.commands.accelerate_cli:main",
"accelerate-config=accelerate.commands.config:main",
"accelerate-launch=accelerate.commands.launch:main",
]
},
python_requires=">=3.6.0",
install_requires=["torch>=1.4.0"],
install_requires=["numpy>=1.17", "packaging>=20.0", "pyyaml", "torch>=1.4.0"],
extras_require=extras,
classifiers=[
"Development Status :: 5 - Production/Stable",
@ -55,10 +73,10 @@ setup(
)
# Release checklist
# 1. Change the version in __init__.py, setup.py as well as docs/source/conf.py.
# 1. Change the version in __init__.py and setup.py.
# 2. Commit these changes with the message: "Release: VERSION"
# 3. Add a tag in git to mark the release: "git tag VERSION -m 'Adds tag VERSION for pypi' "
# Push the tag to git: git push --tags origin master
# Push the tag to git: git push --tags origin main
# 4. Run the following commands in the top-level directory:
# python setup.py bdist_wheel
# python setup.py sdist
@ -67,8 +85,9 @@ setup(
# twine upload dist/* -r pypitest --repository-url=https://test.pypi.org/legacy/
# 6. Check that you can install it in a virtualenv by running:
# pip install -i https://testpypi.python.org/pypi accelerate
# accelerate env
# accelerate test
# 7. Upload the final version to actual pypi:
# twine upload dist/* -r pypi
# 8. Add release notes to RELEASE.md and the tag in github once everything is looking hunky-dory.
# 9. Add the release version to docs/source/_static/js/custom.js and .circleci/deploy.sh
# 10. Update the version in __init__.py, setup.py to the new version "-dev" and push to master
# 8. Add release notes to the tag in github once everything is looking hunky-dory.
# 9. Update the version in __init__.py, setup.py to the new version "-dev" and push to master

View File

@ -2,8 +2,20 @@
# There's no way to ignore "F401 '...' imported but unused" warnings in this
# module, but to preserve other warnings. So, don't check this module at all.
__version__ = "0.1.0"
__version__ = "0.10.0.dev0"
from .accelerator import Accelerator
from .state import DistributedType
from .utils import synchronize_rng_states
from .big_modeling import cpu_offload, disk_offload, dispatch_model, init_empty_weights, load_checkpoint_and_dispatch
from .launchers import debug_launcher, notebook_launcher
from .utils import (
DeepSpeedPlugin,
DistributedDataParallelKwargs,
DistributedType,
FullyShardedDataParallelPlugin,
GradScalerKwargs,
InitProcessGroupKwargs,
find_executable_batch_size,
infer_auto_device_map,
load_checkpoint_in_model,
synchronize_rng_states,
)

View File

@ -12,14 +12,52 @@
# See the License for the specific language governing permissions and
# limitations under the License.
import gc
import os
import sys
import warnings
from contextlib import contextmanager
from typing import List, Optional, Union
import torch
from packaging import version
from .checkpointing import load_accelerator_state, load_custom_state, save_accelerator_state, save_custom_state
from .data_loader import prepare_data_loader
from .logging import get_logger
from .optimizer import AcceleratedOptimizer
from .state import AcceleratorState, DistributedType
from .utils import extract_model_from_parallel, gather, save, wait_for_everyone
from .scheduler import AcceleratedScheduler
from .state import AcceleratorState
from .tracking import LOGGER_TYPE_TO_CLASS, GeneralTracker, filter_trackers
from .utils import (
DeepSpeedPlugin,
DistributedDataParallelKwargs,
DistributedType,
FullyShardedDataParallelPlugin,
GradScalerKwargs,
InitProcessGroupKwargs,
KwargsHandler,
LoggerType,
PrecisionType,
RNGType,
convert_outputs_to_fp32,
extract_model_from_parallel,
gather,
get_pretty_name,
is_deepspeed_available,
is_torch_version,
pad_across_processes,
reduce,
save,
wait_for_everyone,
)
if is_deepspeed_available():
import deepspeed
from .utils import DeepSpeedEngineWrapper, DeepSpeedOptimizerWrapper
logger = get_logger(__name__)
class Accelerator:
@ -27,45 +65,186 @@ class Accelerator:
Creates an instance of an accelerator for distributed training (on multi-GPU, TPU) or mixed precision training.
Args:
device_placement (:obj:`bool`, `optional`, defaults to :obj:`True`):
device_placement (`bool`, *optional*, defaults to `True`):
Whether or not the accelerator should put objects on device (tensors yielded by the dataloader, model,
etc...).
split_batches (:obj:`bool`, `optional`, defaults to :obj:`False`):
split_batches (`bool`, *optional*, defaults to `False`):
Whether or not the accelerator should split the batches yielded by the dataloaders across the devices. If
:obj:`True` the actual batch size used will be the same on any kind of distributed processes, but it must
be a round multiple of the :obj:`num_processes` you are using. If :obj:`False`, actual batch size used will
be the one set in your script multiplied by the number of processes.
fp16 (:obj:`bool`, `optional`):
Whether or not to use mixed precision training. Will default to the value in the environment variable
:obj:`USE_FP16`, which will use the default value in the accelerate config of the current system or the
flag passed with the :obj:`accelerate.launch` command.
cpu (:obj:`bool`, `optional`):
Whether or not to force the script to execute on CPU. Will ignore GPU available if set to :obj:`True` and
force the execution on one process only.
`True` the actual batch size used will be the same on any kind of distributed processes, but it must be a
round multiple of the `num_processes` you are using. If `False`, actual batch size used will be the one set
in your script multiplied by the number of processes.
mixed_precision (`str`, *optional*):
Whether or not to use mixed precision training (fp16 or bfloat16). Choose from 'no','fp16','bf16'. Will
default to the value in the environment variable `MIXED_PRECISION`, which will use the default value in the
accelerate config of the current system or the flag passed with the `accelerate.launch` command. 'fp16'
requires pytorch 1.6 or higher. 'bf16' requires pytorch 1.10 or higher.
cpu (`bool`, *optional*):
Whether or not to force the script to execute on CPU. Will ignore GPU available if set to `True` and force
the execution on one process only.
deepspeed_plugin (`DeepSpeedPlugin`, *optional*):
Tweak your DeepSpeed related args using this argument. This argument is optional and can be configured
directly using *accelerate config*
fsdp_plugin (`FullyShardedDataParallelPlugin`, *optional*):
Tweak your FSDP related args using this argument. This argument is optional and can be configured directly
using *accelerate config*
rng_types (list of `str` or [`~utils.RNGType`]):
The list of random number generators to synchronize at the beginning of each iteration in your prepared
dataloaders. Should be one or several of:
- `"torch"`: the base torch random number generator
- `"cuda"`: the CUDA random number generator (GPU only)
- `"xla"`: the XLA random number generator (TPU only)
- `"generator"`: the `torch.Generator` of the sampler (or batch sampler if there is no sampler in your
dataloader) or of the iterable dataset (if it exists) if the underlying dataset is of that type.
Will default to `["torch"]` for PyTorch versions <=1.5.1 and `["generator"]` for PyTorch versions >= 1.6.
log_with (list of `str`, [`~utils.LoggerType`] or [`~tracking.GeneralTracker`], *optional*):
A list of loggers to be setup for experiment tracking. Should be one or several of:
- `"all"`
- `"tensorboard"`
- `"wandb"`
- `"comet_ml"`
If `"all`" is selected, will pick up all available trackers in the environment and intialize them. Can also
accept implementations of `GeneralTracker` for custom trackers, and can be combined with `"all"`.
logging_dir (`str`, `os.PathLike`, *optional*):
A path to a directory for storing logs of locally-compatible loggers.
dispatch_batches (`bool`, *optional*):
If set to `True`, the dataloader prepared by the Accelerator is only iterated through on the main process
and then the batches are split and broadcast to each process. Will default to `True` for `DataLoader` whose
underlying dataset is an `IterableDataset`, `False` otherwise.
step_scheduler_with_optimizer (`bool`, *optional`, defaults to `True`):
Set `True` if the learning rate scheduler is stepped at the same time as the optimizer, `False` if only
done under certain circumstances (at the end of each epoch, for instance).
kwargs_handlers (`List[KwargHandler]`, *optional*)
A list of `KwargHandler` to customize how the objects related to distributed training or mixed precision
are created. See [kwargs](kwargs) for more information.
Attributes
- **device** (:obj:`torch.device`) -- The device to use.
- **state** (:class:`~accelerate.AcceleratorState`) -- The distributed setup state.
- **device** (`torch.device`) -- The device to use.
- **state** ([`~state.AcceleratorState`]) -- The distributed setup state.
"""
def __init__(
self, device_placement: bool = True, split_batches: bool = False, fp16: bool = None, cpu: bool = False
self,
device_placement: bool = True,
split_batches: bool = False,
fp16: bool = None,
mixed_precision: Union[PrecisionType, str] = None,
cpu: bool = False,
deepspeed_plugin: DeepSpeedPlugin = None,
fsdp_plugin: FullyShardedDataParallelPlugin = None,
rng_types: Optional[List[Union[str, RNGType]]] = None,
log_with: Optional[List[Union[str, LoggerType, GeneralTracker]]] = None,
logging_dir: Optional[Union[str, os.PathLike]] = None,
dispatch_batches: Optional[bool] = None,
step_scheduler_with_optimizer: bool = True,
kwargs_handlers: Optional[List[KwargsHandler]] = None,
):
self.state = AcceleratorState(fp16=fp16, cpu=cpu, _from_accelerator=True)
self.logging_dir = logging_dir
self.log_with = filter_trackers(log_with, self.logging_dir)
if mixed_precision is not None:
mixed_precision = str(mixed_precision)
if mixed_precision not in PrecisionType:
raise ValueError(
f"Unknown mixed_precision mode: {mixed_precision}. Choose between {PrecisionType.list()}"
)
if fp16:
warnings.warn('fp16=True is deprecated. Use mixed_precision="fp16" instead.', DeprecationWarning)
mixed_precision = "fp16"
if deepspeed_plugin is None: # init from env variables
deepspeed_plugin = DeepSpeedPlugin() if os.environ.get("USE_DEEPSPEED", "false") == "true" else None
else:
assert isinstance(
deepspeed_plugin, DeepSpeedPlugin
), "`deepspeed_plugin` must be a DeepSpeedPlugin object."
os.environ["USE_DEEPSPEED"] = "true" # use DeepSpeed if plugin is provided
if fsdp_plugin is None: # init from env variables
fsdp_plugin = FullyShardedDataParallelPlugin() if os.environ.get("USE_FSDP", "false") == "true" else None
else:
if not isinstance(fsdp_plugin, FullyShardedDataParallelPlugin):
raise TypeError("`fsdp_plugin` must be a FullyShardedDataParallelPlugin object.")
os.environ["USE_FSDP"] = "true" # use FSDP if plugin is provided
if os.environ.get("USE_FSDP", "false") == "true":
if is_torch_version("<", "1.12.0.dev20220418+cu113"):
raise ValueError("FSDP requires PyTorch >= 1.12.0.dev20220418+cu113")
# Kwargs handlers
self.ddp_handler = None
self.scaler_handler = None
self.init_handler = None
if kwargs_handlers is not None:
for handler in kwargs_handlers:
assert isinstance(handler, KwargsHandler), f"Unsupported kwargs handler passed: {handler}."
if isinstance(handler, DistributedDataParallelKwargs):
if self.ddp_handler is not None:
raise ValueError("You can only pass one `DistributedDataParallelKwargs` in `kwargs_handler`.")
else:
self.ddp_handler = handler
elif isinstance(handler, GradScalerKwargs):
if self.scaler_handler is not None:
raise ValueError("You can only pass one `GradScalerKwargs` in `kwargs_handler`.")
else:
self.scaler_handler = handler
elif isinstance(handler, InitProcessGroupKwargs):
if self.init_handler is not None:
raise ValueError("You can only pass one `InitProcessGroupKwargs` in `kwargs_handler`.")
else:
self.init_handler = handler
kwargs = self.init_handler.to_kwargs() if self.init_handler is not None else {}
self.state = AcceleratorState(
mixed_precision=mixed_precision,
cpu=cpu,
deepspeed_plugin=deepspeed_plugin,
fsdp_plugin=fsdp_plugin,
_from_accelerator=True,
**kwargs,
)
self.device_placement = device_placement
self.split_batches = split_batches
self.dispatch_batches = dispatch_batches
if dispatch_batches is True and is_torch_version("<", "1.8.0"):
raise ImportError(
"Using `DataLoaderDispatcher` requires PyTorch 1.8.0 minimum. You have {torch.__version__}."
)
self.step_scheduler_with_optimizer = step_scheduler_with_optimizer
# Mixed precision attributes
self.scaler = None
self.native_amp = False
if self.state.use_fp16:
self.native_amp = version.parse(torch.__version__) >= version.parse("1.6")
self.scaler = torch.cuda.amp.GradScaler()
if self.state.mixed_precision == "fp16":
self.native_amp = is_torch_version(">=", "1.6")
if not self.native_amp:
raise ValueError("fp16 mixed precision requires PyTorch >= 1.6")
kwargs = self.scaler_handler.to_kwargs() if self.scaler_handler is not None else {}
self.scaler = torch.cuda.amp.GradScaler(**kwargs)
elif self.state.mixed_precision == "bf16":
self.native_amp = is_torch_version(">=", "1.10")
if mixed_precision == "bf16" and not self.native_amp:
raise ValueError("bf16 mixed precision requires PyTorch >= 1.10")
kwargs = self.scaler_handler.to_kwargs() if self.scaler_handler is not None else {}
self.scaler = torch.cuda.amp.GradScaler(**kwargs)
# Internal references to the training objects
self._optimizers = []
self._models = []
self._schedulers = []
self._custom_objects = []
# RNG Types
self.rng_types = rng_types
if self.rng_types is None:
self.rng_types = ["torch"] if is_torch_version("<=", "1.5.1") else ["generator"]
@property
def distributed_type(self):
@ -99,39 +278,143 @@ class Accelerator:
@property
def use_fp16(self):
return self.state.use_fp16
return self.mixed_precision != "no"
@property
def mixed_precision(self):
if self.distributed_type == DistributedType.DEEPSPEED:
config = self.state.deepspeed_plugin.deepspeed_config
if config.get("fp16", {}).get("enabled", False):
mixed_precision = "fp16"
elif config.get("bf16", {}).get("enabled", False):
mixed_precision = "bf16"
else:
mixed_precision = "no"
else:
mixed_precision = self.state.mixed_precision
return mixed_precision
@contextmanager
def local_main_process_first(self):
"""
Lets the local main process go inside a with block.
The other processes will enter the with block after the main process exits.
"""
yield from self._goes_first(self.is_local_main_process)
@contextmanager
def main_process_first(self):
"""
Lets the main process go first inside a with block.
The other processes will enter the with block after the main process exits.
"""
yield from self._goes_first(self.is_main_process)
def _goes_first(self, is_main):
if not is_main:
self.wait_for_everyone()
yield
if is_main:
self.wait_for_everyone()
def print(self, *args, **kwargs):
"""
Use in replacement of :obj:`print()` to only print once per server.
Use in replacement of `print()` to only print once per server.
"""
if self.is_local_main_process:
print(*args, **kwargs)
def _prepare_one(self, obj):
if isinstance(obj, torch.utils.data.DataLoader):
def _prepare_one(self, obj, first_pass=False):
# First pass of preparation: DataLoader, model, optimizer
if isinstance(obj, torch.utils.data.DataLoader) and first_pass:
return self.prepare_data_loader(obj)
elif isinstance(obj, torch.nn.Module):
elif isinstance(obj, torch.nn.Module) and first_pass:
self._models.append(obj)
return self.prepare_model(obj)
elif isinstance(obj, torch.optim.Optimizer):
elif isinstance(obj, torch.optim.Optimizer) and first_pass:
optimizer = self.prepare_optimizer(obj)
self._optimizers.append(optimizer)
return optimizer
# Second pass of preparation: LR scheduler (which need the full list of optimizers)
elif isinstance(obj, torch.optim.lr_scheduler._LRScheduler) and not first_pass:
scheduler = self.prepare_scheduler(obj)
self._schedulers.append(scheduler)
return scheduler
else:
return obj
def _prepare_fsdp(self, *args):
result = []
for obj in args:
if isinstance(obj, torch.nn.Module):
model = obj
break
optimizers = []
self._schedulers = []
self._models = []
intermediate_result = []
for obj in args:
if isinstance(obj, torch.optim.Optimizer):
if len(obj.param_groups) > 1:
logger.warn(
"FSDP Warning: When using FSDP, several parameter groups will be conflated into "
"a single one due to nested module wrapping and parameter flattening."
)
optimizer = obj.optimizer.__class__(model.parameters(), **obj.optimizer.defaults)
obj = self.prepare_optimizer(optimizer)
optimizers.append(obj)
elif isinstance(obj, torch.nn.Module):
self._models.append(obj)
intermediate_result.append(obj)
for obj in intermediate_result:
if isinstance(obj, AcceleratedScheduler):
obj.optimizer = optimizers
for i, opt in enumerate(self._optimizers):
if getattr(obj.scheduler, "optimizer", None) == opt.optimizer:
obj.scheduler.optimizer = optimizers[i]
obj.optimizers = [optimizers[i]]
break
self._schedulers.append(obj)
result.append(obj)
self._optimizers = optimizers
return tuple(result)
def prepare(self, *args):
"""
Prepare all objects passed in :obj:`args` for distributed training and mixed precision, then return them in the
same order.
Prepare all objects passed in `args` for distributed training and mixed precision, then return them in the same
order.
Accepts the following type of objects:
- :obj:`torch.utils.data.DataLoader`: PyTorch Dataloader
- :obj:`torch.nn.Module`: PyTorch Module
- :obj:`torch.optim.Optimizer`: PyTorch Optimizer
- `torch.utils.data.DataLoader`: PyTorch Dataloader
- `torch.nn.Module`: PyTorch Module
- `torch.optim.Optimizer`: PyTorch Optimizer
"""
if self.distributed_type == DistributedType.FSDP:
model_count = 0
optimizer_present = False
for obj in args:
if isinstance(obj, torch.nn.Module):
model_count += 1
if isinstance(obj, torch.optim.Optimizer):
optimizer_present = True
if model_count > 1 and optimizer_present:
raise ValueError(
"For FSDP to work with multiple models (>1), "
"prepare must be called for all the models before optimizers are created"
)
elif model_count == 1 and optimizer_present:
logger.warn(
"FSDP Warning: When using FSDP, "
"it is efficient and recommended to call prepare for the model before creating the optimizer"
)
# On TPUs, putting the model on the XLA device will create new parameters, so the corresponding optimizer will
# have parameters disconnected from the model (so no training :-( ).
# If the model and optimizer have parameters on different devices we raise an error.
@ -152,7 +435,11 @@ class Accelerator:
# 1. grabbing old model parameters
old_named_params = self._get_named_parameters(*args)
result = tuple(self._prepare_one(obj) for obj in args)
if self.distributed_type == DistributedType.DEEPSPEED:
result = self._prepare_deepspeed(*args)
else:
result = tuple(self._prepare_one(obj, first_pass=True) for obj in args)
result = tuple(self._prepare_one(obj) for obj in result)
if tpu_should_fix_optimizer:
# 2. grabbing new model parameters
@ -164,21 +451,132 @@ class Accelerator:
if isinstance(obj, torch.optim.Optimizer):
obj._switch_parameters(mapping)
return result
if self.distributed_type == DistributedType.FSDP and model_count == 1 and optimizer_present:
result = self._prepare_fsdp(*result)
return result if len(result) > 1 else result[0]
def prepare_model(self, model):
if self.device_placement:
if self.device_placement and self.distributed_type != DistributedType.FSDP:
model = model.to(self.device)
if self.distributed_type == DistributedType.MULTI_GPU:
kwargs = self.ddp_handler.to_kwargs() if self.ddp_handler is not None else {}
model = torch.nn.parallel.DistributedDataParallel(
model,
device_ids=[self.local_process_index],
output_device=self.local_process_index,
model, device_ids=[self.local_process_index], output_device=self.local_process_index, **kwargs
)
elif self.distributed_type == DistributedType.FSDP:
from torch.distributed.fsdp.fully_sharded_data_parallel import FullyShardedDataParallel as FSDP
# Check if the model is already a FSDP model due to `Manual Wrapping` and if so,
# don't wrap it again
if type(model) != FSDP:
fsdp_plugin = self.state.fsdp_plugin
model = FSDP(
model,
sharding_strategy=fsdp_plugin.sharding_strategy,
cpu_offload=fsdp_plugin.cpu_offload,
auto_wrap_policy=fsdp_plugin.auto_wrap_policy,
backward_prefetch=fsdp_plugin.backward_prefetch,
ignored_modules=fsdp_plugin.ignored_modules,
)
if not fsdp_plugin.cpu_offload.offload_params:
model.to(self.device)
elif self.distributed_type == DistributedType.MULTI_CPU:
kwargs = self.ddp_handler.to_kwargs() if self.ddp_handler is not None else {}
model = torch.nn.parallel.DistributedDataParallel(model, **kwargs)
if self.native_amp:
model.forward = torch.cuda.amp.autocast()(model.forward)
if self.mixed_precision == "fp16" and is_torch_version(">=", "1.10"):
model.forward = torch.cuda.amp.autocast(dtype=torch.float16)(model.forward)
elif self.mixed_precision == "bf16":
model.forward = torch.cuda.amp.autocast(dtype=torch.bfloat16)(model.forward)
else:
model.forward = torch.cuda.amp.autocast()(model.forward)
model.forward = convert_outputs_to_fp32(model.forward)
return model
def _prepare_deepspeed(self, *args):
deepspeed_plugin = self.state.deepspeed_plugin
self.deepspeed_config = deepspeed_plugin.deepspeed_config
batch_sizes = [obj.batch_size for obj in args if hasattr(obj, "batch_size")]
if len(batch_sizes) == 0:
raise ValueError(
"You must specify a training or evaluation dataloader in `accelerate.prepare()` when using DeepSpeed."
)
batch_size_per_device = min(batch_sizes) if deepspeed_plugin.is_train_batch_min else max(batch_sizes)
if len(batch_sizes) > 1:
logger.info(
f"Since you passed both train and evaluation dataloader, `is_train_batch_min` (here \
{deepspeed_plugin.is_train_batch_min} will decide the `train_batch_size` ({batch_size_per_device})."
)
self.deepspeed_config["train_batch_size"] = (
batch_size_per_device * deepspeed_plugin.gradient_accumulation_steps * self.num_processes
)
result = [
self._prepare_one(obj, first_pass=True) if isinstance(obj, torch.utils.data.DataLoader) else obj
for obj in args
]
model = None
optimizer = None
for obj in result:
if isinstance(obj, torch.nn.Module):
model = obj
elif isinstance(obj, (torch.optim.Optimizer, dict)):
optimizer = obj
if deepspeed_plugin.auto_opt_mapping:
is_adam = isinstance(optimizer, torch.optim.Adam)
is_adamw = isinstance(optimizer, torch.optim.AdamW)
if (is_adam or is_adamw) and deepspeed_plugin.offload_optimizer_device == "cpu":
defaults = optimizer.defaults
params = []
for group in optimizer.param_groups:
params.extend(group["params"])
optimizer = deepspeed.ops.adam.DeepSpeedCPUAdam(
params,
lr=defaults["lr"],
bias_correction=True,
betas=defaults["betas"],
eps=defaults["eps"],
weight_decay=defaults["weight_decay"],
amsgrad=defaults["amsgrad"],
adamw_mode=is_adamw,
)
# useful when only eval_dataloader is given into `accelerator.prepare()`
if model is not None:
engine = DeepSpeedEngineWrapper(
args=None,
model=model,
optimizer=optimizer,
config_params=self.deepspeed_config,
dist_init_required=False,
)
for i in range(len(result)):
if isinstance(result[i], torch.nn.Module):
result[i] = engine
elif isinstance(result[i], torch.optim.Optimizer):
result[i] = DeepSpeedOptimizerWrapper(engine.optimizer, engine)
self.deepspeed_engine = engine # pointing for deepspeed_engine.backward()
self._models.append(engine)
self._optimizers.append(engine.optimizer)
assert (
len(self._models) == 1
), "You can't use same `Accelerator()` instance with 2 models when using DeepSpeed"
if self.distributed_type == DistributedType.DEEPSPEED:
assert hasattr(
self, "deepspeed_engine"
), "You need to pass the model along the optimizer when using Deepspeed."
return tuple(result)
def prepare_data_loader(self, data_loader):
return prepare_data_loader(
data_loader,
@ -187,66 +585,133 @@ class Accelerator:
process_index=self.process_index,
split_batches=self.split_batches,
put_on_device=self.device_placement,
rng_types=self.rng_types.copy(),
dispatch_batches=self.dispatch_batches,
)
def prepare_optimizer(self, optimizer):
return AcceleratedOptimizer(optimizer, device_placement=self.device_placement, scaler=self.scaler)
def backward(self, loss):
def prepare_scheduler(self, scheduler):
# We try to find the optimizer associated with `scheduler`, the default is the full list.
optimizer = self._optimizers
for opt in self._optimizers:
if getattr(scheduler, "optimizer", None) == opt.optimizer:
optimizer = opt
break
return AcceleratedScheduler(
scheduler,
optimizer,
step_with_optimizer=self.step_scheduler_with_optimizer,
split_batches=self.split_batches,
)
def backward(self, loss, **kwargs):
"""
Use :obj:`accelerator.backward(loss)` in lieu of :obj:`loss.backward()`.
Use `accelerator.backward(loss)` in lieu of `loss.backward()`.
"""
if self.scaler is not None:
self.scaler.scale(loss).backward()
if self.distributed_type == DistributedType.DEEPSPEED:
self.deepspeed_engine.backward(loss, **kwargs)
elif self.scaler is not None:
self.scaler.scale(loss).backward(**kwargs)
else:
loss.backward()
loss.backward(**kwargs)
def unscale_gradients(self, optimizer=None):
"""
Unscale the gradients in mixed precision training with AMP. This is a noop in all other settings.
Args:
optimizer (`torch.optim.Optimizer` or `List[torch.optim.Optimizer]`, *optional*):
The optimizer(s) for which to unscale gradients. If not set, will unscale gradients on all optimizers
that were passed to [`~Accelerator.prepare`].
"""
if self.use_fp16 and self.native_amp:
if optimizer is None:
# TODO: this unscales all optimizers where we should only unscale the one where parameters are.
optimizer = self._optimizers
elif not isinstance(optimizer, (tuple, list)):
optimizer = [optimizer]
for opt in optimizer:
while isinstance(opt, AcceleratedOptimizer):
opt = opt.optimizer
self.scaler.unscale_(opt)
def clip_grad_norm_(self, parameters, max_norm, norm_type=2):
"""
Should be used in place of :func:`torch.nn.utils.clip_grad_norm_`.
Should be used in place of `torch.nn.utils.clip_grad_norm_`.
"""
# TODO: this unscales all optimizers where we should only unscale the one where parameters are.
if self.state.use_fp16 and self.native_amp:
for optimizer in self._optimizers:
self.scaler.unscale_(optimizer)
if self.distributed_type == DistributedType.FSDP:
parameters = [p for p in parameters]
for model in self._models:
if parameters == [p for p in model.parameters()]:
model.clip_grad_norm_(max_norm, norm_type)
return
self.unscale_gradients()
torch.nn.utils.clip_grad_norm_(parameters, max_norm, norm_type=norm_type)
def clip_grad_value_(self, parameters, clip_value):
"""
Should be used in place of :func:`torch.nn.utils.clip_grad_value_`.
Should be used in place of `torch.nn.utils.clip_grad_value_`.
"""
# TODO: this unscales all optimizers where we should only unscale the one where parameters are.
if self.state.use_fp16 and self.native_amp:
for optimizer in self._optimizers:
self.scaler.unscale_(optimizer)
self.unscale_gradients()
torch.nn.utils.clip_grad_value_(parameters, clip_value)
def gather(self, tensor):
"""
Gather the values in `tensor` accross all processes and concatenate them on the first dimension. Useful to
Gather the values in *tensor* across all processes and concatenate them on the first dimension. Useful to
regroup the predictions from all processes when doing evaluation.
Note:
This gather happens in all processes.
Args:
tensor (:obj:`torch.Tensor`, or a nested tuple/list/dictionary of :obj:`torch.Tensor`):
The tensors to gather accross all processes.
tensor (`torch.Tensor`, or a nested tuple/list/dictionary of `torch.Tensor`):
The tensors to gather across all processes.
Returns:
:obj:`torch.Tensor`, or a nested tuple/list/dictionary of :obj:`torch.Tensor`: The gathered tensor(s). Note
that the first dimension of the result is `num_processes` multiplied by the first dimension of the input
tensors.
`torch.Tensor`, or a nested tuple/list/dictionary of `torch.Tensor`: The gathered tensor(s). Note that the
first dimension of the result is *num_processes* multiplied by the first dimension of the input tensors.
"""
return gather(tensor)
def unwrap_model(self, model):
def reduce(self, tensor: torch.Tensor, reduction="sum"):
"""
Unwraps the :obj:`model` from the additional layer possible added by :meth:`~accelerate.Accelerator.prepare`.
Useful before saving the model.
Reduce the values in *tensor* across all processes based on *reduction*.
Args:
model (:obj:`torch.nn.Module`):
tensor (`torch.Tensor`):
The tensors to reduce across all processes.
reduction (`str`, *optional*, defaults to "sum"):
A reduction type, can be one of 'sum', 'mean', or 'none'. If 'none', will not perform any operation.
"""
reduce(tensor, reduction)
def pad_across_processes(self, tensor, dim=0, pad_index=0, pad_first=False):
"""
Recursively pad the tensors in a nested list/tuple/dictionary of tensors from all devices to the same size so
they can safely be gathered.
Args:
tensor (nested list/tuple/dictionary of `torch.Tensor`):
The data to gather.
dim (`int`, *optional*, defaults to 0):
The dimension on which to pad.
pad_index (`int`, *optional*, defaults to 0):
The value with which to pad.
pad_first (`bool`, *optional*, defaults to `False`):
Whether to pad at the beginning or the end.
"""
return pad_across_processes(tensor, dim=dim, pad_index=pad_index, pad_first=pad_first)
def unwrap_model(self, model):
"""
Unwraps the `model` from the additional layer possible added by [`~Accelerator.prepare`]. Useful before saving
the model.
Args:
model (`torch.nn.Module`):
The model to unwrap.
"""
return extract_model_from_parallel(model)
@ -258,17 +723,131 @@ class Accelerator:
"""
wait_for_everyone()
def init_trackers(self, project_name: str, config: Optional[dict] = None):
"""
Initializes a run for all trackers stored in `self.log_with`, potentially with starting configurations
Args:
project_name (`str`):
The name of the project. All trackers will save their data based on this
config (`dict`, *optional*):
Optional starting configuration to be logged.
"""
self.trackers = []
for tracker in self.log_with:
if issubclass(type(tracker), GeneralTracker):
# Custom trackers are already initialized
self.trackers.append(tracker)
else:
tracker_init = LOGGER_TYPE_TO_CLASS[str(tracker)]
if getattr(tracker_init, "requires_logging_directory"):
# We can skip this check since it was done in `__init__`
self.trackers.append(tracker_init(project_name, self.logging_dir))
else:
self.trackers.append(tracker_init(project_name))
if config is not None:
for tracker in self.trackers:
tracker.store_init_configuration(config)
def log(self, values: dict, step: Optional[int] = None):
"""
Logs `values` to all stored trackers in `self.trackers`.
Args:
values (`dict`):
Values should be a dictionary-like object containing only types `int`, `float`, or `str`.
step (`int`, *optional*):
The run step. If included, the log will be affiliated with this step.
"""
if self.is_main_process:
for tracker in self.trackers:
tracker.log(values, step=step)
def end_training(self):
"""
Runs any special end training behaviors, such as stopping trackers
"""
if self.is_main_process:
for tracker in self.trackers:
tracker.finish()
def save(self, obj, f):
"""
Save the object passed to disk once per machine. Use in place of :obj:`torch.save`.
Save the object passed to disk once per machine. Use in place of `torch.save`.
Args:
obj: The object to save.
f (:obj:`str` or :obj:`os.PathLike`):
Where to save the content of :obj:`obj`.
f (`str` or `os.PathLike`):
Where to save the content of `obj`.
"""
save(obj, f)
def save_state(self, output_dir: str):
"""
Saves the current states of the model, optimizer, scaler, RNG generators, and registered objects.
Args:
output_dir (`str` or `os.PathLike`):
The name of the folder to save all relevant weights and states.
"""
# Check if folder exists
output_dir = os.path.expanduser(output_dir)
os.makedirs(output_dir, exist_ok=True)
logger.info(f"Saving current state to {output_dir}")
weights = [self.get_state_dict(m) for m in self._models]
save_location = save_accelerator_state(
output_dir, weights, self._optimizers, self._schedulers, self.state.process_index, self.scaler
)
for i, obj in enumerate(self._custom_objects):
save_custom_state(obj, output_dir, i)
return save_location
def load_state(self, input_dir: str):
"""
Loads the current states of the model, optimizer, scaler, RNG generators, and registered objects.
Args:
input_dir (`str` or `os.PathLike`):
The name of the folder all relevant weights and states were saved in.
"""
# Check if folder exists
input_dir = os.path.expanduser(input_dir)
if not os.path.isdir(input_dir):
raise ValueError(f"Tried to find {input_dir} but folder does not exist")
logger.info(f"Loading states from {input_dir}")
load_accelerator_state(
input_dir, self._models, self._optimizers, self._schedulers, self.state.process_index, self.scaler
)
custom_checkpoints = [f for f in os.listdir(input_dir) if "custom_checkpoint" in f]
if len(custom_checkpoints) != len(self._custom_objects):
err = "Warning! Number of found checkpoints does not match the number of registered objects:"
err += f"\n\tFound checkpoints: {len(custom_checkpoints)}"
err += f"\n\tRegistered objects: {len(self._custom_objects)}\nSkipping."
logger.warn(err)
else:
logger.info(f"Loading in {len(custom_checkpoints)} custom states")
for index, obj in enumerate(self._custom_objects):
load_custom_state(obj, input_dir, index)
def free_memory(self):
"""
Will release all references to the internal objects stored and call the garbage collector. You should call this
method between two trainings with different models/optimizers.
"""
self._schedulers = []
self._optimizers = []
self._models = []
self.deepspeed_engine = None
gc.collect()
torch.cuda.empty_cache()
def clear(self):
"""
Alias for [`Accelerate.free_memory`], releases all references to the internal objects stored and call the
garbage collector. You should call this method between two trainings with different models/optimizers.
"""
self.free_memory()
def _get_named_parameters(self, *args):
named_parameters = {}
for obj in args:
@ -293,3 +872,77 @@ class Accelerator:
optimizer_device = param_group["params"][0].device
break
return (model_device, optimizer_device)
def get_state_dict(self, model):
is_zero_3 = False
if is_deepspeed_available():
if isinstance(model, DeepSpeedEngineWrapper) and self.distributed_type == DistributedType.DEEPSPEED:
is_zero_3 = self.state.deepspeed_plugin.zero_stage == 3
if is_zero_3:
state_dict = model._zero3_consolidated_16bit_state_dict()
else:
model = self.unwrap_model(model)
state_dict = model.state_dict()
if state_dict is not None:
for k in state_dict:
if state_dict[k].dtype == torch.float16:
state_dict[k] = state_dict[k].float()
return state_dict
def register_for_checkpointing(self, *objects):
"""
Makes note of `objects` and will save or load them in during `save_state` or `load_state`.
These should be utilized when the state is being loaded or saved in the same script. It is not designed to be
used in different scripts
<Tip>
Every `object` must have a `load_state_dict` and `state_dict` function to be stored.
</Tip>
"""
invalid_objects = []
for obj in objects:
if not hasattr(obj, "state_dict") or not hasattr(obj, "load_state_dict"):
invalid_objects.append(obj)
if len(invalid_objects) > 0:
err = "All `objects` must include a `state_dict` and `load_state_dict` function to be stored. The following inputs are invalid:"
for index, obj in enumerate(invalid_objects):
err += f"\n\t- Item at index {index}, `{get_pretty_name(obj)}`"
raise ValueError(err)
self._custom_objects.extend(objects)
@contextmanager
def autocast(self):
"""
Will apply automatic mixed-precision inside the block inside this context manager, if it is enabled. Nothing
different will happen otherwise.
"""
if self.native_amp:
if self.mixed_precision == "fp16" and is_torch_version(">=", "1.10"):
autocast_context = torch.cuda.amp.autocast(dtype=torch.float16)
elif self.mixed_precision == "bf16":
autocast_context = torch.cuda.amp.autocast(dtype=torch.bfloat16)
else:
autocast_context = torch.cuda.amp.autocast()
autocast_context.__enter__()
yield
autocast_context.__exit__(*sys.exc_info())
else:
yield
@property
def optimizer_step_was_skipped(self):
"""
Whether or not the optimizer update was skipped (because of gradient overflow in mixed precision), in which
case the learning rate should not be changed.
"""
for optimizer in self._optimizers:
if optimizer.step_was_skipped:
return True
return False

View File

@ -0,0 +1,285 @@
# Copyright 2022 The HuggingFace Team. All rights reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
import os
from contextlib import contextmanager
from typing import Dict, List, Optional, Union
import torch
import torch.nn as nn
from .hooks import AlignDevicesHook, add_hook_to_module, attach_align_device_hook, attach_align_device_hook_on_blocks
from .utils import (
OffloadedWeightsLoader,
check_device_map,
extract_submodules_state_dict,
infer_auto_device_map,
load_checkpoint_in_model,
offload_state_dict,
)
@contextmanager
def init_empty_weights(include_buffers: bool = False):
"""
A context manager under which models are initialized with all parameters on the meta device, therefore creating an
empty model. Useful when just initializing the model would blow the available RAM.
Args:
include_buffers (`bool`, *optional*, defaults to `False`):
Whether or not to also put all buffers on the meta device while initializing.
Example:
```pyton
import torch.nn as nn
from accelerate import init_empty_weights
# Initialize a model with 100 billions parameters in no time and without using any RAM.
with init_empty_weights():
tst = nn.Sequential(*[nn.Linear(10000, 10000) for _ in range(1000)])
```
<Tip warning={true}>
Any model created under this context manager has no weights. As such you can't do something like
`model.to(some_device)` with it. To load weights inside your empty model, see [`load_checkpoint_and_dispatch`].
</Tip>
"""
old_register_parameter = nn.Module.register_parameter
if include_buffers:
old_register_buffer = nn.Module.register_buffer
def register_empty_parameter(module, name, param):
old_register_parameter(module, name, param)
if param is not None:
module._parameters[name] = nn.Parameter(module._parameters[name].to(torch.device("meta")))
def register_empty_buffer(module, name, buffer):
old_register_buffer(module, name, buffer)
if buffer is not None:
module._buffers[name] = module._buffers[name].to(torch.device("meta"))
try:
nn.Module.register_parameter = register_empty_parameter
if include_buffers:
nn.Module.register_buffer = register_empty_buffer
yield
finally:
nn.Module.register_parameter = old_register_parameter
if include_buffers:
nn.Module.register_buffer = old_register_buffer
def cpu_offload(
model: nn.Module,
execution_device: Optional[torch.device] = None,
offload_buffers: bool = False,
state_dict: Optional[Dict[str, torch.Tensor]] = None,
):
"""
Activates full CPU offload for a model. As a result, all parameters of the model will be offloaded and only one
copy of the state dict of the model will be kept. During the forward pass, parameters will be extracted from that
state dict and put on the execution device passed as they are needed, then offloaded again.
Args:
model (`torch.nn.Module`):
The model to offload.
execution_device (`torch.device`, *optional*):
The device on which the forward pass of the model will be executed (should be a GPU). Will default to the
model first parameter device.
offload_buffers (`bool`, *optional*, defaults to `False`):
Whether or not to offload the buffers with the model parameters.
state_dict (`Dict[str, torch.Tensor]`, *optional*):
The state dict of the model that will be kept on CPU.
"""
if execution_device is None:
execution_device = next(iter(model.parameters())).device
if state_dict is None:
state_dict = {n: p.to("cpu") for n, p in model.state_dict().items()}
attach_align_device_hook(
model, execution_device=execution_device, offload=True, offload_buffers=offload_buffers, weights_map=state_dict
)
add_hook_to_module(model, AlignDevicesHook(io_same_device=True))
return model
def disk_offload(
model: nn.Module,
offload_dir: Union[str, os.PathLike],
execution_device: Optional[torch.device] = None,
offload_buffers: bool = False,
):
"""
Activates full disk offload for a model. As a result, all parameters of the model will be offloaded as
memory-mapped array in a given folder. During the forward pass, parameters will be accessed from that folder and
put on the execution device passed as they are needed, then offloaded again.
Args:
model (`torch.nn.Module`): The model to offload.
offload_dir (`str` or `os.PathLike`):
The folder in which to offload the model weights (or where the model weights are already offloaded).
execution_device (`torch.device`, *optional*):
The device on which the forward pass of the model will be executed (should be a GPU). Will default to the
model's first parameter device.
offload_buffers (`bool`, *optional*, defaults to `False`):
Whether or not to offload the buffers with the model parameters.
"""
if not os.path.isdir(offload_dir) or not os.path.isfile(os.path.join(offload_dir, "index.json")):
offload_state_dict(offload_dir, model.state_dict())
if execution_device is None:
execution_device = next(iter(model.parameters())).device
weights_map = OffloadedWeightsLoader(save_folder=offload_dir)
attach_align_device_hook(
model,
execution_device=execution_device,
offload=True,
offload_buffers=offload_buffers,
weights_map=weights_map,
)
add_hook_to_module(model, AlignDevicesHook(io_same_device=True))
return model
def dispatch_model(
model: nn.Module,
device_map: Dict[str, Union[str, int, torch.device]],
main_device: Optional[torch.device] = None,
state_dict: Optional[Dict[str, torch.Tensor]] = None,
offload_dir: Union[str, os.PathLike] = None,
offload_buffers: bool = False,
):
"""
Dispatches a model according to a given device map. Layers of the model might be spread across GPUs, offloaded on
the CPU or even the disk.
Args:
model (`torch.nn.Module`):
The model to dispatch.
device_map (`Dict[str, Union[str, int, torch.device]]`):
A dictionary mapping module names in the models `state_dict` to the device they should go to. Note that
`"disk"` is accepted even if it's not a proper value for `torch.device`.
main_device (`str`, `int` or `torch.device`, *optional*):
The main execution device. Will default to the first device in the `device_map` different from `"cpu"` or
`"disk"`.
state_dict (`Dict[str, torch.Tensor]`, *optional*):
The state dict of the part of the model that will be kept on CPU.
offload_dir (`str` or `os.PathLike`):
The folder in which to offload the model weights (or where the model weights are already offloaded).
offload_buffers (`bool`, *optional*, defaults to `False`):
Whether or not to offload the buffers with the model parameters.
"""
# Error early if the device map is incomplete.
check_device_map(model, device_map)
if main_device is None:
main_device = [d for d in device_map.values() if d not in ["cpu", "disk"]][0]
cpu_modules = [name for name, device in device_map.items() if device == "cpu"]
if state_dict is None and len(cpu_modules) > 0:
state_dict = extract_submodules_state_dict(model.state_dict(), cpu_modules)
disk_modules = [name for name, device in device_map.items() if device == "disk"]
if offload_dir is None and len(disk_modules) > 0:
raise ValueError(
"We need an `offload_dir` to dispatch this model according to this `device_map`, the following submodules "
f"need to be offloaded: {', '.join(disk_modules)}."
)
if len(disk_modules) > 0 and (
not os.path.isdir(offload_dir) or not os.path.isfile(os.path.join(offload_dir, "index.json"))
):
disk_state_dict = extract_submodules_state_dict(model.state_dict(), disk_modules)
offload_state_dict(offload_dir, disk_state_dict)
execution_device = {
name: main_device if device in ["cpu", "disk"] else device for name, device in device_map.items()
}
offload = {name: device in ["cpu", "disk"] for name, device in device_map.items()}
save_folder = offload_dir if len(disk_modules) > 0 else None
if state_dict is not None or save_folder is not None:
weights_map = OffloadedWeightsLoader(state_dict=state_dict, save_folder=save_folder)
else:
weights_map = None
attach_align_device_hook_on_blocks(
model,
execution_device=execution_device,
offload=offload,
offload_buffers=offload_buffers,
weights_map=weights_map,
)
model.hf_device_map = device_map
return model
def load_checkpoint_and_dispatch(
model: nn.Module,
checkpoint: Union[str, os.PathLike],
device_map: Optional[Union[str, Dict[str, Union[int, str, torch.device]]]] = None,
max_memory: Optional[Dict[Union[int, str], Union[int, str]]] = None,
no_split_module_classes: Optional[List[str]] = None,
offload_folder: Optional[Union[str, os.PathLike]] = None,
offload_buffers: bool = False,
dtype: Optional[Union[str, torch.dtype]] = None,
offload_state_dict: bool = False,
):
"""
Loads a (potentially sharded) checkpoint inside a model, potentially sending weights to a given device as they are
loaded and adds the various hooks that will make this model run properly (even if split across devices).
Args:
model (`torch.nn.Module`): The model in which we want to load a checkpoint.
checkpoint (`str` or `os.PathLike`):
The folder checkpoint to load. It can be:
- a path to a file containing a whole model state dict
- a path to a `.json` file containing the index to a sharded checkpoint
- a path to a folder containing a unique `.index.json` file and the shards of a checkpoint.
device_map (`Dict[str, Union[int, str, torch.device]]`, *optional*):
A map that specifies where each submodule should go. It doesn't need to be refined to each parameter/buffer
name, once a given module name is inside, every submodule of it will be sent to the same device.
To have Accelerate compute the most optimized `device_map` automatically, set `device_map="auto"`.
max_memory (`Dict`, *optional*):
A dictionary device identifier to maximum memory. Will default to the maximum memory available for each GPU
and the available CPU RAM if unset.
no_split_module_classes (`List[str]`, *optional*):
A list of layer class names that should never be split across device (for instance any layer that has a
residual connection).
offload_folder (`str` or `os.PathLike`, *optional*):
If the `device_map` contains any value `"disk"`, the folder where we will offload weights.
offload_buffers (`bool`, *optional*, defaults to `False`):
In the layers that are offloaded on the CPU or the hard drive, whether or not to offload the buffers as
well as the parameters.
dtype (`str` or `torch.dtype`, *optional*):
If provided, the weights will be converted to that type when loaded.
offload_state_dict (`bool`, *optional*, defaults to `False`):
If `True`, will temporarily offload the CPU state dict on the hard drive to avoig getting out of CPU RAM if
the weight of the CPU state dict + the biggest shard does not fit.
"""
if device_map == "auto":
device_map = infer_auto_device_map(
model, max_memory=max_memory, no_split_module_classes=no_split_module_classes, dtype=dtype
)
load_checkpoint_in_model(
model,
checkpoint,
device_map=device_map,
offload_folder=offload_folder,
dtype=dtype,
offload_state_dict=offload_state_dict,
)
if device_map is None:
return model
return dispatch_model(model, device_map=device_map, offload_dir=offload_folder, offload_buffers=offload_buffers)

View File

@ -0,0 +1,185 @@
# Copyright 2022 The HuggingFace Team. All rights reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
import os
import random
from pathlib import Path
from typing import List
import numpy as np
import torch
from torch.cuda.amp import GradScaler
from .utils import (
MODEL_NAME,
OPTIMIZER_NAME,
RNG_STATE_NAME,
SCALER_NAME,
SCHEDULER_NAME,
get_pretty_name,
is_tpu_available,
save,
)
if is_tpu_available():
import torch_xla.core.xla_model as xm
from .logging import get_logger
logger = get_logger(__name__)
def save_accelerator_state(
output_dir: str,
model_states: List[dict],
optimizers: list,
schedulers: list,
process_index: int,
scaler: GradScaler = None,
):
"""
Saves the current states of the models, optimizers, scaler, and RNG generators to a given directory.
Args:
output_dir (`str` or `os.PathLike`):
The name of the folder to save all relevant weights and states.
model_states (`List[torch.nn.Module]`):
A list of model states
optimizers (`List[torch.optim.Optimizer]`):
A list of optimizer instances
schedulers (`List[torch.optim.lr_scheduler._LRScheduler]`):
A list of learning rate schedulers
process_index (`int`):
The current process index in the Accelerator state
scaler (`torch.cuda.amp.GradScaler`, *optional*):
An optional gradient scaler instance to save
"""
# Model states
for i, state in enumerate(model_states):
weights_name = f"{MODEL_NAME}.bin" if i == 0 else f"{MODEL_NAME}_{i}.bin"
output_model_file = os.path.join(output_dir, weights_name)
save(state, output_model_file)
logger.info(f"Model weights saved in {output_model_file}")
# Optimizer states
for i, opt in enumerate(optimizers):
state = opt.state_dict()
optimizer_name = f"{OPTIMIZER_NAME}.bin" if i == 0 else f"{OPTIMIZER_NAME}_{i}.bin"
output_optimizer_file = os.path.join(output_dir, optimizer_name)
save(state, output_optimizer_file)
logger.info(f"Optimizer state saved in {output_optimizer_file}")
# Scheduler states
for i, scheduler in enumerate(schedulers):
state = scheduler.state_dict()
scheduler_name = f"{SCHEDULER_NAME}.bin" if i == 0 else f"{SCHEDULER_NAME}_{i}.bin"
output_scheduler_file = os.path.join(output_dir, scheduler_name)
save(state, output_scheduler_file)
logger.info(f"Scheduler state saved in {output_scheduler_file}")
# GradScaler state
if scaler is not None:
state = scaler.state_dict()
output_scaler_file = os.path.join(output_dir, SCALER_NAME)
torch.save(state, output_scaler_file)
logger.info(f"Gradient scaler state saved in {output_scaler_file}")
# Random number generator states
states = {}
states_name = f"{RNG_STATE_NAME}_{process_index}.pkl"
states["random_state"] = random.getstate()
states["numpy_random_seed"] = np.random.get_state()
states["torch_manual_seed"] = torch.get_rng_state()
states["torch_cuda_manual_seed"] = torch.cuda.get_rng_state_all()
# ^^ safe to call this function even if cuda is not available
if is_tpu_available():
states["xm_seed"] = xm.get_rng_state()
output_states_file = os.path.join(output_dir, states_name)
torch.save(states, output_states_file)
logger.info(f"Random states saved in {output_states_file}")
return output_dir
def load_accelerator_state(input_dir, models, optimizers, schedulers, process_index, scaler=None):
"""
Loads states of the models, optimizers, scaler, and RNG generators from a given directory.
Args:
input_dir (`str` or `os.PathLike`):
The name of the folder to load all relevant weights and states.
model_stmodelsates (`List[torch.nn.Module]`):
A list of model instances
optimizers (`List[torch.optim.Optimizer]`):
A list of optimizer instances
schedulers (`List[torch.optim.lr_scheduler._LRScheduler]`):
A list of learning rate schedulers
process_index (`int`):
The current process index in the Accelerator state
scaler (`torch.cuda.amp.GradScaler`, *optional*):
An optional *GradScaler* instance to load
"""
# Model states
for i, model in enumerate(models):
weights_name = f"{MODEL_NAME}.bin" if i == 0 else f"{MODEL_NAME}_{i}.bin"
input_model_file = os.path.join(input_dir, weights_name)
models[i].load_state_dict(torch.load(input_model_file, map_location="cpu"))
logger.info("All model weights loaded successfully")
# Optimizer states
for i, opt in enumerate(optimizers):
optimizer_name = f"{OPTIMIZER_NAME}.bin" if i == 0 else f"{OPTIMIZER_NAME}_{i}.bin"
input_optimizer_file = os.path.join(input_dir, optimizer_name)
optimizers[i].load_state_dict(torch.load(input_optimizer_file, map_location="cpu"))
logger.info("All optimizer states loaded successfully")
# Scheduler states
for i, scheduler in enumerate(schedulers):
scheduler_name = f"{SCHEDULER_NAME}.bin" if i == 0 else f"{SCHEDULER_NAME}_{i}.bin"
input_scheduler_file = os.path.join(input_dir, scheduler_name)
scheduler.load_state_dict(torch.load(input_scheduler_file))
logger.info("All scheduler states loaded successfully")
# GradScaler state
if scaler is not None:
input_scaler_file = os.path.join(input_dir, SCALER_NAME)
scaler.load_state_dict(torch.load(input_scaler_file))
logger.info("GradScaler state loaded successfully")
# Random states
states = torch.load(os.path.join(input_dir, f"{RNG_STATE_NAME}_{process_index}.pkl"))
random.setstate(states["random_state"])
np.random.set_state(states["numpy_random_seed"])
torch.set_rng_state(states["torch_manual_seed"])
torch.cuda.set_rng_state_all(states["torch_cuda_manual_seed"])
# ^^ safe to call this function even if cuda is not available
if is_tpu_available():
xm.set_rng_state(states["xm_seed"])
logger.info("All random states loaded successfully")
def save_custom_state(obj, path, index: int = 0):
"""
Saves the state of `obj` to `{path}/custom_checkpoint_{index}.pkl`
"""
# Should this be the right way to get a qual_name type value from `obj`?
save_location = Path(path) / f"custom_checkpoint_{index}.pkl"
logger.info(f"Saving the state of {get_pretty_name(obj)} to {save_location}")
torch.save(obj.state_dict(), save_location)
def load_custom_state(obj, path, index: int = 0):
"""
Loads the state of `obj` at `{path}/custom_checkpoint_{index}.pkl`
"""
load_location = f"{path}/custom_checkpoint_{index}.pkl"
logger.info(f"Loading the state of {get_pretty_name(obj)} from {load_location}")
obj.load_state_dict(torch.load(load_location))

View File

@ -17,6 +17,7 @@
from argparse import ArgumentParser
from accelerate.commands.config import config_command_parser
from accelerate.commands.env import env_command_parser
from accelerate.commands.launch import launch_command_parser
from accelerate.commands.test import test_command_parser
@ -29,6 +30,7 @@ def main():
config_command_parser(subparsers=subparsers)
launch_command_parser(subparsers=subparsers)
test_command_parser(subparsers=subparsers)
env_command_parser(subparsers=subparsers)
# Let's go
args = parser.parse_args()

View File

@ -1,184 +0,0 @@
#!/usr/bin/env python
# Copyright 2021 The HuggingFace Team. All rights reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
import argparse
import json
import os
from dataclasses import dataclass
from typing import Optional
from accelerate.state import DistributedType
hf_cache_home = os.path.expanduser(
os.getenv("HF_HOME", os.path.join(os.getenv("XDG_CACHE_HOME", "~/.cache"), "huggingface"))
)
cache_dir = os.path.join(hf_cache_home, "accelerate")
default_config_file = os.path.join(cache_dir, "default_config.json")
@dataclass
class LaunchConfig:
distributed_type: DistributedType
num_processes: int
fp16: bool
machine_rank: int = 0
num_machines: int = 1
main_process_ip: Optional[str] = None
main_process_port: Optional[int] = None
main_training_function: str = "main"
@classmethod
def from_json_file(cls, json_file=None):
json_file = default_config_file if json_file is None else json_file
with open(json_file, "r", encoding="utf-8") as f:
return cls(**json.load(f))
def to_json_file(self, json_file):
with open(json_file, "w", encoding="utf-8") as f:
content = json.dumps(self.__dict__, indent=2, sort_keys=True) + "\n"
f.write(content)
def config_command_parser(subparsers=None):
if subparsers is not None:
parser = subparsers.add_parser("config")
else:
parser = argparse.ArgumentParser("Accelerate config command")
parser.add_argument(
"--config_file",
default=None,
help=(
"The path to use to store the config file. Will default to a file named default_config.json in the cache "
"location, which is the content of the environment `HF_HOME` suffixed with 'accelerate', or if you don't have "
"such an environment variable, your cache directory ('~/.cache' or the content of `XDG_CACHE_HOME`) suffixed "
"with 'huggingface'."
),
)
if subparsers is not None:
parser.set_defaults(func=config_command)
return parser
def _ask_field(input_text, convert_value=None, default=None, error_message=None):
ask_again = True
while ask_again:
result = input(input_text)
try:
if default is not None and len(result) == 0:
return default
return convert_value(result) if convert_value is not None else result
except:
if error_message is not None:
print(error_message)
def get_user_input():
def _convert_distributed_mode(value):
value = int(value)
return DistributedType(["NO", "MULTI_GPU", "TPU"][value])
def _convert_yes_no_to_bool(value):
return {"yes": True, "no": False}[value.lower()]
distributed_type = _ask_field(
"Which type of machine are you using? ([0] No distributed training, [1] multi-GPU, [2] TPU): ",
_convert_distributed_mode,
error_message="Please enter 0, 1 or 2.",
)
machine_rank = 0
num_machines = 1
main_process_ip = None
main_process_port = None
if distributed_type == DistributedType.MULTI_GPU:
num_machines = _ask_field(
"How many different machines will you use (use more than 1 for multi-node training)? [1]: ",
lambda x: int(x),
default=1,
)
if num_machines > 1:
machine_rank = _ask_field(
"What is the rank of this machine (from 0 to the number of machines - 1 )? [0]: ",
lambda x: int(x),
default=0,
)
main_process_ip = _ask_field(
"What is the IP address of the machine that will host the main process? ",
)
main_process_ip = _ask_field(
"What is the port you will use to communicate with the main process? ",
lambda x: int(x),
)
if distributed_type == DistributedType.TPU:
main_training_function = _ask_field(
"What is the name of the function in your script that should be launched in all parallel scripts? [main]: ",
default="main",
)
else:
main_training_function = "main"
num_processes = _ask_field(
"How many processes in total will you use? [1]: ",
lambda x: int(x),
default=1,
error_message="Please enter an integer.",
)
if distributed_type != DistributedType.TPU:
fp16 = _ask_field(
"Do you wish to use FP16 (mixed precision)? [yes/NO]: ",
_convert_yes_no_to_bool,
default=False,
error_message="Please enter yes or no.",
)
else:
fp16 = False
return LaunchConfig(
distributed_type=distributed_type,
num_processes=num_processes,
fp16=fp16,
machine_rank=machine_rank,
num_machines=num_machines,
main_process_ip=main_process_ip,
main_process_port=main_process_port,
main_training_function=main_training_function,
)
def config_command(args):
config = get_user_input()
if args.config_file is not None:
config_file = args.config_file
else:
if not os.path.isdir(cache_dir):
os.makedirs(cache_dir)
config_file = default_config_file
config.to_json_file(config_file)
def main():
parser = config_command_parser()
args = parser.parse_args()
config_command(args)
if __name__ == "__main__":
main()

View File

@ -0,0 +1,85 @@
#!/usr/bin/env python
# Copyright 2021 The HuggingFace Team. All rights reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
import argparse
import os
from accelerate.utils import ComputeEnvironment
from .cluster import get_cluster_input
from .config_args import cache_dir, default_config_file, default_yaml_config_file, load_config_from_file # noqa: F401
from .config_utils import _ask_field, _convert_compute_environment
from .sagemaker import get_sagemaker_input
def get_user_input():
compute_environment = _ask_field(
"In which compute environment are you running? ([0] This machine, [1] AWS (Amazon SageMaker)): ",
_convert_compute_environment,
error_message="Please enter 0 or 1",
)
if compute_environment == ComputeEnvironment.AMAZON_SAGEMAKER:
config = get_sagemaker_input()
else:
config = get_cluster_input()
return config
def config_command_parser(subparsers=None):
if subparsers is not None:
parser = subparsers.add_parser("config")
else:
parser = argparse.ArgumentParser("Accelerate config command")
parser.add_argument(
"--config_file",
default=None,
help=(
"The path to use to store the config file. Will default to a file named default_config.yaml in the cache "
"location, which is the content of the environment `HF_HOME` suffixed with 'accelerate', or if you don't have "
"such an environment variable, your cache directory ('~/.cache' or the content of `XDG_CACHE_HOME`) suffixed "
"with 'huggingface'."
),
)
if subparsers is not None:
parser.set_defaults(func=config_command)
return parser
def config_command(args):
config = get_user_input()
if args.config_file is not None:
config_file = args.config_file
else:
if not os.path.isdir(cache_dir):
os.makedirs(cache_dir)
config_file = default_yaml_config_file
if config_file.endswith(".json"):
config.to_json_file(config_file)
else:
config.to_yaml_file(config_file)
def main():
parser = config_command_parser()
args = parser.parse_args()
config_command(args)
if __name__ == "__main__":
main()

View File

@ -0,0 +1,179 @@
#!/usr/bin/env python
# Copyright 2021 The HuggingFace Team. All rights reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
from ...utils import ComputeEnvironment, DistributedType, is_deepspeed_available
from .config_args import ClusterConfig
from .config_utils import _ask_field, _convert_distributed_mode, _convert_yes_no_to_bool
def get_cluster_input():
distributed_type = _ask_field(
"Which type of machine are you using? ([0] No distributed training, [1] multi-CPU, [2] multi-GPU, [3] TPU): ",
_convert_distributed_mode,
error_message="Please enter 0, 1, 2 or 3.",
)
machine_rank = 0
num_machines = 1
main_process_ip = None
main_process_port = None
if distributed_type in [DistributedType.MULTI_GPU, DistributedType.MULTI_CPU]:
num_machines = _ask_field(
"How many different machines will you use (use more than 1 for multi-node training)? [1]: ",
lambda x: int(x),
default=1,
)
if num_machines > 1:
machine_rank = _ask_field(
"What is the rank of this machine (from 0 to the number of machines - 1 )? [0]: ",
lambda x: int(x),
default=0,
)
main_process_ip = _ask_field(
"What is the IP address of the machine that will host the main process? ",
)
main_process_port = _ask_field(
"What is the port you will use to communicate with the main process? ",
lambda x: int(x),
)
if distributed_type == DistributedType.NO:
use_cpu = _ask_field(
"Do you want to run your training on CPU only (even if a GPU is available)? [yes/NO]:",
_convert_yes_no_to_bool,
default=False,
error_message="Please enter yes or no.",
)
elif distributed_type == DistributedType.MULTI_CPU:
use_cpu = True
else:
use_cpu = False
deepspeed_config = {}
if distributed_type in [DistributedType.MULTI_GPU, DistributedType.NO]:
use_deepspeed = _ask_field(
"Do you want to use DeepSpeed? [yes/NO]: ",
_convert_yes_no_to_bool,
default=False,
error_message="Please enter yes or no.",
)
if use_deepspeed:
distributed_type = DistributedType.DEEPSPEED
assert (
is_deepspeed_available()
), "DeepSpeed is not installed => run `pip3 install deepspeed` or build it from source"
if distributed_type == DistributedType.DEEPSPEED:
deepspeed_config["zero_stage"] = _ask_field(
"What should be your DeepSpeed's ZeRO optimization stage (0, 1, 2, 3)? [2]: ",
lambda x: int(x),
default=2,
)
if deepspeed_config["zero_stage"] >= 2:
deepspeed_config["offload_optimizer_device"] = _ask_field(
"Where to offload optimizer states? [NONE/cpu/nvme]: ",
lambda x: str(x),
default="none",
)
deepspeed_config["gradient_accumulation_steps"] = _ask_field(
"How many gradient accumulation steps you're passing in your script? [1]: ",
lambda x: int(x),
default=1,
)
fsdp_config = {}
if distributed_type in [DistributedType.MULTI_GPU]:
use_fsdp = _ask_field(
"Do you want to use FullyShardedDataParallel? [yes/NO]: ",
_convert_yes_no_to_bool,
default=False,
error_message="Please enter yes or no.",
)
if use_fsdp:
distributed_type = DistributedType.FSDP
if distributed_type == DistributedType.FSDP:
fsdp_config["sharding_strategy"] = _ask_field(
"What should be your sharding strategy ([1] FULL_SHARD, [2] SHARD_GRAD_OP)? [1]: ",
lambda x: int(x),
default=1,
)
fsdp_config["offload_params"] = _ask_field(
"Do you want to offload parameters and gradients to CPU? [yes/NO]: ",
_convert_yes_no_to_bool,
default=False,
error_message="Please enter yes or no.",
)
fsdp_config["min_num_params"] = _ask_field(
"What should be your FSDP's minimum number of parameters for Default Auto Wrapping Policy? [1e8]: ",
lambda x: int(x),
default=1e8,
)
if distributed_type == DistributedType.TPU:
main_training_function = _ask_field(
"What is the name of the function in your script that should be launched in all parallel scripts? [main]: ",
default="main",
)
else:
main_training_function = "main"
if distributed_type in [DistributedType.MULTI_CPU, DistributedType.MULTI_GPU, DistributedType.TPU]:
machine_type = str(distributed_type).split(".")[1].replace("MULTI_", "")
if machine_type == "TPU":
machine_type += " cores"
else:
machine_type += "(s)"
num_processes = _ask_field(
f"How many {machine_type} should be used for distributed training? [1]:",
lambda x: int(x),
default=1,
error_message="Please enter an integer.",
)
elif distributed_type in [DistributedType.FSDP, DistributedType.DEEPSPEED]:
num_processes = _ask_field(
"How many GPU(s) should be used for distributed training? [1]:",
lambda x: int(x),
default=1,
error_message="Please enter an integer.",
)
else:
num_processes = 1
if distributed_type != DistributedType.TPU:
mixed_precision = _ask_field(
"Do you wish to use FP16 or BF16 (mixed precision)? [NO/fp16/bf16]: ",
lambda x: str(x).lower(),
default="no",
)
else:
mixed_precision = "no"
return ClusterConfig(
compute_environment=ComputeEnvironment.LOCAL_MACHINE,
distributed_type=distributed_type,
num_processes=num_processes,
mixed_precision=mixed_precision,
machine_rank=machine_rank,
num_machines=num_machines,
main_process_ip=main_process_ip,
main_process_port=main_process_port,
main_training_function=main_training_function,
deepspeed_config=deepspeed_config,
fsdp_config=fsdp_config,
use_cpu=use_cpu,
)

View File

@ -0,0 +1,160 @@
#!/usr/bin/env python
# Copyright 2021 The HuggingFace Team. All rights reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
import json
import os
from dataclasses import dataclass
from enum import Enum
from typing import Optional, Union
import yaml
from ...utils import ComputeEnvironment, DistributedType, SageMakerDistributedType
hf_cache_home = os.path.expanduser(
os.getenv("HF_HOME", os.path.join(os.getenv("XDG_CACHE_HOME", "~/.cache"), "huggingface"))
)
cache_dir = os.path.join(hf_cache_home, "accelerate")
default_json_config_file = os.path.join(cache_dir, "default_config.yaml")
default_yaml_config_file = os.path.join(cache_dir, "default_config.yaml")
# For backward compatibility: the default config is the json one if it's the only existing file.
if os.path.isfile(default_yaml_config_file) or not os.path.isfile(default_json_config_file):
default_config_file = default_yaml_config_file
else:
default_config_file = default_json_config_file
def load_config_from_file(config_file):
config_file_exists = config_file is not None and os.path.isfile(config_file)
config_file = config_file if config_file_exists else default_config_file
with open(config_file, "r", encoding="utf-8") as f:
if config_file.endswith(".json"):
if (
json.load(f).get("compute_environment", ComputeEnvironment.LOCAL_MACHINE)
== ComputeEnvironment.LOCAL_MACHINE
):
config_class = ClusterConfig
else:
config_class = SageMakerConfig
return config_class.from_json_file(json_file=config_file)
else:
if (
yaml.safe_load(f).get("compute_environment", ComputeEnvironment.LOCAL_MACHINE)
== ComputeEnvironment.LOCAL_MACHINE
):
config_class = ClusterConfig
else:
config_class = SageMakerConfig
return config_class.from_yaml_file(yaml_file=config_file)
@dataclass
class BaseConfig:
compute_environment: ComputeEnvironment
distributed_type: Union[DistributedType, SageMakerDistributedType]
mixed_precision: str
use_cpu: bool
def to_dict(self):
result = self.__dict__
# For serialization, it's best to convert Enums to strings (or their underlying value type).
for key, value in result.items():
if isinstance(value, Enum):
result[key] = value.value
return result
@classmethod
def from_json_file(cls, json_file=None):
json_file = default_json_config_file if json_file is None else json_file
with open(json_file, "r", encoding="utf-8") as f:
config_dict = json.load(f)
if "compute_environment" not in config_dict:
config_dict["compute_environment"] = ComputeEnvironment.LOCAL_MACHINE
if "mixed_precision" not in config_dict:
config_dict["mixed_precision"] = "fp16" if ("fp16" in config_dict and config_dict["fp16"]) else "no"
if "fp16" in config_dict: # Convert the config to the new format.
del config_dict["fp16"]
if "use_cpu" not in config_dict:
config_dict["use_cpu"] = False
return cls(**config_dict)
def to_json_file(self, json_file):
with open(json_file, "w", encoding="utf-8") as f:
content = json.dumps(self.to_dict(), indent=2, sort_keys=True) + "\n"
f.write(content)
@classmethod
def from_yaml_file(cls, yaml_file=None):
yaml_file = default_yaml_config_file if yaml_file is None else yaml_file
with open(yaml_file, "r", encoding="utf-8") as f:
config_dict = yaml.safe_load(f)
if "compute_environment" not in config_dict:
config_dict["compute_environment"] = ComputeEnvironment.LOCAL_MACHINE
if "mixed_precision" not in config_dict:
config_dict["mixed_precision"] = "fp16" if ("fp16" in config_dict and config_dict["fp16"]) else "no"
if "fp16" in config_dict: # Convert the config to the new format.
del config_dict["fp16"]
if "use_cpu" not in config_dict:
config_dict["use_cpu"] = False
return cls(**config_dict)
def to_yaml_file(self, yaml_file):
with open(yaml_file, "w", encoding="utf-8") as f:
yaml.safe_dump(self.to_dict(), f)
def __post_init__(self):
if isinstance(self.compute_environment, str):
self.compute_environment = ComputeEnvironment(self.compute_environment)
if isinstance(self.distributed_type, str):
self.distributed_type = DistributedType(self.distributed_type)
@dataclass
class ClusterConfig(BaseConfig):
num_processes: int
machine_rank: int = 0
num_machines: int = 1
main_process_ip: Optional[str] = None
main_process_port: Optional[int] = None
main_training_function: str = "main"
# args for deepspeed_plugin
deepspeed_config: dict = None
# args for fsdp
fsdp_config: dict = None
def __post_init__(self):
if self.deepspeed_config is None:
self.deepspeed_config = {}
if self.fsdp_config is None:
self.fsdp_config = {}
return super().__post_init__()
@dataclass
class SageMakerConfig(BaseConfig):
ec2_instance_type: str
iam_role_name: str
profile: Optional[str] = None
region: str = "us-east-1"
num_machines: int = 1
base_job_name: str = f"accelerate-sagemaker-{num_machines}"
pytorch_version: str = "1.6"
transformers_version: str = "4.4"

View File

@ -0,0 +1,49 @@
#!/usr/bin/env python
# Copyright 2021 The HuggingFace Team. All rights reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
from ...utils.dataclasses import ComputeEnvironment, DistributedType, SageMakerDistributedType
def _ask_field(input_text, convert_value=None, default=None, error_message=None):
ask_again = True
while ask_again:
result = input(input_text)
try:
if default is not None and len(result) == 0:
return default
return convert_value(result) if convert_value is not None else result
except:
if error_message is not None:
print(error_message)
def _convert_compute_environment(value):
value = int(value)
return ComputeEnvironment(["LOCAL_MACHINE", "AMAZON_SAGEMAKER"][value])
def _convert_distributed_mode(value):
value = int(value)
return DistributedType(["NO", "MULTI_CPU", "MULTI_GPU", "TPU"][value])
def _convert_sagemaker_distributed_mode(value):
value = int(value)
return SageMakerDistributedType(["NO", "DATA_PARALLEL", "MODEL_PARALLEL"][value])
def _convert_yes_no_to_bool(value):
return {"yes": True, "no": False}[value.lower()]

View File

@ -0,0 +1,157 @@
#!/usr/bin/env python
# Copyright 2021 The HuggingFace Team. All rights reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
import json
import os
from ...utils.dataclasses import ComputeEnvironment, SageMakerDistributedType
from ...utils.imports import is_boto3_available
from .config_args import SageMakerConfig
from .config_utils import _ask_field, _convert_sagemaker_distributed_mode
if is_boto3_available():
import boto3 # noqa: F401
def _create_iam_role_for_sagemaker(role_name):
iam_client = boto3.client("iam")
sagemaker_trust_policy = {
"Version": "2012-10-17",
"Statement": [
{"Effect": "Allow", "Principal": {"Service": "sagemaker.amazonaws.com"}, "Action": "sts:AssumeRole"}
],
}
try:
# create the role, associated with the chosen trust policy
iam_client.create_role(
RoleName=role_name, AssumeRolePolicyDocument=json.dumps(sagemaker_trust_policy, indent=2)
)
policy_document = {
"Version": "2012-10-17",
"Statement": [
{
"Effect": "Allow",
"Action": [
"sagemaker:*",
"ecr:GetDownloadUrlForLayer",
"ecr:BatchGetImage",
"ecr:BatchCheckLayerAvailability",
"ecr:GetAuthorizationToken",
"cloudwatch:PutMetricData",
"cloudwatch:GetMetricData",
"cloudwatch:GetMetricStatistics",
"cloudwatch:ListMetrics",
"logs:CreateLogGroup",
"logs:CreateLogStream",
"logs:DescribeLogStreams",
"logs:PutLogEvents",
"logs:GetLogEvents",
"s3:CreateBucket",
"s3:ListBucket",
"s3:GetBucketLocation",
"s3:GetObject",
"s3:PutObject",
],
"Resource": "*",
}
],
}
# attach policy to role
iam_client.put_role_policy(
RoleName=role_name,
PolicyName=f"{role_name}_policy_permission",
PolicyDocument=json.dumps(policy_document, indent=2),
)
except iam_client.exceptions.EntityAlreadyExistsException:
print(f"role {role_name} already exists. Using existing one")
def _get_iam_role_arn(role_name):
iam_client = boto3.client("iam")
return iam_client.get_role(RoleName=role_name)["Role"]["Arn"]
def get_sagemaker_input():
credentials_configuration = _ask_field(
"How do you want to authorize? ([0] AWS Profile, [1] Credentials (AWS_ACCESS_KEY_ID, AWS_SECRET_ACCESS_KEY)): ",
lambda x: int(x),
)
aws_profile = None
if credentials_configuration == 0:
aws_profile = _ask_field("Enter your AWS Profile name: [default] ", default="default")
os.environ["AWS_PROFILE"] = aws_profile
else:
print(
"Note you will need to provide AWS_ACCESS_KEY_ID and AWS_SECRET_ACCESS_KEY when you launch you training script with,"
"`accelerate launch --aws_access_key_id XXX --aws_secret_access_key YYY`"
)
aws_access_key_id = _ask_field("AWS Access Key ID: ")
os.environ["AWS_ACCESS_KEY_ID"] = aws_access_key_id
aws_secret_access_key = _ask_field("AWS Secret Access Key: ")
os.environ["AWS_SECRET_ACCESS_KEY"] = aws_secret_access_key
aws_region = _ask_field("Enter your AWS Region: [us-east-1]", default="us-east-1")
os.environ["AWS_DEFAULT_REGION"] = aws_region
role_management = _ask_field(
"Do you already have an IAM Role for executing Amazon SageMaker Training Jobs? ([0] provide IAM Role name, [1] create new IAM role using credentials: ",
lambda x: int(x),
)
if role_management == 0:
iam_role_name = _ask_field("Enter your IAM role name: ")
else:
iam_role_name = "accelerate_sagemaker_execution_role"
print(f'Accelerate will create an iam role "{iam_role_name}" using the provided credentials')
_create_iam_role_for_sagemaker(iam_role_name)
distributed_type = _ask_field(
"Which type of machine are you using? ([0] No distributed training, [1] data parallelism, [2] model parallelism): ",
_convert_sagemaker_distributed_mode,
error_message="Please enter 0, 1 or 2",
)
# using the best two instances for single-gpu training or multi-gpu -> can turn into question to make it more diverse
ec2_instance_type = "ml.p3.2xlarge" if distributed_type == SageMakerDistributedType.NO else "ml.p3dn.24xlarge"
num_machines = 1
if (
distributed_type == SageMakerDistributedType.DATA_PARALLEL
or distributed_type == SageMakerDistributedType.MODEL_PARALLEL
):
raise NotImplementedError("Model or Data Parallelism is not implemented yet. We are working on it")
num_machines = _ask_field(
"How many machines do you want use? [2]: ",
lambda x: int(x),
default=2,
)
mixed_precision = _ask_field(
"Do you wish to use FP16 or BF16 (mixed precision)? [No/FP16/BF16]: ",
lambda x: str(x),
default="No",
)
return SageMakerConfig(
compute_environment=ComputeEnvironment.AMAZON_SAGEMAKER,
distributed_type=distributed_type,
ec2_instance_type=ec2_instance_type,
profile=aws_profile,
region=aws_region,
iam_role_name=iam_role_name,
mixed_precision=mixed_precision,
num_machines=num_machines,
)

View File

@ -0,0 +1,68 @@
import argparse
import os
import platform
import numpy as np
import torch
from accelerate import __version__ as version
from accelerate.commands.config import default_config_file, load_config_from_file
def env_command_parser(subparsers=None):
if subparsers is not None:
parser = subparsers.add_parser("env")
else:
parser = argparse.ArgumentParser("Accelerate env command")
parser.add_argument(
"--config_file", default=None, help="The config file to use for the default values in the launching script."
)
if subparsers is not None:
parser.set_defaults(func=env_command)
return parser
def env_command(args):
pt_version = torch.__version__
pt_cuda_available = torch.cuda.is_available()
accelerate_config = "Not found"
# Get the default from the config file.
if args.config_file is not None or os.path.isfile(default_config_file):
accelerate_config = load_config_from_file(args.config_file).to_dict()
info = {
"`Accelerate` version": version,
"Platform": platform.platform(),
"Python version": platform.python_version(),
"Numpy version": np.__version__,
"PyTorch version (GPU?)": f"{pt_version} ({pt_cuda_available})",
}
print("\nCopy-and-paste the text below in your GitHub issue\n")
print("\n".join([f"- {prop}: {val}" for prop, val in info.items()]))
print("- `Accelerate` default config:" if args.config_file is None else "- `Accelerate` config passed:")
accelerate_config_str = (
"\n".join([f"\t- {prop}: {val}" for prop, val in accelerate_config.items()])
if isinstance(accelerate_config, dict)
else f"\t{accelerate_config}"
)
print(accelerate_config_str)
info["`Accelerate` configs"] = accelerate_config
return info
def main() -> int:
parser = env_command_parser()
args = parser.parse_args()
env_command(args)
return 0
if __name__ == "__main__":
raise SystemExit(main())

View File

@ -16,23 +16,24 @@
import argparse
import importlib
import inspect
import os
import subprocess
import sys
import warnings
from ast import literal_eval
from pathlib import Path
from typing import Optional
from typing import Dict, List
from accelerate.commands.config import LaunchConfig, default_config_file
from accelerate.state import DistributedType
class _AddOneArg():
def __init__(self, launcher):
self.launcher = launcher
def __call__(self, index):
self.launcher()
from accelerate.commands.config import default_config_file, load_config_from_file
from accelerate.commands.config.config_args import SageMakerConfig
from accelerate.utils import (
ComputeEnvironment,
DistributedType,
PrecisionType,
PrepareForLaunch,
is_sagemaker_available,
)
from accelerate.utils.versions import is_torch_version
def launch_command_parser(subparsers=None):
@ -50,9 +51,48 @@ def launch_command_parser(subparsers=None):
action="store_true",
help="Whether or not this should launch a distributed GPU training.",
)
parser.add_argument(
"--use_deepspeed",
default=False,
action="store_true",
help="Whether to use deepspeed.",
)
parser.add_argument(
"--use_fsdp",
default=False,
action="store_true",
help="Whether to use fsdp.",
)
parser.add_argument(
"--offload_params",
default="false",
type=str,
help="Decides Whether (true|false) to offload parameters and gradients to CPU. (useful only when `use_fsdp` flag is passed).",
)
parser.add_argument(
"--min_num_params",
type=int,
default=1e8,
help="FSDP's minimum number of parameters for Default Auto Wrapping. (useful only when `use_fsdp` flag is passed).",
)
parser.add_argument(
"--sharding_strategy",
type=int,
default=1,
help="FSDP's Sharding Strategy. (useful only when `use_fsdp` flag is passed).",
)
parser.add_argument(
"--tpu", default=False, action="store_true", help="Whether or not this should launch a TPU training."
)
parser.add_argument(
"--mixed_precision",
type=str,
choices=["no", "fp16", "bf16"],
help="Whether or not to use mixed precision training. "
"Choose between FP16 and BF16 (bfloat16) training. "
"BF16 training is only supported on Nvidia Ampere GPUs and PyTorch 1.10 or later.",
)
parser.add_argument(
"--fp16", default=False, action="store_true", help="Whether or not to use mixed precision training."
)
@ -63,17 +103,15 @@ def launch_command_parser(subparsers=None):
"--num_processes", type=int, default=None, help="The total number of processes to be launched in parallel."
)
parser.add_argument(
"--num_machines", type=int, default=1, help="The total number of machines used in this training."
"--num_machines", type=int, default=None, help="The total number of machines used in this training."
)
parser.add_argument(
"--machine_rank", type=int, default=0, help="The rank of the machine on which this script is launched."
)
parser.add_argument(
"--main_process_ip", type=Optional[str], default=None, help="The IP address of the machine of rank 0."
"--machine_rank", type=int, default=None, help="The rank of the machine on which this script is launched."
)
parser.add_argument("--main_process_ip", type=str, default=None, help="The IP address of the machine of rank 0.")
parser.add_argument(
"--main_process_port",
type=Optional[int],
type=int,
default=None,
help="The port to use to communicate with the machine of rank 0.",
)
@ -83,6 +121,35 @@ def launch_command_parser(subparsers=None):
default=None,
help="The name of the main function to be executed in your script (only for TPU training).",
)
parser.add_argument(
"-m",
"--module",
action="store_true",
help="Change each process to interpret the launch script as a Python module, executing with the same behavior as 'python -m'.",
)
parser.add_argument(
"--no_python",
action="store_true",
help="Skip prepending the training script with 'python' - just execute it directly. Useful when the script is not a Python script.",
)
parser.add_argument(
"--num_cpu_threads_per_process",
type=int,
default=1,
help="The number of CPU threads per process. Can be tuned for optimal performance.",
)
parser.add_argument(
"--aws_access_key_id",
type=str,
default=None,
help="The AWS_ACCESS_KEY_ID used to launch the Amazon SageMaker training job",
)
parser.add_argument(
"--aws_secret_access_key",
type=str,
default=None,
help="The AWS_SECRET_ACCESS_KEY used to launch the Amazon SageMaker training job",
)
parser.add_argument(
"training_script",
type=str,
@ -91,6 +158,25 @@ def launch_command_parser(subparsers=None):
"script."
),
)
parser.add_argument(
"--zero_stage",
default=None,
type=int,
help="DeepSpeed's ZeRO optimization stage (useful only when `use_deepspeed` flag is passed).",
)
parser.add_argument(
"--offload_optimizer_device",
default=None,
type=str,
help="Decides where (none|cpu|nvme) to offload optimizer states (useful only when `use_deepspeed` flag is passed).",
)
parser.add_argument(
"--gradient_accumulation_steps",
default=None,
type=int,
help="No of gradient_accumulation_steps used in your training script (useful only when `use_deepspeed` flag is passed).",
)
# Other arguments of the training scripts
parser.add_argument("training_script_args", nargs=argparse.REMAINDER, help="Arguments of the training script.")
@ -100,12 +186,30 @@ def launch_command_parser(subparsers=None):
def simple_launcher(args):
cmd = [sys.executable, args.training_script]
cmd = []
if args.no_python and args.module:
raise ValueError("--module and --no_python cannot be used together")
if not args.no_python:
cmd.append(sys.executable)
if args.module:
cmd.append("-m")
cmd.append(args.training_script)
cmd.extend(args.training_script_args)
current_env = os.environ.copy()
current_env["USE_CPU"] = str(args.cpu)
current_env["USE_FP16"] = str(args.fp16)
try:
mixed_precision = PrecisionType(args.mixed_precision.lower())
except ValueError:
raise ValueError(
f"Unknown mixed_precision mode: {args.mixed_precision.lower()}. Choose between {PrecisionType.list()}."
)
if args.fp16:
warnings.warn('--fp16 flag is deprecated. Use "--mixed_precision fp16" instead.', DeprecationWarning)
mixed_precision = "fp16"
current_env["MIXED_PRECISION"] = str(mixed_precision)
process = subprocess.Popen(cmd, env=current_env)
process.wait()
@ -114,8 +218,12 @@ def simple_launcher(args):
def multi_gpu_launcher(args):
cmd = [sys.executable, "-m", "torch.distributed.launch"]
cmd.extend(["--nproc_per_node", str(args.num_processes), "--use_env"])
if is_torch_version(">=", "1.10.0"):
cmd = ["torchrun"]
elif is_torch_version(">=", "1.9.0"):
cmd = [sys.executable, "-m", "torch.distributed.run"]
else:
cmd = [sys.executable, "-m", "torch.distributed.launch", "--use_env"]
if args.num_machines > 1:
cmd.extend(
[
@ -127,17 +235,95 @@ def multi_gpu_launcher(args):
str(args.machine_rank),
"--master_addr",
args.main_process_ip,
"--node_rank",
"--master_port",
str(args.main_process_port),
]
)
else:
cmd.extend(["--nproc_per_node", str(args.num_processes)])
if args.main_process_port is not None:
cmd.extend(["--master_port", str(args.main_process_port)])
if args.module and args.no_python:
raise ValueError("--module and --no_python cannot be used together")
elif args.module:
cmd.append("--module")
elif args.no_python:
cmd.append("--no_python")
cmd.append(args.training_script)
cmd.extend(args.training_script_args)
current_env = os.environ.copy()
current_env["USE_FP16"] = str(args.fp16)
try:
mixed_precision = PrecisionType(args.mixed_precision.lower())
except ValueError:
raise ValueError(
f"Unknown mixed_precision mode: {args.mixed_precision.lower()}. Choose between {PrecisionType.list()}."
)
if args.fp16:
warnings.warn('--fp16 flag is deprecated. Use "--mixed_precision fp16" instead.', DeprecationWarning)
mixed_precision = "fp16"
current_env["MIXED_PRECISION"] = str(mixed_precision)
if args.use_fsdp:
current_env["USE_FSDP"] = "true"
current_env["FSDP_OFFLOAD_PARAMS"] = str(args.offload_params).lower()
current_env["FSDP_MIN_NUM_PARAMS"] = str(args.min_num_params)
current_env["FSDP_SHARDING_STRATEGY"] = str(args.sharding_strategy)
current_env["OMP_NUM_THREADS"] = str(args.num_cpu_threads_per_process)
process = subprocess.Popen(cmd, env=current_env)
process.wait()
if process.returncode != 0:
raise subprocess.CalledProcessError(returncode=process.returncode, cmd=cmd)
def deepspeed_launcher(args):
cmd = ["deepspeed", "--no_local_rank"]
if args.num_machines > 1:
cmd.extend(
[
"--num_gpus",
str(args.num_processes // args.num_machines),
"--num_nodes",
str(args.num_machines),
"--node_rank",
str(args.machine_rank),
"--master_addr",
args.main_process_ip,
"--master_port",
str(args.main_process_port),
]
)
else:
cmd.extend(["--num_gpus", str(args.num_processes)])
if args.module and args.no_python:
raise ValueError("--module and --no_python cannot be used together")
elif args.module:
cmd.append("--module")
elif args.no_python:
cmd.append("--no_python")
cmd.append(args.training_script)
cmd.extend(args.training_script_args)
current_env = os.environ.copy()
try:
mixed_precision = PrecisionType(args.mixed_precision.lower())
except ValueError:
raise ValueError(
f"Unknown mixed_precision mode: {args.mixed_precision.lower()}. Choose between {PrecisionType.list()}."
)
if args.fp16:
warnings.warn('--fp16 flag is deprecated. Use "--mixed_precision fp16" instead.', DeprecationWarning)
mixed_precision = "fp16"
current_env["MIXED_PRECISION"] = str(mixed_precision)
current_env["USE_DEEPSPEED"] = "true"
current_env["DEEPSPEED_ZERO_STAGE"] = str(args.zero_stage)
current_env["GRADIENT_ACCUMULATION_STEPS"] = str(args.gradient_accumulation_steps)
current_env["DEEPSPEED_OFFLOAD_OPTIMIZER_DEVICE"] = str(args.offload_optimizer_device).lower()
process = subprocess.Popen(cmd, env=current_env)
process.wait()
@ -148,55 +334,201 @@ def multi_gpu_launcher(args):
def tpu_launcher(args):
import torch_xla.distributed.xla_multiprocessing as xmp
# Import training_script as a module.
script_path = Path(args.training_script)
sys.path.append(str(script_path.parent.resolve()))
mod_name = script_path.stem
if args.no_python:
raise ValueError("--no_python cannot be used with TPU launcher")
if args.module:
mod_name = args.training_script
else:
# Import training_script as a module
script_path = Path(args.training_script)
sys.path.append(str(script_path.parent.resolve()))
mod_name = script_path.stem
mod = importlib.import_module(mod_name)
if not hasattr(mod, args.main_training_function):
raise ValueError(
f"Your training script should have a function named {args.main_training_function}, or you should pass a "
"different value to `--main_training_function`."
)
main_function = getattr(mod, args.main_training_function)
# Patch sys.argv
sys.argv = [args.training_script] + args.training_script_args
sys.argv = [mod.__file__] + args.training_script_args
# If the function does not take one argument, launch will fail
launcher_sig = inspect.signature(main_function)
if len(launcher_sig.parameters) == 0:
xmp.spawn(_AddOneArg(main_function), args=(), nprocs=args.num_processes)
main_function = getattr(mod, args.main_training_function)
xmp.spawn(PrepareForLaunch(main_function), args=(), nprocs=args.num_processes)
def _convert_nargs_to_dict(nargs: List[str]) -> Dict[str, str]:
if len(nargs) < 0:
return {}
# helper function to infer type for argsparser
def _infer_type(s):
try:
s = float(s)
if s // 1 == s:
return int(s)
return s
except ValueError:
return s
parser = argparse.ArgumentParser()
_, unknown = parser.parse_known_args(nargs)
for index, argument in enumerate(unknown):
if argument.startswith(("-", "--")):
action = None
if index + 1 < len(unknown): # checks if next index would be in list
if unknown[index + 1].startswith(("-", "--")): # checks if next element is an key
# raise an error if element is store_true or store_false
raise ValueError(
"SageMaker doesnt support argparse actions for `store_true` or `store_false`. Please define explicit types"
)
else: # raise an error if last element is store_true or store_false
raise ValueError(
"SageMaker doesnt support argparse actions for `store_true` or `store_false`. Please define explicit types"
)
# adds argument to parser based on action_store true
if action is None:
parser.add_argument(argument, type=_infer_type)
else:
parser.add_argument(argument, action=action)
return {
key: (literal_eval(value) if value == "True" or value == "False" else value)
for key, value in parser.parse_args(nargs).__dict__.items()
}
def sagemaker_launcher(sagemaker_config: SageMakerConfig, args):
if not is_sagemaker_available():
raise ImportError(
"Please install sagemaker to be able to launch training on Amazon SageMaker with `pip install accelerate[sagemaker]`"
)
if args.module or args.no_python:
raise ValueError(
"SageMaker requires a python training script file and cannot be used with --module or --no_python"
)
from sagemaker.huggingface import HuggingFace
# configure environment
print("Configuring Amazon SageMaker environment")
os.environ["AWS_DEFAULT_REGION"] = sagemaker_config.region
# configure credentials
if sagemaker_config.profile is not None:
os.environ["AWS_PROFILE"] = sagemaker_config.profile
elif args.aws_access_key_id is not None and args.aws_secret_access_key is not None:
os.environ["AWS_ACCESS_KEY_ID"] = args.aws_access_key_id
os.environ["AWS_SECRET_ACCESS_KEY"] = args.aws_secret_access_key
else:
xmp.spawn(main_function, args=(), nprocs=args.num_processes)
raise EnvironmentError(
"You need to provide an aws_access_key_id and aws_secret_access_key when not using aws_profile"
)
# extract needed arguments
source_dir = os.path.dirname(args.training_script)
if not source_dir: # checks if string is empty
source_dir = "."
entry_point = os.path.basename(args.training_script)
if not entry_point.endswith(".py"):
raise ValueError(f'Your training script should be a python script and not "{entry_point}"')
print("Converting Arguments to Hyperparameters")
hyperparameters = _convert_nargs_to_dict(args.training_script_args)
try:
mixed_precision = PrecisionType(args.mixed_precision.lower())
except ValueError:
raise ValueError(
f"Unknown mixed_precision mode: {args.mixed_precision.lower()}. Choose between {PrecisionType.list()}."
)
if args.fp16:
warnings.warn('--fp16 flag is deprecated. Use "--mixed_precision fp16" instead.', DeprecationWarning)
mixed_precision = "fp16"
# Environment variables to be set for use during training job
environment = {"MIXED_PRECISION": str(mixed_precision)}
# configure distribution set up
distribution = None # TODO: not yet implemented
# configure session
print("Creating Estimator")
huggingface_estimator = HuggingFace(
entry_point=entry_point,
source_dir=source_dir,
role=sagemaker_config.iam_role_name,
transformers_version="4.4",
pytorch_version="1.6",
py_version="py36",
base_job_name=sagemaker_config.base_job_name,
instance_count=sagemaker_config.num_machines,
instance_type=sagemaker_config.ec2_instance_type,
debugger_hook_config=False,
distribution=distribution,
hyperparameters=hyperparameters,
environment=environment,
)
huggingface_estimator.fit()
print(f"You can find your model data at: {huggingface_estimator.model_data}")
def launch_command(args):
# Sanity checks
if args.multi_gpu and args.tpu:
raise ValueError("You can only pick one between `--multi_gpu` and `--tpu`.")
if sum([args.multi_gpu, args.tpu, args.use_deepspeed, args.use_fsdp]) > 1:
raise ValueError("You can only pick one between `--multi_gpu`, `--use_deepspeed`, `--tpu`, `--use_fsdp`.")
defaults = None
# Get the default from the config file.
if args.config_file is not None or os.path.isfile(default_config_file) and not args.cpu:
defaults = LaunchConfig.from_json_file(json_file=args.config_file)
if not args.multi_gpu and not args.tpu:
defaults = load_config_from_file(args.config_file)
if not args.multi_gpu and not args.tpu and not args.use_deepspeed:
args.use_deepspeed = defaults.distributed_type == DistributedType.DEEPSPEED
args.multi_gpu = defaults.distributed_type == DistributedType.MULTI_GPU
args.tpu = defaults.distributed_type == DistributedType.TPU
if args.num_processes is None:
args.num_processes = defaults.num_processes
if not args.fp16:
args.fp16 = defaults.fp16
if args.main_training_function is None:
args.main_training_function = defaults.main_training_function
args.use_fsdp = defaults.distributed_type == DistributedType.FSDP
if defaults.compute_environment == ComputeEnvironment.LOCAL_MACHINE:
# Update args with the defaults
for name, attr in defaults.__dict__.items():
if isinstance(attr, dict):
for k in defaults.deepspeed_config:
if getattr(args, k) is None:
setattr(args, k, defaults.deepspeed_config[k])
for k in defaults.fsdp_config:
setattr(args, k, defaults.fsdp_config[k])
continue
# Those args are handled separately
if (
name not in ["compute_environment", "fp16", "mixed_precision", "distributed_type"]
and getattr(args, name, None) is None
):
setattr(args, name, attr)
if not args.mixed_precision:
if args.fp16:
args.mixed_precision = "fp16"
else:
args.mixed_precision = defaults.mixed_precision
else:
if args.num_processes is None:
args.num_processes = 1
# Use the proper launcher
if args.multi_gpu and not args.cpu:
if args.use_deepspeed and not args.cpu:
deepspeed_launcher(args)
elif args.use_fsdp and not args.cpu:
multi_gpu_launcher(args)
elif args.multi_gpu and not args.cpu:
multi_gpu_launcher(args)
elif args.tpu and not args.cpu:
tpu_launcher(args)
elif defaults is not None and defaults.compute_environment == ComputeEnvironment.AMAZON_SAGEMAKER:
sagemaker_launcher(defaults, args)
else:
simple_launcher(args)

View File

@ -30,7 +30,7 @@ def test_command_parser(subparsers=None):
"--config_file",
default=None,
help=(
"The path to use to store the config file. Will default to a file named default_config.json in the cache "
"The path to use to store the config file. Will default to a file named default_config.yaml in the cache "
"location, which is the content of the environment `HF_HOME` suffixed with 'accelerate', or if you don't have "
"such an environment variable, your cache directory ('~/.cache' or the content of `XDG_CACHE_HOME`) suffixed "
"with 'huggingface'."
@ -46,7 +46,7 @@ def test_command(args):
script_name = os.path.sep.join(__file__.split(os.path.sep)[:-2] + ["test_utils", "test_script.py"])
test_args = f"""
{script_name} --config_file={args.config_file}
--config_file={args.config_file} {script_name}
""".split()
cmd = ["accelerate-launch"] + test_args
result = execute_subprocess_async(cmd, env=os.environ.copy())

View File

@ -12,15 +12,26 @@
# See the License for the specific language governing permissions and
# limitations under the License.
from typing import Optional
import math
from typing import List, Optional, Union
import torch
from torch.utils.data import BatchSampler, DataLoader, IterableDataset
from packaging import version
from .state import AcceleratorState, DistributedType, is_tpu_available
from .utils import send_to_device, synchronize_rng_states
from .utils import (
RNGType,
broadcast,
broadcast_object_list,
concatenate,
find_batch_size,
get_data_structure,
initialize_tensors,
is_torch_version,
send_to_device,
slice_tensors,
synchronize_rng_states,
)
if is_tpu_available():
@ -49,40 +60,40 @@ _PYTORCH_DATALOADER_ADDITIONAL_KWARGS = {
}
for v, additional_kwargs in _PYTORCH_DATALOADER_ADDITIONAL_KWARGS.items():
if version.parse(torch.__version__) >= version.parse(v):
if is_torch_version(">=", v):
_PYTORCH_DATALOADER_KWARGS.update(additional_kwargs)
class BatchSamplerShard(BatchSampler):
"""
Wraps a PyTorch :obj:`BatchSampler` to generate batches for one of the processes only. Instances of this class will
always yield a number of batches that is a round multiple of :obj:`num_processes` and that all have the same size.
Depending on the value of the :obj:`drop_last` attribute of the batch sampler passed, it will either stop the
iteration at the first batch that would be too small / not present on all processes or loop with indices from the
beginning.
Wraps a PyTorch `BatchSampler` to generate batches for one of the processes only. Instances of this class will
always yield a number of batches that is a round multiple of `num_processes` and that all have the same size.
Depending on the value of the `drop_last` attribute of the batch sampler passed, it will either stop the iteration
at the first batch that would be too small / not present on all processes or loop with indices from the beginning.
Args:
batch_sampler (:obj:`torch.utils.data.sampler.BatchSampler`):
batch_sampler (`torch.utils.data.sampler.BatchSampler`):
The batch sampler to split in several shards.
num_processes (:obj:`int`, `optional`, defaults to 1):
num_processes (`int`, *optional*, defaults to 1):
The number of processes running concurrently.
process_index (:obj:`int`, `optional`, defaults to 0):
process_index (`int`, *optional*, defaults to 0):
The index of the current process.
split_batches (:obj:`bool`, `optional`, defaults to :obj:`False`):
split_batches (`bool`, *optional*, defaults to `False`):
Whether the shards should be created by splitting a batch to give a piece of it on each process, or by
yielding different full batches on each process.
On two processes with a sampler of :obj:`[[0, 1, 2, 3], [4, 5, 6, 7]]`, this will result in:
On two processes with a sampler of `[[0, 1, 2, 3], [4, 5, 6, 7]]`, this will result in:
- the sampler on process 0 to yield :obj:`[0, 1, 2, 3]` and the sampler on process 1 to yield :obj:`[4, 5,
6, 7]` if this argument is set to :obj:`False`.
- the sampler on process 0 to yield :obj:`[0, 1]` then :obj:`[4, 5]` and the sampler on process 1 to yield
:obj:`[2, 3]` then :obj:`[6, 7]` if this argument is set to :obj:`True`.
- the sampler on process 0 to yield `[0, 1, 2, 3]` and the sampler on process 1 to yield `[4, 5, 6, 7]` if
this argument is set to `False`.
- the sampler on process 0 to yield `[0, 1]` then `[4, 5]` and the sampler on process 1 to yield `[2, 3]`
then `[6, 7]` if this argument is set to `True`.
.. warning::
<Tip warning={true}>
This does not support :obj:`BatchSampler` with varying batch size yet.
"""
This does not support `BatchSampler` with varying batch size yet.
</Tip>"""
def __init__(
self,
@ -104,6 +115,8 @@ class BatchSamplerShard(BatchSampler):
self.drop_last = batch_sampler.drop_last
def __len__(self):
if self.split_batches:
return len(self.batch_sampler)
if len(self.batch_sampler) % self.num_processes == 0:
return len(self.batch_sampler) // self.num_processes
length = len(self.batch_sampler) // self.num_processes
@ -174,35 +187,35 @@ class BatchSamplerShard(BatchSampler):
class IterableDatasetShard(IterableDataset):
"""
Wraps a PyTorch :obj:`IterableDataset` to generate samples for one of the processes only. Instances of this class
will always yield a number of samples that is a round multiple of the actual batch size (depending of the value of
:obj:`split_batches`, this is either :obj:`batch_size` or :obj:`batch_size x num_processes`). Depending on the
value of the :obj:`drop_last` attribute of the batch sampler passed, it will either stop the iteration at the first
batch that would be too small or loop with indices from the beginning.
Wraps a PyTorch `IterableDataset` to generate samples for one of the processes only. Instances of this class will
always yield a number of samples that is a round multiple of the actual batch size (depending of the value of
`split_batches`, this is either `batch_size` or `batch_size x num_processes`). Depending on the value of the
`drop_last` attribute of the batch sampler passed, it will either stop the iteration at the first batch that would
be too small or loop with indices from the beginning.
Args:
dataset (:obj:`torch.utils.data.dataset.IterableDataset`):
dataset (`torch.utils.data.dataset.IterableDataset`):
The batch sampler to split in several shards.
batch_size (:obj:`int`, `optional`, defaults to 1):
The size of the batches per shard (if :obj:`split_batches=False`) or the size of the batches (if
:obj:`split_batches=True`).
drop_last (:obj:`bool`, `optional`, defaults to :obj:`False`):
batch_size (`int`, *optional*, defaults to 1):
The size of the batches per shard (if `split_batches=False`) or the size of the batches (if
`split_batches=True`).
drop_last (`bool`, *optional*, defaults to `False`):
Whether or not to drop the last incomplete batch or complete the last batches by using the samples from the
beginning.
num_processes (:obj:`int`, `optional`, defaults to 1):
num_processes (`int`, *optional*, defaults to 1):
The number of processes running concurrently.
process_index (:obj:`int`, `optional`, defaults to 0):
process_index (`int`, *optional*, defaults to 0):
The index of the current process.
split_batches (:obj:`bool`, `optional`, defaults to :obj:`False`):
split_batches (`bool`, *optional*, defaults to `False`):
Whether the shards should be created by splitting a batch to give a piece of it on each process, or by
yielding different full batches on each process.
On two processes with an iterable dataset yielding of :obj:`[0, 1, 2, 3, 4, 5, 6, 7]`, this will result in:
On two processes with an iterable dataset yielding of `[0, 1, 2, 3, 4, 5, 6, 7]`, this will result in:
- the shard on process 0 to yield :obj:`[0, 1, 2, 3]` and the shard on process 1 to yield :obj:`[4, 5, 6,
7]` if this argument is set to :obj:`False`.
- the shard on process 0 to yield :obj:`[0, 1, 4, 5]` and the sampler on process 1 to yield :obj:`[2, 3, 6,
7]` if this argument is set to :obj:`True`.
- the shard on process 0 to yield `[0, 1, 2, 3]` and the shard on process 1 to yield `[4, 5, 6, 7]` if this
argument is set to `False`.
- the shard on process 0 to yield `[0, 1, 4, 5]` and the sampler on process 1 to yield `[2, 3, 6, 7]` if
this argument is set to `True`.
"""
def __init__(
@ -214,7 +227,7 @@ class IterableDatasetShard(IterableDataset):
process_index: int = 0,
split_batches: bool = False,
):
if split_batches and batch_size % num_processes != 0:
if split_batches and batch_size > 1 and batch_size % num_processes != 0:
raise ValueError(
f"To use `IterableDatasetShard` in `split_batches` mode, the batch size ({batch_size}) "
f"needs to be a round multiple of the number of processes ({num_processes})."
@ -255,23 +268,36 @@ class IterableDatasetShard(IterableDataset):
class DataLoaderShard(DataLoader):
"""
Subclass of a PyTorch :obj:`DataLoader` that will deal with device placement and current distributed setup.
Subclass of a PyTorch `DataLoader` that will deal with device placement and current distributed setup.
Args:
dataset (:obj:`torch.utils.data.dataset.Dataset`):
dataset (`torch.utils.data.dataset.Dataset`):
The dataset to use to build this datalaoder.
device (:obj:`torch.device`, `optional`):
device (`torch.device`, *optional*):
If passed, the device to put all batches on.
rng_types (list of `str` or [`~utils.RNGType`]):
The list of random number generators to synchronize at the beginning of each iteration. Should be one or
several of:
- `"torch"`: the base torch random number generator
- `"cuda"`: the CUDA random number generator (GPU only)
- `"xla"`: the XLA random number generator (TPU only)
- `"generator"`: an optional `torch.Generator`
generator (`torch.Generator`, *optional*):
A random number generator to keep synchronized across processes.
kwargs:
All other keyword arguments to pass to the regular :obj:`DataLoader` initialization.
All other keyword arguments to pass to the regular `DataLoader` initialization.
"""
def __init__(self, dataset, device=None, **kwargs):
def __init__(self, dataset, device=None, rng_types=None, generator=None, **kwargs):
super().__init__(dataset, **kwargs)
self.device = device
self.rng_types = rng_types
self.generator = generator
def __iter__(self):
synchronize_rng_states()
if self.rng_types is not None:
synchronize_rng_states(self.rng_types, self.generator)
state = AcceleratorState()
for batch in super().__iter__():
if state.distributed_type == DistributedType.TPU:
@ -279,6 +305,123 @@ class DataLoaderShard(DataLoader):
yield batch if self.device is None else send_to_device(batch, self.device)
class DataLoaderDispatcher(DataLoader):
"""
Subclass of a PyTorch `DataLoader` that will iterate and preprocess on process 0 only, then dispatch on each
process their part of the batch.
Args:
split_batches (`bool`, *optional*, defaults to `False`):
Whether the resulting `DataLoader` should split the batches of the original data loader across devices or
yield full batches (in which case it will yield batches starting at the `process_index`-th and advancing of
`num_processes` batches at each iteration).
Another way to see this is that the observed batch size will be the same as the initial `dataloader` if
this option is set to `True`, the batch size of the initial `dataloader` multiplied by `num_processes`
otherwise.
Setting this option to `True` requires that the batch size of the `dataloader` is a round multiple of
`batch_size`.
"""
def __init__(self, dataset, split_batches: bool = False, **kwargs):
shuffle = False
if is_torch_version(">=", "1.11.0"):
from torch.utils.data.datapipes.iter.combinatorics import ShufflerIterDataPipe
# We need to save the shuffling state of the DataPipe
if isinstance(dataset, ShufflerIterDataPipe):
shuffle = dataset._shuffle_enabled
super().__init__(dataset, **kwargs)
self.split_batches = split_batches
if is_torch_version("<", "1.8.0"):
raise ImportError(
"Using `DataLoaderDispatcher` requires PyTorch 1.8.0 minimum. You have {torch.__version__}."
)
if shuffle:
torch.utils.data.graph_settings.apply_shuffle_settings(dataset, shuffle=shuffle)
def __iter__(self):
state = AcceleratorState()
if state.process_index == 0:
# We only iterate through the DataLoader on process 0.
main_iterator = super().__iter__()
stop_iteration = False
first_batch = None
while not stop_iteration:
# On process 0, we gather the batch to dispatch.
if state.process_index == 0:
try:
if self.split_batches:
# One batch of the main iterator is dispatched and split.
batch = next(main_iterator)
else:
# num_processes batches of the main iterator are concatenated then dispatched and split.
# We add the batches one by one so we have the remainder available when drop_last=False.
batches = []
for _ in range(state.num_processes):
batches.append(next(main_iterator))
batch = concatenate(batches, dim=0)
# In both cases, we need to get the structure of the batch that we will broadcast on other
# processes to initialize the tensors with the right shape.
# data_structure, stop_iteration
batch_info = [get_data_structure(batch), False]
except StopIteration:
batch_info = [None, True]
else:
batch_info = [None, stop_iteration]
# This is inplace, so after this instruction, every process has the same `batch_info` as process 0.
broadcast_object_list(batch_info)
stop_iteration = batch_info[1]
if stop_iteration:
# If drop_last is False and split_batches is False, we may have a remainder to take care of.
if not self.split_batches and not self.drop_last:
if state.process_index == 0 and len(batches) > 0:
batch = concatenate(batches, dim=0)
batch_info = [get_data_structure(batch), False]
else:
batch_info = [None, True]
broadcast_object_list(batch_info)
if batch_info[1]:
continue
else:
continue
if state.process_index != 0:
# Initialize tensors on other processes than process 0.
batch = initialize_tensors(batch_info[0])
batch = send_to_device(batch, state.device)
# Broadcast the batch before splitting it.
batch = broadcast(batch, from_process=0)
if not self.drop_last and first_batch is None:
# We keep at least num processes elements of the first batch to be able to complete the last batch
first_batch = slice_tensors(batch, slice(0, state.num_processes))
observed_batch_size = find_batch_size(batch)
batch_size = observed_batch_size // state.num_processes
if not self.drop_last and stop_iteration and observed_batch_size % state.num_processes != 0:
# If the last batch is not complete, let's add the first batch to it.
batch = concatenate([batch, first_batch], dim=0)
batch_size += 1
data_slice = slice(state.process_index * batch_size, (state.process_index + 1) * batch_size)
if state.distributed_type == DistributedType.TPU:
xm.mark_step()
yield slice_tensors(batch, data_slice)
def __len__(self):
state = AcceleratorState()
whole_length = super().__len__()
if self.drop_last:
return whole_length // state.num_processes
else:
return math.ceil(whole_length / state.num_processes)
def prepare_data_loader(
dataloader: DataLoader,
device: Optional[torch.device] = None,
@ -286,46 +429,70 @@ def prepare_data_loader(
process_index: Optional[int] = None,
split_batches: bool = False,
put_on_device: bool = False,
rng_types: Optional[List[Union[str, RNGType]]] = None,
dispatch_batches: Optional[bool] = None,
) -> DataLoader:
"""
Wraps a PyTorch :obj:`DataLoader` to generate batches for one of the processes only.
Wraps a PyTorch `DataLoader` to generate batches for one of the processes only.
Depending on the value of the :obj:`drop_last` attribute of the :obj:`dataloader` passed, it will either stop the
iteration at the first batch that would be too small / not present on all processes or loop with indices from the
beginning.
Depending on the value of the `drop_last` attribute of the `dataloader` passed, it will either stop the iteration
at the first batch that would be too small / not present on all processes or loop with indices from the beginning.
Args:
dataloader (:obj:`torch.utils.data.dataloader.DataLoader`):
dataloader (`torch.utils.data.dataloader.DataLoader`):
The data loader to split across several devices.
device (:obj:`torch.device`):
The target device for the returned :obj:`DataLoader`.
num_processes (:obj:`int`, `optional`):
device (`torch.device`):
The target device for the returned `DataLoader`.
num_processes (`int`, *optional*):
The number of processes running concurrently. Will default to the value given by
:class:`~accelerate.AcceleratorState`.
process_index (:obj:`int`, `optional`):
The index of the current process. Will default to the value given by :class:`~accelerate.AcceleratorState`.
split_batches (:obj:`bool`, `optional`, defaults to :obj:`False`):
Whether the resulting :obj:`DataLoader` should split the batches of the original data loader across devices
or yield full batches (in which case it will yield batches starting at the :obj:`process_index`-th and
advancing of :obj:`num_processes` batches at each iteration).
[`~state.AcceleratorState`].
process_index (`int`, *optional*):
The index of the current process. Will default to the value given by [`~state.AcceleratorState`].
split_batches (`bool`, *optional*, defaults to `False`):
Whether the resulting `DataLoader` should split the batches of the original data loader across devices or
yield full batches (in which case it will yield batches starting at the `process_index`-th and advancing of
`num_processes` batches at each iteration).
Another way to see this is that the observed batch size will be the same as the initial :obj:`dataloader`
if this option is set to :obj:`True`, the batch size of the initial :obj:`dataloader` multiplied by
:obj:`num_processes` otherwise.
Another way to see this is that the observed batch size will be the same as the initial `dataloader` if
this option is set to `True`, the batch size of the initial `dataloader` multiplied by `num_processes`
otherwise.
Setting this option to :obj:`True` requires that the batch size of the :obj:`dataloader` is a round
multiple of :obj:`batch_size`.
put_on_device (:obj:`bool`, `optional`, defaults to :obj:`False`):
Whether or not to put the batches on :obj:`device` (only works if the batches are nested list, tuples or
Setting this option to `True` requires that the batch size of the `dataloader` is a round multiple of
`batch_size`.
put_on_device (`bool`, *optional*, defaults to `False`):
Whether or not to put the batches on `device` (only works if the batches are nested list, tuples or
dictionaries of tensors).
rng_types (list of `str` or [`~utils.RNGType`]):
The list of random number generators to synchronize at the beginning of each iteration. Should be one or
several of:
- `"torch"`: the base torch random number generator
- `"cuda"`: the CUDA random number generator (GPU only)
- `"xla"`: the XLA random number generator (TPU only)
- `"generator"`: the `torch.Generator` of the sampler (or batch sampler if there is no sampler in your
dataloader) or of the iterable dataset (if it exists) if the underlying dataset is of that type.
dispatch_batches (`bool`, *optional*):
If set to `True`, the datalaoder prepared is only iterated through on the main process and then the batches
are split and broadcast to each process. Will default to `True` when the underlying dataset is an
`IterableDataset`, `False` otherwise.
Returns:
:obj:`torch.utils.data.dataloader.DataLoader`: A new data loader that will yield the portion of the batches
`torch.utils.data.dataloader.DataLoader`: A new data loader that will yield the portion of the batches
.. warning::
<Tip warning={true}>
This does not support :obj:`BatchSampler` with varying batch size yet.
"""
This does not support `BatchSampler` with varying batch size yet.
</Tip>"""
if dispatch_batches is None:
if is_torch_version("<", "1.8.0") or not put_on_device:
dispatch_batches = False
else:
dispatch_batches = isinstance(dataloader.dataset, IterableDataset)
if dispatch_batches and not put_on_device:
raise ValueError("Using `dispatch_batches=True` requires `put_on_device=True`.")
# Grab defaults from AcceleratorState
state = AcceleratorState()
if num_processes is None:
@ -334,17 +501,21 @@ def prepare_data_loader(
process_index = state.process_index
# Sanity check
if split_batches and dataloader.batch_size % num_processes != 0:
if split_batches and dataloader.batch_size > 1 and dataloader.batch_size % num_processes != 0:
raise ValueError(
f"Using `split_batches=True` requires that the batch size ({dataloader.batch_size}) "
f"to be a round multiple of the number of processes ({num_processes})."
f"To use a `DataLoader` in `split_batches` mode, the batch size ({dataloader.batch_size}) "
f"needs to be a round multiple of the number of processes ({num_processes})."
)
new_dataset = dataloader.dataset
new_batch_sampler = dataloader.batch_sampler
# Iterable dataset doesn't like batch_sampler, but data_loader creates a default one for it
new_batch_sampler = dataloader.batch_sampler if not isinstance(new_dataset, IterableDataset) else None
generator = getattr(dataloader, "generator", None)
# No change if no multiprocess
if num_processes != 1:
if num_processes != 1 and not dispatch_batches:
if isinstance(new_dataset, IterableDataset):
if getattr(dataloader.dataset, "generator", None) is not None:
generator = dataloader.dataset.generator
new_dataset = IterableDatasetShard(
new_dataset,
batch_size=dataloader.batch_size,
@ -355,6 +526,13 @@ def prepare_data_loader(
)
else:
# New batch sampler for the current process.
if hasattr(dataloader.sampler, "generator"):
if dataloader.sampler.generator is None:
dataloader.sampler.generator = torch.Generator()
generator = dataloader.sampler.generator
generator.manual_seed(int(torch.empty((), dtype=torch.int64).random_().item()))
elif getattr(dataloader.batch_sampler, "generator", None) is not None:
generator = dataloader.batch_sampler.generator
new_batch_sampler = BatchSamplerShard(
dataloader.batch_sampler,
num_processes=num_processes,
@ -369,16 +547,33 @@ def prepare_data_loader(
"sampler",
"batch_sampler",
"drop_last",
"generator",
]
if rng_types is not None and generator is None and "generator" in rng_types:
rng_types.remove("generator")
kwargs = {
k: getattr(dataloader, k, _PYTORCH_DATALOADER_KWARGS[k])
for k in _PYTORCH_DATALOADER_KWARGS
if k not in ignore_kwargs
}
# Need to provide batch_size as batch_sampler is None for Iterable dataset
if new_batch_sampler is None:
kwargs["drop_last"] = dataloader.drop_last
kwargs["batch_size"] = dataloader.batch_size // num_processes if split_batches else dataloader.batch_size
if dispatch_batches:
return DataLoaderDispatcher(
new_dataset, split_batches=split_batches, batch_sampler=new_batch_sampler, **kwargs
)
return DataLoaderShard(
new_dataset,
device=device if put_on_device else None,
batch_sampler=new_batch_sampler,
rng_types=rng_types,
generator=generator,
**kwargs,
)

444
src/accelerate/hooks.py Normal file
View File

@ -0,0 +1,444 @@
# Copyright 2022 The HuggingFace Team. All rights reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
import functools
from typing import Dict, Mapping, Optional, Union
import torch
import torch.nn as nn
from .utils import PrefixedDataset, find_device, named_module_tensors, send_to_device, set_module_tensor_to_device
class ModelHook:
"""
A hook that contains callbacks to be executed just before and after the forward method of a model. The difference
with PyTorch existing hooks is that they get passed along the kwargs.
Class attribute:
- **no_grad** (`bool`, *optional*, defaults to `False`) -- Whether or not to execute the actual forward pass under
the `torch.no_grad()` context manager.
"""
no_grad = False
def init_hook(self, module):
"""
To be executed when the hook is attached to the module.
Args:
module (`torch.nn.Module`): The module attached to this hook.
"""
return module
def pre_forward(self, module, *args, **kwargs):
"""
To be executed just before the forward method of the model.
Args:
module (`torch.nn.Module`): The module whose forward pass will be executed just after this event.
args (`Tuple[Any]`): The positional arguments passed to the module.
kwargs (`Dict[Str, Any]`): The keyword arguments passed to the module.
Returns:
`Tuple[Tuple[Any], Dict[Str, Any]]`: A tuple with the treated `args` and `kwargs`.
"""
return args, kwargs
def post_forward(self, module, output):
"""
To be executed just after the forward method of the model.
Args:
module (`torch.nn.Module`): The module whose forward pass been executed just before this event.
output (`Any`): The output of the module.
Returns:
`Any`: The processed `output`.
"""
return output
def detach_hook(self, module):
"""
To be executed when the hook is deached from a module.
Args:
module (`torch.nn.Module`): The module detached from this hook.
"""
return module
class SequentialHook(ModelHook):
"""
A hook that can contain several hooks and iterates through them at each event.
"""
def __init__(self, *hooks):
self.hooks = hooks
def init_hook(self, module):
for hook in self.hooks:
module = hook.init_hook(module)
return module
def pre_forward(self, module, *args, **kwargs):
for hook in self.hooks:
args, kwargs = hook.pre_forward(module, *args, **kwargs)
return args, kwargs
def post_forward(self, module, output):
for hook in self.hooks:
output = hook.post_forward(module, output)
return output
def detach_hook(self, module):
for hook in self.hooks:
module = hook.detach_hook(module)
return module
def add_hook_to_module(module: nn.Module, hook: ModelHook):
"""
Adds a hook to a given module. This will rewrite the `forward` method of the module to include the hook, to remove
this behavior and restore the original `forward` method, use `remove_hook_from_module`.
<Tip warning={true}>
If the module already contains a hook, this will replace it with the new hook passed. To chain two hooks together,
use the `SequentialHook` class.
</Tip>
Args:
module (`torch.nn.Module`): The module to attach a hook to.
hook (`ModelHook`): The hook to attach.
Returns:
`torch.nn.Module`: The same module, with the hook attached (the module is modified in place, so the result can
be discarded).
"""
if hasattr(module, "_hf_hook") and hasattr(module, "_old_forward"):
# If we already put some hook on this module, we replace it with the new one.
old_forward = module._old_forward
else:
old_forward = module.forward
module._old_forward = old_forward
module = hook.init_hook(module)
module._hf_hook = hook
@functools.wraps(old_forward)
def new_forward(*args, **kwargs):
args, kwargs = module._hf_hook.pre_forward(module, *args, **kwargs)
if module._hf_hook.no_grad:
with torch.no_grad():
output = old_forward(*args, **kwargs)
else:
output = old_forward(*args, **kwargs)
return module._hf_hook.post_forward(module, output)
module.forward = new_forward
return module
def remove_hook_from_module(module: nn.Module):
"""
Removes any hook attached to a module via `add_hook_to_module`.
Args:
module (`torch.nn.Module`): The module to attach a hook to.
Returns:
`torch.nn.Module`: The same module, with the hook detached (the module is modified in place, so the result can
be discarded).
"""
if hasattr(module, "_hf_hook"):
module._hf_hook.detach_hook(module)
delattr(module, "_hf_hook")
if hasattr(module, "_old_forward"):
module.forward = module._old_forward
delattr(module, "_old_forward")
return module
class AlignDevicesHook(ModelHook):
"""
A generic `ModelHook` that ensures inputs and model weights are on the same device for the forward pass of the
associated module, potentially offloading the weights after the forward pass.
Args:
execution_device (`torch.device`, *optional*):
The device on which inputs and model weights should be placed before the forward pass.
offload (`bool`, *optional*, defauts to `False`):
Whether or not the weights should be offloaded after the forward pass.
io_same_device (`bool`, *optional*, defaults to `False`):
Whether or not the output should be placed on the same device as the input was.
weights_map (`Mapping[str, torch.Tensor]`, *optional*):
When the model weights are offloaded, a (potentially lazy) map from param names to the tensor values.
offload_buffers (`bool`, *optional*, defaults to `False`):
Whether or not to include the associated module's buffers when offloading.
place_submodules (`bool`, *optional*, defaults to `False`):
Whether to place the submodules on `execution_device` during the `init_hook` event.
"""
def __init__(
self,
execution_device: Optional[Union[int, str, torch.device]] = None,
offload: bool = False,
io_same_device: bool = False,
weights_map: Optional[Mapping] = None,
offload_buffers: bool = False,
place_submodules: bool = False,
):
self.execution_device = execution_device
self.offload = offload
self.io_same_device = io_same_device
self.weights_map = weights_map
self.offload_buffers = offload_buffers
self.place_submodules = place_submodules
# Will contain the input device when `io_same_device=True`.
self.input_device = None
self.param_original_devices = {}
self.buffer_original_devices = {}
def init_hook(self, module):
if not self.offload and self.execution_device is not None:
for name, _ in named_module_tensors(module, recurse=self.place_submodules):
set_module_tensor_to_device(module, name, self.execution_device)
elif self.offload:
self.original_devices = {
name: param.device for name, param in named_module_tensors(module, recurse=self.place_submodules)
}
if self.weights_map is None:
self.weights_map = {
name: param.to("cpu")
for name, param in named_module_tensors(
module, include_buffers=self.offload_buffers, recurse=self.place_submodules
)
}
for name, _ in named_module_tensors(
module, include_buffers=self.offload_buffers, recurse=self.place_submodules
):
set_module_tensor_to_device(module, name, "meta")
if not self.offload_buffers and self.execution_device is not None:
for name, _ in module.named_buffers(recurse=self.place_submodules):
set_module_tensor_to_device(module, name, self.execution_device)
return module
def pre_forward(self, module, *args, **kwargs):
if self.io_same_device:
self.input_device = find_device([args, kwargs])
if self.offload:
for name, _ in named_module_tensors(
module, include_buffers=self.offload_buffers, recurse=self.place_submodules
):
set_module_tensor_to_device(module, name, self.execution_device, value=self.weights_map[name])
return send_to_device(args, self.execution_device), send_to_device(kwargs, self.execution_device)
def post_forward(self, module, output):
if self.offload:
for name, _ in named_module_tensors(
module, include_buffers=self.offload_buffers, recurse=self.place_submodules
):
set_module_tensor_to_device(module, name, "meta")
if self.io_same_device and self.input_device is not None:
output = send_to_device(output, self.input_device)
return output
def detach_hook(self, module):
if self.offload:
for name, device in self.original_devices.items():
if device != torch.device("meta"):
set_module_tensor_to_device(module, name, device, value=self.weights_map.get(name, None))
def attach_execution_device_hook(module: torch.nn.Module, execution_device: Union[int, str, torch.device]):
"""
Recursively attaches `AlignDevicesHook` to all submodules of a given model to make sure they have the right
execution device
Args:
module (`torch.nn.Module`):
The module where we want to attach the hooks.
execution_device (`int`, `str` or `torch.device`):
The device on which inputs and model weights should be placed before the forward pass.
"""
if not hasattr(module, "_hf_hook") and len(module.state_dict()) > 0:
add_hook_to_module(module, AlignDevicesHook(execution_device))
for child in module.children():
attach_execution_device_hook(child, execution_device)
def attach_align_device_hook(
module: torch.nn.Module,
execution_device: Optional[torch.device] = None,
offload: bool = False,
weights_map: Optional[Mapping] = None,
offload_buffers: bool = False,
module_name: str = "",
):
"""
Recursively attaches `AlignDevicesHook` to all submodules of a given model that have direct parameters and/or
buffers.
Args:
module (`torch.nn.Module`):
The module where we want to attach the hooks.
execution_device (`torch.device`, *optional*):
The device on which inputs and model weights should be placed before the forward pass.
offload (`bool`, *optional*, defauts to `False`):
Whether or not the weights should be offloaded after the forward pass.
weights_map (`Mapping[str, torch.Tensor]`, *optional*):
When the model weights are offloaded, a (potentially lazy) map from param names to the tensor values.
offload_buffers (`bool`, *optional*, defaults to `False`):
Whether or not to include the associated module's buffers when offloading.
module_name (`str`, *optional*, defaults to `""`):
The name of the module.
"""
# Attach the hook on this module if it has any direct tensor.
directs = named_module_tensors(module)
if len(list(directs)) > 0:
if weights_map is not None:
prefix = f"{module_name}." if len(module_name) > 0 else ""
prefixed_weights_map = PrefixedDataset(weights_map, prefix)
else:
prefixed_weights_map = None
hook = AlignDevicesHook(
execution_device=execution_device,
offload=offload,
weights_map=prefixed_weights_map,
offload_buffers=offload_buffers,
)
add_hook_to_module(module, hook)
# Recurse on all children of the module.
for child_name, child in module.named_children():
child_name = f"{module_name}.{child_name}" if len(module_name) > 0 else child_name
attach_align_device_hook(
child,
execution_device=execution_device,
offload=offload,
weights_map=weights_map,
offload_buffers=offload_buffers,
module_name=child_name,
)
def remove_hook_from_submodules(module: nn.Module):
"""
Recursively removes all hooks attached on the submodules of a given model.
Args:
module (`torch.nn.Module`): The module on which to remove all hooks.
"""
remove_hook_from_module(module)
for child in module.children():
remove_hook_from_submodules(child)
def attach_align_device_hook_on_blocks(
module: nn.Module,
execution_device: Optional[Union[torch.device, Dict[str, torch.device]]] = None,
offload: Union[bool, Dict[str, bool]] = False,
weights_map: Mapping = None,
offload_buffers: bool = False,
module_name: str = "",
):
"""
Attaches `AlignDevicesHook` to all blocks of a given model as needed.
Args:
module (`torch.nn.Module`):
The module where we want to attach the hooks.
execution_device (`torch.device` or `Dict[str, torch.device]`, *optional*):
The device on which inputs and model weights should be placed before the forward pass. It can be one device
for the whole module, or a dictionary mapping module name to device.
offload (`bool`, *optional*, defauts to `False`):
Whether or not the weights should be offloaded after the forward pass. It can be one boolean for the whole
module, or a dictionary mapping module name to boolean.
weights_map (`Mapping[str, torch.Tensor]`, *optional*):
When the model weights are offloaded, a (potentially lazy) map from param names to the tensor values.
offload_buffers (`bool`, *optional*, defaults to `False`):
Whether or not to include the associated module's buffers when offloading.
module_name (`str`, *optional*, defaults to `""`):
The name of the module.
"""
# If one device and one offload, we've got one hook.
if not isinstance(execution_device, Mapping) and not isinstance(offload, dict):
if not offload:
hook = AlignDevicesHook(execution_device=execution_device, io_same_device=True, place_submodules=True)
add_hook_to_module(module, hook)
else:
attach_align_device_hook(
module,
execution_device=execution_device,
offload=True,
weights_map=weights_map,
offload_buffers=offload_buffers,
module_name=module_name,
)
return
if not isinstance(execution_device, Mapping):
execution_device = {key: offload for key in offload.keys()}
if not isinstance(offload, Mapping):
offload = {key: offload for key in execution_device.keys()}
if module_name in execution_device and not offload[module_name]:
hook = AlignDevicesHook(
execution_device=execution_device[module_name],
offload_buffers=offload_buffers,
io_same_device=(module_name == ""),
place_submodules=True,
)
add_hook_to_module(module, hook)
attach_execution_device_hook(module, execution_device[module_name])
elif module_name in execution_device:
if weights_map is not None:
prefix = f"{module_name}." if len(module_name) > 0 else ""
prefixed_weights_map = PrefixedDataset(weights_map, prefix)
else:
prefixed_weights_map = None
hook = AlignDevicesHook(
execution_device=execution_device[module_name],
offload=True,
weights_map=prefixed_weights_map,
offload_buffers=offload_buffers,
io_same_device=(module_name == ""),
place_submodules=True,
)
attach_execution_device_hook(module, execution_device[module_name])
add_hook_to_module(module, hook)
elif module_name == "":
hook = AlignDevicesHook(io_same_device=True)
add_hook_to_module(module, hook)
for child_name, child in module.named_children():
child_name = f"{module_name}.{child_name}" if len(module_name) > 0 else child_name
attach_align_device_hook_on_blocks(
child,
execution_device=execution_device,
offload=offload,
weights_map=weights_map,
offload_buffers=offload_buffers,
module_name=child_name,
)

177
src/accelerate/launchers.py Normal file
View File

@ -0,0 +1,177 @@
# Copyright 2021 The HuggingFace Team. All rights reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
import os
import sys
import tempfile
import warnings
import torch
from .state import AcceleratorState
from .utils import PrecisionType, PrepareForLaunch, is_torch_version, patch_environment
def notebook_launcher(function, args=(), num_processes=None, use_fp16=False, mixed_precision="no", use_port="29500"):
"""
Launches a training function, using several processes if it's possible in the current environment (TPU with
multiple cores for instance).
Args:
function (`Callable`):
The training function to execute. If it accepts arguments, the first argument should be the index of the
process run.
args (`Tuple`):
Tuple of arguments to pass to the function (it will receive `*args`).
num_processes (`int`, *optional*):
The number of processes to use for training. Will default to 8 in Colab/Kaggle if a TPU is available, to
the number of GPUs available otherwise.
mixed_precision (`str`, *optional*, defaults to `"no"`):
If `fp16` or `bf16`, will use mixed precision training on multi-GPU.
use_port (`str`, *optional*, defaults to `"29500"`):
The port to use to communicate between processes when launching a multi-GPU training.
"""
# Are we in a google colab or a Kaggle Kernel?
if any(key.startswith("KAGGLE") for key in os.environ.keys()):
in_colab_or_kaggle = True
elif "IPython" in sys.modules:
in_colab_or_kaggle = "google.colab" in str(sys.modules["IPython"].get_ipython())
else:
in_colab_or_kaggle = False
if in_colab_or_kaggle:
if os.environ.get("TPU_NAME", None) is not None:
# TPU launch
import torch_xla.distributed.xla_multiprocessing as xmp
if len(AcceleratorState._shared_state) > 0:
raise ValueError(
"To train on TPU in Colab or Kaggle Kernel, the `Accelerator` should only be initialized inside "
"your training function. Restart your notebook and make sure no cells initializes an "
"`Accelerator`."
)
if num_processes is None:
num_processes = 8
launcher = PrepareForLaunch(function, distributed_type="TPU")
print(f"Launching a training on {num_processes} TPU cores.")
xmp.spawn(launcher, args=args, nprocs=num_processes, start_method="fork")
else:
# No need for a distributed launch otherwise as it's either CPU or one GPU.
if torch.cuda.is_available():
print("Launching training on one GPU.")
else:
print("Launching training on CPU.")
function(*args)
else:
if num_processes is None:
raise ValueError(
"You have to specify the number of GPUs you would like to use, add `num_processes=...` to your call."
)
if num_processes > 1:
# Multi-GPU launch
if is_torch_version("<", "1.5.0"):
raise ImportError(
"Using `notebook_launcher` for distributed training on GPUs require torch >= 1.5.0, got "
f"{torch.__version__}."
)
from torch.multiprocessing import start_processes
if len(AcceleratorState._shared_state) > 0:
raise ValueError(
"To launch a multi-GPU training from your notebook, the `Accelerator` should only be initialized "
"inside your training function. Restart your notebook and make sure no cells initializes an "
"`Accelerator`."
)
if torch.cuda.is_initialized():
raise ValueError(
"To launch a multi-GPU training from your notebook, you need to avoid running any instruction "
"using `torch.cuda` in any cell. Restart your notebook and make sure no cells use any CUDA "
"function."
)
try:
mixed_precision = PrecisionType(mixed_precision.lower())
except ValueError:
raise ValueError(
f"Unknown mixed_precision mode: {args.mixed_precision.lower()}. Choose between {PrecisionType.list()}."
)
if use_fp16:
warnings.warn('use_fp16=True is deprecated. Use mixed_precision="fp16" instead.', DeprecationWarning)
mixed_precision = "fp16"
# torch.distributed will expect a few environment variable to be here. We set the ones common to each
# process here (the other ones will be set be the launcher).
with patch_environment(
world_size=num_processes, master_addr="127.0.01", master_port=use_port, mixed_precision=mixed_precision
):
launcher = PrepareForLaunch(function, distributed_type="MULTI_GPU")
print(f"Launching training on {num_processes} GPUs.")
start_processes(launcher, args=args, nprocs=num_processes, start_method="fork")
else:
# No need for a distributed launch otherwise as it's either CPU or one GPU.
if torch.cuda.is_available():
print("Launching training on one GPU.")
else:
print("Launching training on CPU.")
function(*args)
def debug_launcher(function, args=(), num_processes=2):
"""
Launches a training function using several processes on CPU for debugging purposes.
<Tip warning={true}>
This function is provided for internal testing and debugging, but it's not intended for real trainings. It will
only use the CPU.
</Tip>
Args:
function (`Callable`):
The training function to execute.
args (`Tuple`):
Tuple of arguments to pass to the function (it will receive `*args`).
num_processes (`int`, *optional*, defaults to 2):
The number of processes to use for training.
"""
if is_torch_version("<", "1.5.0"):
raise ImportError(
"Using `debug_launcher` for distributed training on GPUs require torch >= 1.5.0, got "
f"{torch.__version__}."
)
from torch.multiprocessing import start_processes
with tempfile.NamedTemporaryFile() as tmp_file:
# torch.distributed will expect a few environment variable to be here. We set the ones common to each
# process here (the other ones will be set be the launcher).
with patch_environment(
world_size=num_processes,
master_addr="127.0.01",
master_port="29500",
mixed_precision="no",
accelerate_debug_rdv_file=tmp_file.name,
use_cpu="yes",
):
launcher = PrepareForLaunch(function, debug=True)
start_processes(launcher, args=args, nprocs=num_processes, start_method="fork")

63
src/accelerate/logging.py Normal file
View File

@ -0,0 +1,63 @@
# Copyright 2022 The HuggingFace Team. All rights reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
import logging
from .state import AcceleratorState
class MultiProcessAdapter(logging.LoggerAdapter):
"""
An adapter to assist with logging in multiprocess.
`log` takes in an additional `main_process_only` kwarg, which dictates whether it should be called on all processes
or only the main executed one. Default is `main_process_only=True`.
"""
@staticmethod
def _should_log(main_process_only):
"Check if log should be performed"
return not main_process_only or (main_process_only and AcceleratorState().local_process_index == 0)
def log(self, level, msg, *args, **kwargs):
"""
Delegates logger call after checking if we should log.
Accepts a new kwarg of `main_process_only`, which will dictate whether it will be logged across all processes
or only the main executed one. Default is `True` if not passed
"""
main_process_only = kwargs.pop("main_process_only", True)
if self.isEnabledFor(level) and self._should_log(main_process_only):
msg, kwargs = self.process(msg, kwargs)
self.logger.log(level, msg, *args, **kwargs)
def get_logger(name: str):
"""
Returns a `logging.Logger` for `name` that can handle multiprocessing.
If a log should be called on all processes, pass `main_process_only=False`
E.g.
```python
logger.info("My log", main_process_only=False)
logger.debug("My log", main_process_only=False)
```
Args:
name (`str`):
The name for the logger, such as `__file__`
"""
logger = logging.getLogger(name)
return MultiProcessAdapter(logger, {})

View File

@ -0,0 +1,29 @@
# Copyright 2022 The HuggingFace Team. All rights reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
# flake8: noqa
# There's no way to ignore "F401 '...' imported but unused" warnings in this
# module, but to preserve other warnings. So, don't check this module at all
import warnings
warnings.warn(
"memory_utils has been reorganized to utils.memory. Import `find_executable_batchsize` from the main `__init__`: "
"`from accelerate import find_executable_batch_size` to avoid this warning.",
FutureWarning,
)
from .utils.memory import find_executable_batch_size

View File

@ -12,9 +12,13 @@
# See the License for the specific language governing permissions and
# limitations under the License.
import inspect
import warnings
import torch
from .state import AcceleratorState, DistributedType, is_tpu_available
from .state import AcceleratorState
from .utils import DistributedType, honor_type, is_torch_version, is_tpu_available
if is_tpu_available():
@ -23,7 +27,7 @@ if is_tpu_available():
def move_to_device(state, device):
if isinstance(state, (list, tuple)):
return type(state)(move_to_device(t, device) for t in state)
return honor_type(state, (move_to_device(t, device) for t in state))
elif isinstance(state, dict):
return type(state)({k: move_to_device(v, device) for k, v in state.items()})
elif isinstance(state, torch.Tensor):
@ -36,56 +40,114 @@ class AcceleratedOptimizer(torch.optim.Optimizer):
Internal wrapper around a torch optimizer.
Args:
optimizer (:obj:`torch.optim.optimizer.Optimizer`):
optimizer (`torch.optim.optimizer.Optimizer`):
The optimizer to wrap.
device_placement (:obj:`bool`, `optional`, defaults to :obj:`True`):
device_placement (`bool`, *optional*, defaults to `True`):
Whether or not the optimizer should handle device placement. If so, it will place the state dictionary of
:obj:`optimizer` on the right device.
scaler (:obj:`torch.cuda.amp.grad_scaler.GradScaler`, `optional`):
`optimizer` on the right device.
scaler (`torch.cuda.amp.grad_scaler.GradScaler`, *optional*):
The scaler to use in the step function if training with mixed precision.
"""
def __init__(self, optimizer, device_placement=True, scaler=None):
self.optimizer = optimizer
self.scaler = scaler
self.state = AcceleratorState()
self.accelerator_state = AcceleratorState()
self.device_placement = device_placement
self._is_overflow = False
# Handle device placement
if device_placement:
state_dict = self.optimizer.state_dict()
if self.state.distributed_type == DistributedType.TPU:
xm.send_cpu_data_to_device(state_dict, self.state.device)
if self.accelerator_state.distributed_type == DistributedType.TPU:
xm.send_cpu_data_to_device(state_dict, self.accelerator_state.device)
else:
state_dict = move_to_device(state_dict, self.state.device)
state_dict = move_to_device(state_dict, self.accelerator_state.device)
self.optimizer.load_state_dict(state_dict)
@property
def state(self):
return self.optimizer.state
@state.setter
def state(self, state):
self.optimizer.state = state
@property
def param_groups(self):
return self.optimizer.param_groups
@param_groups.setter
def param_groups(self, param_groups):
self.optimizer.param_groups = param_groups
@property
def defaults(self):
return self.optimizer.defaults
@defaults.setter
def defaults(self, defaults):
self.optimizer.defaults = defaults
def add_param_group(self, param_group):
self.optimizer.add_param_group(param_group)
def load_state_dict(self, state_dict):
if self.state.distributed_type == DistributedType.TPU and self.device_placement:
xm.send_cpu_data_to_device(state_dict, self.state.device)
if self.accelerator_state.distributed_type == DistributedType.TPU and self.device_placement:
xm.send_cpu_data_to_device(state_dict, self.accelerator_state.device)
self.optimizer.load_state_dict(state_dict)
def state_dict(self):
return self.optimizer.state_dict()
def zero_grad(self):
self.optimizer.zero_grad()
def step(self):
if self.state.distributed_type == DistributedType.TPU:
xm.optimizer_step(self.optimizer)
elif self.scaler is not None:
self.scaler.step(self.optimizer)
self.scaler.update()
def zero_grad(self, set_to_none=None):
if is_torch_version("<", "1.7.0"):
if set_to_none is not None:
raise ValueError(
"`set_to_none` for Optimizer.zero_grad` was introduced in PyTorch 1.7.0 and can't be used for "
f"earlier versions (found version {torch.__version__})."
)
self.optimizer.zero_grad()
else:
self.optimizer.step()
accept_arg = "set_to_none" in inspect.signature(self.optimizer.zero_grad).parameters
if accept_arg:
if set_to_none is None:
set_to_none = False
self.optimizer.zero_grad(set_to_none=set_to_none)
else:
if set_to_none is not None:
raise ValueError("`set_to_none` for Optimizer.zero_grad` is not supported by this optimizer.")
self.optimizer.zero_grad()
def step(self, closure=None):
if self.accelerator_state.distributed_type == DistributedType.TPU:
optimizer_args = {"closure": closure} if closure is not None else {}
xm.optimizer_step(self.optimizer, optimizer_args=optimizer_args)
elif self.scaler is not None:
scale_before = self.scaler.get_scale()
self.scaler.step(self.optimizer, closure)
self.scaler.update()
scale_after = self.scaler.get_scale()
# If we reduced the loss scale, it means the optimizer step was skipped because of gradient overflow.
self._is_overflow = scale_after < scale_before
else:
self.optimizer.step(closure)
def _switch_parameters(self, parameters_map):
for param_group in self.optimizer.param_groups:
param_group["params"] = [parameters_map.get(p, p) for p in param_group["params"]]
@property
def is_overflow(self):
"""Whether or not the optimizer step was done, or skipped because of gradient overflow."""
warnings.warn(
"The `is_overflow` property is deprecated and will be removed in version 1.0 of Accelerate use "
"`optimizer.step_was_skipped` instead.",
FutureWarning,
)
return self._is_overflow
@property
def step_was_skipped(self):
"""Whether or not the optimizer step was skipped."""
return self._is_overflow

View File

@ -0,0 +1,82 @@
# Copyright 2022 The HuggingFace Team. All rights reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
from .state import AcceleratorState
class AcceleratedScheduler:
"""
A wrapper around a learning rate scheduler that will only step when the optimizer(s) have a training step. Useful
to avoid making a scheduler step too fast when:
- gradients went overflow and there was no training step (in mixed precision training)
- step was skipped because of gradient accumulation
Args:
scheduler (`torch.optim.lr_scheduler._LRScheduler`):
The scheduler to wrap.
optimizers (one or a list of `torch.optim.Optimizer`):
The optimizers used.
step_with_optimizer (`bool`, *optional*, defaults to `True`):
Whether or not the scheduler should be stepped at each optimizer step.
split_batches (`bool`, *optional*, defaults to `False`):
Whether or not the dataloaders split one batch across the different processes (so batch size is the same
regardless of the number of processes) or create batches on each process (so batch size is the original
batch size multiplied by the number of processes).
"""
def __init__(self, scheduler, optimizers, step_with_optimizer: bool = True, split_batches: bool = False):
self.scheduler = scheduler
self.optimizers = optimizers if isinstance(optimizers, (list, tuple)) else [optimizers]
self.split_batches = split_batches
self.step_with_optimizer = step_with_optimizer
def step(self, *args, **kwargs):
if not self.step_with_optimizer:
# No link between scheduler and optimizer -> just step
self.scheduler.step(*args, **kwargs)
return
# Otherwise, first make sure the optimizer was stepped.
for opt in self.optimizers:
if opt.step_was_skipped:
return
if self.split_batches:
# Split batches -> the training dataloader batch size is not changed so one step per training step
self.scheduler.step(*args, **kwargs)
else:
# Otherwise the training dataloader batch size was multiplied by `num_processes`, so we need to do
# num_processes steps per training step
num_processes = AcceleratorState().num_processes
for _ in range(num_processes):
# Special case when using OneCycle and `drop_last` was not used
if getattr(self.scheduler, "total_steps", 0) <= self.scheduler.last_epoch:
self.scheduler.step(*args, **kwargs)
# Passthroughs
def get_last_lr(self):
return self.scheduler.get_last_lr()
def state_dict(self):
return self.scheduler.state_dict()
def load_state_dict(self, state_dict):
self.scheduler.load_state_dict(state_dict)
def get_lr(self):
return self.scheduler.get_lr()
def print_lr(self, *args, **kwargs):
return self.scheduler.print_lr(*args, **kwargs)

View File

@ -14,21 +14,23 @@
import os
from distutils.util import strtobool
from enum import Enum
import torch
from .utils import DistributedType, is_ccl_available, is_deepspeed_available, is_tpu_available
try:
if is_tpu_available():
import torch_xla.core.xla_model as xm
_tpu_available = True
except ImportError:
_tpu_available = False
def is_tpu_available():
return _tpu_available
def get_int_from_env(env_keys, default):
"""Returns the first positive env value found in the `env_keys` list or the default."""
for e in env_keys:
val = int(os.environ.get(e, -1))
if val >= 0:
return val
return default
def parse_flag_from_env(key, default=False):
@ -36,45 +38,47 @@ def parse_flag_from_env(key, default=False):
return strtobool(value) == 1 # As its name indicates `strtobool` actually returns an int...
class DistributedType(str, Enum):
"""
Represents a type of distributed environment.
Values:
- **NO** -- Not a distributed environment, just a single process.
- **MULTI_GPU** -- Distributed on multiple GPUs.
- **TPU** -- Distributed on TPUs.
"""
# Subclassing str as well as Enum allows the `DistributedType` to be JSON-serializable out of the box.
NO = "NO"
MULTI_GPU = "MULTI_GPU"
TPU = "TPU"
def parse_choice_from_env(key, default="no"):
value = os.environ.get(key, str(default))
return value
# Inspired by Alex Martelli's 'Borg'.
class AcceleratorState:
"""
This is a variation of a `singleton class <https://en.wikipedia.org/wiki/Singleton_pattern>`__ in the sense that
all instance of :obj:`AcceleratorState` share the same state, which is initialized on the first instantiation.
This is a variation of a [singleton class](https://en.wikipedia.org/wiki/Singleton_pattern) in the sense that all
instance of `AcceleratorState` share the same state, which is initialized on the first instantiation.
Attributes
Attributes:
- **device** (:obj:`torch.device`) -- The device to use.
- **distributed_type** (:obj:`~accelerate.state.DistributedType`) -- The type of distributed environment
currently in use.
- **num_processes** (:obj:`int`) -- The number of processes currently launched in parallel.
- **process_index** (:obj:`int`) -- The index of the current process.
- **local_process_index** (:obj:`int`) -- The index of the current process on the current server.
- **use_fp16** (:obj:`bool`) -- Whether or not the current script will use mixed precision.
- **device** (`torch.device`) -- The device to use.
- **distributed_type** (`~accelerate.state.DistributedType`) -- The type of distributed environment currently
in use.
- **num_processes** (`int`) -- The number of processes currently launched in parallel.
- **process_index** (`int`) -- The index of the current process.
- **local_process_index** (`int`) -- The index of the current process on the current server.
- **mixed_precision** (`str`) -- Whether or not the current script will use mixed precision. If you are using
mixed precision, define if you want to use FP16 or BF16 (bfloat16) as the floating point.
"""
_shared_state = {}
def __init__(self, fp16: bool = None, cpu: bool = False, _from_accelerator: bool = False):
def __init__(
self,
mixed_precision: str = None,
cpu: bool = False,
deepspeed_plugin=None,
fsdp_plugin=None,
_from_accelerator: bool = False,
**kwargs,
):
self.__dict__ = self._shared_state
if parse_flag_from_env("USE_CPU"):
cpu = True
if not getattr(self, "initialized", False):
self.backend = None
self.deepspeed_plugin = None
mixed_precision = mixed_precision.lower() if mixed_precision else None
if not _from_accelerator:
raise ValueError(
"Please make sure to properly initialize your accelerator via `accelerator = Accelerator()` "
@ -86,30 +90,112 @@ class AcceleratorState:
self.process_index = xm.get_ordinal()
self.local_process_index = xm.get_local_ordinal()
self.device = xm.xla_device()
self.use_fp16 = False
elif int(os.environ.get("LOCAL_RANK", -1)) != -1 and not cpu:
self.distributed_type = DistributedType.MULTI_GPU
self.mixed_precision = "no"
elif os.environ.get("USE_DEEPSPEED", "false") == "true" and not cpu:
assert (
is_deepspeed_available()
), "DeepSpeed is not available => install it using `pip3 install deepspeed` or build it from source"
self.distributed_type = DistributedType.DEEPSPEED
if not torch.distributed.is_initialized():
torch.distributed.init_process_group(backend="nccl")
torch.distributed.init_process_group(backend="nccl", **kwargs)
self.backend = "nccl"
self.num_processes = torch.distributed.get_world_size()
self.process_index = torch.distributed.get_rank()
self.local_process_index = int(os.environ.get("LOCAL_RANK", -1))
self.device = torch.device("cuda", self.local_process_index)
self.use_fp16 = parse_flag_from_env("USE_FP16", False) if fp16 is None else fp16
torch.cuda.set_device(self.device)
self.mixed_precision = "no" # deepspeed handles mixed_precision using deepspeed_config
mixed_precision = (
parse_choice_from_env("MIXED_PRECISION", "no") if mixed_precision is None else mixed_precision
)
if mixed_precision == "fp16":
deepspeed_plugin.deepspeed_config.update({"fp16": {"enabled": True}})
elif mixed_precision == "bf16":
deepspeed_plugin.deepspeed_config.update({"bfloat16": {"enabled": True}})
self.deepspeed_plugin = deepspeed_plugin
elif int(os.environ.get("LOCAL_RANK", -1)) != -1 and not cpu:
self.distributed_type = DistributedType.MULTI_GPU
if not torch.distributed.is_initialized():
torch.distributed.init_process_group(backend="nccl", **kwargs)
self.backend = "nccl"
self.num_processes = torch.distributed.get_world_size()
self.process_index = torch.distributed.get_rank()
self.local_process_index = int(os.environ.get("LOCAL_RANK", -1))
self.device = torch.device("cuda", self.local_process_index)
torch.cuda.set_device(self.device)
self.mixed_precision = (
parse_choice_from_env("MIXED_PRECISION", "no") if mixed_precision is None else mixed_precision
)
if os.environ.get("USE_FSDP", "false") == "true":
self.distributed_type = DistributedType.FSDP
if self.mixed_precision != "no":
raise ValueError(
"Mixed precision is currently not supported for FSDP. Please set `mixed_precision` to `no`."
)
self.fsdp_plugin = fsdp_plugin
elif get_int_from_env(["PMI_SIZE", "OMPI_COMM_WORLD_SIZE", "MV2_COMM_WORLD_SIZE", "WORLD_SIZE"], 1) > 1:
self.distributed_type = DistributedType.MULTI_CPU
if is_ccl_available() and get_int_from_env(["CCL_WORKER_COUNT"], 0) > 0:
backend = "ccl"
elif torch.distributed.is_mpi_available():
backend = "mpi"
else:
backend = "gloo"
# Try to get launch configuration from environment variables set by MPI launcher - works for Intel MPI, OpenMPI and MVAPICH
rank = get_int_from_env(["RANK", "PMI_RANK", "OMPI_COMM_WORLD_RANK", "MV2_COMM_WORLD_RANK"], 0)
size = get_int_from_env(["WORLD_SIZE", "PMI_SIZE", "OMPI_COMM_WORLD_SIZE", "MV2_COMM_WORLD_SIZE"], 1)
local_rank = get_int_from_env(
["LOCAL_RANK", "MPI_LOCALRANKID", "OMPI_COMM_WORLD_LOCAL_RANK", "MV2_COMM_WORLD_LOCAL_RANK"], 0
)
local_size = get_int_from_env(
["MPI_LOCALNRANKS", "OMPI_COMM_WORLD_LOCAL_SIZE", "MV2_COMM_WORLD_LOCAL_SIZE"], 1
)
self.local_process_index = local_rank
os.environ["RANK"] = str(rank)
os.environ["WORLD_SIZE"] = str(size)
os.environ["LOCAL_RANK"] = str(local_rank)
if not os.environ.get("MASTER_PORT", None):
os.environ["MASTER_PORT"] = "29500"
if not os.environ.get("MASTER_ADDR", None):
if local_size != size and backend != "mpi":
raise ValueError(
"Looks like distributed multinode run but MASTER_ADDR env not set, "
"please try exporting rank 0's hostname as MASTER_ADDR"
)
if not torch.distributed.is_initialized():
torch.distributed.init_process_group(backend, rank=rank, world_size=size, **kwargs)
self.backend = backend
self.num_processes = torch.distributed.get_world_size()
self.process_index = torch.distributed.get_rank()
self.local_process_index = local_rank
self.device = torch.device("cpu")
self.mixed_precision = "no"
else:
self.distributed_type = DistributedType.NO
self.num_processes = 1
self.process_index = self.local_process_index = 0
self.device = torch.device("cuda" if torch.cuda.is_available() and not cpu else "cpu")
self.use_fp16 = parse_flag_from_env("USE_FP16", False) if fp16 is None else fp16
self.mixed_precision = (
parse_choice_from_env("MIXED_PRECISION", "no") if mixed_precision is None else mixed_precision
)
self.initialized = True
def __repr__(self):
return (
f"Distributed environment: {self.distributed_type}\n"
mixed_precision = self.mixed_precision
repr = (
f"Distributed environment: {self.distributed_type}{(' Backend: ' + self.backend) if self.backend else ''}\n"
f"Num processes: {self.num_processes}\n"
f"Process index: {self.process_index}\n"
f"Local process index: {self.local_process_index}\n"
f"Device: {self.device}\n"
f"Use FP16 precision: {self.use_fp16}\n"
f"Mixed precision type: {mixed_precision}\n"
)
if self.distributed_type == DistributedType.DEEPSPEED:
repr += f"ds_config: {self.deepspeed_plugin.deepspeed_config}\n"
return repr
# For backward compatibility
@property
def use_fp16(self):
return self.mixed_precision != "no"

View File

@ -2,5 +2,5 @@
# There's no way to ignore "F401 '...' imported but unused" warnings in this
# module, but to preserve other warnings. So, don't check this module at all.
from .testing import are_the_same_tensors, execute_subprocess_async, require_multi_gpu, require_tpu
from .testing import are_the_same_tensors, execute_subprocess_async, require_cuda, require_multi_gpu, require_tpu, slow
from .training import RegressionDataset, RegressionModel

View File

@ -0,0 +1,139 @@
#!/usr/bin/env python
# Copyright 2022 The HuggingFace Team. All rights reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
"""
A collection of utilities for comparing `examples/complete_*_example.py` scripts with the capabilities inside of each
`examples/by_feature` example. `compare_against_test` is the main function that should be used when testing, while the
others are used to either get the code that matters, or to preprocess them (such as stripping comments)
"""
import os
from typing import List
def get_function_contents_by_name(lines: List[str], name: str):
"""
Extracts a function from `lines` of segmented source code with the name `name`.
Args:
lines (`List[str]`):
Source code of a script seperated by line.
name (`str`):
The name of the function to extract. Should be either `training_function` or `main`
"""
if name != "training_function" and name != "main":
raise ValueError(f"Incorrect function name passed: {name}, choose either 'main' or 'training_function'")
good_lines, found_start = [], False
for line in lines:
if not found_start and f"def {name}" in line:
found_start = True
good_lines.append(line)
continue
if found_start:
if name == "training_function" and "def main" in line:
return good_lines
if name == "main" and "if __name__" in line:
return good_lines
good_lines.append(line)
def clean_lines(lines: List[str]):
"""
Filters `lines` and removes any entries that start with a comment ('#') or is just a newline ('\n')
Args:
lines (`List[str]`):
Source code of a script seperated by line.
"""
return [line for line in lines if not line.lstrip().startswith("#") and line != "\n"]
def compare_against_test(base_filename: str, feature_filename: str, parser_only: bool, secondary_filename: str = None):
"""
Tests whether the additional code inside of `feature_filename` was implemented in `base_filename`. This should be
used when testing to see if `complete_*_.py` examples have all of the implementations from each of the
`examples/by_feature/*` scripts.
It utilizes `nlp_example.py` to extract out all of the repeated training code, so that only the new additional code
is examined and checked. If something *other* than `nlp_example.py` should be used, such as `cv_example.py` for the
`complete_cv_example.py` script, it should be passed in for the `secondary_filename` parameter.
Args:
base_filename (`str` or `os.PathLike`):
The filepath of a single "complete" example script to test, such as `examples/complete_cv_example.py`
feature_filename (`str` or `os.PathLike`):
The filepath of a single feature example script. The contents of this script are checked to see if they
exist in `base_filename`
parser_only (`bool`):
Whether to compare only the `main()` sections in both files, or to compare the contents of
`training_loop()`
secondary_filename (`str`, *optional*):
A potential secondary filepath that should be included in the check. This function extracts the base
functionalities off of "examples/nlp_example.py", so if `base_filename` is a script other than
`complete_nlp_example.py`, the template script should be included here. Such as `examples/cv_example.py`
"""
with open(base_filename, "r") as f:
base_file_contents = f.readlines()
with open(os.path.abspath(os.path.join("examples", "nlp_example.py")), "r") as f:
full_file_contents = f.readlines()
with open(feature_filename, "r") as f:
feature_file_contents = f.readlines()
if secondary_filename is not None:
with open(secondary_filename, "r") as f:
secondary_file_contents = f.readlines()
# This is our base, we remove all the code from here in our `full_filename` and `feature_filename` to find the new content
if parser_only:
base_file_func = clean_lines(get_function_contents_by_name(base_file_contents, "main"))
full_file_func = clean_lines(get_function_contents_by_name(full_file_contents, "main"))
feature_file_func = clean_lines(get_function_contents_by_name(feature_file_contents, "main"))
if secondary_filename is not None:
secondary_file_func = clean_lines(get_function_contents_by_name(secondary_file_contents, "main"))
else:
base_file_func = clean_lines(get_function_contents_by_name(base_file_contents, "training_function"))
full_file_func = clean_lines(get_function_contents_by_name(full_file_contents, "training_function"))
feature_file_func = clean_lines(get_function_contents_by_name(feature_file_contents, "training_function"))
if secondary_filename is not None:
secondary_file_func = clean_lines(
get_function_contents_by_name(secondary_file_contents, "training_function")
)
_dl_line = "train_dataloader, eval_dataloader = get_dataloaders(accelerator, batch_size)\n"
# Specific code in our script that differs from the full version, aka what is new
new_feature_code = []
passed_idxs = [] # We keep track of the idxs just in case it's a repeated statement
for i, line in enumerate(feature_file_func):
if i not in passed_idxs:
if (line not in full_file_func) and (line.lstrip() != _dl_line):
new_feature_code.append(line)
passed_idxs.append(i)
# Extract out just the new parts from the full_file_training_func
new_full_example_parts = []
passed_idxs = [] # We keep track of the idxs just in case it's a repeated statement
for i, line in enumerate(base_file_func):
if i not in passed_idxs:
if (line not in full_file_func) and (line.lstrip() != _dl_line):
new_full_example_parts.append(line)
passed_idxs.append(i)
# Finally, get the overall diff
diff_from_example = [line for line in new_feature_code if line not in new_full_example_parts]
if secondary_filename is not None:
diff_from_two = [line for line in full_file_contents if line not in secondary_file_func]
diff_from_example = [line for line in diff_from_example if line not in diff_from_two]
return diff_from_example

View File

@ -19,9 +19,9 @@ from torch.utils.data import DataLoader
from accelerate import Accelerator
from accelerate.data_loader import prepare_data_loader
from accelerate.state import AcceleratorState, DistributedType
from accelerate.state import AcceleratorState
from accelerate.test_utils import RegressionDataset, RegressionModel, are_the_same_tensors
from accelerate.utils import gather, set_seed, synchronize_rng_states
from accelerate.utils import DistributedType, gather, is_torch_version, set_seed, synchronize_rng_states
def init_state_check():
@ -34,10 +34,16 @@ def init_state_check():
def rng_sync_check():
state = AcceleratorState()
synchronize_rng_states()
assert are_the_same_tensors(torch.get_rng_state())
synchronize_rng_states(["torch"])
assert are_the_same_tensors(torch.get_rng_state()), "RNG states improperly synchronized on CPU."
if state.distributed_type == DistributedType.MULTI_GPU:
assert are_the_same_tensors(torch.cuda.get_rng_state())
synchronize_rng_states(["cuda"])
assert are_the_same_tensors(torch.cuda.get_rng_state()), "RNG states improperly synchronized on GPU."
if is_torch_version(">=", "1.6.0"):
generator = torch.Generator()
synchronize_rng_states(["generator"], generator=generator)
assert are_the_same_tensors(generator.get_state()), "RNG states improperly synchronized in generator."
if state.local_process_index == 0:
print("All rng are properly synched.")
@ -52,7 +58,9 @@ def dl_preparation_check():
for batch in dl:
result.append(gather(batch))
result = torch.cat(result)
assert torch.equal(result.cpu(), torch.arange(0, length).long())
print(state.process_index, result, type(dl))
assert torch.equal(result.cpu(), torch.arange(0, length).long()), "Wrong non-shuffled dataloader result."
dl = DataLoader(range(length), batch_size=8)
dl = prepare_data_loader(
@ -67,7 +75,7 @@ def dl_preparation_check():
for batch in dl:
result.append(gather(batch))
result = torch.cat(result)
assert torch.equal(result.cpu(), torch.arange(0, length).long())
assert torch.equal(result.cpu(), torch.arange(0, length).long()), "Wrong non-shuffled dataloader result."
if state.process_index == 0:
print("Non-shuffled dataloader passing.")
@ -79,7 +87,7 @@ def dl_preparation_check():
result.append(gather(batch))
result = torch.cat(result).tolist()
result.sort()
assert result == list(range(length))
assert result == list(range(length)), "Wrong shuffled dataloader result."
dl = DataLoader(range(length), batch_size=8, shuffle=True)
dl = prepare_data_loader(
@ -95,19 +103,85 @@ def dl_preparation_check():
result.append(gather(batch))
result = torch.cat(result).tolist()
result.sort()
assert result == list(range(length))
assert result == list(range(length)), "Wrong shuffled dataloader result."
if state.local_process_index == 0:
print("Shuffled dataloader passing.")
def mock_training(length, batch_size):
def central_dl_preparation_check():
state = AcceleratorState()
length = 32 * state.num_processes
dl = DataLoader(range(length), batch_size=8)
dl = prepare_data_loader(
dl, state.device, state.num_processes, state.process_index, put_on_device=True, dispatch_batches=True
)
result = []
for batch in dl:
result.append(gather(batch))
result = torch.cat(result)
assert torch.equal(result.cpu(), torch.arange(0, length).long()), "Wrong non-shuffled dataloader result."
dl = DataLoader(range(length), batch_size=8)
dl = prepare_data_loader(
dl,
state.device,
state.num_processes,
state.process_index,
put_on_device=True,
split_batches=True,
dispatch_batches=True,
)
result = []
for batch in dl:
result.append(gather(batch))
result = torch.cat(result)
assert torch.equal(result.cpu(), torch.arange(0, length).long()), "Wrong non-shuffled dataloader result."
if state.process_index == 0:
print("Non-shuffled central dataloader passing.")
dl = DataLoader(range(length), batch_size=8, shuffle=True)
dl = prepare_data_loader(
dl, state.device, state.num_processes, state.process_index, put_on_device=True, dispatch_batches=True
)
result = []
for batch in dl:
result.append(gather(batch))
result = torch.cat(result).tolist()
result.sort()
assert result == list(range(length)), "Wrong shuffled dataloader result."
dl = DataLoader(range(length), batch_size=8, shuffle=True)
dl = prepare_data_loader(
dl,
state.device,
state.num_processes,
state.process_index,
put_on_device=True,
split_batches=True,
dispatch_batches=True,
)
result = []
for batch in dl:
result.append(gather(batch))
result = torch.cat(result).tolist()
result.sort()
assert result == list(range(length)), "Wrong shuffled dataloader result."
if state.local_process_index == 0:
print("Shuffled central dataloader passing.")
def mock_training(length, batch_size, generator):
set_seed(42)
generator.manual_seed(42)
train_set = RegressionDataset(length=length)
train_dl = DataLoader(train_set, batch_size=batch_size, shuffle=True)
train_dl = DataLoader(train_set, batch_size=batch_size, shuffle=True, generator=generator)
model = RegressionModel()
optimizer = torch.optim.SGD(model.parameters(), lr=0.1)
for _ in range(3):
for epoch in range(3):
for batch in train_dl:
model.zero_grad()
output = model(batch["x"])
@ -119,21 +193,23 @@ def mock_training(length, batch_size):
def training_check():
state = AcceleratorState()
generator = torch.Generator()
batch_size = 8
length = batch_size * 4 * state.num_processes
train_set, old_model = mock_training(length, batch_size * state.num_processes)
assert are_the_same_tensors(old_model.a)
assert are_the_same_tensors(old_model.b)
train_set, old_model = mock_training(length, batch_size * state.num_processes, generator)
assert are_the_same_tensors(old_model.a), "Did not obtain the same model on both processes."
assert are_the_same_tensors(old_model.b), "Did not obtain the same model on both processes."
accelerator = Accelerator()
train_dl = DataLoader(train_set, batch_size=batch_size, shuffle=True)
train_dl = DataLoader(train_set, batch_size=batch_size, shuffle=True, generator=generator)
model = RegressionModel()
optimizer = torch.optim.SGD(model.parameters(), lr=0.1)
train_dl, model, optimizer = accelerator.prepare(train_dl, model, optimizer)
set_seed(42)
for _ in range(3):
generator.manual_seed(42)
for epoch in range(3):
for batch in train_dl:
model.zero_grad()
output = model(batch["x"])
@ -142,18 +218,19 @@ def training_check():
optimizer.step()
model = accelerator.unwrap_model(model).cpu()
assert torch.allclose(old_model.a, model.a)
assert torch.allclose(old_model.b, model.b)
assert torch.allclose(old_model.a, model.a), "Did not obtain the same model on CPU or distributed training."
assert torch.allclose(old_model.b, model.b), "Did not obtain the same model on CPU or distributed training."
accelerator.print("Training yielded the same results on one CPU or distributes setup with no batch split.")
accelerator.print("Training yielded the same results on one CPU or distributed setup with no batch split.")
accelerator = Accelerator(split_batches=True)
train_dl = DataLoader(train_set, batch_size=batch_size * state.num_processes, shuffle=True)
train_dl = DataLoader(train_set, batch_size=batch_size * state.num_processes, shuffle=True, generator=generator)
model = RegressionModel()
optimizer = torch.optim.SGD(model.parameters(), lr=0.1)
train_dl, model, optimizer = accelerator.prepare(train_dl, model, optimizer)
set_seed(42)
generator.manual_seed(42)
for _ in range(3):
for batch in train_dl:
model.zero_grad()
@ -163,19 +240,21 @@ def training_check():
optimizer.step()
model = accelerator.unwrap_model(model).cpu()
assert torch.allclose(old_model.a, model.a)
assert torch.allclose(old_model.b, model.b)
assert torch.allclose(old_model.a, model.a), "Did not obtain the same model on CPU or distributed training."
assert torch.allclose(old_model.b, model.b), "Did not obtain the same model on CPU or distributed training."
accelerator.print("Training yielded the same results on one CPU or distributes setup with batch split.")
# Mostly a test that FP16 doesn't crash as the operation inside the model is not converted to FP16
accelerator = Accelerator(fp16=True)
train_dl = DataLoader(train_set, batch_size=batch_size, shuffle=True)
print("FP16 training check.")
accelerator = Accelerator(mixed_precision="fp16")
train_dl = DataLoader(train_set, batch_size=batch_size, shuffle=True, generator=generator)
model = RegressionModel()
optimizer = torch.optim.SGD(model.parameters(), lr=0.1)
train_dl, model, optimizer = accelerator.prepare(train_dl, model, optimizer)
set_seed(42)
generator.manual_seed(42)
for _ in range(3):
for batch in train_dl:
model.zero_grad()
@ -185,8 +264,53 @@ def training_check():
optimizer.step()
model = accelerator.unwrap_model(model).cpu()
assert torch.allclose(old_model.a, model.a)
assert torch.allclose(old_model.b, model.b)
assert torch.allclose(old_model.a, model.a), "Did not obtain the same model on CPU or distributed training."
assert torch.allclose(old_model.b, model.b), "Did not obtain the same model on CPU or distributed training."
# TEST that previous fp16 flag still works
print("Legacy FP16 training check.")
accelerator = Accelerator(fp16=True)
train_dl = DataLoader(train_set, batch_size=batch_size, shuffle=True, generator=generator)
model = RegressionModel()
optimizer = torch.optim.SGD(model.parameters(), lr=0.1)
train_dl, model, optimizer = accelerator.prepare(train_dl, model, optimizer)
set_seed(42)
generator.manual_seed(42)
for _ in range(3):
for batch in train_dl:
model.zero_grad()
output = model(batch["x"])
loss = torch.nn.functional.mse_loss(output, batch["y"])
accelerator.backward(loss)
optimizer.step()
model = accelerator.unwrap_model(model).cpu()
assert torch.allclose(old_model.a, model.a), "Did not obtain the same model on CPU or distributed training."
assert torch.allclose(old_model.b, model.b), "Did not obtain the same model on CPU or distributed training."
# Mostly a test that BF16 doesn't crash as the operation inside the model is not converted to BF16
print("BF16 training check.")
accelerator = Accelerator(mixed_precision="bf16")
train_dl = DataLoader(train_set, batch_size=batch_size, shuffle=True, generator=generator)
model = RegressionModel()
optimizer = torch.optim.SGD(model.parameters(), lr=0.1)
train_dl, model, optimizer = accelerator.prepare(train_dl, model, optimizer)
set_seed(42)
generator.manual_seed(42)
for _ in range(3):
for batch in train_dl:
model.zero_grad()
output = model(batch["x"])
loss = torch.nn.functional.mse_loss(output, batch["y"])
accelerator.backward(loss)
optimizer.step()
model = accelerator.unwrap_model(model).cpu()
assert torch.allclose(old_model.a, model.a), "Did not obtain the same model on CPU or distributed training."
assert torch.allclose(old_model.b, model.b), "Did not obtain the same model on CPU or distributed training."
def main():
accelerator = Accelerator()
@ -202,10 +326,16 @@ def main():
if state.local_process_index == 0:
print("\n**DataLoader integration test**")
dl_preparation_check()
central_dl_preparation_check()
# Trainings are not exactly the same in DeepSpeed and CPU mode
if state.distributed_type == DistributedType.DEEPSPEED:
return
if state.local_process_index == 0:
print("\n**Training integration test**")
training_check()
if __name__ == "__main__":
main()

View File

@ -13,13 +13,157 @@
# limitations under the License.
import asyncio
import os
import shutil
import sys
import tempfile
import unittest
from distutils.util import strtobool
from pathlib import Path
from typing import List, Union
from unittest import mock
import torch
from ..state import AcceleratorState, is_tpu_available
from ..utils import gather
from ..state import AcceleratorState
from ..utils import gather, is_comet_ml_available, is_tensorflow_available, is_tpu_available, is_wandb_available
def parse_flag_from_env(key, default=False):
try:
value = os.environ[key]
except KeyError:
# KEY isn't set, default to `default`.
_value = default
else:
# KEY is set, convert it to True or False.
try:
_value = strtobool(value)
except ValueError:
# More values are supported, but let's keep the message simple.
raise ValueError(f"If set, {key} must be yes or no.")
return _value
_run_slow_tests = parse_flag_from_env("RUN_SLOW", default=False)
def slow(test_case):
"""
Decorator marking a test as slow. Slow tests are skipped by default. Set the RUN_SLOW environment variable to a
truthy value to run them.
"""
return unittest.skipUnless(_run_slow_tests, "test is slow")(test_case)
def require_cuda(test_case):
"""
Decorator marking a test that requires CUDA. These tests are skipped when there are no GPU available.
"""
return unittest.skipUnless(torch.cuda.is_available(), "test requires a GPU")(test_case)
def require_tpu(test_case):
"""
Decorator marking a test that requires TPUs. These tests are skipped when there are no TPUs available.
"""
return unittest.skipUnless(is_tpu_available(), "test requires TPU")(test_case)
def require_multi_gpu(test_case):
"""
Decorator marking a test that requires a multi-GPU setup. These tests are skipped on a machine without multiple
GPUs.
"""
return unittest.skipUnless(torch.cuda.device_count() > 1, "test requires multiple GPUs")(test_case)
def require_tensorflow(test_case):
"""
Decorator marking a test that requires TensorFlow installed. These tests are skipped when TensorFlow isn't
installed
"""
return unittest.skipUnless(is_tensorflow_available(), "test requires TensorFlow")(test_case)
def require_wandb(test_case):
"""
Decorator marking a test that requires wandb installed. These tests are skipped when wandb isn't installed
"""
return unittest.skipUnless(is_wandb_available(), "test requires wandb")(test_case)
def require_comet_ml(test_case):
"""
Decorator marking a test that requires comet_ml installed. These tests are skipped when comet_ml isn't installed
"""
return unittest.skipUnless(is_comet_ml_available(), "test requires comet_ml")(test_case)
class TempDirTestCase(unittest.TestCase):
"""
A TestCase class that keeps a single `tempfile.TemporaryDirectory` open for the duration of the class, wipes its
data at the start of a test, and then destroyes it at the end of the TestCase.
Useful for when a class or API requires a single constant folder throughout it's use, such as Weights and Biases
The temporary directory location will be stored in `self.tmpdir`
"""
clear_on_setup = True
@classmethod
def setUpClass(cls):
"Creates a `tempfile.TemporaryDirectory` and stores it in `cls.tmpdir`"
cls.tmpdir = tempfile.mkdtemp()
@classmethod
def tearDownClass(cls):
"Remove `cls.tmpdir` after test suite has finished"
if os.path.exists(cls.tmpdir):
shutil.rmtree(cls.tmpdir)
def setUp(self):
"Destroy all contents in `self.tmpdir`, but not `self.tmpdir`"
if self.clear_on_setup:
for path in Path(self.tmpdir).glob("**/*"):
if path.is_file():
path.unlink()
elif path.is_dir():
shutil.rmtree(path)
class MockingTestCase(unittest.TestCase):
"""
A TestCase class designed to dynamically add various mockers that should be used in every test, mimicking the
behavior of a class-wide mock when defining one normally will not do.
Useful when a mock requires specific information available only initialized after `TestCase.setUpClass`, such as
setting an environment variable with that information.
The `add_mocks` function should be ran at the end of a `TestCase`'s `setUp` function, after a call to
`super().setUp()` such as:
```python
def setUp(self):
super().setUp()
mocks = mock.patch.dict(os.environ, {"SOME_ENV_VAR", "SOME_VALUE"})
self.add_mocks(mocks)
```
"""
def add_mocks(self, mocks: Union[mock.Mock, List[mock.Mock]]):
"""
Add custom mocks for tests that should be repeated on each test. Should be called during
`MockingTestCase.setUp`, after `super().setUp()`.
Args:
mocks (`mock.Mock` or list of `mock.Mock`):
Mocks that should be added to the `TestCase` after `TestCase.setUpClass` has been run
"""
self.mocks = mocks if isinstance(mocks, (tuple, list)) else [mocks]
for m in self.mocks:
m.start()
self.addCleanup(m.stop)
def are_the_same_tensors(tensor):
@ -33,28 +177,6 @@ def are_the_same_tensors(tensor):
return True
def require_tpu(test_case):
"""
Decorator marking a test that requires TPUs. These tests are skipped when there are no TPUs available.
"""
if not is_tpu_available():
return unittest.skip("test requires TPU")(test_case)
else:
return test_case
def require_multi_gpu(test_case):
"""
Decorator marking a test that requires a multi-GPU setup. These tests are skipped on a machine without multiple
GPUs.
"""
if torch.cuda.device_count() < 2:
return unittest.skip("test requires multiple GPUs")(test_case)
else:
return test_case
class _RunOutput:
def __init__(self, returncode, stdout, stderr):
self.returncode = returncode

View File

@ -14,6 +14,9 @@
import numpy as np
import torch
from torch.utils.data import DataLoader
from accelerate.utils.dataclasses import DistributedType
class RegressionDataset:
@ -36,6 +39,50 @@ class RegressionModel(torch.nn.Module):
super().__init__()
self.a = torch.nn.Parameter(torch.tensor(a).float())
self.b = torch.nn.Parameter(torch.tensor(b).float())
self.first_batch = True
def forward(self, x=None):
if self.first_batch:
print(f"Model dtype: {self.a.dtype}, {self.b.dtype}. Input dtype: {x.dtype}")
self.first_batch = False
return x * self.a + self.b
def mocked_dataloaders(accelerator, batch_size: int = 16):
from datasets import load_dataset
from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained("bert-base-cased")
data_files = {"train": "tests/test_samples/MRPC/train.csv", "validation": "tests/test_samples/MRPC/dev.csv"}
datasets = load_dataset("csv", data_files=data_files)
label_list = datasets["train"].unique("label")
label_to_id = {v: i for i, v in enumerate(label_list)}
def tokenize_function(examples):
# max_length=None => use the model max length (it's actually the default)
outputs = tokenizer(
examples["sentence1"], examples["sentence2"], truncation=True, max_length=None, padding="max_length"
)
if "label" in examples:
outputs["labels"] = [label_to_id[l] for l in examples["label"]]
return outputs
# Apply the method we just defined to all the examples in all the splits of the dataset
tokenized_datasets = datasets.map(
tokenize_function,
batched=True,
remove_columns=["sentence1", "sentence2", "label"],
)
def collate_fn(examples):
# On TPU it's best to pad everything to the same length or training will be very slow.
if accelerator.distributed_type == DistributedType.TPU:
return tokenizer.pad(examples, padding="max_length", max_length=128, return_tensors="pt")
return tokenizer.pad(examples, padding="longest", return_tensors="pt")
# Instantiate dataloaders.
train_dataloader = DataLoader(tokenized_datasets["train"], shuffle=True, collate_fn=collate_fn, batch_size=2)
eval_dataloader = DataLoader(tokenized_datasets["validation"], shuffle=False, collate_fn=collate_fn, batch_size=1)
return train_dataloader, eval_dataloader

332
src/accelerate/tracking.py Normal file
View File

@ -0,0 +1,332 @@
# Copyright 2022 The HuggingFace Team. All rights reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
# Expectation:
# Provide a project dir name, then each type of logger gets stored in project/{`logging_dir`}
import os
from abc import ABCMeta, abstractmethod, abstractproperty
from typing import List, Optional, Union
from .logging import get_logger
from .utils import LoggerType, is_comet_ml_available, is_tensorboard_available, is_wandb_available
_available_trackers = []
if is_tensorboard_available():
from torch.utils import tensorboard
_available_trackers.append(LoggerType.TENSORBOARD)
if is_wandb_available():
import wandb
_available_trackers.append(LoggerType.WANDB)
if is_comet_ml_available():
from comet_ml import Experiment
_available_trackers.append(LoggerType.COMETML)
logger = get_logger(__name__)
def get_available_trackers():
"Returns a list of all supported available trackers in the system"
return _available_trackers
class GeneralTracker(object, metaclass=ABCMeta):
"""
A base Tracker class to be used for all logging integration implementations.
"""
@abstractproperty
def requires_logging_directory(self):
"""
Whether the logger requires a directory to store their logs. Should either return `True` or `False`.
"""
pass
@abstractmethod
def store_init_configuration(self, values: dict):
"""
Logs `values` as hyperparameters for the run. Implementations should use the experiment configuration
functionality of a tracking API.
Args:
values (Dictionary `str` to `bool`, `str`, `float` or `int`):
Values to be stored as initial hyperparameters as key-value pairs. The values need to have type `bool`,
`str`, `float`, `int`, or `None`.
"""
pass
@abstractmethod
def log(self, values: dict, step: Optional[int]):
"""
Logs `values` to the current run. Base `log` implementations of a tracking API should go in here, along with
special behavior for the `step parameter.
Args:
values (Dictionary `str` to `str`, `float`, or `int`):
Values to be logged as key-value pairs. The values need to have type `str`, `float`, or `int`.
step (`int`, *optional*):
The run step. If included, the log will be affiliated with this step.
"""
pass
def finish(self):
"""
Should run any finalizing functions within the tracking API. If the API should not have one, just don't
overwrite that method.
"""
pass
class TensorBoardTracker(GeneralTracker):
"""
A `Tracker` class that supports `tensorboard`. Should be initialized at the start of your script.
Args:
run_name (`str`):
The name of the experiment run
logging_dir (`str`, `os.PathLike`):
Location for TensorBoard logs to be stored.
"""
requires_logging_directory = True
def __init__(self, run_name: str, logging_dir: Optional[Union[str, os.PathLike]]):
self.run_name = run_name
self.logging_dir = os.path.join(logging_dir, run_name)
self.writer = tensorboard.SummaryWriter(self.logging_dir)
logger.info(f"Initialized TensorBoard project {self.run_name} logging to {self.logging_dir}")
logger.info(
"Make sure to log any initial configurations with `self.store_init_configuration` before training!"
)
def store_init_configuration(self, values: dict):
"""
Logs `values` as hyperparameters for the run. Should be run at the beginning of your experiment.
Args:
values (Dictionary `str` to `bool`, `str`, `float` or `int`):
Values to be stored as initial hyperparameters as key-value pairs. The values need to have type `bool`,
`str`, `float`, `int`, or `None`.
"""
self.writer.add_hparams(values, metric_dict={})
self.writer.flush()
logger.info("Stored initial configuration hyperparameters to TensorBoard")
def log(self, values: dict, step: Optional[int] = None):
"""
Logs `values` to the current run.
Args:
values (Dictionary `str` to `str`, `float`, `int` or `dict` of `str` to `float`/`int`):
Values to be logged as key-value pairs. The values need to have type `str`, `float`, `int` or `dict` of
`str` to `float`/`int`.
step (`int`, *optional*):
The run step. If included, the log will be affiliated with this step.
"""
for k, v in values.items():
if isinstance(v, (int, float)):
self.writer.add_scalar(k, v, global_step=step)
elif isinstance(v, str):
self.writer.add_text(k, v, global_step=step)
elif isinstance(v, dict):
self.writer.add_scalars(k, v, global_step=step)
self.writer.flush()
logger.info("Successfully logged to TensorBoard")
def finish(self):
"""
Closes `TensorBoard` writer
"""
self.writer.close()
logger.info("TensorBoard writer closed")
class WandBTracker(GeneralTracker):
"""
A `Tracker` class that supports `wandb`. Should be initialized at the start of your script.
Args:
run_name (`str`):
The name of the experiment run.
"""
requires_logging_directory = False
def __init__(self, run_name: str):
self.run_name = run_name
self.run = wandb.init(project=self.run_name)
logger.info(f"Initialized WandB project {self.run_name}")
logger.info(
"Make sure to log any initial configurations with `self.store_init_configuration` before training!"
)
def store_init_configuration(self, values: dict):
"""
Logs `values` as hyperparameters for the run. Should be run at the beginning of your experiment.
Args:
values (Dictionary `str` to `bool`, `str`, `float` or `int`):
Values to be stored as initial hyperparameters as key-value pairs. The values need to have type `bool`,
`str`, `float`, `int`, or `None`.
"""
wandb.config.update(values)
logger.info("Stored initial configuration hyperparameters to WandB")
def log(self, values: dict, step: Optional[int] = None):
"""
Logs `values` to the current run.
Args:
values (Dictionary `str` to `str`, `float`, `int` or `dict` of `str` to `float`/`int`):
Values to be logged as key-value pairs. The values need to have type `str`, `float`, `int` or `dict` of
`str` to `float`/`int`.
step (`int`, *optional*):
The run step. If included, the log will be affiliated with this step.
"""
self.run.log(values, step=step)
logger.info("Successfully logged to WandB")
def finish(self):
"""
Closes `wandb` writer
"""
self.run.finish()
logger.info("WandB run closed")
class CometMLTracker(GeneralTracker):
"""
A `Tracker` class that supports `comet_ml`. Should be initialized at the start of your script.
API keys must be stored in a Comet config file.
Args:
run_name (`str`):
The name of the experiment run.
"""
requires_logging_directory = False
def __init__(self, run_name: str):
self.run_name = run_name
self.writer = Experiment(project_name=run_name)
logger.info(f"Initialized CometML project {self.run_name}")
logger.info(
"Make sure to log any initial configurations with `self.store_init_configuration` before training!"
)
def store_init_configuration(self, values: dict):
"""
Logs `values` as hyperparameters for the run. Should be run at the beginning of your experiment.
Args:
values (Dictionary `str` to `bool`, `str`, `float` or `int`):
Values to be stored as initial hyperparameters as key-value pairs. The values need to have type `bool`,
`str`, `float`, `int`, or `None`.
"""
self.writer.log_parameters(values)
logger.info("Stored initial configuration hyperparameters to CometML")
def log(self, values: dict, step: Optional[int] = None):
"""
Logs `values` to the current run.
Args:
values (Dictionary `str` to `str`, `float`, `int` or `dict` of `str` to `float`/`int`):
Values to be logged as key-value pairs. The values need to have type `str`, `float`, `int` or `dict` of
`str` to `float`/`int`.
step (`int`, *optional*):
The run step. If included, the log will be affiliated with this step.
"""
if step is not None:
self.writer.set_step(step)
for k, v in values.items():
if isinstance(v, (int, float)):
self.writer.log_metric(k, v, step=step)
elif isinstance(v, str):
self.writer.log_other(k, v)
elif isinstance(v, dict):
self.writer.log_metrics(v, step=step)
logger.info("Successfully logged to CometML")
def finish(self):
"""
Closes `comet-ml` writer
"""
self.writer.end()
logger.info("CometML run closed")
LOGGER_TYPE_TO_CLASS = {"tensorboard": TensorBoardTracker, "wandb": WandBTracker, "comet_ml": CometMLTracker}
def filter_trackers(
log_with: List[Union[str, LoggerType, GeneralTracker]], logging_dir: Union[str, os.PathLike] = None
):
"""
Takes in a list of potential tracker types and checks that:
- The tracker wanted is available in that environment
- Filters out repeats of tracker types
- If `all` is in `log_with`, will return all trackers in the environment
- If a tracker requires a `logging_dir`, ensures that `logging_dir` is not `None`
Args:
log_with (list of `str`, [`~utils.LoggerType`] or [`~tracking.GeneralTracker`], *optional*):
A list of loggers to be setup for experiment tracking. Should be one or several of:
- `"all"`
- `"tensorboard"`
- `"wandb"`
- `"comet_ml"`
If `"all`" is selected, will pick up all available trackers in the environment and intialize them. Can also
accept implementations of `GeneralTracker` for custom trackers, and can be combined with `"all"`.
logging_dir (`str`, `os.PathLike`, *optional*):
A path to a directory for storing logs of locally-compatible loggers.
"""
loggers = []
if log_with is not None:
if not isinstance(log_with, (list, tuple)):
log_with = [log_with]
logger.debug(f"{log_with}")
if "all" in log_with or LoggerType.ALL in log_with:
loggers = [o for o in log_with if issubclass(type(o), GeneralTracker)] + get_available_trackers()
else:
for log_type in log_with:
if log_type not in LoggerType and not issubclass(type(log_type), GeneralTracker):
raise ValueError(f"Unsupported logging capability: {log_type}. Choose between {LoggerType.list()}")
if issubclass(type(log_type), GeneralTracker):
loggers.append(log_type)
else:
log_type = LoggerType(log_type)
if log_type not in loggers:
if log_type in get_available_trackers():
tracker_init = LOGGER_TYPE_TO_CLASS[str(log_type)]
if getattr(tracker_init, "requires_logging_directory"):
if logging_dir is None:
raise ValueError(
f"Logging with `{str(log_type)}` requires a `logging_dir` to be passed in."
)
loggers.append(log_type)
else:
logger.info(f"Tried adding logger {log_type}, but package is unavailable in the system.")
return loggers

View File

@ -1,168 +0,0 @@
# Copyright 2021 The HuggingFace Team. All rights reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
import random
import numpy as np
import torch
from .state import AcceleratorState, DistributedType, is_tpu_available
if is_tpu_available():
import torch_xla.core.xla_model as xm
def set_seed(seed: int):
"""
Helper function for reproducible behavior to set the seed in ``random``, ``numpy``, ``torch``.
Args:
seed (:obj:`int`): The seed to set.
"""
random.seed(seed)
np.random.seed(seed)
torch.manual_seed(seed)
torch.cuda.manual_seed_all(seed)
# ^^ safe to call this function even if cuda is not available
def synchronize_rng_states():
"""
Helper function to synchronize the rng states in distributed training.
"""
state = AcceleratorState()
if state.distributed_type == DistributedType.TPU:
rng_state = torch.get_rng_state()
rng_state = xm.mesh_reduce("random_seed", rng_state, lambda x: x[0])
torch.set_rng_state(rng_state)
elif state.distributed_type == DistributedType.MULTI_GPU:
rng_state = torch.get_rng_state().to(state.device)
# Broadcast the state from process 0 to all the others.
torch.distributed.broadcast(rng_state, 0)
torch.set_rng_state(rng_state.cpu())
# Broadcast the state from process 0 to all the others.
rng_state = torch.cuda.get_rng_state().to(state.device)
torch.distributed.broadcast(rng_state, 0)
torch.cuda.set_rng_state(rng_state.cpu())
def send_to_device(tensor, device):
"""
Recursively sends the elements in a nested list/tuple/dictionary of tensors to a given device.
Args:
tensor (nested list/tuple/dictionary of :obj:`torch.Tensor`):
The data to send to a given device.
device (:obj:`torch.device`):
The device to send the data to
Returns:
The same data structure as :obj:`tensor` with all tensors sent to the proper device.
"""
if isinstance(tensor, (list, tuple)):
return type(tensor)(send_to_device(t, device) for t in tensor)
elif isinstance(tensor, dict):
return type(tensor)({k: send_to_device(v, device) for k, v in tensor.items()})
elif not hasattr(tensor, "to"):
raise TypeError(
f"Can't send the values of type {type(tensor)} to device {device}, only of nested list/tuple/dicts "
"of tensors or objects having a `to` method."
)
return tensor.to(device)
def extract_model_from_parallel(model):
"""
Extract a model from its distributed containers.
Args:
model (:obj:`torch.nn.Module`): The model to extract.
Returns:
:obj:`torch.nn.Module`: The extracted model.
"""
while isinstance(model, (torch.nn.parallel.DistributedDataParallel, torch.nn.DataParallel)):
model = model.module
return model
def _tpu_gather(tensor, name="tensor"):
if isinstance(tensor, (list, tuple)):
return type(tensor)(_tpu_gather(t, name=f"{name}_{i}") for i, t in enumerate(tensor))
elif isinstance(tensor, dict):
return type(tensor)({k: _tpu_gather(v, name=f"{name}_{k}") for k, v in tensor.items()})
elif not isinstance(tensor, torch.Tensor):
raise TypeError(f"Can't gather the values of type {type(tensor)}, only of nested list/tuple/dicts of tensors.")
return xm.mesh_reduce(name, tensor, torch.cat)
def _gpu_gather(tensor):
if isinstance(tensor, (list, tuple)):
return type(tensor)(_gpu_gather(t) for t in tensor)
elif isinstance(tensor, dict):
return type(tensor)({k: _gpu_gather(v) for k, v in tensor.items()})
elif not isinstance(tensor, torch.Tensor):
raise TypeError(f"Can't gather the values of type {type(tensor)}, only of nested list/tuple/dicts of tensors.")
output_tensors = [tensor.clone() for _ in range(torch.distributed.get_world_size())]
torch.distributed.all_gather(output_tensors, tensor)
return torch.cat(output_tensors, dim=0)
def gather(tensor):
"""
Recusrively gather tensor in a nested list/tuple/dictionary of tensors from all devices.
Args:
tensor (nested list/tuple/dictionary of :obj:`torch.Tensor`):
The data to gather.
Returns:
The same data structure as :obj:`tensor` with all tensors sent to the proper device.
"""
if AcceleratorState().distributed_type == DistributedType.TPU:
return _tpu_gather(tensor, name="accelerate.utils.gather")
elif AcceleratorState().distributed_type == DistributedType.MULTI_GPU:
return _gpu_gather(tensor)
else:
return tensor
def wait_for_everyone():
"""
Introduces a blocking point in the script, making sure all processes have reached this point before continuing.
Warning::
Make sure all processes will reach this instruction otherwise one of your processes will hang forever.
"""
if AcceleratorState().distributed_type == DistributedType.MULTI_GPU:
torch.distributed.barrier()
elif AcceleratorState().distributed_type == DistributedType.TPU:
xm.rendezvous("accelerate.utils.wait_for_everyone")
def save(obj, f):
"""
Save the data to disk. Use in place of :obj:`torch.save()`.
Args:
obj: The data to save
f: The file (or file-like object) to use to save the data
"""
if AcceleratorState().distributed_type == DistributedType.TPU:
xm.save(obj, f)
elif AcceleratorState().local_process_index == 0:
torch.save(obj, f)

View File

@ -0,0 +1,92 @@
# flake8: noqa
# There's no way to ignore "F401 '...' imported but unused" warnings in this
# module, but to preserve other warnings. So, don't check this module at all
from .constants import MODEL_NAME, OPTIMIZER_NAME, RNG_STATE_NAME, SCALER_NAME, SCHEDULER_NAME
from .dataclasses import (
ComputeEnvironment,
DeepSpeedPlugin,
DistributedDataParallelKwargs,
DistributedType,
FullyShardedDataParallelPlugin,
GradScalerKwargs,
InitProcessGroupKwargs,
KwargsHandler,
LoggerType,
PrecisionType,
RNGType,
SageMakerDistributedType,
TensorInformation,
)
from .imports import (
is_apex_available,
is_boto3_available,
is_ccl_available,
is_comet_ml_available,
is_deepspeed_available,
is_sagemaker_available,
is_tensorboard_available,
is_tensorflow_available,
is_tpu_available,
is_wandb_available,
)
from .modeling import (
check_device_map,
compute_module_sizes,
convert_file_size_to_int,
dtype_byte_size,
find_tied_parameters,
get_max_layer_size,
get_max_memory,
infer_auto_device_map,
load_checkpoint_in_model,
load_offloaded_weights,
named_module_tensors,
set_module_tensor_to_device,
)
from .offload import (
OffloadedWeightsLoader,
PrefixedDataset,
extract_submodules_state_dict,
offload_state_dict,
offload_weight,
save_offload_index,
)
from .operations import (
broadcast,
broadcast_object_list,
concatenate,
convert_outputs_to_fp32,
convert_to_fp32,
find_batch_size,
find_device,
gather,
gather_object,
get_data_structure,
honor_type,
initialize_tensors,
is_tensor_information,
is_torch_tensor,
pad_across_processes,
recursively_apply,
reduce,
send_to_device,
slice_tensors,
)
from .versions import compare_versions, is_torch_version
if is_deepspeed_available():
from .deepspeed import DeepSpeedEngineWrapper, DeepSpeedOptimizerWrapper
from .launch import PrepareForLaunch
from .memory import find_executable_batch_size
from .other import (
extract_model_from_parallel,
get_pretty_name,
patch_environment,
save,
wait_for_everyone,
write_basic_config,
)
from .random import set_seed, synchronize_rng_state, synchronize_rng_states

View File

@ -0,0 +1,24 @@
# Copyright 2022 The HuggingFace Team. All rights reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
import operator as op
SCALER_NAME = "scaler.pt"
MODEL_NAME = "pytorch_model"
RNG_STATE_NAME = "random_states"
OPTIMIZER_NAME = "optimizer"
SCHEDULER_NAME = "scheduler"
STR_OPERATION_TO_FUNC = {">": op.gt, ">=": op.ge, "==": op.eq, "!=": op.ne, "<=": op.le, "<": op.lt}

View File

@ -0,0 +1,304 @@
# Copyright 2022 The HuggingFace Team. All rights reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
"""
General namespace and dataclass related classes
"""
import copy
import enum
import functools
import os
import typing
from dataclasses import dataclass, field
from datetime import timedelta
from typing import Callable, Iterable, Optional
import torch
class KwargsHandler:
"""
Internal mixin that implements a `to_kwargs()` method for a dataclass.
"""
def to_dict(self):
return copy.deepcopy(self.__dict__)
def to_kwargs(self):
"""
Returns a dictionary containing the attributes with values different from the default of this class.
"""
default_dict = self.__class__().to_dict()
this_dict = self.to_dict()
return {k: v for k, v in this_dict.items() if default_dict[k] != v}
@dataclass
class DistributedDataParallelKwargs(KwargsHandler):
"""
Use this object in your [`Accelerator`] to customize how your model is wrapped in a
`torch.nn.parallel.DistributedDataParallel`. Please refer to the documentation of this
[wrapper](https://pytorch.org/docs/stable/generated/torch.nn.parallel.DistributedDataParallel.html) for more
information on each argument.
<Tip warning={true}>
`gradient_as_bucket_view` is only available in PyTorch 1.7.0 and later versions.
</Tip>"""
dim: int = 0
broadcast_buffers: bool = True
bucket_cap_mb: int = 25
find_unused_parameters: bool = False
check_reduction: bool = False
gradient_as_bucket_view: bool = False
@dataclass
class GradScalerKwargs(KwargsHandler):
"""
Use this object in your [`Accelerator`] to customize the behavior of mixed precision, specifically how the
`torch.cuda.amp.GradScaler` used is created. Please refer to the documentation of this
[scaler](https://pytorch.org/docs/stable/amp.html?highlight=gradscaler) for more information on each argument.
<Tip warning={true}>
`GradScaler` is only available in PyTorch 1.5.0 and later versions.
</Tip>"""
init_scale: float = 65536.0
growth_factor: float = 2.0
backoff_factor: float = 0.5
growth_interval: int = 2000
enabled: bool = True
@dataclass
class InitProcessGroupKwargs(KwargsHandler):
"""
Use this object in your [`Accelerator`] to customize the initialization of the distributed processes. Please refer
to the documentation of this
[method](https://pytorch.org/docs/stable/distributed.html#torch.distributed.init_process_group) for more
information on each argument.
"""
init_method: Optional[str] = None
timeout: timedelta = timedelta(seconds=1800)
class DistributedType(str, enum.Enum):
"""
Represents a type of distributed environment.
Values:
- **NO** -- Not a distributed environment, just a single process.
- **MULTI_CPU** -- Distributed on multiple CPU nodes.
- **MULTI_GPU** -- Distributed on multiple GPUs.
- **DEEPSPEED** -- Using DeepSpeed.
- **TPU** -- Distributed on TPUs.
"""
# Subclassing str as well as Enum allows the `DistributedType` to be JSON-serializable out of the box.
NO = "NO"
MULTI_CPU = "MULTI_CPU"
MULTI_GPU = "MULTI_GPU"
DEEPSPEED = "DEEPSPEED"
FSDP = "FSDP"
TPU = "TPU"
class SageMakerDistributedType(str, enum.Enum):
"""
Represents a type of distributed environment.
Values:
- **NO** -- Not a distributed environment, just a single process.
- **DATA_PARALLEL** -- using sagemaker distributed data parallelism.
- **MODEL_PARALLEL** -- using sagemaker distributed model parallelism.
"""
# Subclassing str as well as Enum allows the `SageMakerDistributedType` to be JSON-serializable out of the box.
NO = "NO"
DATA_PARALLEL = "DATA_PARALLEL"
MODEL_PARALLEL = "MODEL_PARALLEL"
class ComputeEnvironment(str, enum.Enum):
"""
Represents a type of the compute environment.
Values:
- **LOCAL_MACHINE** -- private/custom cluster hardware.
- **AMAZON_SAGEMAKER** -- Amazon SageMaker as compute environment.
"""
# Subclassing str as well as Enum allows the `ComputeEnvironment` to be JSON-serializable out of the box.
LOCAL_MACHINE = "LOCAL_MACHINE"
AMAZON_SAGEMAKER = "AMAZON_SAGEMAKER"
class EnumWithContains(enum.EnumMeta):
"A metaclass that adds the ability to check if `self` contains an item with the `in` operator"
def __contains__(cls, item):
try:
cls(item)
except ValueError:
return False
return True
class BaseEnum(enum.Enum, metaclass=EnumWithContains):
"An enum class that can get the value of an item with `str(Enum.key)`"
def __str__(self):
return self.value
@classmethod
def list(cls):
"Method to list all the possible items in `cls`"
return list(map(lambda item: str(item), cls))
class LoggerType(BaseEnum):
ALL = "all"
TENSORBOARD = "tensorboard"
WANDB = "wandb"
COMETML = "comet_ml"
class PrecisionType(BaseEnum):
NO = "no"
FP16 = "fp16"
BF16 = "bf16"
class RNGType(BaseEnum):
TORCH = "torch"
CUDA = "cuda"
XLA = "xla"
GENERATOR = "generator"
# data classes
@dataclass
class TensorInformation:
shape: torch.Size
dtype: torch.dtype
@dataclass
class DeepSpeedPlugin:
gradient_accumulation_steps: int = field(
default=None, metadata={"help": "Number of steps to accumulate gradients before updating optimizer states"}
)
zero_stage: int = field(
default=None,
metadata={"help": "Possible options are 0,1,2,3; Default will be taken from environment variable"},
)
is_train_batch_min: str = field(
default=True,
metadata={"help": "If both train & eval dataloaders are specified, this will decide the train_batch_size"},
)
auto_opt_mapping: bool = field(
default=True,
metadata={"help": "whether to map torch.adam to deepspeed optimizer version of adam based on config"},
)
offload_optimizer_device: bool = field(default=None, metadata={"help": "Possible options are none|cpu|nvme"})
def __post_init__(self):
if self.gradient_accumulation_steps is None:
self.gradient_accumulation_steps = int(os.environ.get("GRADIENT_ACCUMULATION_STEPS", 1))
if self.zero_stage is None:
self.zero_stage = int(os.environ.get("DEEPSPEED_ZERO_STAGE", 2))
if self.offload_optimizer_device is None:
self.offload_optimizer_device = os.environ.get("DEEPSPEED_OFFLOAD_OPTIMIZER_DEVICE", "none")
self.deepspeed_config = {
"train_batch_size": None,
"gradient_accumulation_steps": self.gradient_accumulation_steps,
"zero_optimization": {
"stage": self.zero_stage,
"offload_optimizer": {
"device": self.offload_optimizer_device,
},
},
"steps_per_print": float("inf"), # this will stop deepspeed from logging @ stdout
"zero_allow_untested_optimizer": True,
}
@dataclass
class FullyShardedDataParallelPlugin:
"""
This plugin is used to enable fully sharded data parallelism.
"""
sharding_strategy: "typing.Any" = field(
default=None,
metadata={"help": "Possible options are [1] FULL_SHARD, [2] SHARD_GRAD_OP"},
)
backward_prefetch: "typing.Any" = field(
default=None,
metadata={"help": "Possible options are [1] BACKWARD_PRE, [2] BACKWARD_POST"},
)
auto_wrap_policy: "typing.Any" = field(
default=None,
metadata={"help": "A callable specifying a policy to recursively wrap layers with FSDP"},
)
cpu_offload: Optional[Callable] = field(
default=None,
metadata={"help": "Decides Whether to offload parameters and gradients to CPU."},
)
min_num_params: int = field(
default=None, metadata={"help": "FSDP's minimum number of parameters for Default Auto Wrapping."}
)
ignored_modules: Optional[Iterable[torch.nn.Module]] = field(
default=None,
metadata={"help": "A list of modules to ignore for FSDP."},
)
def __post_init__(self):
from torch.distributed.fsdp.fully_sharded_data_parallel import CPUOffload, ShardingStrategy
from torch.distributed.fsdp.wrap import default_auto_wrap_policy
if self.sharding_strategy is None:
self.sharding_strategy = ShardingStrategy(int(os.environ.get("FSDP_SHARDING_STRATEGY", 1)))
if self.cpu_offload is None:
if os.environ.get("FSDP_OFFLOAD_PARAMS", "false") == "true":
self.cpu_offload = CPUOffload(offload_params=True)
else:
self.cpu_offload = CPUOffload(offload_params=False)
if self.min_num_params is None:
self.min_num_params = int(os.environ.get("FSDP_MIN_NUM_PARAMS", 0))
if self.auto_wrap_policy is None:
if self.min_num_params > 0:
self.auto_wrap_policy = functools.partial(default_auto_wrap_policy, min_num_params=self.min_num_params)

View File

@ -0,0 +1,96 @@
# Copyright 2021 The HuggingFace Team. All rights reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
from ..optimizer import AcceleratedOptimizer
from .imports import is_apex_available, is_deepspeed_available
if is_deepspeed_available():
from deepspeed import DeepSpeedEngine
if is_apex_available():
from apex import amp
class DeepSpeedEngineWrapper(DeepSpeedEngine):
"""
Wrapper over deepspeed.DeepSpeedEngine object
"""
def __init__(self, *args, **kwargs):
super().__init__(*args, **kwargs)
# overwriting micro_steps for user's gradient_accumulation
self.micro_steps = -1
def step(self, lr_kwargs=None):
"""DeepSpeedEngine.step() without `micro_steps` update & no profiling"""
if self.is_gradient_accumulation_boundary(): # it shouldn't matter whether we keep this line or not
if self.progressive_layer_drop:
self.progressive_layer_drop.update_state(self.global_steps)
self._take_model_step(lr_kwargs)
def backward(self, loss):
"""DeepSpeedEngine.backward() with with no loss scaling; no profiling but with `micro_steps` update"""
if self.zero_optimization():
self.optimizer.is_gradient_accumulation_boundary = self.is_gradient_accumulation_boundary()
self.optimizer.backward(loss)
elif self.amp_enabled():
# AMP requires delaying unscale when inside gradient accumulation boundaries
# https://nvidia.github.io/apex/advanced.html#gradient-accumulation-across-iterations
delay_unscale = not self.is_gradient_accumulation_boundary()
with amp.scale_loss(loss, self.optimizer, delay_unscale=delay_unscale) as scaled_loss:
scaled_loss.backward()
elif self.fp16_enabled():
self.optimizer.backward(loss)
else:
loss.backward()
if self.enable_backward_allreduce:
self.allreduce_gradients()
# this will ensure deepspeed gradient_accumulation matches user's accumulation
self.micro_steps += 1
class DeepSpeedOptimizerWrapper(AcceleratedOptimizer):
"""
Internal wrapper around a deepspeed optimizer.
Args:
optimizer (`torch.optim.optimizer.Optimizer`):
The optimizer to wrap.
"""
def __init__(self, optimizer, model: DeepSpeedEngineWrapper):
super().__init__(optimizer, device_placement=False, scaler=None)
self.model = model
def zero_grad(self, set_to_none=None):
pass # `model.step()` is doing that automatically. Therefore, it's implementation is not needed
def step(self):
"""This will handle optimizer.step() & optimizer.zero_grad() with gradient_accumulation"""
self.model.step()
@property
def is_overflow(self):
"""Whether or not the optimizer step was done, or skipped because of gradient overflow."""
overflow = False
if hasattr(self.optimizer, "overflow"):
overflow = self.optimizer.overflow
return overflow

View File

@ -0,0 +1,87 @@
# Copyright 2022 The HuggingFace Team. All rights reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
import importlib
import sys
# The package importlib_metadata is in a different place, depending on the Python version.
if sys.version_info < (3, 8):
import importlib_metadata
else:
import importlib.metadata as importlib_metadata
try:
import torch_ccl # noqa: F401
_ccl_available = True
except ImportError:
_ccl_available = False
try:
import torch_xla.core.xla_model as xm # noqa: F401
_tpu_available = True
except ImportError:
_tpu_available = False
def is_ccl_available():
return _ccl_available
def is_apex_available():
return importlib.util.find_spec("apex") is not None
def is_tpu_available():
return _tpu_available
def is_deepspeed_available():
package_exists = importlib.util.find_spec("deepspeed") is not None
# Check we're not importing a "deepspeed" directory somewhere but the actual library by trying to grab the version
# AND checking it has an author field in the metadata that is HuggingFace.
if package_exists:
try:
_ = importlib_metadata.metadata("deepspeed")
return True
except importlib_metadata.PackageNotFoundError:
return False
def is_tensorflow_available():
return importlib.util.find_spec("tensorflow") is not None
def is_tensorboard_available():
return importlib.util.find_spec("tensorboard") is not None or importlib.util.find_spec("tensorboardX") is not None
def is_wandb_available():
return importlib.util.find_spec("wandb") is not None
def is_comet_ml_available():
return importlib.util.find_spec("comet_ml") is not None
def is_boto3_available():
return importlib.util.find_spec("boto3") is not None
def is_sagemaker_available():
return importlib.util.find_spec("sagemaker") is not None

View File

@ -0,0 +1,55 @@
# Copyright 2022 The HuggingFace Team. All rights reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
import os
import torch
from .dataclasses import DistributedType
class PrepareForLaunch:
"""
Prepare a function that will launched in a distributed setup.
Args:
launcher (`Callable`):
The function to launch.
distributed_type ([`~state.DistributedType`]):
The distributed type to prepare for.
debug (`bool`, *optional*, defaults to `False`):
Whether or not this is a debug launch.
"""
def __init__(self, launcher, distributed_type="NO", debug=False):
self.launcher = launcher
self.distributed_type = DistributedType(distributed_type)
self.debug = debug
def __call__(self, index, *args):
if self.debug:
world_size = int(os.environ.get("WORLD_SIZE"))
rdv_file = os.environ.get("ACCELERATE_DEBUG_RDV_FILE")
torch.distributed.init_process_group(
"gloo",
rank=index,
store=torch.distributed.FileStore(rdv_file, world_size),
world_size=world_size,
)
elif self.distributed_type == DistributedType.MULTI_GPU or self.distributed_type == DistributedType.MULTI_CPU:
# Prepare the environment for torch.distributed
os.environ["LOCAL_RANK"] = str(index)
os.environ["RANK"] = str(index)
self.launcher(*args)

View File

@ -0,0 +1,88 @@
# Copyright 2022 The HuggingFace Team. All rights reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
"""
A collection of utilities for ensuring that training can always occur. Heavily influenced by the
[toma](https://github.com/BlackHC/toma) library.
"""
import functools
import gc
import inspect
import torch
def should_reduce_batch_size(exception: Exception) -> bool:
"""
Checks if `exception` relates to CUDA out-of-memory, CUDNN not supported, or CPU out-of-memory
Args:
exception (`Exception`):
An exception
"""
_statements = [
"CUDA out of memory.", # CUDA OOM
"cuDNN error: CUDNN_STATUS_NOT_SUPPORTED.", # CUDNN SNAFU
"DefaultCPUAllocator: can't allocate memory", # CPU OOM
]
if isinstance(exception, RuntimeError) and len(exception.args) == 1:
return any(err in exception.args[0] for err in _statements)
return False
def find_executable_batch_size(function: callable = None, starting_batch_size: int = 128):
"""
A basic decorator that will try to execute `function`. If it fails from exceptions related to out-of-memory or
CUDNN, the batch size is cut in half and passed to `function`
`function` must take in a `batch_size` parameter as its first argument.
Args:
function (`callable`, *optional*):
A function to wrap
starting_batch_size (`int`, *optional*):
The batch size to try and fit into memory
"""
if function is None:
return functools.partial(find_executable_batch_size, starting_batch_size=starting_batch_size)
batch_size = starting_batch_size
def decorator(*args, **kwargs):
nonlocal batch_size
gc.collect()
torch.cuda.empty_cache()
params = list(inspect.signature(function).parameters.keys())
# Guard against user error
if len(params) < (len(args) + 1):
arg_str = ", ".join([f"{arg}={value}" for arg, value in zip(params[1:], args[1:])])
raise TypeError(
f"Batch size was passed into `{function.__name__}` as the first argument when called."
f"Remove this as the decorator already does so: `{function.__name__}({arg_str})`"
)
while True:
if batch_size == 0:
raise RuntimeError("No executable batch size found, reached zero.")
try:
return function(batch_size, *args, **kwargs)
except Exception as e:
if should_reduce_batch_size(e):
gc.collect()
torch.cuda.empty_cache()
batch_size //= 2
else:
raise
return decorator

View File

@ -0,0 +1,614 @@
# Copyright 2022 The HuggingFace Team. All rights reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
import gc
import json
import os
import re
import shutil
import tempfile
from collections import defaultdict
from typing import Dict, List, Optional, Tuple, Union
import numpy as np
import torch
import torch.nn as nn
from .offload import offload_weight, save_offload_index
WEIGHTS_INDEX_NAME = "pytorch_model.bin.index.json"
def convert_file_size_to_int(size: Union[int, str]):
"""
Converts a size expressed as a string with digits an unit (like `"5MB"`) to an integer (in bytes).
Args:
size (`int` or `str`): The size to convert. Will be directly returned if an `int`.
Example:
```py
>>> convert_file_size_to_int("1MiB")
1048576
```
"""
if isinstance(size, int):
return size
if size.upper().endswith("GIB"):
return int(size[:-3]) * (2**30)
if size.upper().endswith("MIB"):
return int(size[:-3]) * (2**20)
if size.upper().endswith("KIB"):
return int(size[:-3]) * (2**10)
if size.upper().endswith("GB"):
int_size = int(size[:-2]) * (10**9)
return int_size // 8 if size.endswith("b") else int_size
if size.upper().endswith("MB"):
int_size = int(size[:-2]) * (10**6)
return int_size // 8 if size.endswith("b") else int_size
if size.upper().endswith("KB"):
int_size = int(size[:-2]) * (10**3)
return int_size // 8 if size.endswith("b") else int_size
raise ValueError("`size` is not in a valid format. Use an integer followed by the unit, e.g., '5GB'.")
def dtype_byte_size(dtype: torch.dtype):
"""
Returns the size (in bytes) occupied by one parameter of type `dtype`.
Example:
```py
>>> dtype_byte_size(torch.float32)
4
```
"""
if dtype == torch.bool:
return 1 / 8
bit_search = re.search("[^\d](\d+)$", str(dtype))
if bit_search is None:
raise ValueError(f"`dtype` is not a valid dtype: {dtype}.")
bit_size = int(bit_search.groups()[0])
return bit_size // 8
def set_module_tensor_to_device(
module: nn.Module, tensor_name: str, device: Union[int, str, torch.device], value: Optional[torch.Tensor] = None
):
"""
A helper function to set a given tensor (parameter of buffer) of a module on a specific device (note that doing
`param.to(device)` creates a new tensor not linked to the parameter, which is why we need this function).
Args:
module (`torch.nn.Module`): The module in which the tensor we want to move lives.
param_name (`str`): The full name of the parameter/buffer.
device (`int`, `str` or `torch.device`): The device on which to set the tensor.
value (`torch.Tensor`, *optional*): The value of the tensor (useful when going from the meta device to any
other device).
"""
# Recurse if needed
if "." in tensor_name:
splits = tensor_name.split(".")
for split in splits[:-1]:
new_module = getattr(module, split)
if new_module is None:
raise ValueError(f"{module} has no attribute {split}.")
module = new_module
tensor_name = splits[-1]
if tensor_name not in module._parameters and tensor_name not in module._buffers:
raise ValueError(f"{module} does not have a parameter or a buffer named {tensor_name}.")
is_buffer = tensor_name in module._buffers
old_value = getattr(module, tensor_name)
if old_value.device == torch.device("meta") and device not in ["meta", torch.device("meta")] and value is None:
raise ValueError(f"{tensor_name} is on the meta device, we need a `value` to put in on {device}.")
with torch.no_grad():
if value is None:
new_value = old_value.to(device)
elif isinstance(value, torch.Tensor):
new_value = value.to(device)
else:
new_value = torch.tensor(value, device=device)
if is_buffer:
module._buffers[tensor_name] = new_value
else:
new_value = nn.Parameter(new_value, requires_grad=old_value.requires_grad)
module._parameters[tensor_name] = new_value
def named_module_tensors(module: nn.Module, include_buffers: bool = True, recurse: bool = False):
"""
A helper function that gathers all the tensors (parameters + buffers) of a given module. If `include_buffers=True`
it's the same as doing `module.named_parameters(recurse=recurse) + module.named_buffers(recurse=recurse)`.
Args:
module (`torch.nn.Module`): The module we want the tensors or.
include_buffer (`bool`, *optional*, defaults to `True`): Whether or not to include the buffers in the result.
recurse (`bool`, *optional`, defaults to `False`):
Whether or not to go look in every submodule or just return the direct parameters and buffers.
"""
for named_parameter in module.named_parameters(recurse=recurse):
yield named_parameter
if include_buffers:
for named_buffer in module.named_buffers(recurse=recurse):
yield named_buffer
def find_tied_parameters(model: nn.Module, **kwargs):
"""
Find the tied parameters in a given model.
Args:
model (`torch.nn.Module`): The model to inspect.
<Tip warning={true}>
The signature accepts keyword arguments, but they are for the recursive part of this function and you should ignore
them.
</Tip>
Example:
```py
>>> from collections import OrderedDict
>>> import torch.nn as nn
>>> model = nn.Sequential(OrderedDict([("linear1", nn.Linear(4, 4)), ("linear2", nn.Linear(4, 4))]))
>>> model.linear2.weight = test_model.linear1.weight
>>> find_tied_parameters(test_model)
{'linear1.weight': 'linear2.weight'}
```
Returns:
Dict[str, str]: A dictionary mapping tied parameter names to the name of the parameter they are tied to.
"""
# Initialize result and named_parameters before recursing.
named_parameters = kwargs.get("named_parameters", None)
prefix = kwargs.get("prefix", "")
result = kwargs.get("result", {})
if named_parameters is None:
named_parameters = {n: p for n, p in model.named_parameters()}
else:
# A tied parameter will not be in the full `named_parameters` seen above but will be in the `named_parameters`
# of the submodule it belongs to. So while recursing we track the names that are not in the initial
# `named_parameters`.
for name, parameter in model.named_parameters():
full_name = name if prefix == "" else f"{prefix}.{name}"
if full_name not in named_parameters:
# When we find one, it has to be one of the existing parameters.
for new_name, new_param in named_parameters.items():
if new_param is parameter:
result[new_name] = full_name
# Once we have treated direct parameters, we move to the child modules.
for name, child in model.named_children():
child_name = name if prefix == "" else f"{prefix}.{name}"
find_tied_parameters(child, named_parameters=named_parameters, prefix=child_name, result=result)
return result
def compute_module_sizes(model: nn.Module, dtype: Optional[Union[str, torch.device]] = None):
"""
Compute the size of each submodule of a given model.
"""
if isinstance(dtype, str):
# We accept "torch.float16" or just "float16"
dtype = dtype.replace("torch.", "")
dtype = getattr(torch, dtype)
if dtype is not None:
dtype_size = dtype_byte_size(dtype)
module_sizes = defaultdict(int)
for name, tensor in named_module_tensors(model, recurse=True):
if dtype is None:
size = tensor.numel() * dtype_byte_size(tensor.dtype)
else:
size = tensor.numel() * min(dtype_size, dtype_byte_size(tensor.dtype))
name_parts = name.split(".")
for idx in range(len(name_parts) + 1):
module_sizes[".".join(name_parts[:idx])] += size
return module_sizes
def get_max_layer_size(
modules: List[Tuple[str, torch.nn.Module]], module_sizes: Dict[str, int], no_split_module_classes: List[str]
):
"""
Utility function that will scan a list of named modules and return the maximum size used by one full layer. The
definition of a layer being:
- a module with no direct children (just parameters and buffers)
- a module whose class name is in the list `no_split_module_classes`
Args:
modules (`List[Tuple[str, torch.nn.Module]]`):
The list of named modules where we want to determine the maximum layer size.
module_sizes (`Dict[str, int]`):
A dictionary mapping each layer name to its size (as generated by `compute_module_sizes`).
no_split_module_classes (`List[str]`):
A list of class names for layers we don't want to be split.
Returns:
`Tuple[int, List[str]]`: The maximum size of a layer with the list of layer names realizing that maximum size.
"""
max_size = 0
layer_names = []
modules_to_treat = modules.copy()
while len(modules_to_treat) > 0:
module_name, module = modules_to_treat.pop(0)
modules_children = list(module.named_children())
if len(modules_children) == 0 or module.__class__.__name__ in no_split_module_classes:
# No splitting this one so we compare to the max_size
size = module_sizes[module_name]
if size > max_size:
max_size = size
layer_names = [module_name]
elif size == max_size:
layer_names.append(module_name)
else:
modules_to_treat = [(f"{module_name}.{n}", v) for n, v in modules_children] + modules_to_treat
return max_size, layer_names
def get_max_memory(max_memory: Optional[Dict[Union[int, str], Union[int, str]]] = None):
"""
Get the maximum memory available if nothing is passed, converts string to int otherwise.
"""
import psutil
if max_memory is None:
if not torch.cuda.is_available():
max_memory = {}
else:
# Make sure CUDA is initialized on each GPU to have the right memory info.
for i in range(torch.cuda.device_count()):
_ = torch.tensor([0], device=i)
max_memory = {i: torch.cuda.mem_get_info(i)[0] for i in range(torch.cuda.device_count())}
max_memory["cpu"] = psutil.virtual_memory().available
return max_memory
for key in max_memory:
if isinstance(max_memory[key], str):
max_memory[key] = convert_file_size_to_int(max_memory[key])
return max_memory
def clean_device_map(device_map: Dict[str, Union[int, str, torch.device]], module_name: str = ""):
"""
Cleans a device_map by grouping all submodules that go on the same device together.
"""
# Get the value of the current module and if there is only one split across several keys, regroup it.
prefix = "" if module_name == "" else f"{module_name}."
values = [v for k, v in device_map.items() if k.startswith(prefix)]
if len(set(values)) == 1 and len(values) > 1:
for k in [k for k in device_map if k.startswith(prefix)]:
del device_map[k]
device_map[module_name] = values[0]
# Recurse over the children
children_modules = [k for k in device_map.keys() if k.startswith(module_name) and len(k) > len(module_name)]
idx = len(module_name.split(".")) + 1 if len(module_name) > 0 else 1
children_modules = set(".".join(k.split(".")[:idx]) for k in children_modules)
for child in children_modules:
clean_device_map(device_map, module_name=child)
return device_map
def load_offloaded_weights(model, index, offload_folder):
if index is None or len(index) == 0:
# Nothing to do
return
for param_name, metadata in index.items():
tensor_file = os.path.join(offload_folder, f"{param_name}.dat")
shape = tuple(metadata["shape"])
weight = np.memmap(tensor_file, dtype=metadata["dtype"], mode="r", shape=shape)
set_module_tensor_to_device(model, param_name, "cpu", value=torch.tensor(weight))
def infer_auto_device_map(
model: nn.Module,
max_memory: Optional[Dict[Union[int, str], Union[int, str]]] = None,
no_split_module_classes: Optional[List[str]] = None,
dtype: Optional[Union[str, torch.dtype]] = None,
):
"""
Compute a device map for a given model giving priority to GPUs, then offload on CPU and finally offload to disk,
such that:
- we don't exceed the memory available of any of the GPU.
- if offload to the CPU is needed, there is always room left on GPU 0 to put back the layer offloaded on CPU that
has the largest size.
- if offload to the CPU is needed,we don't exceed the RAM available on the CPU.
- if offload to the disk is needed, there is always room left on the CPU to put back the layer offloaded on disk
that has the largest size.
<Tip>
All computation is done analyzing sizes and dtypes of the model parameters. As a result, the model can be on the
meta device (as it would if initialized within the `init_empty_weights` context manager).
</Tip>
Args:
model (`torch.nn.Module`): The model to analyze.
max_memory (`Dict`, *optional*):
A dictionary device identifier to maximum memory. Will default to the maximum memory available if unset.
no_split_module_classes (`List[str]`, *optional*):
A list of layer class names that should never be split across device (for instance any layer that has a
residual connection).
dtype (`str` or `torch.dtype`, *optional*):
If provided, the weights will be converted to that type when loaded.
"""
# Get default / clean up max_memory
max_memory = get_max_memory(max_memory)
if no_split_module_classes is None:
no_split_module_classes = []
elif not isinstance(no_split_module_classes, (list, tuple)):
no_split_module_classes = [no_split_module_classes]
devices = list(max_memory.keys())
gpus = [device for device in devices if device != "cpu"]
if "disk" not in devices:
devices.append("disk")
# Devices that need to keep space for a potential offloaded layer.
main_devices = [gpus[0], "cpu"] if len(gpus) > 0 else ["cpu"]
module_sizes = compute_module_sizes(model, dtype=dtype)
tied_parameters = find_tied_parameters(model)
device_map = {}
current_device = 0
current_memory_used = 0
# Direct submodules and parameters
modules_to_treat = list(model.named_parameters(recurse=False)) + list(model.named_children())
# Initialize maximum largest layer, to know which space to keep in memory
max_layer_size, max_layer_names = get_max_layer_size(modules_to_treat, module_sizes, no_split_module_classes)
# Ready ? This is going to be a bit messy.
while len(modules_to_treat) > 0:
name, module = modules_to_treat.pop(0)
# Max size in the remaining layers may have changed since we took one, so we maybe update it.
max_layer_names = [n for n in max_layer_names if not n.startswith(name)]
if len(max_layer_names) == 0:
max_layer_size, max_layer_names = get_max_layer_size(
[(n, m) for n, m in modules_to_treat if isinstance(m, torch.nn.Module)],
module_sizes,
no_split_module_classes,
)
# Assess size needed
module_size = module_sizes[name]
tied_params = [v for k, v in tied_parameters.items() if name in k]
# We ignore parameters that are tied when they're tied to > 1 one
tied_param = tied_params[0] if len(tied_params) == 1 else None
device = devices[current_device]
current_max_size = max_memory[device] if device != "disk" else None
# Reduce max size available by the largest layer.
if devices[current_device] in main_devices:
current_max_size = current_max_size - max_layer_size
# Case 1 -> We're too big!
if current_max_size is not None and current_memory_used + module_size > current_max_size:
# Split or not split?
modules_children = list(module.named_children())
if len(modules_children) == 0 or module.__class__.__name__ in no_split_module_classes:
# -> no split, we go to the next device
current_device += 1
modules_to_treat = [(name, module)] + modules_to_treat
current_memory_used = 0
else:
# -> split, we replace the module studied by its children + parameters
modules_children = list(module.named_parameters(recurse=False)) + modules_children
modules_to_treat = [(f"{name}.{n}", v) for n, v in modules_children] + modules_to_treat
# Update the max layer size.
max_layer_size, max_layer_names = get_max_layer_size(
[(n, m) for n, m in modules_to_treat if isinstance(m, torch.nn.Module)],
module_sizes,
no_split_module_classes,
)
# Case 2, it fits! We're not entirely out of the wood though, because we may have some tied parameters.
elif tied_param is not None:
# Determine the sized occupied by this module + the module containing the tied parameter
tied_module_size = module_size
tied_module_index = [i for i, (n, _) in enumerate(modules_to_treat) if n in tied_param][0]
tied_module_name, tied_module = modules_to_treat[tied_module_index]
tied_module_size += module_sizes[tied_module_name] - module_sizes[tied_param]
if current_max_size is not None and current_memory_used + tied_module_size > current_max_size:
# Split or not split?
tied_module_children = list(tied_module.named_children())
if len(tied_module_children) == 0 or tied_module.__class__.__name__ in no_split_module_classes:
# If the tied module is not split, we go to the next device
current_device += 1
modules_to_treat = [(name, module)] + modules_to_treat
current_memory_used = 0
else:
# Otherwise, we replace the tied module by its children.
tied_module_children = list(tied_module.named_parameters(recurse=False)) + tied_module_children
tied_module_children = [(f"{tied_module_name}.{n}", v) for n, v in tied_module_children]
modules_to_treat = (
[(name, module)]
+ modules_to_treat[:tied_module_index]
+ tied_module_children
+ modules_to_treat[tied_module_index + 1 :]
)
# Update the max layer size.
max_layer_size, max_layer_names = get_max_layer_size(
[(n, m) for n, m in modules_to_treat if isinstance(m, torch.nn.Module)],
module_sizes,
no_split_module_classes,
)
else:
# We really really fit!
current_memory_used += tied_module_size
device_map[name] = devices[current_device]
modules_to_treat.pop(tied_module_index)
device_map[tied_module_name] = devices[current_device]
else:
current_memory_used += module_size
device_map[name] = devices[current_device]
return clean_device_map(device_map)
def check_device_map(model: nn.Module, device_map: Dict[str, Union[int, str, torch.device]]):
"""
Checks a device map covers everything in a given model.
Args:
model (`torch.nn.Module`): The model to check the device map against.
device_map (`Dict[str, Union[int, str, torch.device]]`): The device map to check.
"""
all_model_tensors = [name for name, _ in model.state_dict().items()]
for module_name in device_map.keys():
all_model_tensors = [name for name in all_model_tensors if not name.startswith(module_name)]
if len(all_model_tensors) > 0:
non_covered_params = ", ".join(all_model_tensors)
raise ValueError(
f"The device_map provided does not give any device for the following parameters: {non_covered_params}"
)
def load_checkpoint_in_model(
model: nn.Module,
checkpoint: Union[str, os.PathLike],
device_map: Optional[Dict[str, Union[int, str, torch.device]]] = None,
offload_folder: Optional[Union[str, os.PathLike]] = None,
dtype: Optional[Union[str, torch.dtype]] = None,
offload_state_dict: bool = False,
):
"""
Loads a (potentially sharded) checkpoint inside a model, potentially sending weights to a given device as they are
loaded.
<Tip warning={true}>
Once loaded across devices, you still need to call [`dispatch_model`] on your model to make it able to run. To
group the checkpoint loading and dispatch in one single call, use [`load_checkpoint_and_dispatch`].
</Tip>
Args:
model (`torch.nn.Module`): The model in which we want to load a checkpoint.
checkpoint (`str` or `os.PathLike`):
The folder checkpoint to load. It can be:
- a path to a file containing a whole model state dict
- a path to a `.json` file containing the index to a sharded checkpoint
- a path to a folder containing a unique `.index.json` file and the shards of a checkpoint.
device_map (`Dict[str, Union[int, str, torch.device]]`, *optional*):
A map that specifies where each submodule should go. It doesn't need to be refined to each parameter/buffer
name, once a given module name is inside, every submodule of it will be sent to the same device.
offload_folder (`str` or `os.PathLike`, *optional*):
If the `device_map` contains any value `"disk"`, the folder where we will offload weights.
dtype (`str` or `torch.dtype`, *optional*):
If provided, the weights will be converted to that type when loaded.
offload_state_dict (`bool`, *optional*, defaults to `False`):
If `True`, will temporarily offload the CPU state dict on the hard drive to avoig getting out of CPU RAM if
the weight of the CPU state dict + the biggest shard does not fit.
"""
if offload_folder is None and device_map is not None and "disk" in device_map.values():
raise ValueError(
"At least one of the model submodule will be offloaded to disk, please pass along an `offload_folder`."
)
elif offload_folder is not None and device_map is not None and "disk" in device_map.values():
os.makedirs(offload_folder, exist_ok=True)
if isinstance(dtype, str):
# We accept "torch.float16" or just "float16"
dtype = dtype.replace("torch.", "")
dtype = getattr(torch, dtype)
checkpoint_files = None
index_filename = None
if os.path.isfile(checkpoint):
if str(checkpoint).endswith(".json"):
index_filename = checkpoint
else:
checkpoint_files = [checkpoint]
elif os.path.isdir(checkpoint):
potential_index = [f for f in os.listdir(checkpoint) if f.endswith(".index.json")]
if len(potential_index) == 0:
raise ValueError(f"{checkpoint} is not a folder containing a `.index.json` file.")
elif len(potential_index) == 1:
index_filename = os.path.join(checkpoint, potential_index[0])
else:
raise ValueError(f"{checkpoint} containing mote than one `.index.json` file, delete the irrelevant ones.")
else:
raise ValueError(
"`checkpoint` should be the path to a file containing a whole state dict, or the index of a sharded "
f"checkpoint, or a folder containing a sharded checkpoint, but got {checkpoint}."
)
if index_filename is not None:
checkpoint_folder = os.path.split(index_filename)[0]
with open(index_filename, "r") as f:
index = json.loads(f.read())
if "weight_map" in index:
index = index["weight_map"]
checkpoint_files = sorted(list(set(index.values())))
checkpoint_files = [os.path.join(checkpoint_folder, f) for f in checkpoint_files]
# Logic for missing/unexepected keys goes here.
offload_index = {}
if offload_state_dict:
state_dict_folder = tempfile.mkdtemp()
state_dict_index = {}
for checkpoint_file in checkpoint_files:
checkpoint = torch.load(checkpoint_file)
if device_map is None:
model.load_state_dict(checkpoint, strict=False)
else:
for param_name, param in checkpoint.items():
module_name = param_name
if dtype is not None:
param = param.to(dtype)
while len(module_name) > 0 and module_name not in device_map:
module_name = ".".join(module_name.split(".")[:-1])
if module_name == "" and "" not in device_map:
# TODO: group all errors and raise at the end.
raise ValueError(f"{param_name} doesn't have any device set.")
param_device = device_map[module_name]
if param_device == "disk":
set_module_tensor_to_device(model, param_name, "meta")
offload_weight(param, param_name, offload_folder, index=offload_index)
elif param_device == "cpu" and offload_state_dict:
set_module_tensor_to_device(model, param_name, "meta")
offload_weight(param, param_name, state_dict_folder, index=state_dict_index)
else:
set_module_tensor_to_device(model, param_name, param_device, value=param)
# Force Python to clean up.
del checkpoint
gc.collect()
save_offload_index(offload_index, offload_folder)
# Load back offloaded state dict on CPU
if offload_state_dict:
load_offloaded_weights(model, state_dict_index, state_dict_folder)
shutil.rmtree(state_dict_folder)

View File

@ -0,0 +1,157 @@
# Copyright 2022 The HuggingFace Team. All rights reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
import json
import os
from collections.abc import Mapping
from typing import Dict, List, Optional, Union
import numpy as np
import torch
def offload_weight(weight, weight_name, offload_folder, index=None):
array = weight.numpy()
tensor_file = os.path.join(offload_folder, f"{weight_name}.dat")
if index is not None:
index[weight_name] = {"dtype": str(array.dtype), "shape": list(array.shape)}
if array.ndim == 0:
array = array[None]
file_array = np.memmap(tensor_file, dtype=array.dtype, mode="w+", shape=array.shape)
file_array[:] = array[:]
file_array.flush()
return index
def save_offload_index(index, offload_folder):
if index is None or len(index) == 0:
# Nothing to save
return
offload_index_file = os.path.join(offload_folder, "index.json")
if os.path.isfile(offload_index_file):
with open(offload_index_file, "r", encoding="utf-8") as f:
current_index = json.load(f)
else:
current_index = {}
current_index.update(index)
with open(offload_index_file, "w", encoding="utf-8") as f:
json.dump(current_index, f, indent=2)
def offload_state_dict(save_dir: Union[str, os.PathLike], state_dict: Dict[str, torch.Tensor]):
"""
Offload a state dict in a given folder.
Args:
save_dir (`str` or `os.PathLike`): The directory in which to offload the state dict.
state_dict (`Dict[str, torch.Tensor]`): The dictionary of tensors to offload.
"""
os.makedirs(save_dir, exist_ok=True)
index = {}
for name, parameter in state_dict.items():
index = offload_weight(parameter, name, save_dir, index=index)
# Update index
save_offload_index(index, save_dir)
class PrefixedDataset(Mapping):
"""
Will access keys in a given dataset by adding a prefix.
Args:
dataset (`Mapping`): Any map with string keys.
prefix (`str`): A prefix to add when trying to access any element in the underlying dataset.
"""
def __init__(self, dataset: Mapping, prefix: str):
self.dataset = dataset
self.prefix = prefix
def __getitem__(self, key):
return self.dataset[f"{self.prefix}{key}"]
def __iter__(self):
return iter([key for key in self.dataset if key.startswith(self.prefix)])
def __len__(self):
return len(self.dataset)
class OffloadedWeightsLoader(Mapping):
"""
A collection that loads weights stored in a given state dict or memory-mapped on disk.
Args:
state_dict (`Dict[str, torch.Tensor]`, *optional*):
A dictionary parameter name to tensor.
save_folder (`str` or `os.PathLike`, *optional*):
The directory in which the weights are stored (by `offload_state_dict` for instance).
index (`Dict`, *optional*):
A dictionary from weight name to their information (`dtype` and `shape`). Will default to the index saved
in `save_folder`.
"""
def __init__(
self,
state_dict: Dict[str, torch.Tensor] = None,
save_folder: Optional[Union[str, os.PathLike]] = None,
index: Mapping = None,
):
if state_dict is None and save_folder is None:
raise ValueError("Need either a `state_dict` or a `save_folder` containing offloaded weights.")
self.state_dict = {} if state_dict is None else state_dict
self.save_folder = save_folder
if index is None and save_folder is not None:
with open(os.path.join(save_folder, "index.json")) as f:
index = json.load(f)
self.index = {} if index is None else index
self.all_keys = list(self.state_dict.keys())
self.all_keys.extend([key for key in self.index if key not in self.all_keys])
def __getitem__(self, key: str):
# State dict gets priority
if key in self.state_dict:
return self.state_dict[key]
weight_info = self.index[key]
weight_file = os.path.join(self.save_folder, f"{key}.dat")
shape = tuple(weight_info["shape"])
if shape == ():
weight = np.memmap(weight_file, dtype=weight_info["dtype"], shape=(1,), mode="r")[0]
else:
weight = np.memmap(weight_file, dtype=weight_info["dtype"], shape=shape, mode="r")
return torch.tensor(weight)
def __iter__(self):
return iter(self.all_keys)
def __len__(self):
return len(self.all_keys)
def extract_submodules_state_dict(state_dict: Dict[str, torch.Tensor], submodule_names: List[str]):
"""
Extract the sub state-dict corresponding to a list of given submodules.
Args:
state_dict (`Dict[str, torch.Tensor]`): The state dict to extract from.
submodule_names (`List[str]`): The list of submodule names we want to extract.
"""
result = {}
for module_name in submodule_names:
result.update({key: param for key, param in state_dict.items() if key.startswith(module_name)})
return result

View File

@ -0,0 +1,511 @@
# Copyright 2022 The HuggingFace Team. All rights reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
"""
A set of basic tensor ops compatible with tpu, gpu, and multigpu
"""
from functools import update_wrapper
from typing import Any, Mapping
import torch
from torch.distributed import ReduceOp
from ..state import AcceleratorState
from .dataclasses import DistributedType, TensorInformation
from .imports import is_tpu_available
from .versions import is_torch_version
if is_tpu_available():
import torch_xla.core.xla_model as xm
def is_torch_tensor(tensor):
return isinstance(tensor, torch.Tensor)
def is_tensor_information(tensor_info):
return isinstance(tensor_info, TensorInformation)
def honor_type(obj, generator):
"""
Cast a generator to the same type as obj (list, tuple or namedtuple)
"""
try:
return type(obj)(generator)
except TypeError:
# Some objects may not be able to instantiate from a generator directly
return type(obj)(*list(generator))
def recursively_apply(func, data, *args, test_type=is_torch_tensor, error_on_other_type=False, **kwargs):
"""
Recursively apply a function on a data structure that is a nested list/tuple/dictionary of a given base type.
Args:
func (`callable`):
The function to recursively apply.
data (nested list/tuple/dictionary of `main_type`):
The data on which to apply `func`
*args:
Positional arguments that will be passed to `func` when applied on the unpacked data.
main_type (`type`, *optional*, defaults to `torch.Tensor`):
The base type of the objects to which apply `func`.
error_on_other_type (`bool`, *optional*, defaults to `False`):
Whether to return an error or not if after unpacking `data`, we get on an object that is not of type
`main_type`. If `False`, the function will leave objects of types different than `main_type` unchanged.
**kwargs:
Keyword arguments that will be passed to `func` when applied on the unpacked data.
Returns:
The same data structure as `data` with `func` applied to every object of type `main_type`.
"""
if isinstance(data, (tuple, list)):
return honor_type(
data,
(
recursively_apply(
func, o, *args, test_type=test_type, error_on_other_type=error_on_other_type, **kwargs
)
for o in data
),
)
elif isinstance(data, Mapping):
return type(data)(
{
k: recursively_apply(
func, v, *args, test_type=test_type, error_on_other_type=error_on_other_type, **kwargs
)
for k, v in data.items()
}
)
elif test_type(data):
return func(data, *args, **kwargs)
elif error_on_other_type:
raise TypeError(
f"Can't apply {func.__name__} on object of type {type(data)}, only of nested list/tuple/dicts of objects "
f"that satisfy {test_type.__name__}."
)
return data
def send_to_device(tensor, device):
"""
Recursively sends the elements in a nested list/tuple/dictionary of tensors to a given device.
Args:
tensor (nested list/tuple/dictionary of `torch.Tensor`):
The data to send to a given device.
device (`torch.device`):
The device to send the data to.
Returns:
The same data structure as `tensor` with all tensors sent to the proper device.
"""
def _send_to_device(t, device):
return t.to(device)
def _has_to_method(t):
return hasattr(t, "to")
return recursively_apply(_send_to_device, tensor, device, test_type=_has_to_method)
def get_data_structure(data):
"""
Recursively gathers the information needed to rebuild a nested list/tuple/dictionary of tensors.
Args:
data (nested list/tuple/dictionary of `torch.Tensor`):
The data to send to analyze.
Returns:
The same data structure as `data` with [`~utils.TensorInformation`] instead of tensors.
"""
def _get_data_structure(tensor):
return TensorInformation(shape=tensor.shape, dtype=tensor.dtype)
return recursively_apply(_get_data_structure, data)
def initialize_tensors(data_structure):
"""
Recursively initializes tensors from a nested list/tuple/dictionary of [`~utils.TensorInformation`].
Returns:
The same data structure as `data` with tensors instead of [`~utils.TensorInformation`].
"""
def _initialize_tensor(tensor_info):
return torch.empty(*tensor_info.shape, dtype=tensor_info.dtype)
return recursively_apply(_initialize_tensor, data_structure, test_type=is_tensor_information)
def find_batch_size(data):
"""
Recursively finds the batch size in a nested list/tuple/dictionary of lists of tensors.
Args:
data (nested list/tuple/dictionary of `torch.Tensor`): The data from which to find the batch size.
Returns:
`int`: The batch size.
"""
if isinstance(data, (tuple, list)):
return find_batch_size(data[0])
elif isinstance(data, Mapping):
for k in data.keys():
return find_batch_size(data[k])
elif not isinstance(data, torch.Tensor):
raise TypeError(f"Can only find the batch size of tensors but got {type(data)}.")
return data.shape[0]
def _tpu_gather(tensor, name="gather tensor"):
if isinstance(tensor, (list, tuple)):
return honor_type(tensor, (_tpu_gather(t, name=f"{name}_{i}") for i, t in enumerate(tensor)))
elif isinstance(tensor, Mapping):
return type(tensor)({k: _tpu_gather(v, name=f"{name}_{k}") for k, v in tensor.items()})
elif not isinstance(tensor, torch.Tensor):
raise TypeError(f"Can't gather the values of type {type(tensor)}, only of nested list/tuple/dicts of tensors.")
if tensor.ndim == 0:
tensor = tensor.clone()[None]
return xm.mesh_reduce(name, tensor, torch.cat)
def _gpu_gather(tensor):
def _gpu_gather_one(tensor):
if tensor.ndim == 0:
tensor = tensor.clone()[None]
output_tensors = [tensor.clone() for _ in range(torch.distributed.get_world_size())]
torch.distributed.all_gather(output_tensors, tensor)
return torch.cat(output_tensors, dim=0)
return recursively_apply(_gpu_gather_one, tensor, error_on_other_type=True)
_cpu_gather = _gpu_gather
def gather(tensor):
"""
Recursively gather tensor in a nested list/tuple/dictionary of tensors from all devices.
Args:
tensor (nested list/tuple/dictionary of `torch.Tensor`):
The data to gather.
Returns:
The same data structure as `tensor` with all tensors sent to the proper device.
"""
if AcceleratorState().distributed_type == DistributedType.TPU:
return _tpu_gather(tensor, name="accelerate.utils.gather")
elif AcceleratorState().distributed_type in [DistributedType.DEEPSPEED, DistributedType.MULTI_GPU]:
return _gpu_gather(tensor)
elif AcceleratorState().distributed_type == DistributedType.MULTI_CPU:
return _cpu_gather(tensor)
else:
return tensor
def _gpu_gather_object(object: Any):
def _gpu_gather_object_one(object: Any):
output_objects = [None for _ in range(AcceleratorState().num_processes)]
torch.distributed.all_gather_object(output_objects, object)
return output_objects
return recursively_apply(_gpu_gather_object_one, object)
_cpu_gather_object = _gpu_gather_object
def gather_object(object: Any):
"""
Recursively gather object in a nested list/tuple/dictionary of objects from all devices.
Args:
object (nested list/tuple/dictionary of picklable object):
The data to gather.
Returns:
The same data structure as `object` with all the objects sent to every device.
"""
if AcceleratorState().distributed_type == DistributedType.TPU:
raise NotImplementedError("gather objects in TPU is not supported")
elif AcceleratorState().distributed_type in [DistributedType.DEEPSPEED, DistributedType.MULTI_GPU]:
return _gpu_gather_object(object)
elif AcceleratorState().distributed_type == DistributedType.MULTI_CPU:
return _cpu_gather_object(object)
else:
return object
def _gpu_broadcast(data, src=0):
def _gpu_broadcast_one(tensor, src=0):
torch.distributed.broadcast(tensor, src=src)
return tensor
return recursively_apply(_gpu_broadcast_one, data, error_on_other_type=True, src=src)
def _tpu_broadcast(tensor, src=0, name="broadcast tensor"):
if isinstance(tensor, (list, tuple)):
return honor_type(tensor, (_tpu_broadcast(t, name=f"{name}_{i}") for i, t in enumerate(tensor)))
elif isinstance(tensor, Mapping):
return type(tensor)({k: _tpu_broadcast(v, name=f"{name}_{k}") for k, v in tensor.items()})
return xm.mesh_reduce(name, tensor, lambda x: x[src])
def broadcast(tensor, from_process: int = 0):
"""
Recursively broadcast tensor in a nested list/tuple/dictionary of tensors to all devices.
Args:
tensor (nested list/tuple/dictionary of `torch.Tensor`):
The data to gather.
from_process (`int`, *optional*, defaults to 0):
The process from which to send the data
Returns:
The same data structure as `tensor` with all tensors broadcasted to the proper device.
"""
if AcceleratorState().distributed_type == DistributedType.TPU:
return _tpu_broadcast(tensor, src=from_process, name="accelerate.utils.broadcast")
elif AcceleratorState().distributed_type in [DistributedType.DEEPSPEED, DistributedType.MULTI_GPU]:
return _gpu_broadcast(tensor, src=from_process)
elif AcceleratorState().distributed_type == DistributedType.MULTI_CPU:
return _gpu_broadcast(tensor, src=from_process)
else:
return tensor
def broadcast_object_list(object_list, from_process: int = 0):
"""
Broadcast a list of picklable objects form one process to the others.
Args:
object_list (list of picklable objects):
The list of objects to broadcast. This list will be modified inplace.
from_process (`int`, *optional*, defaults to 0):
The process from which to send the data.
Returns:
The same list containing the objects from process 0.
"""
if AcceleratorState().distributed_type == DistributedType.TPU:
for i, obj in enumerate(object_list):
object_list[i] = xm.mesh_reduce("accelerate.utils.broadcast_object_list", obj, lambda x: x[from_process])
elif AcceleratorState().distributed_type in [DistributedType.DEEPSPEED, DistributedType.MULTI_GPU]:
torch.distributed.broadcast_object_list(object_list, src=from_process)
elif AcceleratorState().distributed_type == DistributedType.MULTI_CPU:
torch.distributed.broadcast_object_list(object_list, src=from_process)
return object_list
def slice_tensors(data, tensor_slice):
"""
Recursively takes a slice in a nested list/tuple/dictionary of tensors.
Args:
data (nested list/tuple/dictionary of `torch.Tensor`):
The data to slice.
tensor_slice (`slice`):
The slice to take.
Returns:
The same data structure as `data` with all the tensors slices.
"""
def _slice_tensor(tensor, tensor_slice):
return tensor[tensor_slice]
return recursively_apply(_slice_tensor, data, tensor_slice)
def concatenate(data, dim=0):
"""
Recursively concatenate the tensors in a nested list/tuple/dictionary of lists of tensors with the same shape.
Args:
data (nested list/tuple/dictionary of lists of tensors `torch.Tensor`):
The data to concatenate.
dim (`int`, *optional*, defaults to 0):
The dimension on which to concatenate.
Returns:
The same data structure as `data` with all the tensors concatenated.
"""
if isinstance(data[0], (tuple, list)):
return honor_type(data[0], (concatenate([d[i] for d in data], dim=dim) for i in range(len(data[0]))))
elif isinstance(data[0], Mapping):
return type(data[0])({k: concatenate([d[k] for d in data], dim=dim) for k in data[0].keys()})
elif not isinstance(data[0], torch.Tensor):
raise TypeError(f"Can only concatenate tensors but got {type(data[0])}")
return torch.cat(data, dim=dim)
def pad_across_processes(tensor, dim=0, pad_index=0, pad_first=False):
"""
Recursively pad the tensors in a nested list/tuple/dictionary of tensors from all devices to the same size so they
can safely be gathered.
Args:
tensor (nested list/tuple/dictionary of `torch.Tensor`):
The data to gather.
dim (`int`, *optional*, defaults to 0):
The dimension on which to pad.
pad_index (`int`, *optional*, defaults to 0):
The value with which to pad.
pad_first (`bool`, *optional*, defaults to `False`):
Whether to pad at the beginning or the end.
"""
def _pad_across_processes(tensor, dim=0, pad_index=0, pad_first=False):
if dim >= len(tensor.shape):
return tensor
# Gather all sizes
size = torch.tensor(tensor.shape, device=tensor.device)[None]
sizes = gather(size).cpu()
# Then pad to the maximum size
max_size = max(s[dim] for s in sizes)
if max_size == tensor.shape[dim]:
return tensor
old_size = tensor.shape
new_size = list(old_size)
new_size[dim] = max_size
new_tensor = tensor.new_zeros(tuple(new_size)) + pad_index
if pad_first:
indices = tuple(
slice(max_size - old_size[dim], max_size) if i == dim else slice(None) for i in range(len(new_size))
)
else:
indices = tuple(slice(0, old_size[dim]) if i == dim else slice(None) for i in range(len(new_size)))
new_tensor[indices] = tensor
return new_tensor
return recursively_apply(
_pad_across_processes, tensor, error_on_other_type=True, dim=dim, pad_index=pad_index, pad_first=pad_first
)
def reduce(tensor, reduction="mean"):
"""
Recursively reduce the tensors in a nested list/tuple/dictionary of lists of tensors across all processes by the
mean of a given operation.
Args:
tensor (nested list/tuple/dictionary of `torch.Tensor`):
The data to reduce.
reduction (`str`, *optional*, defaults to `"mean"`):
A reduction method. Can be of "mean", "sum", or "none"
Returns:
The same data structure as `data` with all the tensors reduced.
"""
def _reduce_across_processes(tensor, reduction="mean"):
state = AcceleratorState()
cloned_tensor = tensor.clone()
if state.distributed_type == DistributedType.TPU:
xm.all_reduce("sum", cloned_tensor)
return cloned_tensor
elif state.distributed_type in [DistributedType.DEEPSPEED, DistributedType.MULTI_GPU]:
torch.distributed.reduce(cloned_tensor, ReduceOp.SUM)
return cloned_tensor
else:
if reduction == "sum":
return cloned_tensor.sum()
else:
return cloned_tensor.mean()
return recursively_apply(_reduce_across_processes, tensor, error_on_other_type=True, reduction=reduction)
def convert_to_fp32(tensor):
"""
Recursively converts the elements nested list/tuple/dictionary of tensors in FP16/BF16 precision to FP32.
Args:
tensor (nested list/tuple/dictionary of `torch.Tensor`):
The data to convert from FP16/BF16 to FP32.
Returns:
The same data structure as `tensor` with all tensors that were in FP16/BF16 precision converted to FP32.
"""
def _convert_to_fp32(tensor):
return tensor.float()
def _is_fp16_bf16_tensor(tensor):
return hasattr(tensor, "dtype") and (
tensor.dtype == torch.float16 or (is_torch_version(">=", "1.10") and tensor.dtype == torch.bfloat16)
)
return recursively_apply(_convert_to_fp32, tensor, test_type=_is_fp16_bf16_tensor)
class ConvertOutputsToFp32:
"""
Decorator to apply to a function outputing tensors (like a model forward pass) that ensures the outputs in FP16
precision will be convert back to FP32.
Use a class instead of a decorator because otherwise, the prepared model can no longer be pickled (issue #273).
Args:
model_forward (`Callable`):
The function which outputs we want to treat.
Returns:
The same function as `model_forward` but with converted outputs.
"""
def __init__(self, model_forward):
self.model_forward = model_forward
update_wrapper(self, model_forward)
def __call__(self, *args, **kwargs):
return convert_to_fp32(self.model_forward(*args, **kwargs))
convert_outputs_to_fp32 = ConvertOutputsToFp32
def find_device(data):
"""
Finds the device on which a nested dict/list/tuple of tensors lies (assuming they are all on the same device).
Args:
(nested list/tuple/dictionary of `torch.Tensor`): The data we want to know the device of.
"""
if isinstance(data, Mapping):
for obj in data.values():
device = find_device(obj)
if device is not None:
return device
elif isinstance(data, (tuple, list)):
for obj in data:
device = find_device(obj)
if device is not None:
return device
elif isinstance(data, torch.Tensor):
return data.device

View File

@ -0,0 +1,156 @@
# Copyright 2022 The HuggingFace Team. All rights reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
import os
from contextlib import contextmanager
from pathlib import Path
import torch
from ..commands.config.cluster import ClusterConfig
from ..commands.config.config_args import default_json_config_file
from ..state import AcceleratorState
from .dataclasses import DistributedType
from .imports import is_deepspeed_available, is_tpu_available
if is_deepspeed_available():
from deepspeed import DeepSpeedEngine
if is_tpu_available():
import torch_xla.core.xla_model as xm
def extract_model_from_parallel(model):
"""
Extract a model from its distributed containers.
Args:
model (`torch.nn.Module`): The model to extract.
Returns:
`torch.nn.Module`: The extracted model.
"""
options = (torch.nn.parallel.DistributedDataParallel, torch.nn.DataParallel)
if is_deepspeed_available():
options += (DeepSpeedEngine,)
while isinstance(model, options):
model = model.module
return model
def wait_for_everyone():
"""
Introduces a blocking point in the script, making sure all processes have reached this point before continuing.
<Tip warning={true}>
Make sure all processes will reach this instruction otherwise one of your processes will hang forever.
</Tip>
"""
if (
AcceleratorState().distributed_type == DistributedType.MULTI_GPU
or AcceleratorState().distributed_type == DistributedType.MULTI_CPU
or AcceleratorState().distributed_type == DistributedType.DEEPSPEED
):
torch.distributed.barrier()
elif AcceleratorState().distributed_type == DistributedType.TPU:
xm.rendezvous("accelerate.utils.wait_for_everyone")
def save(obj, f):
"""
Save the data to disk. Use in place of `torch.save()`.
Args:
obj: The data to save
f: The file (or file-like object) to use to save the data
"""
if AcceleratorState().distributed_type == DistributedType.TPU:
xm.save(obj, f)
elif AcceleratorState().local_process_index == 0:
torch.save(obj, f)
@contextmanager
def patch_environment(**kwargs):
"""
A context manager that will add each keyword argument passed to `os.environ` and remove them when exiting.
Will convert the values in `kwargs` to strings and upper-case all the keys.
"""
for key, value in kwargs.items():
os.environ[key.upper()] = str(value)
yield
for key in kwargs:
del os.environ[key.upper()]
def get_pretty_name(obj):
"""
Gets a pretty name from `obj`.
"""
if not hasattr(obj, "__qualname__") and not hasattr(obj, "__name__"):
obj = getattr(obj, "__class__", obj)
if hasattr(obj, "__qualname__"):
return obj.__qualname__
if hasattr(obj, "__name__"):
return obj.__name__
return str(obj)
def write_basic_config(mixed_precision="no", save_location: str = default_json_config_file):
"""
Creates and saves a basic cluster config to be used on a local machine with potentially multiple GPUs. Will also
set CPU if it is a CPU-only machine.
Args:
mixed_precision (`str`, *optional*, defaults to "no"):
Mixed Precision to use. Should be one of "no", "fp16", or "bf16"
save_location (`str`, *optional*, defaults to `default_json_config_file`):
Optional custom save location. Should be passed to `--config_file` when using `accelerate launch`. Default
location is inside the huggingface cache folder (`~/.cache/huggingface`) but can be overriden by setting
the `HF_HOME` environmental variable, followed by `accelerate/default_config.yaml`.
"""
path = Path(save_location)
path.parent.mkdir(parents=True, exist_ok=True)
if path.exists():
print(
f"Configuration already exists at {save_location}, will not override. Run `accelerate config` manually or pass a different `save_location`."
)
return
mixed_precision = mixed_precision.lower()
if mixed_precision not in ["no", "fp16", "bf16"]:
raise ValueError(f"`mixed_precision` should be one of 'no', 'fp16', or 'bf16'. Received {mixed_precision}")
config = {"compute_environment": "LOCAL_MACHINE", "mixed_precision": mixed_precision}
if torch.cuda.is_available():
num_gpus = torch.cuda.device_count()
config["num_processes"] = num_gpus
config["use_cpu"] = False
if num_gpus > 1:
config["distributed_type"] = "MULTI_GPU"
else:
config["distributed_type"] = "NO"
else:
num_gpus = 0
config["use_cpu"] = True
config["num_processes"] = 1
config["distributed_type"] = "NO"
if not path.exists():
config = ClusterConfig(**config)
config.to_json_file(path)

View File

@ -0,0 +1,87 @@
# Copyright 2022 The HuggingFace Team. All rights reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
import random
from typing import List, Optional, Union
import numpy as np
import torch
from ..state import AcceleratorState
from .dataclasses import DistributedType, RNGType
from .imports import is_tpu_available
if is_tpu_available():
import torch_xla.core.xla_model as xm
def set_seed(seed: int, device_specific: bool = False):
"""
Helper function for reproducible behavior to set the seed in `random`, `numpy`, `torch`.
Args:
seed (`int`): The seed to set.
device_specific (`bool`, *optional*, defaults to `False`):
Whether to differ the seed on each device slightly with `self.process_index`.
"""
if device_specific:
seed += AcceleratorState().process_index
random.seed(seed)
np.random.seed(seed)
torch.manual_seed(seed)
torch.cuda.manual_seed_all(seed)
# ^^ safe to call this function even if cuda is not available
if is_tpu_available():
xm.set_rng_state(seed)
def synchronize_rng_state(rng_type: Optional[RNGType] = None, generator: Optional[torch.Generator] = None):
# Get the proper rng state
if rng_type == RNGType.TORCH:
rng_state = torch.get_rng_state()
elif rng_type == RNGType.CUDA:
rng_state = torch.cuda.get_rng_state()
elif rng_type == RNGType.XLA:
assert is_tpu_available(), "Can't synchronize XLA seeds on an environment without TPUs."
rng_state = torch.tensor(xm.get_rng_state())
elif rng_type == RNGType.GENERATOR:
assert generator is not None, "Need a generator to synchronize its seed."
rng_state = generator.get_state()
# Broadcast the rng state from device 0 to other devices
state = AcceleratorState()
if state.distributed_type == DistributedType.TPU:
rng_state = xm.mesh_reduce("random_seed", rng_state, lambda x: x[0])
elif state.distributed_type in [DistributedType.DEEPSPEED, DistributedType.MULTI_GPU]:
rng_state = rng_state.to(state.device)
torch.distributed.broadcast(rng_state, 0)
rng_state = rng_state.cpu()
elif state.distributed_type == DistributedType.MULTI_CPU:
torch.distributed.broadcast(rng_state, 0)
# Set the broadcast rng state
if rng_type == RNGType.TORCH:
torch.set_rng_state(rng_state)
elif rng_type == RNGType.CUDA:
torch.cuda.set_rng_state(rng_state)
elif rng_type == RNGType.XLA:
xm.set_rng_state(rng_state.item())
elif rng_type == RNGType.GENERATOR:
generator.set_state(rng_state)
def synchronize_rng_states(rng_types: List[Union[str, RNGType]], generator: Optional[torch.Generator] = None):
for rng_type in rng_types:
synchronize_rng_state(RNGType(rng_type), generator=generator)

View File

@ -0,0 +1,61 @@
# Copyright 2022 The HuggingFace Team. All rights reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
import sys
from typing import Union
from packaging.version import Version, parse
from .constants import STR_OPERATION_TO_FUNC
if sys.version_info < (3, 8):
import importlib_metadata
else:
import importlib.metadata as importlib_metadata
torch_version = parse(importlib_metadata.version("torch"))
def compare_versions(library_or_version: Union[str, Version], operation: str, requirement_version: str):
"""
Compares a library version to some requirement using a given operation.
Args:
library_or_version (`str` or `packaging.version.Version`):
A library name or a version to check.
operation (`str`):
A string representation of an operator, such as `">"` or `"<="`.
requirement_version (`str`):
The version to compare the library version against
"""
if operation not in STR_OPERATION_TO_FUNC.keys():
raise ValueError(f"`operation` must be one of {list(STR_OPERATION_TO_FUNC.keys())}, received {operation}")
operation = STR_OPERATION_TO_FUNC[operation]
if isinstance(library_or_version, str):
library_or_version = parse(importlib_metadata.version(library_or_version))
return operation(library_or_version, parse(requirement_version))
def is_torch_version(operation: str, version: str):
"""
Compares the current PyTorch version to a given reference with an operation.
Args:
operation (`str`):
A string representation of an operator, such as `">"` or `"<="`
version (`str`):
A string version of PyTorch
"""
return compare_versions(torch_version, operation, version)

Some files were not shown because too many files have changed in this diff Show More