36 Commits

Author SHA1 Message Date
df0c1870d9 Bump to python3.10 + update linter (#3809)
* py310 and some changes

* fix

* better
2025-10-10 18:22:51 +02:00
23cf4ef8a3 Fix tests (#3722)
* fix tests

* fix skorch tests

* fix deepspeed

* pin torch as compile tests don't pass and create segmentation fault

* skip compile tests

* fix

* forgot v ...

* style
2025-08-07 16:59:29 +02:00
348aabaaaf Update Gaudi runner image to latest SynapseAI and enable previously disabled tests (#3653)
* update synapse and add tp tests

* only skip regional compile speedup check

* pass sdp test on hpu
2025-07-16 14:33:36 +02:00
14fc61eeac Bump to 1.6.0.dev0 2025-03-12 10:13:18 -04:00
d9e6af8773 HPU support (#3378)
* init

* style

* is_hpu_available

* fix

* import habana_frameworks.torch.distributed.hccl

* style

* test

* initialize dist proc group

* revert

* set backend to hccl only if hccl initialization sets a local rank

* force backend hccl and multi_hpu type when sure of distributed launch

* style

* pass accelerator tests

* pas big modeling tests with bigger atol/rtol for accelerators

* fix hpu device count and skip tests requiring hpu:x

* hpu autocast

* hpu rng_state

* hpu launch

* hpu special device placement

* hpu launch

* rng state

* distributed data loop tests

* enforce non contiguity after device memory allocation

* pass fsdp tests

* enforce pt_hpu_lazy_mode=0 when fsdp testing

* pass cli tests

* pass and document grad sync tests

* pass kwargs handler and autocast tests

* memory utils

* found source of int64 errors

* skip some modeling utils tests

* enable int64

* skip optimizer tests

* pass checkpointing tests

* pass accelerator tests with safetensors main

* more hpu stuff

* style

* remove PT_HPU_LAZY_MODE and PT_ENABLE_INT64_SUPPORT as they should be in the testing environment

* start testing on gaudi2

* support fp16 on gaudi2

* add testing order

* custom hpu fsdp env dict

* fix torch trace malloc

* test ddp half precision comm hooks

* fix

* fix

* remove lower bound for hpu

* use 0.72 as lower bound

* lower lower bound

* order deepspeed tests

* fix

* deepspeed_use_hpu

* assert non lazy mode with offloaded optimizer

* make patching torch with habana frameworks the default

* less of require_non_hpu

* skip test_multi_device_merge_fsdp_weights for now as it halts

* skip another flaky test

* format

* use habana_visible_modules

* patch torch hpu device count

* avoid setting HABANA_VISIBLE_MODULES

* don't play with habana visible devices/modules

* only with hpu

* fixes and skips

* skip

* fix device ids and add some todos

* skip offloading with generate()

* fix

* reduced atol/rtol for hpu

* fix

* tag deepspeed tests that should run first

* enable a test path that was skipped

* revert a test that was customized for gaudi1

* some patching to enable HABANA_VISIBLE_MODULES

* fix zero3 test

* misc

* test DTensor TP

* remove gaudi1

* test

* style

* comment

* pass pad_across_processes

* require_fp16

* pass memory utils test

* test_ddp_comm_hook

* skip half precision comm hooks on hpu

* fix

* is_fp16_available

* fp16

* tp as part of integration tests

* fix

* write_basic_config

* safetensors

* local sgd and masked_fill_fwd_i64

* fix num_processes in test_load_states_by_steps

* fp8 support

* test

* fix

* add a workflow

* Update src/accelerate/accelerator.py

* review comments

* ci

* style

* comments

* test

* habana_frameworks.torch

* patch device count

* fix

* fix

* require_fp8

* fix

* fix

* gaudi 1

* remove unnecessary

* fixed maskd fill error in transformers

* style

* balanced_memory pass on hpu

* remove for now

* run first

* Apply suggestions from code review

* style after merge

* Update src/accelerate/accelerator.py

Co-authored-by: Zach Mueller <muellerzr@gmail.com>

* Update src/accelerate/utils/transformer_engine.py

Co-authored-by: Zach Mueller <muellerzr@gmail.com>

* empty cache review comments

* test_scirpt.py error messages

* AccelerateTestCase for accelerator state cleanup

* test

* add gaudi1 workflow

* fp8 avilability

* fix

* reduce batch size

* concurrency

* check cuda as well

* nits and comments

* mark fsdp tests that require_fp16

* style

* mark deepspeed fp16 tests

* update image

* fix

* updated

* better msgs

* skip pippy

* test

* test on 2 device

* support up to 1% relative error in test_accelerate

* skip hpu fp16

* allow for 1 byte differene

* revert torch_device change

* style

* skip memory release since it's flaky

* add accelerator state cleanup to fixture

* fix

* atol

* fix

* more rtol

* equal grad test

* revert

* pass pippy on gaudi2 and skip on gaudi1

* enable sd 1.5 test with require fp16

* added warning on memory release

* don't log warning in memory release as it requires PartialState to be initialized

* Apply suggestions from code review

---------

Co-authored-by: Zach Mueller <muellerzr@gmail.com>
2025-03-11 11:16:57 -04:00
65356780d4 [Dev] Update release directions (#3352)
* Update release directions

* Update directions and makefile to account for testpypi fun
2025-01-21 08:59:43 -05:00
7a2feecad4 Add copyright + some ruff lint things (#2523)
* Copyright and ruff stuff

* lol
2024-03-04 09:14:31 -05:00
82a1258ffc Remove offline stuff (#2509)
* Better check

* Fully remove

* Trail
2024-02-29 09:17:37 -05:00
21b225e8d5 Check if hub down (#2506)
* Let's try it out

* Let's try this out

* Some more cases

* String

* Require hub online for estimator

* Add CI checker to alert on hub status

* Format

* Oops death by ctrl z

* Fix import
2024-02-28 18:56:37 -05:00
0e1ee4b92d Use Ruff for formatting too (#2400)
Co-authored-by: Zach Mueller <muellerzr@gmail.com>
2024-02-06 08:18:18 -05:00
f4b411f84b Fix CI due to pytest (#2408)
* New makefile

* Big modeling, oops
2024-02-01 12:28:10 -05:00
f0029d6f60 Fix tests not being ran on multi-GPU nightly (#1558)
* Fix tests not being ran

* More tests
2023-06-07 15:14:02 -04:00
a28491bc24 Let quality yell at the user if it's a version difference (#1438)
* Let quality yell at the user if it's a version difference

* Also include in style
2023-05-16 09:30:08 -04:00
435079aafb Improve Slack Updater (#1433)
* Update log_reports to send to slack

* REVERT this change, just for testing!

* Add slack_sdk dep

* Second one

* Try now?

* Remove len

* Need secret

* Try with new version

* Right boldface

* Fix import

* New format, use tabulate

* Add tabulate to yml

* Quality

* Purposefully fail

* Working updater, now to test

* Int

* Print payload

* Append

* Change maxcolwidth

* Offset

* More offset

* Context

* No max width

* gh format

* max-col-width'

* Reduce max

* Non-working tables

* Rm md report

* Try now

* Try with just count

* Use table

* New version

* Use table

* Try with thread

* Should be working now

* Clean

* Fixup test reports fully

* Revert workflow

* Keep tabulate in workflow ci

* Update other workflows

* Use blocks for better formatting

* ONe more test

* Works as expected
2023-05-16 09:08:10 -04:00
5002e56704 Update quality tools to 2023 (#1046)
* Setup 2023 tooling for quality

* Result of styling

* Simplify inits and remove isort and flake8 from doc

* Puts back isort skip flag
2023-02-07 13:34:05 -05:00
f3f2f9e4b5 in sync with trfs, removing style_doc utils and using doc-builder instead (#988) 2023-01-19 19:24:44 +05:30
49cd8d37e6 Fix all github actions issues + depreciations (#773)
* Fix all github actions issues + depreciations
2022-10-18 12:27:05 -04:00
9114fb09d5 Regression cli tests (#772)
* New cli tests

* Add CLI testing

* Makefile + tests

* Segment out CLI in makefile better
2022-10-18 11:07:36 -04:00
efb33d67ea Update runners with report structure, adjust env variable (#704)
* Fixup rest of the runners

* Install pytest-reportlog

* Use more explicit env var

* Fixup
2022-09-20 10:10:58 -04:00
6dc429f6f7 Add in report generation for test failures and make fail-fast false (#703)
* Add logging
2022-09-19 17:24:46 -04:00
a08779f603 Fix DeepSpeed CI (#612)
* Try with integration on makefile
2022-08-08 14:54:56 -04:00
447ad0e635 Complete revamp of the docs (#495)
Completely revamp the entirety of the Accelerate documentation
2022-08-01 10:09:14 -04:00
0c6bdc2c23 enhancements and fixes for FSDP and DeepSpeed (#532)
* checkpointing enhancements and fixes for FSDP and DeepSpeed

* resolving comments

1. Adding deprecation args and warnings in launcher for FSDP
2. Handling old configs to work with new launcher args wrt FSDP.
3. Reverting changes to public methods in `checkpointing.py` and handling it in `Accelerator`
4. Explicitly writing the defaults of various FSDP options in `dataclasses` for readability.

* fixes

1. FSDP wrapped model being added to the `_models`.
2. Not passing the env variables when args are None.

* resolving comments

* adding FSDP for all the collective operations

* adding deepspeed and fsdp tests

1. Removes mrpc datafiles and directly relies on HF datasets as it was throwing `file not found` error when running from within `tests` folder. Updating `moke_dataloaders` as a result.
2. adding `test_performance.py`, `test_memory.py` and `test_checkpointing.py` for multi-gpu FSDP and DeepSpeed tests

* reverting `mocked_dataloader` changes

* adding FSDP tests

* data files revert

* excluding fsdp tests from `tests_core`

* try 2

* adding time delay to avoid `torchrun` from crashing at times leading which causing flaky behaviour

* reducing the time of tests

* fixes

* fix

* fixes and reduce time further

* reduce time further and minor fixes

* adding a deepspeed basic e2e test for single gpu setup
2022-07-26 18:14:29 +05:30
fdf471519c Add production testing + fix failing CI (#547)
* Add production testing

* Fix CI failure on transformers
2022-07-21 14:32:27 -04:00
ea0d5368bd Add benchmarks (#506)
* Add benchmarks

* Oops! Forgot one file

* Update benchmarks/README.md

Co-authored-by: Zachary Mueller <muellerzr@gmail.com>

Co-authored-by: Zachary Mueller <muellerzr@gmail.com>
2022-07-12 15:16:45 -04:00
1703b79a79 DeepSpeed Revamp (#405)
* deepspeed revamp

* Update dataclasses.py

* Update deepspeed.py

* quality

* fixing code

* quality

* FIx imports

* saving 16bit model in zero stage 3

1. Saving 16bit model in zero stage 3
2. zero init in stage 3 support using HFDeepSpeedConfig

* quality

* adding test and fixing bugs

* update makefile for deepspeed tests

* Update test.yml

* adding `deepspeed` as requirement for tests

* Apply suggestions from code review

Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com>

* quality

* addressing comments

* add example and minor updates

1. Add example to show the usage of config file with revamped deepspeed support.
2. update required deepspeed version to 0.6.5
2. reverting `reinit` change as it is not required,
3. raising Exception when using `clip_grad_value` with DeepSpeed/FSDP.

* Documentation and Zero-3 Inference Support

1. Changes to support ZeRo Stage-3 Inference support.
2. minor bug fixes.
3. Documentation.

* doc fix

* Apply suggestions from code review

Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com>

* addressing comments

* update doc to address comments and bug fixes

1. update tests and add new one testing autofill functionality of `prepare` method.
2. fix bug related to zero-3 init related to HFDeepSpeedConfig
3. Update documentation addressing comments.

* removing image and hosting it on `documentation-images` dataset

* check for hidden_size for zero_opt heurisitics

Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com>
2022-06-07 00:52:18 +05:30
938b8f358d Speedup main CI (#419)
* Speed up workflow
2022-06-01 10:59:01 -04:00
3b51d6e9ad Fix debug_launcher issues (#413)
* change to require_cpu only
2022-05-31 14:59:28 -04:00
b515800947 Hotfix all failing GPU tests (#401)
* Fix up makefile
2022-05-26 14:13:19 -04:00
209db19dc8 Create a testing framework for example scripts and fix current ones (#313)
Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com>
2022-04-13 13:24:36 -04:00
5668270de7 Add logging capabilities (#293)
Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com>

- Added experiment tracking API, and support for Weights and Biases, TensorBoard, and CometML + Tests
- Added `tensorflow` to a new dependency list to be used during tests
- Added three new functions in `Accelerator` to interact with the API
2022-03-30 17:40:32 -04:00
fb5ed62c10 Convert documentation to the new front (#271)
* Main conversion

* Doc styling

* Style

* New front deploy

* Fixes

* Fixes

* Fix new docstrings

* Style
2022-03-10 11:13:40 -05:00
4acf728c0e Doc styling 2021-02-10 14:47:43 -05:00
18e4fc61aa A bit of cleanup 2021-02-10 14:46:31 -05:00
9c5ac0dd8c Example + README 2020-11-20 13:48:50 -05:00
eb0dd18e50 Styling 2020-11-12 14:43:23 -05:00