Commit Graph

154 Commits

Author SHA1 Message Date
ff872f5f71 bump to 1.11.0dev0 2025-08-07 12:58:08 +02:00
7ecc2d7f39 bump to v1.10.0-release 2025-07-16 16:26:03 +00:00
e2cc537db8 trackio (#3669)
* trackio

* Apply suggestions from code review

Co-authored-by: Steven Liu <59462357+stevhliu@users.noreply.github.com>
Co-authored-by: Abubakar Abid <abubakar@huggingface.co>

* seven -> eight

* Add trackio as a real tracker instead

* Sort

* Style

* Style

* Remove step

* Disable trackio on Python < 3.10

* Update src/accelerate/tracking.py

Co-authored-by: Steven Liu <59462357+stevhliu@users.noreply.github.com>

* More style

---------

Co-authored-by: Steven Liu <59462357+stevhliu@users.noreply.github.com>
Co-authored-by: Abubakar Abid <abubakar@huggingface.co>
2025-07-15 17:17:49 +02:00
a16d2bb3c1 bump to v1.9.0dev 2025-06-19 15:13:41 +02:00
6597dae780 Integrate SwanLab for offline/online experiment tracking for Accelerate (#3605)
* add support for SwanLabTracker and update related documentation

* add emoji in FRAMWORK

* apply the style corrections and quality control

* add support for SwanLabTracker in tests

* fix bug in test_tracking
2025-06-18 15:42:29 +02:00
417bc52965 bump to v1.8.0dev 2025-05-15 12:02:44 +02:00
583b26db3c Add FP8 runners + tweak building FP8 image (#3493)
* Initial test

* Try on push

* Only wf dispatch now

* keep trying

* Try again

* Try again

* source activate?

* Force bash

* Source activate accelerate to make it get the env propelry

* try using nightly docker

* Try this?

* Try this?

* Try this, proper output

* Try this, proper output

* Try via full conda activate(?)

* rm conda

* te fp8 tests

* add ao

* ao in setup too

* actually include fp8 deps

* FP8 docker image, use newer version

* Update docker image to take in input

* Test

* prior month

* igpu?

* Use only last 2 digits of year

* Build rest

* Apply style fixes

---------

Co-authored-by: [[ -z $EMAIL ]] && read -e -p "Enter your email (for git configuration): " EMAIL <muellerzr@gmail.com>
Co-authored-by: github-actions[bot] <github-actions[bot]@users.noreply.github.com>
2025-04-15 11:39:43 +02:00
9642a1ac81 bump to v1.7.0dev 2025-04-01 13:55:11 +02:00
3169339f5b Bump ruff to 0.11.2 (#3471)
* ruff format

* Bump ruff to 0.11.2
2025-04-01 11:57:06 +02:00
a0edc8dcf2 Apply ruff py39 fixes (#3461)
* Apply ruff py39 fixes

* Ruff format
2025-03-31 19:10:08 +02:00
f648feba97 Add log_artifact, log_artifacts and log_figure capabilities to the MLflowTracker. (#3419)
* Added artifacts and figure tracking at MLFlow tracker

* Added `log_artifact` to the MLFlowTracker

* Remove changes

* Added artifacts, artifacts and figure tracking at MLFlow tracker

* Improved the docstring

* added require_mlflow function at test_utils

* add test for MLflowTracker

* Bit of litting

* Refactor to a more robust test

* Revised the test asserts to something more robust.

* Removed incorrect import and some litting.

* removed commented code

* initiate tracker using Accelerator

* Added mlflow and matplotlib to setup.py. Guarded and decoredated the functions that required them.

* Guarded mlflow import

* added matplotlib required warning.

* ran style and quality
2025-03-12 18:11:29 +01:00
14fc61eeac Bump to 1.6.0.dev0 2025-03-12 10:13:18 -04:00
d9e6af8773 HPU support (#3378)
* init

* style

* is_hpu_available

* fix

* import habana_frameworks.torch.distributed.hccl

* style

* test

* initialize dist proc group

* revert

* set backend to hccl only if hccl initialization sets a local rank

* force backend hccl and multi_hpu type when sure of distributed launch

* style

* pass accelerator tests

* pas big modeling tests with bigger atol/rtol for accelerators

* fix hpu device count and skip tests requiring hpu:x

* hpu autocast

* hpu rng_state

* hpu launch

* hpu special device placement

* hpu launch

* rng state

* distributed data loop tests

* enforce non contiguity after device memory allocation

* pass fsdp tests

* enforce pt_hpu_lazy_mode=0 when fsdp testing

* pass cli tests

* pass and document grad sync tests

* pass kwargs handler and autocast tests

* memory utils

* found source of int64 errors

* skip some modeling utils tests

* enable int64

* skip optimizer tests

* pass checkpointing tests

* pass accelerator tests with safetensors main

* more hpu stuff

* style

* remove PT_HPU_LAZY_MODE and PT_ENABLE_INT64_SUPPORT as they should be in the testing environment

* start testing on gaudi2

* support fp16 on gaudi2

* add testing order

* custom hpu fsdp env dict

* fix torch trace malloc

* test ddp half precision comm hooks

* fix

* fix

* remove lower bound for hpu

* use 0.72 as lower bound

* lower lower bound

* order deepspeed tests

* fix

* deepspeed_use_hpu

* assert non lazy mode with offloaded optimizer

* make patching torch with habana frameworks the default

* less of require_non_hpu

* skip test_multi_device_merge_fsdp_weights for now as it halts

* skip another flaky test

* format

* use habana_visible_modules

* patch torch hpu device count

* avoid setting HABANA_VISIBLE_MODULES

* don't play with habana visible devices/modules

* only with hpu

* fixes and skips

* skip

* fix device ids and add some todos

* skip offloading with generate()

* fix

* reduced atol/rtol for hpu

* fix

* tag deepspeed tests that should run first

* enable a test path that was skipped

* revert a test that was customized for gaudi1

* some patching to enable HABANA_VISIBLE_MODULES

* fix zero3 test

* misc

* test DTensor TP

* remove gaudi1

* test

* style

* comment

* pass pad_across_processes

* require_fp16

* pass memory utils test

* test_ddp_comm_hook

* skip half precision comm hooks on hpu

* fix

* is_fp16_available

* fp16

* tp as part of integration tests

* fix

* write_basic_config

* safetensors

* local sgd and masked_fill_fwd_i64

* fix num_processes in test_load_states_by_steps

* fp8 support

* test

* fix

* add a workflow

* Update src/accelerate/accelerator.py

* review comments

* ci

* style

* comments

* test

* habana_frameworks.torch

* patch device count

* fix

* fix

* require_fp8

* fix

* fix

* gaudi 1

* remove unnecessary

* fixed maskd fill error in transformers

* style

* balanced_memory pass on hpu

* remove for now

* run first

* Apply suggestions from code review

* style after merge

* Update src/accelerate/accelerator.py

Co-authored-by: Zach Mueller <muellerzr@gmail.com>

* Update src/accelerate/utils/transformer_engine.py

Co-authored-by: Zach Mueller <muellerzr@gmail.com>

* empty cache review comments

* test_scirpt.py error messages

* AccelerateTestCase for accelerator state cleanup

* test

* add gaudi1 workflow

* fp8 avilability

* fix

* reduce batch size

* concurrency

* check cuda as well

* nits and comments

* mark fsdp tests that require_fp16

* style

* mark deepspeed fp16 tests

* update image

* fix

* updated

* better msgs

* skip pippy

* test

* test on 2 device

* support up to 1% relative error in test_accelerate

* skip hpu fp16

* allow for 1 byte differene

* revert torch_device change

* style

* skip memory release since it's flaky

* add accelerator state cleanup to fixture

* fix

* atol

* fix

* more rtol

* equal grad test

* revert

* pass pippy on gaudi2 and skip on gaudi1

* enable sd 1.5 test with require fp16

* added warning on memory release

* don't log warning in memory release as it requires PartialState to be initialized

* Apply suggestions from code review

---------

Co-authored-by: Zach Mueller <muellerzr@gmail.com>
2025-03-11 11:16:57 -04:00
65356780d4 [Dev] Update release directions (#3352)
* Update release directions

* Update directions and makefile to account for testpypi fun
2025-01-21 08:59:43 -05:00
78b8126bff v1.4.0.dev0 2025-01-17 10:36:00 -05:00
b13aadcb67 Bye bye torch <2 (#3331)
* Bye bye torch <1

* Add 2.6.0 dl args

* Rm require fsdp

* Adjust imports + 2.0 specific modeling code

* Bring back is_bf16
2025-01-09 12:11:08 -05:00
5f96369161 v1.2.0.dev 2024-11-20 19:24:51 -05:00
85f35647db 🚨 🚨 🚨 Goodbye Python 3.8! 🚨 🚨 🚨 (#3194) 2024-10-24 10:16:47 -04:00
52581c3f01 Change version 2024-10-09 10:50:12 -04:00
3fd02e60dc MAINT: Upgrade ruff to v0.6.4 (#3095)
* MNT Upgrade ruff to 0.6.4

Currently used version, 0.2.1, is quite old at this point.

Not a lot needed to be changed:

- Change ruff version in setup.py
- Remove deprecated ignore-init-module-imports option for ruff
- Type comparison should use is and not ==
- Use f-string instead of % formatting
- Some line wrapping and empty lines

* Oops
2024-09-10 10:43:37 -04:00
b5235f21d8 0.35.0.dev 2024-09-02 18:18:42 -04:00
ad3f574a3b Add early support for torchdata.stateful_dataloader.StatefulDataLoader within the Accelerator (#2895)
* temporary commit

* checkout?

* dataloader wrapper

* tmp

* weird failing test

* trying multiple inheritance

* DataLoaderAdapter

* make style

* Some dark magic dynamic reflection (for backwards compat)

* typo

* some tests

* more mixin stuff

* maybe found broken test?

* this is a very invasive feature

* i think the feature is done?

* add xpu support (#2864)

* better tests

* discovered a bug

* maybe fixed bug?

* make style

* hopefully this is PR ready

* properly skip tests

* parameterize

* temporary commit

* checkout?

* dataloader wrapper

* tmp

* weird failing test

* trying multiple inheritance

* DataLoaderAdapter

* make style

* Some dark magic dynamic reflection (for backwards compat)

* typo

* some tests

* more mixin stuff

* maybe found broken test?

* this is a very invasive feature

* i think the feature is done?

* better tests

* discovered a bug

* maybe fixed bug?

* make style

* hopefully this is PR ready

* properly skip tests

* parameterize

* Update src/accelerate/utils/dataclasses.py

Co-authored-by: Zach Mueller <muellerzr@gmail.com>

* Update src/accelerate/data_loader.py

Co-authored-by: Zach Mueller <muellerzr@gmail.com>

* merge conflicts

* move imports

* make style

* merges are breaking tests

* fix test name

* Require safetensors>=0.4.3

* undo last commit

* minor style

* address pr comments

* Torchdata version 0.8.0 is stable now

* added docs and require torchdata>=0.8.0 for testing

* test base_dataloader attr doesn't cause infinite recursion

* address pr

* replace super().__iter__ with self.base_dataloader.__iter__

---------

Co-authored-by: Fanli Lin <fanli.lin@intel.com>
Co-authored-by: Zach Mueller <muellerzr@gmail.com>
2024-08-22 08:43:45 -04:00
52fae0960c Add end_training/destroy_pg to everything and unpin numpy (#3030)
* Add end_training/destroy_pg to everything

* Carry over to AcceleratorState

* If forked, ignore

* More numpy fun

* Skip only init
2024-08-20 10:40:12 -04:00
cd5698bb32 update version to 0.34.dev0 (#3007) 2024-08-12 12:13:37 -04:00
dc3b5ad82e Fix deepspeed tests (#3003)
* Unpin deepspeed

* Include proper branch for docker image

* Properly working

* Revert all other changes
2024-08-09 15:35:25 -04:00
32f368ec3f Require safetensors>=0.4.3 (#2957) 2024-07-29 07:35:34 -04:00
3ebbe573ad Add huggingface_hub version to setup.py (#2932) 2024-07-15 10:11:41 -04:00
947f64ee62 Version update 2024-07-03 13:27:34 -04:00
1f7a79b428 Potentially fix tests (#2862)
* Potentially fix tests

* Try again with numpy sub 2
2024-06-18 11:38:30 +02:00
7141881b1f Push new release version 2024-06-07 10:05:51 -04:00
4ba436eccc Introduce shard-merging util for FSDP (#2772)
* Initial commit

* Now to test

* Store false

* Slight tweaks

* Fix naming

* Got it all working with tests

* Use not for safetensors arg

* rm change

* Add docs

* Adjust based on Marc's feedback

* Specify just weights

* Update tests to include CLI and swap namings

* Fin

* Rm unused

* Rm again
2024-05-16 13:49:50 -04:00
5b3a7f3892 Update setup.py + test falures found during release 2024-05-03 10:40:25 -04:00
c7e5e41b8c Segment out a deepspeed docker image (#2707)
* Segment out a deepspeed docker image

* Update readme

* Keep pinned ds
2024-04-29 11:25:22 -04:00
6af157ea93 Add diffusers to req (#2711) 2024-04-25 08:31:54 -04:00
f478201c28 Pin DS...again.. (#2679) 2024-04-16 12:07:59 -04:00
16488be9a4 Update version 2024-04-05 13:11:05 -04:00
7531e8c13e Unpin hub (#2625) 2024-04-04 10:33:49 -04:00
f579d9550d Pin hub for tests (#2608) 2024-04-02 10:58:17 -04:00
dd62fc90ce Unpin deepspeed (#2570) 2024-03-21 09:42:03 -04:00
ee163b66fb Update version 2024-03-12 11:55:22 -04:00
e70e3c87de Overdue email change... (#2534) 2024-03-08 12:55:42 -05:00
f20445d4ac Fix the pytest version to be less than 8.0.1 (#2461)
* Fix the pytest version to be less than 8.0.0

We're getting errors such as:

> /opt/hostedtoolcache/Python/3.8.18/x64/lib/python3.8/site-packages/transformers/testing_utils.py:129: in <module>
>     from _pytest.doctest import (
> E   ImportError: cannot import name 'import_path' from '_pytest.doctest' (/opt/hostedtoolcache/Python/3.8.18/x64/lib/python3.8/site-packages/_pytest/doctest.py)

* Update setup.py

Co-authored-by: fxmarty <9808326+fxmarty@users.noreply.github.com>

---------

Co-authored-by: Marc Sun <57196510+SunMarc@users.noreply.github.com>
Co-authored-by: fxmarty <9808326+fxmarty@users.noreply.github.com>
2024-02-23 16:03:29 -05:00
13e79ccfab Enable more Ruff lints & fix issues (#2419)
* Remove antiquated flake8 and isort configuration

* Bump to Ruff 0.2.1

* Explain ruff options

* Autofix Ruff B010 (static `setattr`)

* Autofix Ruff B009 (static `getattr`)

* Enable Ruff UP (not UP007); auto-fix

* Fix remaining Ruff UP complaints

* Fix a couple more format calls
2024-02-14 08:59:42 -05:00
5318bc7733 Dev version 2024-02-13 10:04:34 -05:00
b3d2111708 Version 0.28.0.dev 2024-02-09 10:51:07 -05:00
c3aec59b12 Migrate pippy examples over and run tests (#2424)
* Migrate examples over

* Finish updating doc

* torchpippy

* Readme review nits

* Mention gather op in examples
2024-02-09 10:01:56 -05:00
0e1ee4b92d Use Ruff for formatting too (#2400)
Co-authored-by: Zach Mueller <muellerzr@gmail.com>
2024-02-06 08:18:18 -05:00
d8a64cb79d Unpin (#2418) 2024-02-06 08:00:33 -05:00
7ba64e632c Revert "[don't merge yet] unpin torch (#2406)" (#2407)
This reverts commit 8b770a7dabd957ae54f1abb028d1ce53db6cf4d4.
2024-02-01 10:13:15 -05:00
8b770a7dab [don't merge yet] unpin torch (#2406)
* unpin torch

* unpin torch

---------

Co-authored-by: ydshieh <ydshieh@users.noreply.github.com>
2024-02-01 09:56:16 -05:00