Compare commits

...

413 Commits

Author SHA1 Message Date
f6b0a2dd43 ENH Small speedups to adapter injection (#2785)
See
https://github.com/huggingface/diffusers/issues/11816#issuecomment-3281290153

This PR implements two small improvements to the speed of adapter
injection. On a benchmark based on the linked issue, the first change
leads to a speedup of 21% and the second change of another 3%. It's not
that much, but as the changes don't make the code more complicated,
there is really no reason not to take them.

The optimizations don't add any functional change but are simply based
on not recomputing the same values multiple times. Therefore, unless I'm
missing something, they should strictly improve runtime.
2025-09-23 13:27:49 +02:00
f1b83646a6 The great deduplication (#2771)
Deduplicate a lot of redundant code from PEFT method's model.py:

merge_and_unload
unload
delete_adapter
set_adapter
enable_adapter_layers
disable_adapter_layers
_replace_module
_unload_and_optionally_merge
_mark_only_adapters_as_trainable
_check_new_adapter_config
_check_target_module_exists
_prepare_adapter_config
__getattr__
get_peft_config_as_dict (fully deleted)

Related changes:

A new module, functional.py, is introduced, which contains functions
(just reimported from elsewhere) that can be useful for libraries that
want to integrate PEFT. I would suggest that we should treat them as
public API and thus guarantee backwards compatibility.

I also deduplicated almost identical
TRANSFORMERS_MODULES_TO_XXX_TARGET_MODULES_MAPPING constants by copying
them from LoRA and only overriding a few values that differ. Moreover,
some PEFT methods didn't have their own
TRANSFORMERS_MODULES_TO_XXX_TARGET_MODULES_MAPPING but used the one from
LoRA instead. They now each have their own constant, which is a copy
from the one from LoRA.
2025-09-23 13:26:35 +02:00
b774fd901e TST Add missing configs to test_config.py (#2781)
The test_config.py tests were missing a few configs from recently added
PEFT methods. Those are now included. After adding those, it was
revealed that for C3A and trainable tokens, super().__post_init__() was
not being called. This is now done.
2025-09-19 17:52:58 +02:00
20a9829f76 FIX Account for rsLoRA scaling in set_scale (#2775) 2025-09-16 11:30:29 +02:00
1806c1651a CHORE Update and pin (commit hash) GitHub actions (#2779)
Some GH actions didn't have a pinned commit hash while others did
because of Zizmor. Now all actions have pinned commit hashes.
2025-09-11 11:12:23 +02:00
13fa0aea7e FIX: Wrong coupling between requires_grad and the active adapter (#2765)
Description

At the moment, we strongly couple the active adapter with
requires_grad=True. Concretely, when we call model.set_adapter(name), we
automatically assume that this adapter should not only be made active,
its requires_grad should also be set to True.

For the purpose of training PEFT models, this is fair. However, when
loading PEFT models for inference, this is not desired. Generally, for
inference, we don't need requires_grad=True, but as is, it is enabled.

Generally, this is not a severe bug, since in the inference code, we
don't perform any updates, thus we don't inadvertently update a weight
because it wrongly has requires_grad=True -- this is probably why it
went unnoticed so far. However, it could lead to worse runtime
performance and memory overhead when PyTorch records grads for those
parameters (which it shouldn't if called with torch.inference_mode, but
some users may forget to use this). Therefore, this bug is still worth
fixing.

Example

Example

With `modules_to_save`

A very basic example where the current PEFT fails:

import os
from transformers import AutoModelForCausalLM
from peft import LoraConfig, PeftModel, get_peft_model

model_id = "facebook/opt-125m"
path = "/tmp/peft/2759"
if not os.path.exists(path + "/adapter_model.safetensors"):
    model = AutoModelForCausalLM.from_pretrained(model_id)
    config = LoraConfig(target_modules=["q_proj", "v_proj"], modules_to_save=["lm_head"], r=8)
    model = get_peft_model(model, config)
    model.save_pretrained(path)
    del model

model = AutoModelForCausalLM.from_pretrained(model_id)
model = PeftModel.from_pretrained(model, path)

`modules_to_save` should not have grads enabled, but currently it does.

### With multiple adapters

There is also an issue when loading more than one adapter:

model = PeftModel.from_pretrained(...)
assert not any(p.requires_grad for p in model.parameters())  # works

So far, so good, the first adapter does not have `requires_grad`.

model.load_adapter(...)
assert not any(p.requires_grad for p in model.parameters())  # fails

The load_adapter call inadvertently sets requires_grad=True for the
weights of the _first_ adapter. The reason why this happens is because
when the second adapter is loaded, we call set_adapter with the first
adapter to ensure that it remains the activate adapter. However, due to
the coupling of active adapter and requires_grad, this would result in
setting requires_grad=True for the first adapter.

The PR relaxes this coupling by allowing to call set_adapter with an
additional argument, inference_mode. If set to True, the requires_grad
will not be enabled, even if the adapter is activated.

The example above would also fail for modules_to_save and trainable
tokens, not only for the LoRA/LoHa/... weights.

Still open bugs

The proposed solution is unfortunately not perfect. Right now, we do
pass inference_mode based on the PEFT config of the adapter being added,
which helps with the original issue described above. However, even this
is not absolutely correct, because inference_mode of the second adapter
does not necessarily have the same value as inference_mode of the first
adapter. To illustrate how this can go wrong, I added an xfailing test:

test_loading_model_requires_grad_set_correctly_switch_inference_mode

I believe that this use case is rarer than the ones described at the
beginning, so IMO it is okay to have this bug because we fix more common
bugs. However, LMK if you disagree.

Related to this, I noticed that many tests in
test_custom_models.TestRequiresGrad had code like this:

config0 = FooConfig(...)
peft_model = get_peft_model(MLP(), config0)
config1 = FooConfig(..., inference_mode=True)  # <==
peft_model.add_adapter("adapter1", config1)

This now fails because of the reason just given. I removed
inference_mode=True here and the tests pass again.

Note that the only reason why inference_mode=True was passed here is
because AdaLoRA cannot load 2 adapters in training mode and thus
requires this. Later PEFT methods without this restriction blindly
copied the AdaLoRA test. For those PEFT methods, I removed
inference_mode=True.

However, this also means that the AdaLoRA tests now fail. I thus marked
them as xfail.

To properly fix this bug, I think we would have to refactor the code to
isolate set_adapter (i.e. determining the active adapter) and setting
requires_grad into separate code paths, as they're orthogonal. Moreover,
these attributes are being set all over the place, which makes it hard
to reason about where these attributes are being changed. This should be
streamlined.

Making these changes while not breaking any existing code is not
trivial (or maybe impossible even). Therefore, I went the easier way for
the time being with this PR. Maybe a bigger refactor could be envisioned
for a version 1.0 release of PEFT.

Related changes

While working on this, I noticed that LNTuning was completely buggy when
calling set_adapter. This is now fixed.

Moreover, since I had to touch update_layer everywhere, I ensured that
they all take kwargs for consistency.
2025-09-08 19:49:29 +02:00
42db980676 Add Arrow + GenKnowSub to LoRA (#2644)
This PR adds support for Arrow, a modular routing mechanism for LoRA experts introduced here, as well as the refinement method GenKnowSub, proposed in our ACL 2025 Main Conference paper. GenKnowSub enhances Arrow by subtracting a general-domain LoRA from task-specific ones prior to routing, leading to improved generalisation and modularity.
2025-09-08 14:21:37 +02:00
ed5c6eaa1a Replace from_legacy_cache method with constructors (#2767)
Replace Cache.from_legacy_cache method with init.
2025-09-08 13:49:25 +02:00
92e15573ac CHORE Upgrade trufflehog GitHub action to 3.90.5 (#2770)
Maybe solves the trufflehog false positive, maybe not.
2025-09-08 13:47:02 +02:00
5ef8e85d1f FIX X-LoRA forward hook issue during generate (#2761)
There was an issue that forward hooks would accumulate during
generation, since one hook per forward step was being registered and
generate would call forward multiple times. This is already undesirable,
but to make it worse, only the last hook was removed, resulting in hooks
accumulating.
2025-09-08 13:46:31 +02:00
c81363bd4e Support dataclass model configs (#2778)
LeRobot uses dataclasses to manage policy configs. If we want to
support LeRobot policy fine-tuning it'd be easiest to support
these configs in `get_model_config`.

While it is possible to fix this on LeRobot's side (add a to_dict implementation to the config classes) I think it'd be cleaner to support it on our side since the cost is relatively low and dataclasses are getting more popular anyway.

Thanks @xliu0105 for raising this issue and proposing a fix.
2025-09-08 13:35:47 +02:00
5d97453235 FIX Deprecated key_cache attribute on Cache pt 2 (#2753)
In #2737, we fixed some code that relied on the deprecated attribute but
some was being missed, as it only runs on the nightly CI with multiple
GPUs. This PR fixes this.

Note that the original transformers code that this solution was based on
no longer exists, as transformers now initializes the cache lazily, so
pre-allocating the keys and values to the correct device is not
necessary. But since prefix tuning inserts "virtual" keys/values, we
still have to ensure the correct device in PEFT.

I have tested the failing tests locally and they pass.
2025-09-04 14:47:29 +02:00
2ea5377ee3 TST FIX Failing AutoAWQ test with torch 2.8 (#2752)
There is a failing AWQ test since torch 2.6 which is marked as xfail for
torch=2.7. However, now torch 2.8 is out and the test is still failing.
Therefore, the xfail now checks for torch>=2.7.

As AWQ is no longer being maintained, we should expect this situation to
deteriorate over time and eventually we'll have to remove it. But for
the time being, it still appears to mostly work, so I suggest we leave
it as is.
2025-09-03 19:25:05 +02:00
de60e88b6b Fix missing code start in docs (#2768)
There was a minor typo which a suggestion of PR #2609 which broke code formatting for one code sample.

This is a simple fix for that.
2025-09-03 18:37:52 +02:00
293aea5df6 Support for Activated LoRA (#2609)
This PR migrates Activated LoRA (aLoRA) support from a standalone Github (see above) to PEFT itself.

Note there is also an active PR for vLLM inference support for Activated LoRA: vllm-project/vllm#19710 . There are also collections of aLoRA models on huggingface (in the ibm-granite org), note that these preexisting models run off of the standalone github repo and will be updated to work with this new PEFT feature if merged.

Description of changes: Activated LoRA is a modification of the LoRA architecture to "activate" the adapter weights only on tokens coming after a specified invocation_string. This fact makes it so that KV values for the string coming before the activation matches KV values for the base model. This allows KV cache for the input to be interchangeable between the base model and adapter model, and allows for major speedups in inference pipelines (e.g. agentic pipelines) that want to use both base models and adapter models. See the paper for detailed exploration of use cases and further elaboration.

Other notes:

The crux of the changes are really in layer.py. Everything else is simply managing the alora_offsets quantity which defines where the weights start to be activated. This is determined by scanning input strings for the invocation_string defined in the aLoraConfig.
    
I believe that aLoRA really only makes sense for CausalLMs, hence I've only implemented this for that model type.

Merging doesn't make sense for aLoRA adapters since the weights are not universally applied to all tokens.
    
I used the LoRA code as a starting point, but did not implement various seemingly extra features in that code.

As of now, invocation_string should probably start and end with special tokens, to avoid tokenizer issues at the boundary. Open to suggestions on how to make this more general if needed.

---------

Co-authored-by: githubnemo <githubnemo@users.noreply.github.com>
2025-09-03 18:26:50 +02:00
a3197b1ec5 FIX: Multiple active adapters with auxiliary layers (#2758)
This PR fixes a few issues with the handling of active adapters for
auxiliary modules.

1. Calling set_adapter on the model.base_model

When calling peft_model.set_adapter, it is not possible to activate more
than one adapter, as not all PEFT methods support that. However, many
PEFT methods like LoRA do, in which case users should call
peft_model.base_model.set_adapter(['default', 'other']).

Now the issue was that the activation of auxiliary modules was only done
on PeftModel.set_adapter. This means that if users are calling
peft_model.base_model.set_adapter (i.e. LoraModel.set_adapter etc.), the
auxiliary adapters were not activated.

This PR fixes this issue by ensuring that even if the user activates
adapters like this, the auxiliary modules are activated. When users
activate more than one adapter, additional checks are performed to
ensure that they are not activating multiple auxiliary modules on the
same module.

Note that some existing PEFT code could start raising errors now because
of the change. However, this PEFT code is buggy right now so IMO it is
fine to raise an error.

2. Adding multiple adapters with non-overlapping auxiliary modules

Furthermore, I found an activation issue that could occur when adding
multiple adapters with non-overlapping auxiliary modules. Normally, when
the second/third/... adapter are being added, they are not automatically
activated. However, when these additional adapters target new auxiliary
modules, those would be incorrectly activated (because they look like
they're the first adapter). This has also been fixed.

Right now, we don't allow users to activate multiple auxiliary adapters
on the same module. However, this limitation could be considered too
strict:

For trainable tokens, as long as the indices don't overlap, there is no
conflict. For modules_to_save, we could theoretically determine the
"delta_weight" as new_weight - original_weight, then add up all
delta_weights. This is not implemented in the PR for now to prevent it
becoming even more complex.
2025-08-29 17:54:19 +02:00
e62aee44e3 feat(lokr, loha): add 1x1 Conv2d and Conv1d support (#2515)
This PR enhances the LoKr and LoHa adapter implementations within PEFT by adding proper support for:

 - 1x1 Convolutions (nn.Conv2d with kernel_size=(1,1))
 - nn.Conv1d layers (specifically including kernel_size=1).

This allows LoKr/LoHa adapters to be correctly applied to a wider range of modern architectures that heavily utilize these layer types (e.g., ResNet bottlenecks, MobileNet pointwise convolutions, various Transformer blocks). The implementation aims for optimized handling, inspired by LoRA's 1x1 optimization, while maintaining consistency with existing LyCORIS patterns in PEFT. Parts of the implementation logic, particularly for parameter factorization and layer adaptation, were adapted from the KohakuBlueleaf/LyCORIS library (e.g., lycoris/modules/loha.py), consistent with existing acknowledgements within the PEFT codebase.

This includes:

    New Conv1d adapter layer classes for both LoKr and LoHa, mirroring Conv2d.
    Updated layers_mapping in LoKrModel and LoHaModel to recognize Conv1d.
    Enhanced create_adapter_parameters methods in LoKr/LoHa to correctly initialize parameters based on Conv1d weight shapes.
    Refactored update_layer methods in LoKr/LoHa to:
        Detect Conv1d layers.
        Implement specific logic for 1x1 Conv2d and kernel_size=1 Conv1d, notably disabling use_effective_conv2d where appropriate for direct matrix handling.
        Ensure correct shape calculations for factorization.
    Added detection flags (is_1x1_conv2d, is_1_conv1d) in get_delta_weight methods as hooks for potential future computation optimizations (without altering current paths).
    Maintained backward compatibility; changes are additive and do not affect existing functionality for other layer types or kernel sizes.
    Followed established PEFT/LyCORIS coding patterns for consistency.


---------

Co-authored-by: Kabir Grewal <kabirgrewal@Kabirs-MacBook-Pro-5.local>
2025-08-27 13:07:05 +02:00
246fe4db7c DOC Update BOFT conceptual guide (#2744) 2025-08-26 11:23:27 +02:00
2d9b22f4c0 FIX: DynamicCache key_cache attribute deprecation (#2737)
Resolves failing CI with transformers source install.

The key_cache attribute on DynamicCache is deprecated and will be
removed in the 4.56.0 transformers release. Update the cache dict
in-place instead.
2025-08-26 10:37:12 +02:00
2a27f0e00c Bump version to 0.17.2.dev0 after release (#2748) 2025-08-21 17:58:14 +02:00
41c07f0445 FIX: DynamicCache max_cache_len attribute error (#2735)
Resolves current CI errors with prefix tuning.

Due to some recent changes in transformers (surfaced by
https://github.com/huggingface/transformers/pull/39797), checking
hasattr(cache, max_cache_len) results in an error. This PR fixes it.

Morever, that PR also changed the argument order to initialize
HybridCache (will probably also be reverted in transformers), which is
also taken into account in this PR by only using keyword arguments.

Finally, HybridCache will be deprecated and later removed, so move the
import inside a version guard.
2025-08-21 16:24:04 +02:00
ce5c2044f1 FEAT RoAd: 2D Rotary Adaptation (#2678)
Implements RoAd from https://arxiv.org/pdf/2409.00119

Supports mixed adapter batches.
2025-08-19 15:45:38 +02:00
b5ace6a8c4 CHORE: Clean up config kwargs in custom model tests (#2736)
Resolves #2695

For some PEFT methods, there was a bit of a mess when it comes to how
the init_weights argument was set in test_custom_models.py. The default
kwargs for the tests should be that the PEFT method is initialized as an
identity transform, and for specific tests we want to disable that. Note
that most PEFT methods are initialized by default to be identity
transforms, which is why the argument does not need to be set
explicitly, but it's not true for all PEFT methods.

With this PR, SHiRA, C3A, and FourierFT are now initialized to be
consistent with this. This made it possible to remove some extra
handling of those methods which was intermingled with certain tests.

Moreover, test_custom_models.py now uses the set_init_weights_false
helper function where appropriate.

While working on this, I also cleaned up a bit the docs for the
init_weights arguments of these PEFT methods where appropriate.

I added some clarifying comments.

For test_unload_adapter, I simplified a config type check and
rewrote it to load the base model only once.

---------

Co-authored-by: githubnemo <githubnemo@users.noreply.github.com>
2025-08-19 11:55:25 +02:00
480929537f CI: Allow CI to pass even if MacOS tests error (#2715)
Also fix Zizmor complaint about wrong check.
2025-08-19 11:53:28 +02:00
04d41cbcd0 ENH Enable TP for LoRA linear layers (#2741)
Enables tensor parallelism.
2025-08-14 20:39:33 +02:00
eb1a25abfb CHORE: Upgrade ruff to ~0.12.8 (#2734)
Subjectively, there have been more issues recently with contributor PRs
being rejected by ruff. This could possibly be caused by them using a
different ruff version (presumably: more recent). This PR upgrades ruff
to the latest version to hopefully reduce these issues.

The only change needed to make this ruff version pass was to disable
UP045. This rule requires changing code like:

x: Optional[int]

into

x: int | None

in 220 places. Personally, I don't think it's crucial. Moreover, ruff
won't fix this automically, except with --unsafe-fixes (note that Python
3.9 needs a __future__ import for this, so that could be the reason). My
preference is thus just to disable the rule, but LMK if you disagree.
2025-08-14 18:03:38 +02:00
47961bb547 FIX Dataset download in docs and examples (#2708)
Co-authored-by: Camilo Leonel Amadio <camilo.amadio@microchip.com>
2025-08-12 20:00:06 +02:00
a2c6612b12 FIX Multiple issues with target_parameters (#2710)
There are a few issues with target_parameters that are fixed in this PR.

Existing parametrizations

When using target_parameters with LoRA, after the forward call finishes,
the LoRA parametrization is removed. However, this also used to remove
all other parametrizations on the same parameter, which is bad. With
this PR, only the LoRA parametrization is removed.

Module repr

This PR also extends the __repr__ of lora.ParamWrapper to contain the
parameter name, which makes it more useful.

Extend testing

Added a tiny gpt-oss model to the target_parameters test suite.

Multiple LoRA adapters with target_parameters

There is an issue when adding a second LoRA adapter with
target_paramters, where this second adapter would not actually be
applied correctly. The corresponding unit test was too lax to notice the
bug. This is not easy to fix, so for now we forbid adding a second
adapter with target_parameters. This is very strict but it's better than
having silent errors.

Although it was possible to fix that specific issue, the solution
resulted in ever deeply nested adapters (i.e. with multiple
.base_layer). This in turn results in those infixes to be part of the
state_dict. But then we cannot load the individual adapters correctly,
except if the model is restored in the exact same order as it was
previously created. This is not normally a requirement in PEFT (e.g. I
can create a model with two adapters and later decide to load only one
of them).

In the long run, we need to think about solutions that would allow this.
It may require some form of normalization of the layers to prevent ever
deeper nesting. Also, what is ugly right now is that, given that the
LoRA lives on a module but actually targets one of possibly multiple
parameter, the LoRA weights don't actually reference said parameter in
any name. That means, purely from the state_dict, it is unclear which
parameter a LoRA weight belongs to. Ideally, this should be encoded in
the LoRA weight key.
2025-08-12 13:59:29 +02:00
95df499d87 ENH Support XPU in text gen benchmark (#2730)
Signed-off-by: Yao, Matrix <matrix.yao@intel.com>
2025-08-12 11:08:43 +02:00
06b54d8a0d ENH Support XPU for SFT training script (#2709)
Signed-off-by: Liu, Kaixuan <kaixuan.liu@intel.com>
2025-08-11 14:35:05 +02:00
a90003f0ed ENH Make BufferDict repr accelerator agnostic (#2731)
Signed-off-by: Yao, Matrix <matrix.yao@intel.com>
2025-08-08 12:07:46 +02:00
9b420cc9c7 ENH Support XPU for seq clf examples (#2732)
Signed-off-by: Liu, Kaixuan <kaixuan.liu@intel.com>
2025-08-08 12:07:20 +02:00
a4b41e7924 ENH Support XPU in train_memory.py script (#2729)
Signed-off-by: Yao, Matrix <matrix.yao@intel.com>
2025-08-08 12:06:46 +02:00
e98a59ec2d DOC Make docs more device agnostic (e.g. XPU) (#2728)
Also adjusted some more examples.

Signed-off-by: Yao, Matrix <matrix.yao@intel.com>
2025-08-08 12:06:22 +02:00
7f7463548a ENH Update token clf/NER examples, support XPU (#2727)
Signed-off-by: Yao, Matrix <matrix.yao@intel.com>
2025-08-08 12:05:38 +02:00
a72bbaabf7 ENH Support XPU for SD dreambooth example (#2726)
Signed-off-by: Yao, Matrix <matrix.yao@intel.com>
2025-08-08 12:05:05 +02:00
766a9776bb ENH Update bnb 8bit examples, support XPU (#2723)
Signed-off-by: Liu, Kaixuan <kaixuan.liu@intel.com>
2025-08-08 12:04:29 +02:00
a475f56c81 Updated MetaMathQA results (#2686)
- Updated results for OFT, C3A and shira
- New results for trainable tokens (for completeness)

Trainable tokens wasn't tuned a lot, we could probably search for better
tokens and increase the learning rate. We can do this later.
2025-08-07 14:57:50 +02:00
ee4a2b86be FIX: Warn when using LoRA bias w/o base layer bias (#2725)
When setting lora_bias=True, a bias term is added to lora_B (#2237).
However, to merge this LoRA adapter, we need the base layer to also have
a bias. This has not been checked so far.

With this PR, we will now warn the user when we detect this situation.
Thus they can decide if they want to continue with this setting or not.
If they don't intend to merge, they're fine.

On top of this, when trying to merge in this situation, we now raise an
appropriate error that clearly explains why merging failed.

About PeftWarning

This PR adds a new warning class, PeftWarning. This makes it easier for
users to add PEFT specific warning filters (say, to ignore them or to
raise an error).

There are many more warnings in PEFT that could be migrated to this new
warning class (or a subclass where appropriate). This is outside the
scope of this PR.

Alternatives

1. We considered raising an error instead of warning when encountering
said situation. Many users miss warnings, so an error would be a
stronger signal. This would, however, be too harsh, as it could break
existing user code that is working perfectly fine.

2. We considered adding a bias term to the base layer when it is missing
during the merge. However, this requires careful bookkeeping (e.g. when
unmerging all adapters, the bias needs to be removed again). Moreover,
when calling merge_and_unload, users expect the original model
architecture to be returned. Suddenly adding a bias term would be
unexpected and could lead to errors down the line.
2025-08-07 14:50:13 +02:00
8876664cfe CI: Fix Windows error for low CPU mem usage tests (#2724)
Add tolerances (still quite strict)
2025-08-07 14:49:40 +02:00
6673609479 ENH Support XPU for image clf example (#2722)
Signed-off-by: Liu, Kaixuan <kaixuan.liu@intel.com>
2025-08-07 11:33:33 +02:00
52cc71df9f ENH Support XPU for semantic-segmentation example (#2721)
Also fixing a few issues in the example.

---------

Signed-off-by: Yao, Matrix <matrix.yao@intel.com>
2025-08-07 11:32:26 +02:00
78bf27dd42 ENH Support XPU for RandLoRA example (#2720)
Signed-off-by: Yao, Matrix <matrix.yao@intel.com>
2025-08-07 11:31:42 +02:00
5ef4362e12 ENH Support XPU for QALoRA example (#2719)
Signed-off-by: Yao, Matrix <matrix.yao@intel.com>
2025-08-07 11:30:49 +02:00
a7781aa5e0 ENH Support XPU for OFT dreambooth example (#2718)
Also fixing a couple of issues like wrong argument name.

---------

Signed-off-by: Yao, Matrix <matrix.yao@intel.com>
2025-08-07 11:30:22 +02:00
VED
ec5a1c67b0 FEAT Text generation benchmark (#2525)
Similar to #2395, this benchmark serves to compare different PEFT
methods on an equal basis. This time, the goal is to measure metrics
related to text generation, most notably speed and memory usage. The
results should be easy to reproduce and compare.

The actual experimental settings and results have yet to be added.
2025-08-07 10:17:32 +02:00
d7194f869a ENH Support XPU bnb 4bit example (#2714)
---------

Signed-off-by: Liu, Kaixuan <kaixuan.liu@intel.com>
2025-08-06 16:28:56 +02:00
154ef37561 ENH Support XPU for causal LM examples (#2680)
---------

Signed-off-by: Liu, Kaixuan <kaixuan.liu@intel.com>
2025-08-06 16:27:57 +02:00
6a33744cc2 ENH Support XPU for HRA dreambooth example (#2717)
Signed-off-by: Liu, Kaixuan <kaixuan.liu@intel.com>
2025-08-06 16:27:26 +02:00
db5c00fad2 FIX Poly issue with returned base model (#2702)
Also, add XPU support for Poly example.

---------

Signed-off-by: Yao, Matrix <matrix.yao@intel.com>
2025-08-06 12:16:49 +02:00
e3d8fc98f1 ENH XPU support for conditional generation examples (#2684)
Signed-off-by: Liu, Kaixuan <kaixuan.liu@intel.com>
2025-08-06 12:15:28 +02:00
6d531c77a4 FIX Issue with XPU for face alignment example (#2713)
Signed-off-by: Liu, Kaixuan <kaixuan.liu@intel.com>
2025-08-06 12:14:30 +02:00
2d49c6798d ENH Support XPU for MLP LoRA example (#2712)
Signed-off-by: Yao, Matrix <matrix.yao@intel.com>
2025-08-06 12:14:03 +02:00
d6ed90e8e2 ENH Support XPU for multi_adapter examples (#2711)
---------

Signed-off-by: Yao, Matrix <matrix.yao@intel.com>
2025-08-06 12:13:31 +02:00
e0b2ca7977 Bump version to 0.17.1.dev0 after release (#2707) 2025-08-05 13:05:21 +02:00
44f001c695 Use hub_online_once in trainable token tests (#2701)
Also fix a minor import nit where `TrainableTokensWrapper` was not
added to `utils/__init__.py`. Fixed the corresponding imports as well.

Another housekeeping job is to move hub_online_once to testing_utils.py since it has 
grown to be used in a lot of places and testing_utils.py is the better place to keep 
such utilities.
2025-08-05 12:58:55 +02:00
ff12d13be6 FIX Bug in semantic search example (#2706)
Also updated requirements.

---------

Signed-off-by: Liu, Kaixuan <kaixuan.liu@intel.com>
2025-08-05 11:49:00 +02:00
2518ceeb15 FIX Deprecations in MiSS example (#2704)
Also, was validated on XPU.

---------

Signed-off-by: Yao, Matrix <matrix.yao@intel.com>
2025-08-05 11:46:28 +02:00
ec7dee024f FIX Small issue in PISSA example (#2703)
Also validated it with XPU.

---------

Signed-off-by: Yao, Matrix <matrix.yao@intel.com>
2025-08-05 11:45:34 +02:00
86feb8c4f9 ENH Support XPU for CPT, EVA, GPU offload (#2694)
---------

Signed-off-by: Yao, Matrix <matrix.yao@intel.com>
2025-08-05 11:43:53 +02:00
daee6367aa ENH Support XPU for CorDA example (#2687)
---------

Signed-off-by: Yao, Matrix <matrix.yao@intel.com>
2025-08-05 11:41:42 +02:00
207b27ec2c ENH Support XPU for LoRA-FA example (#2697)
---------

Signed-off-by: Yao, Matrix <matrix.yao@intel.com>
2025-08-05 11:38:44 +02:00
68265a1583 ENH XPU support for training dreambooth (#2696)
---------

Signed-off-by: Yao, Matrix <matrix.yao@intel.com>
2025-08-04 11:42:45 +02:00
be8f824d93 ENH XPU support for dna_language_model example (#2689)
---------

Signed-off-by: Liu, Kaixuan <kaixuan.liu@intel.com>
2025-08-04 11:32:25 +02:00
951e720081 ENH XPU support for boft_dreambooth example (#2679)
---------

Signed-off-by: Yao, Matrix <matrix.yao@intel.com>
2025-08-04 11:17:10 +02:00
49b29c1d1a ENH XPU support for boft/controlnet example (#2674)
---------

Signed-off-by: Liu, Kaixuan <kaixuan.liu@intel.com>
2025-08-04 11:15:36 +02:00
48f6493f94 Release 0.17.0 (#2691)
- Bump versions
- Fix a few TODO comments
- A bit of clean up in test_target_paramters.py
2025-08-01 18:44:24 +02:00
337be05f03 ENH: Adapter injection based on state_dict (#2637)
Make it possible to inject the PEFT adapters based on a state_dict
instead of the PEFT config.

See https://github.com/huggingface/diffusers/issues/11874 for context.

Description

Right now, when creating a PEFT adapter like LoRA, the adapter layers
are injected based on the PEFT config, most notably the entries in
`target_modules`, but other arguments also play into this. Generally,
this is a good approach, but it breaks down in some situations. For
instance, in diffusers, we often have the situation that the checkpoint
was created without PEFT/diffusers, thus there is no PEFT config, only
the `state_dict`. To load these checkpoints in diffusers, the current
approach is to reverse-engineer a valid PEFT config based on the keys in
the `state_dict`.

Unfortunately, this is error prone. Moreover, not every combination of
`state_dict` keys can be easily expressed in a PEFT config through a
combination of `target_modules`, `exclude_modules`, etc. Yes, in theory
everything can be expressed by passing `target_module=<regex_pattern>`,
but reverse-engineering such a regex correctly and efficiently is very
hard (and thus currently not done).

This PR implements a completely different approach to inject adapters.
Instead of relying on the PEFT config to determine which layers to
target, it takes the `state_dict` directly as the source of truth. This
should allow to exactly match what is desired.

Implementation details

I took care to implement this change in a way that if no `state_dict` is
passed, the exact same code path as previously is taken. The risk of
breaking anything should thus be minimized.

Technically, it is not necessary to pass the `state_dict`, we are only
interested in the keys. I still called the argument `state_dict`, since
that is typically what we have at this point, but this can be easily
changed.

I thought it might be a good idea, if the `state_dict` is used, to still
check what modules would have been targeted if we had used the PEFT
config. Then, the results are compared and a warning is given if they
differ. This allows the user to see if the PEFT config is not correctly
specified. While running some diffusers tests, I never encountered this
warning, which is good. However, if we plan, for instance, to get rid of
all the reverse engineering of the PEFT config in diffusers, it would
make more sense to not give this warning.

Caveats

When the original LoRA model was using `target_parameters`, injecting
from `state_dict` will not work correctly. The problem is that the
`state_dict` looks the same, whether the module or a parameter was
targeted. Therefore, we cannot correctly determine the user's intent.

For now, what I decided to do is:

1. Always assume that `target_modules` is meant, as it's the far more
   common occurrence.
2. When we detect `target_parameters` while using `state_dict` for
   injection, we raise an error.
3. If we don't detect this, injection might just slip through, resulting
   in modules being targeted (if they are valid modules) instead of
   parameters.
4. Document that these two features don't work together.

I think overall, this is not too concerning, as both features are rather
niche and thus unlikely to be used in conjunction.

Related changes

While working on this PR, I made a couple of related, though not
strictly necessary, changes:

- Refactor tests in `test_low_level_api.py` to use pytest instead of
  unittest
- Add default target modules for LoHa and LoKr (just copying LoRA)
- Most PEFT method's model classes like `LoraModel` had an `__init__`
  that effectively just called `super()` with the same arguments. I
  removed these `__init__` methods.
2025-08-01 18:39:53 +02:00
J.L
bb4fb50e2b FEAT Add MiSS as a replacement for Bone. (#2604)
Add MiSS, an evolution of Bone, from https://arxiv.org/abs/2409.15371.

MiSS will replace Bone, which is now deprecated. A script to convert Bone
checkpoints to MiSS checkpoints is included.
2025-08-01 18:37:20 +02:00
a91ec33fc5 Fix not detecting regex-targeted embedding layer (#2649)
This issue was found in PR #2638 and is defined thusly:

> When calling `get_peft_model_state_dict(..., save_embedding_layers="auto")` we check if the
> embedding layer is targetted to determine if the embedding layers need saving. This is not
> done when `PeftConfig.target_modules` is a regex-string, potentially missing to save embeddings.

This is fixed by adding a check similar to the existing query of whether `EMBEDDING_LAYER_NAMES` is
a subset of the defined target modules, only that the regex matching from `BaseTuner.inject_adapter`
is used. To avoid code duplication, the matching was moved to its own utility function
`match_target_against_key`.

The main complication was to define the test-cases as it was non-trivial to find what the meaning
of `save_embedding_layers="auto"` entails. I've assembled a list of cases that I think are correct
in the corresponding unit test.
2025-07-31 16:08:32 +02:00
25e5c6b25c FIX Missing device map for facebook/opt-125m (#2675)
Fixes the failing EETQ test in the nighly multi device CI.

In #2612, fixed device_maps were added for multi-GPU training as we
could not rely on device_map="auto". While doing this change, one
device_map was missing, namely for facebook/opt-125m, which is used in
the EETQ multi device test. This device_map was now added. This makes
the test pass locally.
2025-07-30 20:02:22 +02:00
5e00266e85 TST: Add more HF Hub model caching (#2682)
A bunch of tests in test_tuners_utils.py didn't use the decorator so
far, which is now fixed. This should hopefully help reduce timeouts.

Moreover, the iris dataset loading is now moved to a module-scoped
fixture (before, it was just loaded on module level). This doesn't help
with caching, but it prevents loading of this dataset when the
corresponding tests are not even run.
2025-07-30 20:02:07 +02:00
46ae69ac29 FIX Small fixes to target_parameters (#2677)
1. Better error message when same layer targeted twice
2. Remove unused attribute num_experts from _LoraParameterProxy
2025-07-30 14:34:04 +02:00
1c853eaaad Fix trainable tokens with fsdp (#2681)
When using FSDP with trainable tokens, there was an error when
retrieving the state_dict of the TrainableTokensWrapper. The reason is
that for the state_dict that is passed to get_peft_model_state_dict, the
FSDP wrapper was already unwrapped, which means the keys don't have the
FSDP-specific prefix. However, in the PEFT code, when looking up keys
from said state_dict, the prefix was not removed. Now it is removed,
making the lookup succeed. The same logic applies to
set_peft_model_state_dict.

I could successfully start training with FSDP and trainable tokens
locally by adjusting the examples/sft script to include trainable
tokens. Checkpoints could be successfully created and resumed from. The
only change I needed to make was to configure use_orig_params=True for
FSDP.
2025-07-30 14:33:53 +02:00
c11a9dfeaa FIX Failing target_parameters param usage count (#2676)
For testing target_parameters, we use a tiny Llama4 model. This model
was refactored in
https://github.com/huggingface/transformers/pull/39501, resulting in one
parameter being accessed an additional time:

https://github.com/huggingface/transformers/pull/39501/files#diff-e668ec07f78afdb2cb805d939e47453757f0b9437436cb860fcb7cb2431c9cf5R69

Therefore, a unit test that relied on how often this parameter was
accessed started failing. This PR updates the count to the correct
number.

Additionally debug print statements that were accidentally left over are
now removed.
2025-07-30 12:29:51 +02:00
92d65cafa5 Update extending vocab docs (#2669)
- Recommends trainable tokens as first measure
- Clarifies a few things about saving embeddings
- Adds full-finetuning as an option of last resort

---------

Co-authored-by: Benjamin Bossan <BenjaminBossan@users.noreply.github.com>
2025-07-25 13:09:00 +02:00
434651346c ENH: Targeting multiple parameters on the same module (#2665)
When the target_parameters feature for LoRA was introduced in #2638,
there was one gap, namely the possibility to target multiple
nn.Parameters on the same module (there was only a workaround involving
multiple adapters, but that is not user friendly). With this PR, it is
now possible to achieve this.

The mechanism to enable this is a bit crude, namely allowing to nest
multiple ParamWrappers. This should generally be fine as long as there
are only a couple of nn.Parameters being targeted on the same module.
When there are dozens or hundreds, this approach could load to slow
downs or other issues.

A side effect of this implementation is that the ParamWrapper, when it
removes the parametrization, now only removes its own parametrization.
When using nn.utils.parametrize.remove_parametrization, it removes all
parametrizations, which is bad when we have nested parametrizations.

Alternative approaches

Some alternative approaches were discussed internally but the chosen one
was considered most practical.

Allow to have more than one adapted parameter per LoRA layer. This would
require to have nested dicts for the LoRA parameters, something like
self.lora_A[adapter_name][parameter_name]. We don't have this anywhere
so far and it would probably break implicit assumptions about PEFT
layers in many places (like, parsing of state_dict keys), requiring many
adjustments. Have an auxiliary module that contains the individual LoRA
layers that target the individual parameters. This could be the cleanest
solution and would probably be more efficient if there are a huge number
of targeted parameters per module. However, this also brings extra
complexity, as it requires implementing the logic of how to route the
information to the right parameter, and it may be a solution to a
problem that is irrelevant in practice (large number of targets per
module).
2025-07-24 19:42:19 +02:00
43845f9b14 Method Comparison: Improve formatting/layout of table (#2670)
* Method Comparison: Improve formatting/layout of table

Quick improvement to reduce the dominance of columns like `{peft,train}_config` and make
numbers a bit more readable through proper decimal/thousands formatting.

* Bump gradio version to accomodate required fixes
2025-07-24 19:02:09 +02:00
663b1209fd ENH Llama-Adapters support for GPT2 (#2643)
aka "adaption prompt"
2025-07-24 14:51:16 +02:00
04a5ed7b2f DOC Fix error in code example (#2666) 2025-07-24 12:13:41 +02:00
a795199ffa Update tokenizer parameter in sfttrainer across multiple examples (#2664)
* REFAC Update tokenizer parameter to processing_class in SFTTrainer instances across multiple examples

* REFAC Replace tokenizer parameter with processing_class in Trainer instances across documentation and examples

* Refactor tokenizer parameter to processing_class in various examples

- Updated the Trainer initialization in corda_finetuning.py to use processing_class instead of tokenizer.
- Changed the execution_count to null in image_classification_peft_lora.ipynb.
- Modified the tokenizer parameter to processing_class in image_classification_peft_lora.ipynb.
- Adjusted the tokenizer parameter to processing_class in peft_bnb_whisper_large_v2_training.ipynb.
- Updated the README.md in lorafa_finetune to reflect the change from tokenizer to processing_class in Trainer initialization.

* REFAC Update tokenizer parameter to processing_class in Seq2SeqTrainer instantiation

* REFAC Replace tokenizer parameter with processing_class in README and notebook examples
2025-07-23 15:30:28 +02:00
f650b08abb make method comparison device agnostic, so it can expand to more accelerators like XPU (#2610)
make method comparision device agnostic, so it can expand to more
accelerators like XPU

---------

Signed-off-by: YAO Matrix <matrix.yao@intel.com>
2025-07-22 15:25:56 +02:00
e77924563a FIX Prefix tuning after transformers PR 38635 (#2662)
Due to https://github.com/huggingface/transformers/pull/38635, several
tests involving prefix tuning broke:

https://github.com/huggingface/peft/actions/runs/16417140904/job/46385751329

This PR fixes this by resoling two issues:

1. The _supports_cache_class attribute was removed, we can now assume
that it is True if the attribute does not exist.

2. We had special handling of past_key_values for GPTBigCodeForCausalLM
which is no longer required (nor valid) after that PR, so it is removed
depending on the transformers version.
2025-07-22 13:59:34 +02:00
fa85d10a7f Update README.md (#2659)
Update bibtex entry.
2025-07-21 14:36:02 +02:00
f3b97c3704 FEAT Allow LoRA to target nn.Parameter (#2638)
Normally, nn.Parameter cannot be targeted with LoRA adapters. This can
be problematic, e.g. when there are MoE layers that use nn.Parameter
directly, or when there is nn.Linear but the weight is passed directly
instead of calling forward (e.g. MHA).

It would be possible to craft a solution involving a special LoRA layer
for each of the modules that use nn.Parameter directly (e.g. lora.MHA)
but that doesn't scale. This PR is implements a direct way to target
nn.Parameter making use of torch.nn.utils.parametrize.

Using the feature requires passing target_parameters to the LoraConfig.
During the forward pass, when the parameter is acceessed, the LoRA
weights are added to the weights while still ensuring that gradients
flow correctly to the LoRA weights.

Right now, only LoRA supports this feature. Moreover, it is not possible
to target multiple parameters of the same module with the same adapter.
A workaround is to use multiple adapters (i.e. with different names).

---------

Co-authored-by: githubnemo <githubnemo@users.noreply.github.com>
2025-07-15 16:18:46 +02:00
22506a8e42 FIX Deploy method comp app: error in workflow file (#2645)
Fixing the error:

permissions:
  contents: {}
 Check failure on line 11 in .github/workflows/deploy_method_comparison_app.yml

GitHub Actions
/ Deploy "method_comparison" Gradio to Spaces
Invalid workflow file

The workflow is not valid.
.github/workflows/deploy_method_comparison_app.yml (Line: 11, Col: 13):
A mapping was not expected
2025-07-14 14:48:06 +02:00
1c75d96aca FIX: Prompt learning methods modules_to_save issue (#2646)
When using prompt learning methods, modules_to_save was not correctly
set automatically. This is really bad when using, for instance, sequence
classification tasks, which require the classifier layer to be added to
modules_to_save.

The issue was introduced in #2220 where it is wrongly assumed that the
PEFT config always has a modules_to_save attribute, which is not true
for prompt learning. In #2481, this was partly fixed by using getattr to
avoid an error. However, this did not resolve the fundamental issue that
for prompt learning, there is no such attribute, resulting in
module_to_save not being applied.

This PR proposes to fix this by adding modules_to_save to the prompt
learning configs.
2025-07-14 13:57:33 +02:00
a4f9334f12 FEAT Add SHiRA Adapters (#2584)
Implements: Sparse High Rank Adapters

Paper: https://arxiv.org/abs/2406.13175
2025-07-14 11:16:10 +02:00
35000fda88 Fix #2634: Allow peft_method to be a string (#2635)
The auto-tagging code assumed that every `PeftConfig.peft_type` value is an Enum value but
when adding custom types without modifying the enum it is possible to have strings as well
(and the interface supports that).

This change allows for string values of `PeftConfig.peft_type` in the auto-tagging code.
2025-07-08 11:13:06 +02:00
0755ab93f6 FIX Faulty OFT parameter device test (#2630)
There is an error in an OFT test because .cpu() is called on a parameter
instead of a module. Calling it on parameter is not an in-place
operation, so it has no effect.
2025-07-07 15:57:06 +02:00
fa9e429e93 FIX Correctly skip AWQ test based on torch version (#2631)
There is currently an issue with a multi-GPU test using AutoAWQ. Thus,
PR #2529 introduced an unconditional skip for this test. In #2596, a
condition was added to only skip with torch 2.7, as other torch versions
are not affected. However, the is_torch_version function does not
actually match minor and patch versions, so

is_torch_version("==", "2.7")

returns False when using version 2.7.1.

This PR fixes that by checking both "2.7.0" and "2.7.1" explicitly. This
is not very robust in case that there are further patch releases of
PyTorch. However, that is unlikely, and introducing a more general
solution is IMO not worth it just for this instance.
2025-07-07 15:55:37 +02:00
d76f3fe98c FIX Create mask function signature change (#2633)
We use create_mask_for_generate from transformers. It was introduced in
v4.53.0 but in v4.53.1, the function signature was changed to include
position_ids as mandatory argument:

https://github.com/huggingface/transformers/pull/39194

This breaks our function call in PEFT. This PR fixes the function call
by passing position_ids. This in turn would break the function call with
transformers v4.53.0, thus a strict version check is being used for >=
v4.53.1.
2025-07-07 11:46:57 +02:00
b960d259e8 ENH Enable FSDP example for GPTQ quantized model (#2626)
Besides fixes, includes an example script that uses
`hugging-quants/Meta-Llama-3.1-8B-Instruct-GPTQ-INT4`

---------

Signed-off-by: Liu, Kaixuan <kaixuan.liu@intel.com>
2025-07-07 11:08:03 +02:00
9f01809e70 FEAT: Add GH action to deploy method comparison app (#2625)
* FEAT Add GH action to deploy method comparison app

* Add to git credentials

* Different approach

* More fixes

* Fix for requirements

* Another approach

* Bah

* Change trigger to changes in method_comparison/

Manual trigger still possible

* Update method_comparison/README.md

* Satisfy Zizmor
2025-07-04 14:46:59 +02:00
4ad953aefb Bump version to 0.16.1.dev0 after release (#2632) 2025-07-04 14:46:48 +02:00
45996a1d6e Release 0.16.0 (#2629)
- Bump versions
- Update a comment to poin to new PR
- Remove a test skip that is obsolete after #2579
2025-07-03 17:24:25 +02:00
79955723d8 Auto-tagging of PEFT models (#2599)
Features like inference need correctly set tags on the repo / the model card
in order to be available. Also the Hub uses tags to index the models and make
them searchable.

With this change PEFT tags models automatically as lora if they happen to
be trained with LoRA, the base model and a custom `peft:method:<the method>`
tag.

* Base model tags were never supported, they are now

Before PEFT simply ignored tags provided by the base model. Now the
base model tags are added to the PEFT-specific model tags.

* Tag 'tranformers' and add pipeline tag if possible

We remove the `peft:method:*` tag because this change needs more discussion
and is partially unrelated to this change. It is replaced by the necessary
`transformers` tag if the model is based on transformers.

We're also trying to resolve the pipeline tag automatically if it isn't set.
While there is the `transformers.pipelines.base.SUPPORTED_PEFT_TASKS` mapping
it is not sufficient to resolve the pipeline tag automatically since it is
not a 1:1 mapping. Only the causal LM case is a unique mapping.

---------

Co-authored-by: Benjamin Bossan <BenjaminBossan@users.noreply.github.com>
2025-07-03 11:45:26 +02:00
180777ea97 TST Update diffusers hotswap tests (#2619)
When the diffusers hotswap tests were added to PEFT in #2120, the
diffusers test was marked as xfail because hotswapping was not yet
implemented in diffusers. This has long been achieved but the test was
not updated.

This PR now updates the diffusers test in PEFT and removes the xfail.
The new test is basically a copy of the corresponding test in diffusers.
Moreover, I enhanced the test according to #2611 to also ensure that
there are no CUDA graph re-records.
2025-07-02 16:56:55 +02:00
ce3b995f5b FIX CI Multi-GPU tests require device_map (#2612)
As discussed internally, since
https://github.com/huggingface/transformers/pull/37982, some multi-GPU
tests started failing because all parameters are loaded onto a single
GPU. This should now be fixed by providing an explicit device_map
instead of relying on "auto".

Furthermore, for an unknown reason, the HQQ test started failing as the
correlation dipped below 0.97 -- to 0.9696 actually. I think this is
close enough to not warrant further investigation. Therefore, I only
decreased the threshold.
2025-07-02 16:56:18 +02:00
05395fb2de FIX Type annotation error in method comparison (#2628)
Resolves an issue introduced by #2617
2025-07-02 16:33:22 +02:00
2bc97c02b7 FIX Improved handling of conv groups (#2567)
More generalized handling of groups argument in LoRA/DoRA conv layers
(previous solution: #2403).
2025-06-30 16:49:09 +02:00
e6577076bf FEAT Add C3A (Circular Convolution Adaptation) (#2577)
Add new PEFT method C³A (Circular Convolution Adaptation).

From "Parameter-Efficient Fine-Tuning via Circular Convolution":
https://arxiv.org/abs/2407.19342
2025-06-30 14:17:11 +02:00
456292649a FIX Update signature for resolve_lora_variant (#2618)
The function signature was missing **kwargs, which results in a failure
after merging #2571.
2025-06-27 16:57:05 +02:00
87703ba0e5 TST Skip (more) failing MacOS tests (#2620)
We have new MacOS tests that are failing, presumably due to the old
torch version used for MacOS GH CI runners. It's just a handful of tests
related to prefix tuning, IMO not worth trying to fix, as the error is
deep within transformers. Therefore, just skip these tests.
2025-06-27 16:56:51 +02:00
171da8ed60 FIX Attention mask dict issue, generate w/ gemma (#2579)
Resolves CI errors such as this one:

https://github.com/huggingface/peft/actions/runs/15481482956/job/43588020111#step:5:53182

After resolving that error, other errors can occur, but they're
unrelated and investigated independently.

After the transformers change in
https://github.com/huggingface/transformers/pull/37866, it can happen
that:

> Models using different types of attention in different layers (i.e.
gemma3) will now have a dict returned by
prepare_inputd_for_generation (one dict entry per attention type)

As PEFT operates on the attention mask for prompt learning methods, we
need to adjust the code for the possibility of attention_mask being a
dict. Right now, I simply extract the single value if the dict is just
one element. For other sizes, I just raise an error, as I don't know how
to deal with that. For our tests, this is enough but we might need to
find a better solution in the future.
2025-06-27 13:40:09 +02:00
bbc9f5dc8b FIX Avoid CUDA Graph re-record with hotswap (#2611) 2025-06-27 11:33:09 +02:00
d26f332543 ENH Method comparison: temp result files with ts (#2617)
In #2593, the timestamp was removed from the file name of result files.
This makes sense for the proper results, as those should have unique
file names and are tracked in git. However, for temporary and cancelled
results, this is not true. Therefore, the timestamp is added back in.

Moreover, I applied ruff to the MetaMathQA/ directory (it's not applied
automatically) and fixed some imports. Ruff seems to get confused about
local modules, thus the data and utils import are treated differently,
but IMO no big deal.
2025-06-26 16:48:10 +02:00
5af0cbe4ee FIX: Trainable tokens error with DeepSpeed ZeRO3 (#2605)
Resolves #2603

Trainable tokens are erroring when using DS Z3 because the embedding
weights are not available on all ranks. This solution fixes this in an
efficient way that collects these weights on a single rank, initializes
them, and then broadcasts only the slice that is affected.
2025-06-26 16:47:58 +02:00
d936478f07 ENH Make OFT faster and more memory efficient (#2575)
Make OFT faster and more memory efficient. This new version of OFT is
not backwards compatible with older checkpoints and vice versa. To load
older checkpoints, downgrade PEFT to 0.15.2 or lower.
2025-06-26 14:27:03 +02:00
e34852f7b6 ENH Support Quantization-Aware LoRA with GPTQ (#2571)
Support for Quantization-Aware Low-Rank Adaptation (QALoRA) for GPTQ.
2025-06-26 11:51:38 +02:00
bda9665bc9 Results with number of parameters + full fine tuning (#2602)
This change updates all results with their respective number of
parameters (trained + absolute) and adds the newly introduced
full-finetuning.

In addition to these results there was also an issue with the
Makefile as it didn't consider the possibility of having experiments
that don't have an adapter config (e.g., full fine-tuning).
2025-06-24 18:00:46 +02:00
d67d03439c TST XPU regression tests with deterministic (#2600)
---------

Signed-off-by: jiqing-feng <jiqing.feng@intel.com>
2025-06-24 15:42:03 +02:00
59ef3b93c8 FIX: Transformers VLM architecture changes (#2574)
FIX Transformers VLM architecture changes

Follow up to #2554
See discussion in https://github.com/huggingface/transformers/pull/38627

To quote:

> transformers PR #37033 re-arranges the way visual language models are
built by moving the LM head from the language model to the top-level
VLM (among other things).

A consequence of this is that the keys in the PEFT state_dict now also
follow the new architecture. This means that:

1. If a PEFT checkpoint was saved with the old architecture but is
   loaded with the new architecture, loading fails.
2. If a PEFT checkpoint was saved with the new architecture but is
   loaded with the old architecture, loading fails.

1. can be addressed by making use of the newly added
_checkpoint_conversion_mapping attribute for models with the new
architecture. In transformers, this is used to map old model state_dicts
to the new state_dict format. In PEFT, with some fiddling, we can use
the same mapping to make old PEFT state_dicts compatible with the new
architecture (backwards compatibility).

However, 2. is not easily addressed. We would need a reverse mapping for
this. This could be easily derived from _checkpoint_conversion_mapping,
but since this attribute doesn't exist on old models, we cannot do that.
Therefore, new checkpoints created with PEFT on these models won't load
successfully when users use old transformers (forward compatibility).

These cases are covered by the added unit tests, which means that the
test covering case 2 are marked as xfail.

If we could reliably detect that we are in case 2, we could warn the
user and advise them to upgrade transformers, but I don't know if it's
possible to figure this out.

We also allow users to pass their own key_mapping to from_pretrained and
load_adapter, though the documentation advises against it. This argument
could theoretically be used as a workaround in case there is indeed an
issue with prompt learning state_dicts.

Apart from these changes, I also made a small change to account for
https://github.com/huggingface/transformers/issues/38017#issuecomment-2935889679.
2025-06-23 17:39:40 +02:00
bd893a8a36 TST Enable some further XPU tests to pass (#2596)
---------

Signed-off-by: YAO Matrix <matrix.yao@intel.com>
2025-06-23 14:51:49 +02:00
5fe7f8f8ab ENH: Method comparison allow full finetuning (#2597)
- Allow full fine-tuning
- Add an experiment for full fine-tuning
- Rename some column names with wrong names
- Remove redundant metric
- Factor out file size calculation (estimate for FT)
2025-06-19 18:10:20 +02:00
179e29a756 Tracking of (trainable) parameters for MetaMathQA (#2598)
This change adds tracking for the number of (trainable) parameters for each experiment

Tracking the number of parameters, trainable and total, will make the results
much more transparent regarding model capacity. If a method was accidentally
trained with a lot more or less trainable parameters it would make for unfair
results. Having these numbers will also make benchmarking parameter efficiency
easier.
2025-06-19 18:08:25 +02:00
4721213828 Add Makefile + results for MetaMathQA task (#2593)
These are the first results for the MetaMathQA task and also the first
test of the Makefile used to run these tests.

The Makefile offers the functionality to run individual experiments by
specifying the result you want to have, e.g.
`make results/adalora--llama-3.2-3B-rank32[...].json`. Alternatively
you can simply run `make` for `make all` which runs all experiments
that don't have a result yet or which have outdated configs (comparing
result timestamp and config timestamp).

The results are from the main branch. No errors happened during the run.
There were errors with a compute instance that used a A10G 24GB because
of OOM. L40S with 48GB RAM was fine.


* Make sure to use original batch size for OFT

This was not done previously because of runner memory constraints.

* Remove timestamp from result files

We're tracking the results in git for now which makes
looking back easy enough (`git restore -s <rev> results`).
This makes it easier for `make` to track the results that
are already computed and which need to change.
2025-06-19 17:41:51 +02:00
6bcefb02c6 Input sanitizer for benchmark result renderer (#2594)
Since `DataFrame.query` is potentially vulnerable we limit the possible
filter input to a fixed grammar that is roughly like this:

```
expr = left op right
left = ( expr ) | literal
right = ( expr ) | literal
op = in | >= | < | <= | == | and | or
```

this will give us boolean operations and basic comparisons. Note that
`literal` can be arbitrary python literals (strings, tuples, ...).
2025-06-19 11:45:43 +02:00
1f4143a7ca DOC Update README, contributing.md, GH templates (#2588)
- Use a more up to date example code in the README
- A section on transformers integration
- Update devs to tag
- Simplify issue template (did not seem useful in practice)
- Update contribution guideline

---------

Co-authored-by: Steven Liu <59462357+stevhliu@users.noreply.github.com>
2025-06-18 18:11:59 +02:00
d6dbbc9195 ENH: Method comparison improve logging (#2591)
- Print early how the experiment is categorized
- Last resort save_dir so that results are not lost
- Catch errors in general, not only OOM
- Log error message
- Catch checkpoint saving in try ... except, just in case (otherwise,
  if it fails, no logs are written)
2025-06-17 12:14:56 +02:00
a27406c26d ENH Orthogonal LoRA layer initialization (2) (#2498)
Continuation of, and supersedes, #2389

Check discussion there for further info.

---------

Co-authored-by: Wing Lian <wing@axolotl.ai>
2025-06-16 18:48:26 +02:00
fc254e39d9 FIX Correctly determine no_split_modules (#2570)
See discussion in https://github.com/huggingface/transformers/pull/38141
for context.

In the PEFT fsdp_auto_wrap policy, we determine the _no_split_modules.
However, this currently neglects to visit the children of the model,
which can be required for some architectures. This PR fixes that.

Note that the _get_no_split_modules function is largely copied from
transformers. One change is that it doesn't take the device_map
argument. That argument is used in transformers inside an error message
but not for the logic proper. I think it's safe to remove.

Morever, I made an unrelated change to fsdp_auto_wrap_policy, namely
making local imports global (there was no reason for them to be local).
2025-06-16 17:21:06 +02:00
759bb70ace FIX: Generation nightly CI failing due to gemma (#2580)
For a month now, nightly CI has failed with dozens of tests causing this
error:

> RuntimeError: Offset increment outside graph capture encountered
unexpectedly.

(link: https://github.com/huggingface/peft/actions/runs/14850392078/job/41692748031)

It turns out that https://github.com/huggingface/peft/pull/2458, which
added a gemma model to the test suite, is most likely the culprit. Since
that commit, on nightly CI (with GPU), when transformers generates with
gemma, which uses torch.compile, an error can be triggered. For some
reason, this has a side effect on other tests that then results in the
error quoted above.

As is, there is no solution for the gemma issue. To still allow the
tests to run and help discover potential issues, this PR skips the
corresponding gemma tests, which should allow the other tests to pass
again.

I could confirm locally that these tests only fail when the gemma tests
are run in the same session. Hopefully, this generalizes to the CI
environment.

---------

Co-authored-by: githubnemo <githubnemo@users.noreply.github.com>
2025-06-11 18:01:13 +02:00
a8b9a6cecc ENH Optimize LoraParallelLinear initialization (#2576) 2025-06-11 13:48:51 +02:00
e67052b18c Make prepare_model_for_gradient_checkpointing public (#2569)
Expose `_prepare_model_for_gradient_checkpointing`
2025-06-10 16:12:53 +02:00
cc38f09d1d Simple variant application test (#2572)
This change adds a place for future lora variant tests to linger and also adds a basic
test that checks whether the variant was correctly applied to the lora layer if so
requested. This does not cover quantized layers yet but it should be similar to add
thanks to the mapping.

Since we're expecting a lot more variants to be added in the future it is probably
sensible to start early in establishing a place for testing.
2025-06-06 15:42:24 +02:00
f53dd491eb TST Refactor unittest to pytest style custom tests (#2573)
This is a follow up to #2462, #2478, and #2491.

Finish the refactor from unittest-style tests to pytest-style tests to
now also include the last big file to still use the old style,
test_custom_models.py. This file was already mostly written with pytest
in mind, so the changes were rather minimal.

With this class refactored, we can finally remove ClassInstantier, which
made understanding test parametrization much more difficult.
2025-06-06 12:21:31 +02:00
62c9cf3031 ENH Check target modules for Mamba architecture (#2562)
Ensure that for Mamba, incompatible layers are not targeted. If they
are, raise an error.
2025-06-04 19:51:21 +02:00
5c956a479b FIX Inconsistent argument name in load_adapter (#2553) 2025-06-04 17:37:53 +02:00
f122d2cc8d FIX Reset rank/alpha pattern in add_weighted_adapter (#2550) 2025-06-04 15:37:58 +02:00
6d133307ad align xpu behavior w/ cuda (#2551)
* align xpu behavior w/ CUDA in lorafa

for lorafa and randlora: i can see peft requirement torch >=1.13, and in 1.13, torch already has a device agnostic torch.autocast, switch to use the device agnostic API to also cover xpu

clean codes in tests folder to use device agnostic clean cache API. Before this PR, some test cases use device-agnostic clean cache API, some use torch.cuda.xx; after this PR, all use device-agnostic clean cache API

enable gptqmodel multi-device test case on XPU, enable torchao test cases on XPU

* randlora default dtype to bfloat16, align CUDA behavior
* refine randlora&vblora test, refine bnb test skip message
* enable torchao tests on XPU, all passed on torchao 0.11.0
* use accelerate utils
2025-06-02 17:23:42 +02:00
2d74950b52 Fix zizmor warnings about unpinned docker images (#2565)
These images are OK to be unpinned as they are supposed to run
tests on the latest versions and are not used for doing pipeline
work such as releasing or building user artifacts.
2025-06-02 16:45:31 +02:00
e3710e0602 CI: Handle errors with MacOS and transformers (#2561)
CI Handle error with MacOS and transformers

A change in transformers introduced an error in the MacOS CI, which is
handled in this PR.

Context

For context on why we use torch 2.2 for MacOS, check #2431.
Unfortunately, as of today, the available GH workers for MacOS still
haven't improved.

Description

The error was introduced by
https://github.com/huggingface/transformers/pull/37785, which results in
torch.load failing when using torch < 2.6.

The proposed solution is to plug into pytest, intercept the test report,
check for the specific error, and mark the test as skipped instead.

Alternative solutions

The proposed solution is obviously an ugly hack. However, these are
errors we cannot fix directly, as they're caused by a dependency and are
caused by the old torch version we're forced to use (thus fixing them in
transformers is probably not an option).

Instead of altering the test report, the individual tests that fail
could get an explicit skip marker when MacOS is detected. However, since
the amount of affected tests are several hundreds, this is very
impractical and leads to a lot of noise in the tests.

Alternatively, we could move forward with the proposal in #2431 and
remove MacOS completely from the CI. I do, however, still have the faint
hope that GH will provide arm64 workers with more RAM in the future,
allowing us to switch.
2025-06-02 15:07:10 +02:00
5a42bb773f Address changes in transformers VLM architecture (#2554)
[transformers PR #37033](https://github.com/huggingface/transformers/pull/37033) re-arranges
the way visual language models are built by moving the LM head from the language model to
the top-level VLM (among other things).

This breaks the following test:

```
peft_config = PrefixTuningConfig(task_type="CAUSAL_LM", num_virtual_tokens=20)
model.language_model = get_peft_model(model.language_model, peft_config)
```

Reason being that all soft-prompting methods need a task type since each task type has specific
handling of the soft prompt (e.g., padding the labels accordingo to the number of virtual tokens for causal LM).
We also can't simply use `task_type='FEATURE_EXTRACTION'` as this would not deal with `labels` either.

Luckily the VLM is almost behaving like a LM (e.g., `get_input_embeddings` refers to the underlying LM),
therefore we can target the VLM itself and need to have the soft prompt methods detect if we're fine-tuning
a VLM so that we take the respective config variables from the `base_model.text_config` instead of `base_model`
directly.
2025-06-02 11:53:44 +02:00
b3130c9edb Use HF Papers (#2542)
Replaced all arxiv.org/pdf links with HF papers.
2025-05-27 13:48:53 +02:00
d5776f605d fix typos (#2544) 2025-05-26 17:35:55 +02:00
ea07d9d9b4 Fix #2535: Prevent adapters targeting themselves (#2539)
`inject_adapter` currently does not check whether the targeted layer
belongs to an already existing adapter or not. This can lead to the
situation that wildcard patterns (e.g., `o_proj.*`) will attempt to
add adapters to existing adapter layers which naturally falls apart.

This fix attempts to check which keys are already assumed to belong
to adapters by checking for `BaseTunerLayer` instances and doing a
prefix check.

---------
Co-authored-by: Sayak Paul <spsayakpaul@gmail.com>
2025-05-22 00:03:02 +02:00
8af29c6468 added support for Conv1d for DoRA (#2531)
DoRA now supports Conv1d layers and, notably, the check for how to deal with other than linear layers was softened from checking for 4 dimensions to now 3 dimensions since `Conv1d` layers have 3 elements instead of 4.
2025-05-12 20:33:58 +02:00
6c48949930 Randlora documentation and some example usage (#2524)
This is a follow up to #2464 and issue #2441.

Entails documentation for RandLora and slightly updated example usage in the model.py docstring.

Also adds RandLoRA to method comparison.

---------

Co-authored-by: Benjamin Bossan <BenjaminBossan@users.noreply.github.com>
2025-05-07 14:40:55 +02:00
6c054d0ff2 Method comparison: Support more options for the optimizer (#2479)
Allow setting a different optimizer, including PEFT specific ones like
LoRA+.

Add experiment for LoRA-FA

Update param name, rm obsolete directories
2025-05-05 15:41:43 +02:00
eb5e9bcbdf FIX Use correct argument name in MHA forward (#2510)
The arguments of the forward method of MultiheadAttention are called
query etc. PEFT used x. Therefore, if a caller uses keywords only, the
argument is not assigned, resulting in an error.

This was initially reported here:

https://github.com/huggingface/peft/issues/761#issuecomment-2818029500

Note: Other layers' forward method (like Linear) also uses incorrect
names, like x instead of input, but so far no issues were reported, so
I'll leave it as is for now.
2025-05-05 15:40:12 +02:00
1fb98f164b FIX Prompt learning issue with 4d attention mask (#2458)
Resolves #2452

Some causal language models in transformers have 4d attention masks at
the input preparation stage. So far, we have assumed 2d attention masks,
which results in an error in that case. This PR fixes the situation.

The test suite has been extended to include a tiny gemma model. To
prevent the test suite from ballooning, I removed another model.
Specifically, this was GPT neox, which from HF download stats seems to
be one of the least popular architectures from our test suite.

Notes:

My first attempt was to transform the 2d prefix attention mask (from the
virtual tokens) into a 4d attention mask before concatenating them.
However, this was error prone and I was unsure if my approach would
generalize to other model architectures than the one tested (gemma), as
it involved using private transformers methods. The simpler approach was
thus to just create a 2d attention mask and let the model handle it.

---------

Co-authored-by: githubnemo <githubnemo@users.noreply.github.com>
2025-05-05 14:22:26 +02:00
62ee666055 TST Mark AutoAWQ as xfail for now (#2529)
The AutoAWQ multi GPU test is currently failing on CI. This is most
likely an issue of AutoAWQ with PyTorch 2.7. The issue has been reported
but there is no reaction so far. Thus let's skip the test for the time
being.

Since the PR marks the test as strictly x-failing, we will know when
there is a new release with a fix.
2025-05-02 18:42:22 +02:00
f54571223d ENH: Add tests, docs, types for scaling methods (#2526)
For the LoRA methods

- set_scale
- scale_layer
- unscale_layer

unit tests, docstrings, and type annotations were added.

---------

Co-authored-by: Sayak Paul <spsayakpaul@gmail.com>
2025-05-02 12:31:34 +02:00
6c8c3c386e TST: Refactor remaining common tests to use pytest (#2491)
* Refactor test_adaption_prompt.py

- Did not really use PeftCommonTester, thus removed it
- Removed skip if llama or mistral not avaiable
- Parametrized tests instead of duplicating
- Use small models from Hub instead of creating new ones
- Test coverage misses 3 more lines around loading checkpoint, most
  likely unrelated to adaption prompt but instead due to using hub models
  instead of creating new ones

* Refactor test_feature_extraction.py

Pretty straightforward, test coverage is 100% identical.

* Refactor test_multitask_prompt_tuning

Same arguments apply as for test_adaption_prompt.py

* Refactor test_stablediffusion.py

This was pretty straightforward. After refactoring, the test coverage
was 100% the same.

I noticed, however, that these tests did not cover LoKr, they only
pretended to:

37f8dc3458/tests/test_stablediffusion.py (L113-L114)

Thus I added LoKr to the test matrix, after which the test coverage if
of course different, but is fine.

* Skip LoKr merging tests when not CUDA

For some reason, the outputs differ after merging. However, I locally
verified that this is already true before this refactor, so let's just
skip for now, as it is out of scope.
2025-05-02 11:19:32 +02:00
cf75e4aed1 MNT Pin GitHub action hashes for security (#2521)
Make Zizmor happy again.
2025-04-30 16:49:58 +02:00
6383a6bba4 ENH Add default Qwen3 target modules (#2522) 2025-04-29 22:18:58 +08:00
003cf20bcd FEAT Add LoRA INC support (#2499)
Add LoRA Adds Intel Neural Compressor.

---------

Signed-off-by: Daniel Socek <daniel.socek@intel.com>
2025-04-28 18:39:37 +02:00
453a6ff336 DOC Fix links in Corda docstring (#2517) 2025-04-28 12:52:00 +02:00
70e737bbdf FIX TST Incorrect CUDA skipping logic (#2519)
Signed-off-by: jiqing-feng <jiqing.feng@intel.com>
2025-04-28 12:50:51 +02:00
352a230085 FIX add_weighted_adapter with rank_pattern (#2512)
When using add_weighted_adapter with the 'cat' option, it's possible that the
new merged adapter won't allocate enough space when a patterned rank is larger
than config.r. This is now fixed.
2025-04-28 12:50:00 +02:00
d3ff1334a7 FEAT Add RandLoRA to PEFT (#2464)
Implements "RandLoRA: Full-rank parameter-efficient fine-tuning of large
models", https://arxiv.org/abs/2502.00987.
2025-04-25 14:56:24 +02:00
9fdb21e9de Update Docker image builds for torch 2.7+cu126 (#2514)
* Update Docker image builds for torch 2.7+cu126

* Remove bnb multi-source dockerfile
2025-04-24 11:49:05 -04:00
2d90569c5d FIX: CPT should not be tested with sequence classification (#2507)
PR #2481 added sequence classification tests to PEFT. The test matrix
included CPT. However, CPT only supports the task type CAUSAL_LM. These
tests still passed but now started failing with:

> AttributeError: object has no attribute 'prepare_inputs_for_generation'

This is probably a change in transformers but the since causal LM was
never meant to work, the actual fix is to remove CPT from the seq cls
test matrix.

Since CPT automatically changes the task type to CAUSAL_LM, this mistake
can be hard to spot. Therefore, this PR also adds a warning if users
pass the wrong task type. In the future, this will raise an error.
2025-04-23 18:46:27 +02:00
2e39c89b5b TST Make 3 flaky tests pass on XPU (#2503)
Signed-off-by: YAO Matrix <matrix.yao@intel.com>
2025-04-23 12:14:14 +02:00
6a7046c35b TST AQLM test no longer x-fails (#2506)
There has been an AQLM release with the fix to the issue that originally
broke the test.
2025-04-22 12:35:31 +02:00
f7cda1f924 TST Make DoRA tests pass on XPU (#2493) 2025-04-17 12:38:06 +02:00
36160a5c00 Fix #2477: Regression accessing modules_to_save (#2481)
* Fix #2477: Regression accessing `modules_to_save`

Commit ed3c82866ab3fb introduced adapter-local modules_to_save initialization which prevented
needless initialization but also broke prompt tuning methods as they don't have the `modules_to_save`
attribute.

This change also introduces a sequence classification test suite that also tests prompt tuning methods.
While not comprehensive it is sufficient to catch this error and can be extended over time.

While working on this and testing RoBERTa there was also an issue with the default target of `AdaLoRA`
as it defaults to `dense` (among other modules). This is problematic for `PeftModelForSequenceClassification`
as they mark `classification.*` as `modules_to_save`. But since the classification layer is also a dense layer
it will be targeted by `AdaLoRA`. To prevent such situations in the future a general excemption was made in
`check_target_module_exists` to always avoid keys in `modules_to_save`. For this to work the config modification
done in `PeftModelForSequenceClassification` needed changing.

* Remove presumably superflous code from inject_adapter

This code was *probably* for dealing with modules_to_save when calling
inject_adapter directly. However, since the only place that does this is
the PEFT mixed module which already deals with modules_to_save this
code is deemed superfluous.

This also makes dealing with ignoring `modules_to_save` in during targeting
easier since we can use the code in `check_target_module_exists` for every
case (targeting nested layer in modules_to_save module + direct targeting of
modules_to_save module).

* Move `set_additional_trainable_modules`

Move `set_additional_trainable_modules` to `inject_adapter` in case of adapters such as LoRAs, or,
in case of prompt tuning adapters, to their respective initialization point (while keeping the order
of operations intact).

Before this change a significant portion of `modules_to_save` initialization was removed from
`check_target_layer_exists` (called from `inject_adapter`) which only handled the `modules_to_save`
parameter in cases where this function was called directly (e.g., via `LoraModel.add_weighted_adapter`).
This also meant that trainable tokens was completely ignored in these cases. It also copied code from
`_set_trainable`.

The removal prompted the need to find a replacement which is this change: on adapter injection we will
now always check if there need to be additional trainable modules, not only during `PeftModel` init.
2025-04-17 12:25:05 +02:00
4c82bfff76 FIX Multi GPU tests: explicit device map (#2484)
Some multi GPU tests had device_map="auto" but some recent changes in
accelerate resulted in parameters being moved to a single device. Now
set the device map explicitly to avoid that. Add a more rigorous check
to ensure that the parameters are really on multiple devices.
2025-04-11 18:06:38 +02:00
87cffd5041 MNT Update HF Hub download kwargs (#2492) 2025-04-11 18:06:17 +02:00
1083964862 Testing common uses situational HF_HUB_OFFLINE (#2490)
Employ offline mode when the model was already accessed once from the hub in order to speed up the CI and make the process less prone to rate limiting.

The idea here is that we can mark contexts that, once they were visited once for a specific model id, we can assume that they are cached locally and can set HF_HUB_OFFLINE=1 for this context. This PR tests this concept for testing_common which is already a big chunk of the tests and probably has the biggest gain given the amount of change.

We already saw that the assumption does not always hold true: for the prompt tuning tests (_test_prepare_input_for_generation) there is a case where one time the tokenizer is not used for model X and after that time the tokenizer is used - since we're setting the hub to offline for the second time the tokenizer from_pretrained call will fail. This problem is alleviated by adding the tokenizer name to the model id as cache identifier.
2025-04-11 18:05:25 +02:00
3a67a442e6 FIX Deprecated evaluation_strategy argument (#2487)
---------

Signed-off-by: yuanwu <yuan.wu@intel.com>
2025-04-11 12:38:10 +02:00
dc2ea5a766 FIX X-LoRA error when targeting different modules (#2488)
Resolves #2485

Fixes an issue with X-LoRA that resulted in an error when the individual
LoRA adapters were targeting different modules.
2025-04-11 11:00:35 +02:00
37f8dc3458 FIX: Error when merging LoRA bias with scale != 1 (#2489)
When merging with LoRA bias (i.e. setting lora_bias=True), the scaling
was not considered, leading to incorrect results when scaling is not 1.
This is now fixed.
2025-04-10 16:19:22 +02:00
5e9ee26e79 TST Refactor (continued) of encoder tests (#2478) 2025-04-10 10:53:44 +02:00
0c2bdbb11a FEAT Add LoRA-FA to PEFT (#2468)
Adds LoRA with frozen A (LoRA-FA) to PEFT.

Paper: https://arxiv.org/abs/2308.03303
2025-04-10 10:53:19 +02:00
13c81df843 ENH Add default target_modules for Llama4 (#2480)
The architecture is different to the previous Llama models but the
target module names are the same.
2025-04-09 11:18:52 +02:00
896b51548b SFT example: Use correct source for max_seq_length (#2474)
When using Unsloth the SFT example used the wrong source for the `max_seq_length` attribute.
The attribute originates from TRL/TrainingArguments.

---

Co-authored-by: gufengke <gufengke@pinduoduo.com>
Co-authored-by: githubnemo <githubnemo@users.noreply.github.com>
2025-04-08 17:41:37 +02:00
21fc8bd715 FIX Deleting adapters on auxiliary modules (#2466)
Resolves #2381

Deleting adapters now supported by trainable tokens and modules to save
2025-04-08 16:46:22 +02:00
cb65a0dd56 MNT Use Python 3.9 as RUFF target version (#2483)
---------

Signed-off-by: cyy <cyyever@outlook.com>
2025-04-08 16:44:55 +02:00
8feea90319 TST Refactor tests to make them simpler (#2462)
Other test files have yet to follow.
2025-04-04 12:08:09 +02:00
dfd82f73f0 Fix: Multiple PEFT methods have issues with models loaded in float16 or bfloat16 (#2433)
As a user, it should be possible to manually cast the base model to a
lower precision dtype, float16 or bfloat16, and still have the different
PEFT methods work correctly. Currently, this is not the case for many
PEFT methods, as can be replicated by the added tests.

To understand the problem, it helps to take a step back. By default,
PEFT will treat the adapter weights with high precision, i.e. with
float32. When the base model is lower precision, the user needs to pass
inputs in lower precision too, as otherwise self.base_layer(x) would
fail. However, this low precision input clashes with the high precision
adapter weights.

The solution implemented in this PR is to cast the input to a higher
dtype [1]. That way, the whole adapter operation is conducted in high
precision. Only once that has finished will the final result be cast to
the original dtype. This should lead to better results, but it may
require more memory. Note that this is how LoRA is implemented, so the
changes in this PR bring the other methods more in line with what LoRA
does.

If the user does not want the adapter to be in float32, they can always
pass autocast_adapter_dtype=False when calling get_peft_model or
PeftModel.from_pretrained. This is also tested.

Besides adjusting the forward method to account for these changes, the
merge and unmerge methods also often had to be adjusted, as they did not
correctly account for the base model dtype. Now, those methods should
always conserve the original dtype of the base model.

Note that if, for whatever reason, the input casting in [1] is not
desired, users can use the disable_input_dtype_casting context manager
to disable it (more context information on this feature can be found in
PR #2353). I updated the corresponding code to be agnostic to the
specific PEFT method (beforehand, it was only for LoRA).

Note that model.merge_adapter(safe_merge=True) did not work so far, even
though the argument was documented it was not actually there. This is
now fixed.
2025-04-04 12:06:17 +02:00
0083f9c859 TST Increase tolerance in some tests for xpu (#2475)
Numerical stability is lower, increase tolerances.
2025-04-03 17:16:10 +02:00
1cf886b792 TST Skip some GPU tests on non-CUDA devices (#2473) 2025-04-03 17:15:09 +02:00
82a2a0bb6e TST Skip some GPU tests for XPU (#2471)
Avoid issue with numerical instability.
2025-04-03 17:13:50 +02:00
8c8b529b31 CI: More caching in tests to avoid 429 (#2472) 2025-04-02 18:09:31 +02:00
J.L
7dcdf7b311 DOC Update of Bone/Bat/DiSHA docs (#2312) 2025-04-02 12:18:52 +02:00
2ee02af9d4 Bump version to reflect patch release (#2461) 2025-03-27 17:52:03 +01:00
41921013f5 Method comparison evaluation suite (#2395)
Introduction of a method evaluation suite.

We generally face the problem that there is little knowledge on what PEFT methods perform best. To this end we decided to build an evaluation suite that has defined tasks, shared hyper-parameters and can be extended with new tasks and new method configurations over time.

For the sake of comparison we've not decided to incorporate user-submitted results but we encourage users to inspect the results, suggest new experiments and improve the configuration of methods if they're deemed unfavorable.

As of now there's only one task based on the MetaMathQA dataset which has the benefit of being complex while still fitting on a consumer GPU.

Notable changes in this squash:

* Add default training params

The experiment specific training params use the default training params
but can override any parameter from it if needed. However, this way it's
easier to make a change to all experiments (say, I want to change the
base model, I don't need to change each individual
training_parameters.json).

* Add possibility to change attn implementation

However, both flash attention 2 and flex attention are slower on my
system. Thus, stay with default None (-> SDPA).

* Refactor to use GenerationConfig

Allows to more easily use, say, static cache, which is the new default,
as it's faster (apart from the first pass)

* Better parsing of answers

E.g. 1/2 == 0.5

* Keep adapter file by default after train run

But add --clean to delete it.

Keeping the adapter can be useful if the user wants to run further tests
with the trained model.

---------

Co-authored-by: Benjamin Bossan <benjamin.bossan@gmail.com>
2025-03-27 17:00:38 +01:00
7279a9ff2e Fix #2450: Revamp adapter_state_dict_* methods (#2456)
`AuxiliaryTrainingWrapper.adapter_state_dict` now utilizes an external state dict for the
computation of the module state dict to avoid problems with DeepSpeed (or FSDP) when dealing
with distributed parameters.

It is not possible to simply wrap everything in `GatheredParameters` context managers since
doing that leads to a deadlock when running on more than one process (reasons unclear).
Since transformers, or more specifically, accelerate already handles state dict fetching
for the whole model, it is more economical to use that state dict and rewrite the methods
that before depended on `state_dict()` calls.
2025-03-27 14:08:10 +01:00
911da6f356 LoRA variant init now also receives kwargs (#2455)
The kwargs might be required for some LoRA variants for proper
initialization.
2025-03-27 11:58:05 +01:00
986b77c213 Fix sft example script trl and env var (#2454)
There were 2 minor issues:

1. Using DeepSpeed, an env var is not set, leading to an error.
2. TRL renamed the tokenizer argument and to processing_class in
   v0.12.0.
2025-03-26 14:32:33 +01:00
e2262d29a9 FIX Faulty test that results in nan weights (#2448)
This specific test used a learning rate that is too high, resulting in
nan weights. Then, when weights are compared to assert that they're
different, the test passes trivially because nan != nan. The lr is now
reduced and there is a sanity check that none of the weights contain
non-finite values.

See discussion in
https://github.com/huggingface/peft/pull/2433#issuecomment-2747800312
ff.
2025-03-26 11:09:01 +01:00
8d935a63c2 CI Enable 5 test cases on XPU (#2442)
Signed-off-by: Yao, Matrix <matrix.yao@intel.com>
2025-03-25 13:43:14 +01:00
15106f53b7 Refactor to better support LoRA variants (#2443)
Provide a framework to add LoRA variants like DoRA. This way, it will be
easier in the future to add variants to LoRA without the need to either
copy all the LoRA code and make small changes, or clutter the existing
LoRA code with countless if statements. Adding more LoRA variants in the
future will not balloon the size of the proper LoRA implementation.

The new approach is to add LoraVariant subclass to
peft/tuners/lora/variants.py. Typically, this will require one subclass
per supported layer type. The subclass should basically be stateless by
only implementing static methods (which will facilitate composition,
e.g. if a new variant can be combined with DoRA). The subclass needs to
provide a set of methods (init, merge_safe, merge_unsafe, unmerge,
forward). In the LoRA code itself, these methods will be called if the
corresponding adapter uses the subclass.

The choice which variant to dispatch to is determined by the
resolve_lora_variant method. It is called during update_layer and can be
overridden by each LoRA layer (so e.g. lora.Linear dispatches to another
class than lora.Embedding, or bnb LoRA layers could theoretically
dispatch to a different class than normal lora.Linear).

For now, the only LoRA variant is DoRA. This has been refactored to use
the new approach.
2025-03-25 13:42:13 +01:00
e5e7b73fcf Fix typos (#2447) 2025-03-24 11:36:32 +01:00
42bb6b55cc DOC Fix incorrect link in DeepSpeed docs (#2444) 2025-03-24 11:23:37 +01:00
e79fdd78f6 DOC: Tip on how to merge with DeepSpeed ZeRO-3 (#2446)
---------

Co-authored-by: Kashif Rasul <kashif.rasul@gmail.com>
2025-03-21 13:58:23 +01:00
5b601548b9 ENH LoRA ConvNd layers using the groups argument. (#2403)
Conv layers with groups>1 are supported, but not merging.
2025-03-20 12:08:36 +01:00
93eea9c786 Bump version and minor instruction fix (#2439) 2025-03-19 19:15:08 +01:00
b34d8a2ca1 Release 0.15.0 (#2435)
- bump versions
- remove piece of code required for torch <= 1.12
- Small adjustments to release instructions regarding 
  versions
2025-03-19 15:27:10 +01:00
48e0c5de71 Fix #2422: Modules to save with multiple adapters (#2430)
Using multiple adapters with different `modules_to_save` values leads to a scenario where
it is implicitly assumed that each `ModulesToSaveWrapper` has a module for every loaded adapter.
Since the adapters have different `modules_to_save` values this is not the case and retrieving
the state dict fails with a key lookup error.

In addition to that, after disabling a `ModulesToSaveWrapper`, setting the adapter as active does not
re-enable said adapter.

---------

Co-authored-by: Saeid Ghafouri <s.ghafouri@qub.ac.uk>
Co-authored-by: Benjamin Bossan <BenjaminBossan@users.noreply.github.com>
2025-03-19 10:57:58 +01:00
b2b34fd658 FIX Minimal target module optimization bug w/ IA³ (#2432)
Fixes #2429

During PEFT model initialization, we have an optimization/compression
step where we check the target_modules attribute and, if it's very long,
try to find a minimal subset that targets the same modules. If we find
it, we reduce the target_modules to that minimal set. This is done
mostly to prevent some cases (e.g. in diffusers) that result in hundreds
of target_modules being checked against thousands of module names,
slowing down initialization.

There is an issue with this when using IA³. There, we additionally have
the feedforward_modules attribute, which must be subset of
target_modules. When target_modules is shrunk, the subset check will
fail. This PR fixes this by simply skipping the compression step for
IA³.

It would be possible to adjust the logic to also shrink
feedforward_modules, but it's not quite as forward, since the latter may
not be identical to target_modules, so there would have to be extra
logic to account for that. At the end of the day, this is too much
effort for what's pretty much an edge case, so the simple solution is
implemented.
2025-03-17 16:31:09 +01:00
7320bb94a0 FIX AutoPeftModels never reduce embedding size (#2427)
Resolves #2415

There was a bug in AutoPeftModels where the embedding was always resized
to the vocab size of the tokenizer when the tokenizer was found. This
makes sense if the vocabulary was extended, but some models like Qwen
already start out with "spare" embeddings, i.e. the embedding size is
larger than the vocab size. This could result in the embedding being
shrunk, which in turn resulted in an error when loading the weights.
2025-03-14 14:17:31 +01:00
2f063e6342 ENH: Extend the regex for rank/alpha pattern (#2419)
Supersedes #2382

Right now, the regex used to match the keys passed for rank_pattern and
alpha_pattern requires that either:

1. The module name is identical to the key
2. The module name having a prefix and then ending on the key

This is restrictive, since it doesn't allow to disambiguate between all
cases. E.g. if we have a model with these attributes:

- model.foo
- model.bar.foo

We cannot currently target just model.foo. (We can already target only
model.bar.foo by passing "bar.foo" as a key to the rank_pattern /
alpha_pattern dict).

This PR makes it possible to pass "^foo" as a key. This way,
model.bar.foo is not targeted, as the key does not start with "foo".

As a general rule for users, if they intend to have a full match, they
should pass the full name of the module preceded by a ^. This is the
least ambigious way.

When running the test case with the old code, all the test cases with ^
will fail, which is fine, since ^ was not working anyway. At the same
time, all test cases not using ^ pass, which means they are backwards
compatible.
2025-03-13 12:53:27 +01:00
37266c1bab FIX Revert optimization for LoRA scaling == 1 (#2416)
The PR #2404 introduced an optimization for LoRA in case that scaling ==
1 (see
https://github.com/huggingface/peft/pull/2404#discussion_r1975145200).
This unfortunately leads to recompilation when the model is compiled, as
witnessed by the failing CI here:

https://github.com/huggingface/peft/actions/runs/13755365121/job/38461837691#step:6:157

For now, let's revert the optimization. If we have concrete numbers that
show that the optimization makes a significant difference, we can
start thinking about how to optimize this code path in a
compile-friendly way.
2025-03-11 17:19:01 +01:00
8edaae9460 TST Add missing .eval() calls to inference tests (#2408) 2025-03-07 16:59:19 +01:00
e1c7e8c8dc FIX Reset the FP32 matmul precision in tests (#2411)
Fixes currently failing hotswap+compile tests that fail because outputs
are not close enough before vs after compilation.

In test_gpu_examples.py, some tests run torchao, which sets the float32
matmul precision to "high". This in turn results in some models
producing different outputs when compiled (but only for some seeds).
Therefore, we need to ensure that the precision is reset to "highest",
which is the default.
2025-03-07 12:45:12 +01:00
24150d0e41 TST Enable BNB tests on XPU (#2396) 2025-03-06 16:18:47 +01:00
461f6426ef Trainable Tokens: Support for Weight Tying (#2399)
This is a follow-up PR of #2376 to add support for weight-tying.

Some models, such as gpt2, tie the weights between the LM head and the input embeddings for various reasons. If we use the trainable tokens adapter, we're changing the result of the forward() of the input embeddings but we do not change the weights (unless we merge()). This means that the changes are not reflected in the tied weights, such as the LM head, leading to wrong results when training.

The current approach is searching for tied layers and putting TrainableTokensLayer adapters on them as well but initialized to use the parameters from the embedding layer's TrainableTokensLayer. This is done via the tied_adapter argument of TrailableTokensLayer.__init__().

Notable other changes:

* Implement weight-tying for encoder-decoder models

Notably we are removing the duplication filter of `named_modules` when searching for
the (tied) target modules since tied weights are by definition duplicates.

* Implement embedding name inference

It's now possible to let the adapter decide which is the input embedding layer based on the output
of `model.get_input_embeddings()`. If that fails, the default is still `embed_tokens`.

* Refactor getattr in AuxiliaryTrainingWrapper

Before this change only the selection of the module that was supposed to have the queried
attribute was given to the wrapper implemention (via `_{has,get}attr_wrapped`). Now the full
`getattr()` call is done by the implementation.

This change is motivated by the need for access to `embedding.weight` at certain times which,
for `ModulesToSaveWrapper` is not a problem - but it is for `TrainableTokensWrapper` since
the original module's weights differ from the current weights, at least potentially.

What we do now is to merge the weights and return those when `embedding.weight` is accessed.
No other attributes are currently forwarded.

* initialization from buffers was broken since `persistent` flag was set too late
  (update() is called before setting the flag)

* update from other BufferDict was broken since it was assumed that BufferDict was
  a mapping collection object. we cannot simply change it to a Mapping since it
  then will break pytorch code which assumes that modules are hashable.

---------

Co-authored-by: Benjamin Bossan <BenjaminBossan@users.noreply.github.com>
2025-03-06 14:09:01 +01:00
0a695d55be Use new slack secret token name (#2409) 2025-03-05 19:37:06 +01:00
dbae76979b ENH Add simple script to estimate train memory (#2378) 2025-03-04 17:16:54 +01:00
d5f5e35cc6 FIX Bug with PeftConfig.from_pretrained (#2397)
In #2038, we added a change to PEFT to make PEFT configs forward
compatible. To recap, when we add a new config value, say foo, for the
LoraConfig, normally users of older PEFT versions would get an error
when trying to load it because LoraConfig would not accept a foo
argument. Now, we remove this unknown arg and just give a warning.

In general, this worked well, but there was a bug when using
PeftConfig.from_pretrained instead of the more specific
LoraConfig.from_pretrained etc. In that case, we would check the known
arguments from the PeftConfig type, which are only a few. This means
that we would ignore parameters like the rank for LoRA.

With this PR, that bug is fixed. As we know the specific PEFT config, we
can use that instead of the PeftConfig super type to determine the
unknown parameters. Therefore, PeftConfig.from_pretrained will work the
same as LoraConfig.from_pretrained.

Note that when a user uses PeftModel.from_pretrained, under the hood it
will use the more specific PEFT config, i.e. LoraConfig etc. Therefore,
the described bug would not occur there. It is thus very unlikely that
this bug affected many (or any) users in the wild.
2025-03-04 17:16:37 +01:00
1dc1416984 FIX Model with nested all-linear target modules (#2391)
Resolves #2390

There was a bug in PEFT when adding a LoRA adapter with
target_modules='all-linear' (e.g. via add_adapter) to a model that
already had LoRA adapters applied. The resolution of 'all-linear' would
result in, for instance, lora_A and lora_B being targeted, leading to
nested LoRA adapters. With this fix, this is prevented and the correct
layers will be targeted.
2025-03-04 17:16:14 +01:00
8c8bf8f1c8 FIX GPTQModel Lora implementation (#2404)
Requires gptqmodel 2.0+, optimum 1.24.0+
2025-03-04 17:15:56 +01:00
f51203f3e4 Standalone Custom Tokens Tuner and integrated into LoRA (#2376)
This change is based on the nifty addition of @marcusinthesky from #1541.

When adding tokens or fine-tuning the representation of specific tokens we currently have little choice but to retrain the whole embedding matrix which can be huge and adds to the memory footprint (in RAM but also on disk). This method creates a sparse matrix of shape (n, embed_dim) where n is the number of tokens to be customized and only trains these few values.

This change introduces two ways of using it:

```
peft_config = TrainableTokensConfig(target_modules=['embed_tokens'], token_indices=[0, 1, 2])
peft_model = get_peft_model(model, peft_config)
```

and with LoRA

```
peft_config = LoraConfig(
    target_modules='all-linear',
    trainable_token_indices={'embed_tokens': [0, 1, 2]},
)
peft_model = get_peft_model(model, peft_config)
```

Adding this feature to adapters other than LoRA should be relatively easy, mostly adding the `trainable_token_indices` config option and some debugging.

To make this change it was necessary to change the `modules_to_save` infrastructure as combining this feature with LoRA is quite similar. This refactoring entailed moving most of the basic functionality of `ModulesToSave` to the `AuxiliaryTrainingWrapper` class. This also changes the logic how `modules_to_save` is loaded/saved from from the state dict, so there could still be bugs here.

This implementation does not entail support for weight-tied layers yet. This will follow in a future change.

---

Notable commits in this squash:

* Use unload_and_optionally_merge_module protocol

With `AuxiliaryTrainingWrapper` as abstraction it is probably a good idea to
have support for `unload_and_optionally_merge_module`.

Since the wrapper is more akin to a PEFT layer than a model the name semantics
are fine and it does basically the same job.

* trainable tokens is also trained in certain adapters

Before, the assumption was that modules_to_save was the only thing that
is trained alongside an adapter's parameters. Now there's also the
token_adapter delta tokens via `NewTokensWrapper`.

* Remove old modules_to_save handling

This is now all handled via the `AuxiliaryTrainingWrapper`.

* Fix modules_to_save module overwriting

The state dict imlementation of ModulesToSaveWrapper was incorrect in that
it did not include its own parameters, just the parameters it needs to overwrite
in the end. I.e. if layer `lin1` is modules to save wrapped,
`lin1.{weight,bias}` is saved and overwritten but `lin1.modules_to_save.<adpater_name>.[...]`
is not saved.

* Introduce a load key map for aux. train wrapper

Before this change it was only possible to remove a key prefix from the wrapper's
state dict (e.g., `modules_to_save.default.weight` -> `weight`); now it is possible
to restore such reduced value by mapping the key back
(i.e., `weight` -> `modules_to_save.default.weight`).

* Replace sparse matrix with dense + index_copy

This change is mostly because sparse matrices are not that beneficial in this case
(at least not from what we can see right now) and they do not solve the problem
of having to change the new tokens in-place to avoid outdated deltas when new token
vectors are initialized randomly after loading the deltas.

* Make peft_config.layers_to_transform optional

Before this change the base tuner class was forcing this attribute
to be present on the config class even though the attribute is not
specified in the base config.

* Implement missing key logic in `_set_trainable`

Before this it was not checked if the targeted module by `modules_to_save` or `trainable_token_indices` existed
or not (when used in conjunction with a PEFT method). In this case an error message similar to the `inject_adapter`
error is raised when no module is found.

---------

Co-authored-by: Marcus Gawronsky <marcus.g@myrunway.co.za>
Co-authored-by: Benjamin Bossan <BenjaminBossan@users.noreply.github.com>
2025-02-26 16:51:45 +01:00
3dd26682f4 ENH Make hotswap error on compile optional (#2393)
Users may actually want to call prepare_model_for_compiled_hotswap with
a compiled model, even though that leads to recompilation. To allow
this, we give the option to only warn, or even ignore, this fact when we
detect the model to be compiled. By default, we still error.
2025-02-24 12:06:08 +01:00
bf186edc5b FIX Failing single GPU tests related to hotswap (#2385)
After unblocking single GPU tests with #2380, a couple of tests related
to hotswapping failed. This PR should (hopefully) address this.

1. Wrong error type caught with xfail

I set the wrong error type for the xfailing compiled hotswap diffusers
tests. This was because I hadn't checked out diffusers main when I was
checking locally.

2. Loosen tolerance

Some tests fail because an allclose does not match even though the
numbers in the logs look pretty much identical:

https://github.com/huggingface/peft/actions/runs/13404117333/job/37440752790#step:6:1929

This is most likely a problem with tolerances being too strict.
Unfortunately, I can't reproduce the error locally, so I have to guess
that moving from 1e-5 to 1e-4 will the issue.
2025-02-19 16:15:36 +01:00
c118a6e564 SEC Bump transformers version used in examples (#2374)
There were approximately 322 dependabot security advisories for this, so
let's bump the transformers version used in the requirements.txt of a
couple of examples. Note that this is not a real security issue, as that
issue is with another model that's not being used in the examples.
2025-02-19 16:15:14 +01:00
e8babb1063 CI Skip audio test on single GPU CI (#2380)
It appears that the single GPU tests are always failing at this
test ("The operation was canceled"), probably because it is
hanging (after more than 5h). Let's try to debug by skipping this test.

Morevoer, remove a superfluous step in the CI workflow.
2025-02-18 17:29:58 +01:00
1793a95310 FIX: Avoid caching in X-LoRA generate (#2384)
X-LoRA tests started failing after this transformers PR:

https://github.com/huggingface/transformers/pull/35724

The solution appears to be to disable caching completely when calling
generate on the X-LoRA model. This also makes some previously xfail-ing
tests pass.

I tested this locally with transformers checked out before and after the
mentioned PR and the tests pass in both circumstances. I also tested
changing the base model from "facebook/opt-125m" to
"trl-internal-testing/tiny-random-LlamaForCausalLM" and the tests passed
with both.

Also, mark X-LoRA save_load_function test as flaky.
It was marked as xfail beforehand, but it is in fact just flaky.
2025-02-18 17:29:40 +01:00
1e2d6b5832 FIX Load checkpoint from custom cache dir (#2373) 2025-02-14 17:52:54 +01:00
94be64dd19 ENH Hotswap preparation raises when no adapter (#2375)
When erroneously calling prepare_model_for_compiled_hotswap before
loading any LoRA adapter, right now, nothing happens. Later, users will
run into an error because the model was not prepared. Therefore, we now
raise an error when the function did not detect any adapter layers and
give an appropriate error message.
2025-02-13 17:21:24 +01:00
6d033600e7 FIX: Small fixes to hotswapping (#2366)
A couple of smaller issues that surfaced when working on the diffusers
integration are not fixed.

- Better detection if model is compiled in
  prepare_model_for_compiled_hotswap
- Fix handling of models that are compiled but where compilation is not
  detected (from "inside" the model)
- Handle device of swapped in adapter weights.
- Wrong adapter name in compiled diffusion model test
- Add hotswap test for different alphas and ranks but model not being
  compiled (linear and conv2d)
- Make _check_hotswap_configs_compatible "public"
- Don't import diffusers in test root
- Add support for compiled Conv2d
2025-02-12 18:20:02 +01:00
363c14e673 ENH DoRA optimization for ConvNd if dropout=0. (#2371) 2025-02-11 15:31:17 +01:00
5e03d058b8 DOC: Explain uninitialized weights warning (#2369)
Users sometimes get confused by the warning from transformers that some
weights are uninitialized and need to be trained when they use models
for classification. A recent example is #2367.

Even though the warning does not come from PEFT, let's add a section to
the docs to explain this warning, as the situation is a bit different
here.
---------

Co-authored-by: Steven Liu <59462357+stevhliu@users.noreply.github.com>
2025-02-10 12:00:58 +01:00
40fe166446 DOC Fix links to BOFT in docs (#2365)
Fixes #2364
2025-02-07 11:14:42 +01:00
eaab05e18d Hotswap allow different alpha scalings and ranks (#2177)
Hotswapping of LoRA adapters is already implemented, but when alpha
scalings or ranks differ, this triggers recompilation of the model is
compiled, which is inefficient. Users can now call
prepare_model_for_compiled_hotswap to prevent recompilation in many
cases (see the doc update for caveats).
2025-02-05 18:04:06 +01:00
db9dd3f4db ENH Allow disabling input dtype casting for LoRA (#2353)
Provides the disable_input_dtype_casting to prevent the input dtype to
be cast during the forward call of a PEFT layer.

Normally, the dtype of the weight and input need to match, which is why
the dtype is cast. However, in certain circumustances, this is handled
by forward hooks, e.g. when using layerwise casting in diffusers. In
that case, PEFT casting the dtype interferes with the layerwise casting,
which is why the option to disable it is given.

Right now, this only supports LoRA. LoKr and LoHa don't cast the input
dtype anyway. Therefore, the PEFT methods most relevant for diffusers
are covered.
2025-02-04 17:32:29 +01:00
2825774d2d DOC Rename link to PEFT Quicktour (#2358)
The "Get started" link currently points to the "Quicktour" article,
while "Get started" is also the first title in the TOC, causing
confusion.

Rename the "Get started" link to "Quicktour" to match the article and
ensure consistency.
2025-02-03 17:36:28 +01:00
57126d5bdd DOC Fix links to PEFT guides (#2357) 2025-02-03 12:48:10 +01:00
0facdebf62 Use locked install for zizmor (#2350)
To be on the safe side when installing zizmor and it's dependencies
we're using a locked installation, meaning that the dependencies
and their versions are taken from the Cargo.lock file.

This will hopefully reduce the chances of having the pipeline
randomly fail due to updated dependencies down the line.
2025-01-29 15:47:46 +01:00
7af5adec29 TST Use different diffusion model for testing (#2345)
So far, tests are using hf-internal-testing/tiny-stable-diffusion-torch
for testing diffusion models. However, this model has some issues:

- still uses pickle (.bin) instead of safetensors
- there is a FutureWarning because of the config

Now, using hf-internal-testing/tiny-sd-pipe instead which doesn't have
those issues.
2025-01-28 12:31:32 +01:00
6e1a248d50 ENH Improve invalid peft config error message (#2346) 2025-01-28 11:34:14 +01:00
a8e94b69a5 FIX Failing AdaLoRA GPU test (#2349)
PR #2341 added more rigorous checks for AdaLoRA and adjusted the tests
to take that into account. However, one GPU test was missed. This test
is now updated too, fixing the failing nightly CI (I ran it locally on
GPU to verify).

On top of that, I adjusted some numbers on the tests so that each
AdaLoRA phase runs for 2 steps, leading to 6 steps total. This means
that tests run a little bit longer but I think it's acceptable for
better test coverage.
2025-01-27 16:15:57 +01:00
f4176a9e1f ENH Add LoRA implementation for nn.Conv1d (#2333) 2025-01-27 11:48:20 +01:00
53d8115212 DOC Better document init_lora_weights=False option (#2347)
Resolves #2212

The documentation for the LoraConfig option init_lora_weights is
ambiguous. This PR updates the docstring and help to make it clearer
what this option does.

I also changed the order of the options (True -> False -> Gaussian ->
rest, instead of True -> Gaussian -> False -> rest), as that made more
sense to me.

The remaining parts of the docstring were left untouched, except for
changing line breaks (to better adhere to the 120 chars limit) and
adding missing spaces at the end of a few lines.
2025-01-27 11:08:56 +01:00
9c25d9411a Documentation & error checking for AdaLoRA timing (#2341)
The documentation about how the AdaLoRA works was a bit unclear.
Especially that `tfinal` is not a point in time but a duration.

It was also possible to build schedules that never budget and
therefore lead to an exception because the code does not expect
this case (which is OK). We prevent such a scenario now by treating
this configuration as invalid. (Issue #2337)

We also check for `total_step` != None since this is also a guaranteed error in the code.
2025-01-24 18:54:17 +01:00
6538e56e13 TST: Update torch.compile tests and docs (#2332)
We have tests to check if torch.compile works for various PEFT methods
and "advanced" features (QLoRA, merging, ...). These tests are not run
on a regular basis, but are triggered manually. As such, it was time to
revisit them.

So far, a few of these tests were marked as xfailing. All these tests
are passing now. The reasons for this:

- Presumably: New PyTorch version (I haven't checked older)
- Loosening some tolerances
- Remove a spurious argument added by torch.compile
- Slightly adjust order of when torch.compile is called

The docs have been updated to reflect these new findings.
2025-01-24 15:21:28 +01:00
bbb112841b MNT Update ruff to v0.9.2 (#2343)
We use ruff for linting. The version is fixed because otherwise, we
formatting changes would creep into random PRs. Thus far, the version
was ~0.6.1 but that's already quite old by now, thus moving to ~v0.9.2.

The ruff changes themselves are all about:

1. Other line breaking logic for asserts with messages
2. More aggressive string normalizaton

Comment

Making these changes is always a bit annoying since existing PRs might
need to be updated, but there is never a really good time to do it.
2025-01-24 11:28:38 +01:00
6e30991e97 FEAT Add gptqmodel support (#2247)
Add support for gptqmodel quantization. This is a replacement for
auto-gptq.

For now, both packages are supported, but since auto-gptq is no longer
being developed, it will be deprecated and removed at some point in the
future.

---------

Signed-off-by: jiqing-feng <jiqing.feng@intel.com>
Co-authored-by: LRL-ModelCloud <165116337+LRL-ModelCloud@users.noreply.github.com>
Co-authored-by: Qubitium-ModelCloud <qubitium@modelcloud.ai>
Co-authored-by: ZX-ModelCloud <165115237+ZX-ModelCloud@users.noreply.github.com>
Co-authored-by: LRL <lrl@lbx.dev>
Co-authored-by: Steven Liu <59462357+stevhliu@users.noreply.github.com>
2025-01-23 14:00:11 +01:00
1b9bcb200b DOC Add entry to solve unknown config argument (#2340)
There have been multiple issues and forum posts in the past asking about
errors like:

TypeError: LoraConfig.__init__() got an unexpected keyword argument ...

This error can occur when the adapter that is being loaded is trained
with a more recent PEFT version than the one currently being used. I
thus added a section to the Troubleshooting part of our docs to describe
the solutions.

Note that we already added changes to PEFT in #2038 to make configs
forward compatible. But since users who encounter this problem have, by
definition, older PEFT versions, they don't benefit from this.

---------

Co-authored-by: Steven Liu <59462357+stevhliu@users.noreply.github.com>
2025-01-23 12:41:51 +01:00
ed3c82866a FIX: Avoid needless copy from modules_to_save (#2220)
Resolves #2206

The problem is that we keep a "global" modules_to_save on the model
which contains all possible modules_to_save for each adapter. When the
first adapter targets layer "foo" with modules_to_save and the second
adapter targets "bar", then "foo" will create a copy of the original
module for the second adapter, even though it's not needed.

This does not change the result but is unnecessary and takes up memory.
Thus it should be avoided.
2025-01-23 12:41:29 +01:00
9c11a3e59a Attempt at adding a cache for models (#2327)
This change introduces CI caching for datasets and hub artifacts across runner operating systems with the intended goal to minimize the number of failed test runs because of network faults. As an additional bonus it might make the CI a bit faster.

The following artifacts are cached: ${HF_HOME}/hub/**

Note that we're avoiding .lock files as well as *.pyc files. We're not simply caching $HF_HOME since there is also the datasets and modules where the former was acting up when testing (no details, just dropped, we may explore this later but we're not using that many datasets) and the latter is just code which is probably not a good idea to cache anyway.

There is a post process for the cache action which uploads new data to the cache - only one runner can access the cache for uploading. This is done because github actions is locking cache creation, so if there's a concurrent cache creation, both may fail. This runner is currently set to ubuntu in the python 3.10 run.

If this modification turns out to be ineffective we can move to forbidding access to the hub in general (HF_HUB_OFFLINE=1) and updating the cache once per day but let's first try out if this is already enough to decrease the fail rate.
2025-01-23 10:54:25 +01:00
93d80465a5 First attempt at fixing zizmor warnings (#2338)
Zizmor now supports auditing token permissions for each workflow run and
reports that we almost never remove the default permissions (which seem
relatively permissive). As a precaution it does not hurt to revoke all
token permissions by default and see what breaks on the way.
2025-01-22 16:21:33 +01:00
83028178ec FIX Add missing attributes to MultiheadAttention (#2335)
See initial report here:
https://github.com/huggingface/peft/issues/761#issuecomment-2600936330.

For MHA to work in all circumstances, for instance in eval model, it
requires us to expose a couple of more attributes that we have missed so
far. Those were added now.
2025-01-20 18:28:33 +01:00
da998c8f1e FIX Bug with modules_to_save loading if substring (#2334)
Fixes #2289

This bug was the result of an error in the logic of modifying the
state_dict for modules_to_save in set_peft_model_state_dict. The error
in the logic was that it was checked if an entry from modules_to_save (a
set of strings) is a substring of a key of the state_dict. If it was, a
new name was assigned to that key in the state_dict, which would allow
to load the weight later.

The issue that stems from the substring check occurs if there are
multiple modules_to_save, and one of them has a name that is a substring
of another. So e.g. if one is named "classifier" and the other is named
"classifier2", there could be a false match.

This PR fixes the issue by enclosing the string with ".", i.e. we now
check if ".classifier." is a substring instead, which avoid false
matches.

What made this bug even harder to debug was that modules_to_save is a
set and therefore has no predetermined order. Therefore, the bug would
be flaky. To address this, modules_to_save is now sorted before
iterating over it. That doesn't contribute to resolving the bug, but it
makes the bug non-flaky, allowing future debugging to be easier.
2025-01-20 18:28:15 +01:00
aa3f41f752 FIX: Generating with mixed adapter batches and with beam search enabled (#2287)
See #2283

Right now, using mixed adapter batches with beam search generations does
not work. This is because users need to pass the adapter names
associated with each sample, i.e. the number of adapter names should be
identical to the number of samples in the input.

When applying beam search, transformers internally repeats the samples
once per beam (or so it looks like). Therefore, we have more samples
during generation than samples in the input. Consequently, the adapter
names have to be extended accordingly. This is now taken care of.
2025-01-17 18:17:48 +01:00
f973b28ffa TST Make cuda-only tests device-agnostic (#2323) 2025-01-17 15:09:25 +01:00
9481d2ee80 React on new zizmor version findings (#2331)
Zizmor detected a potential cache poisoning attack via `setup-docker-buildx`.
There is an argument to this (an attacker with a valid github token could
modify the cache, change the buildx binary and tamper with Docker build releases)
but there is also an argument against it: the buildx cache would prevent
general information leaks when a new buildx release is tampered with. Since
there is no obvious benefit from either side, we ignore this hint and deem it
uncritical.

We also change the trigger of zizmor runs to pushes on main, regardless of
whether workflow files are changed or not to catch new audits from more
recent zizmor versions.
2025-01-16 15:08:56 +01:00
63ae263644 FIX: Reduce CorDA memory consumption + docs (#2324) 2025-01-15 12:29:21 +01:00
0ab9711c24 FIX: Reinstate PEFT_TYPE_TO_MODEL_MAPPING variable with deprecation (#2328)
This is for backwards compatibility: In #2282,
PEFT_TYPE_TO_MODEL_MAPPING was removed as it was redundant with
PEFT_TYPE_TO_TUNER_MAPPING. However, third party code could still use
this mapping, e.g.:

6689349625/auto_gptq/utils/peft_utils.py (L8)

Therefore, it is reinstated here, but a DeprecationWarning will be given
if it's used.
2025-01-14 17:48:17 +01:00
3289134524 FIX low_cpu_mem_usage=True with 8bit bitsandbytes (#2325)
There was a bug in PEFT that occurred when trying to use the
low_cpu_mem_usage=True option with 8bit bitsandbytes quantized models.
This bug is fixed now.
2025-01-14 10:45:52 +01:00
1e8bc60492 Refactor: PEFT method registration function (#2282)
Goal

The goal of this refactor is the following: Right now, when a new PEFT
method is added, a new directory is created in src/peft/tuners/<name>
with a config, model, etc. This is fine and self-contained.

However, in addition to that, a couple of other places in the PEFT code
base need to be touched for this new PEFT method to become usable.

As an example, take the recently added Bone method (#2172). Ignoring
tests, docs, and examples, we have the additions to
src/peft/tuners/bone, but also need to:

1. Add an entry to PEFT_TYPE_TO_CONFIG_MAPPING in mapping.py.
2. Add an entry to PEFT_TYPE_TO_TUNER_MAPPING in mapping.py.
3. Add an entry to PEFT_TYPE_TO_MODEL_MAPPING in peft_model.py
4. Add an entry to PEFT_TYPE_TO_PREFIX_MAPPING in utils/constants.py
5. Add some code to get_peft_model_state_dict in utils.save_and_load.py

With the changes in this PR, all these steps can be omitted.

On top of that, we also have the re-imports to peft/__init__.py and
peft/tuners/__init__.py but those are still required (I'm hesitant to
mess with the import system). Furthermore, it's still required to add an
entry to PeftType in utils.peft_types.py. Since this is an enum, it
can't be easily generated automatically. Therefore, adding a new PEFT
method is still not 100% self-contained.

Changes in this PR

With this PR, less book-keeping is required. Instead of the 5 steps
described above, contributors now only need to call

# example for the Bone method

register_peft_method(
    name="bone", config_cls=BoneConfig, model_cls=BoneModel
)

in the __init__.py of their PEFT method. In addition to registering the
method, this also performs a couple of sanity checks (e.g. no duplicate
names, method name and method prefix being identical).

Moreover, since so much book keeping is removed, this PR reduces the
number of lines of code overall (at the moment +317, - 343).

Implementation

The real difficulty of this task is that the module structure in PEFT is
really messy, easily resulting in circular imports. This has been an
issue in the past but has been especially painful here. For this reason,
some stuff had to be moved around:

- MODEL_TYPE_TO_PEFT_MODEL_MAPPING is now in auto.py instead of
  mapping.py
- PEFT_TYPE_TO_PREFIX_MAPPING has been moved to mapping.py from
  constants.py
- get_peft_model had to be moved out of mapping.py and is now in its own
  module, func.py (better name suggestions welcome). This should be
  safe, as the function is re-imported to the main PEFT namespace, which
  all examples use.

The PEFT_TYPE_TO_MODEL_MAPPING dict could be completely removed, as it
was basically redundant with PEFT_TYPE_TO_TUNER_MAPPING. The
get_peft_model_state_dict could be simplified, as a lot of code was
almost duplicated.

There were a few instances in peft_model.py like:

        elif config.peft_type == PeftType.P_TUNING:
            prompt_encoder = PromptEncoder(config)

Now, instead of hard-coding the model, I just do model_cls =
PEFT_TYPE_TO_TUNER_MAPPING[config.peft_type].
2025-01-13 15:07:42 +01:00
b345a6e415 FIX Package checks for torchao, EETQ (#2320)
Torchao

Under some unknown circumstances, it can happen that even though
importlib.util.find_spec("torchao") is not None,
importlib_metadata.version("torchao") still fails. This error is now
caught. This error was noticed in the diffusers CI.

EETQ

This is basically a revert of #2226. That PR had to add a check to the
EETQ import as EETQ was broken for some time with latest
transformers (see https://github.com/NetEase-FuXi/EETQ/issues/34 for
context) but that has been fixed.
2025-01-10 16:33:27 +01:00
4cdcaf95fa FIX Adaption prompt error after transformers 35235 (#2314)
The changes in https://github.com/huggingface/transformers/pull/35235
resulted in a couple of adaption prompt tests to fail. This PR fixes
these failures while maintaining compatibility with older transformers
versions.

Required changes:

- hidden_size attribute removed from model, now config.hidden_size
- num_heads attribute removed from model, now config.num_attention_heads
- forward now returns 2 outputs instead of 3, rewritten to be agnostic
  towards the number of outputs
2025-01-10 16:32:51 +01:00
0b0ff9a2e8 FIX Prefix tuning test w/ rotary emb on multi GPU (#2311)
See
https://github.com/huggingface/transformers/pull/35235#issuecomment-2575500996
for context.

There has been a refactor in transformers that resulted in the rotary
embedding of Mistral (and probably others) moving to the model level.
This led to a device map used in one of the tests to being incorrect.
This PR fixes the device map.

Note that this fix doesn't really have anything to do with prefix
tuning, the error occurred even before prefix tuning is used.
2025-01-10 15:27:03 +01:00
af637acc5b DOC In-place modification through get_peft_model (#2313) 2025-01-09 15:05:41 +01:00
8d3039b6cb ENH Add LoRA multihead attention module (#1324)
For now, only works with _qkv_same_embed_dim=True.

---------

Co-authored-by: Wang, Yi <yi.a.wang@intel.com>
Co-authored-by: keakon <keakon@gmail.com>
Co-authored-by: Zach Mueller <muellerzr@gmail.com>
Co-authored-by: Saeid Ghafouri <sdghafouri@gmail.com>
Co-authored-by: Fanli Lin <fanli.lin@intel.com>
Co-authored-by: githubnemo <githubnemo@users.noreply.github.com>
2025-01-08 17:35:43 +01:00
29ba7b85e2 Add zizmor for CI (security) linting (#2288)
To add a bit of a safety net to our CI jobs it might make sense to add a CI security linting tool such as zizmor.

The linting run should be green at the moment since I fixed all reported issues:

- setting persist-credentials: false in all checkout runs
- changing template substitutions to environment variable substitutions

I added an ignore rule for dangerous-triggers to ignore the upload_pr_to_documentation workflow as our actions are configured to only run such steps on approval which should already have seen at least maintainer eyes and the zizmor run.
2025-01-08 17:30:31 +01:00
c207885195 ENH Extend usage for OLoRA finetune script (#2308)
- allow DDP
- make it work on CPU
- set seed and dtype

Related: dequantize_bnb_weight is updated not to move to cuda if not
available.
---------

Signed-off-by: jiqing-feng <jiqing.feng@intel.com>
2025-01-08 17:15:52 +01:00
3d2bf9a8b2 FIX #2295: Warn when user reloads modified model (#2306)
When modifying a model with `get_peft_model` that was already modified
in the same way, even specifying a different config may not change
the trainable parameter count, e.g. when specifying target modules that
are only a subset of the previous target modules.

With this patch a warning will be issued with a hint to `.unload()`
when calling `get_peft_model` on an already modified model.
2025-01-07 18:10:07 +01:00
d967f6394c FIX Make CorDA example work (#2300) 2025-01-07 16:52:51 +01:00
fdf36d28da DOC FIX Add resize_token_embeddings (#2290) 2025-01-07 12:20:45 +01:00
Nil
ad1ff5c338 DOC Extend prepare_model_for_kbit_training docstring (#2305)
Co-authored-by: NIL <nilbiescas@gmail.com>
2025-01-06 17:06:28 +01:00
f0fd2eabc7 FIX: Typo in lora config.py type annotations (#2297) 2025-01-06 17:04:37 +01:00
6d458b300f FEAT Adding CorDA as an optional initialization method of LoRA (#2231)
Implements the paper "CorDA: Context-Oriented Decomposition Adaptation of Large Language Models for Task-Aware Parameter-Efficient Fine-tuning" (https://arxiv.org/abs/2406.05223)

This initialization method can be used for building task-aware LoRA adapters from weight decomposition oriented by the context of the task using examples from data.

---------

Co-authored-by: 5eqn <491100866@qq.com>
2024-12-19 13:33:37 +01:00
c1fe8105a5 FIX Int8 check for torchao v0.7.0 (#2284)
At one point, we need to perform a check for the quantization dtype.
This used to rely on the layout_tensor attribute, which was renamed to
tensor_impl. The code now checks both attributes.
2024-12-18 12:19:00 +01:00
ae55fdcc5c FIX Adoption prompt: New way to obtain pos emb (#2276)
This PR resolves the failing adoption prompt tests in the CI using
transformers installed from source.

In this transformers PR:

https://github.com/huggingface/transformers/pull/34858/

the module.rotary_emb attribute has been removed, which adoption prompt
so far assumed was present. Instead, the position_embeddings are now
already computed and can be taken directly from the kwargs.
2024-12-13 18:19:08 +01:00
cf0dfe5695 ENH Apply sorting of imports (#2279) 2024-12-13 15:49:40 +01:00
8bafdb1126 MNT apply sorting of exported symbols in __all__ (#2280) 2024-12-13 15:48:41 +01:00
a217507105 ENH FIX Allow "all-linear" to target custom models (#2267)
Description

When the option to specify target_modules="all-linear" was introduced in
PEFT (#1295), the restriction was added to only allow it for instances
of PreTrainedModel. This was because we want to exclude the output layer
from being targeted, even if it is a linear layer, and we can't
determine this layer well except by convention.

This PR lifts the restriction to PreTrainedModels. Thus, users can now
target other models like diffusers models or custom models. The caveat
is to use this "at your own risk", since all linear layers will be
targeted, whether they be output layers or not.

Bugfix

While working on this, I found a potential bug. The logic for updating
target_modules was that only the last part of the linear module's name
was used. So e.g. if the module was named "foo.bar.baz", then "baz" was
added to target_modules. This will lead to problems if there is another
"baz" module that is not a linear layer.

This bug was fixed by adding the full name ("foo.bar.baz" in this
example) to the updated target_modules. This can potentially lead to big
target_modules with a lot of almost repititions, but it's worth it to
avoid targeting the wrong module.

It is not clear to me why only the last part was added. The PR that
added this to PEFT copied that part from here:

7f4e95a68d/qlora.py (L248)

but it's not clear why that repo did it that way. Maybe it was just to
keep the set size smaller.

The bug was uncovered by the unet test that is already present. Still, I
extended this test, as well as another one, to better cover this
potential issue, by ensuring that the number of target layers is as
expected.

Backwards compatibility

Technically, this change is breaking backwards compatibility. To go back
to the previous example, let's say we have a module that is called
"conv.baz" and that is a Conv2d layer. With the old behavior, since
"baz" is added to the target_modules, we would now also target this
Conv2d layer, which is supported by LoRA. After merging this PR, the
Conv2d layer would no longer be targeted.

I'd argue this is the correct behavior and thus worth changing. Also,
note that since we override target_modules, this is reflected in the
adapter_config.json. Therefore, if a user loads an adapter that had this
"baz" target, it will still work as it did previously.
2024-12-13 11:29:28 +01:00
5cdade973e ENH Warn when adapter name contains prefix (#2254)
Warn when adapter_name contains the tuner_prefix, which can cause
weight reinitialization during model loading.
2024-12-11 15:23:18 +01:00
3c61b3e880 ENH Typing: fix library interface (#2265)
Improve typing (re-export) in __init__.py files.
2024-12-11 15:20:27 +01:00
b516cee509 Bump version to 0.14.1.dev0 (#2263) 2024-12-11 15:19:32 +01:00
ec92cdcc41 FIX: Failing BOFT tests due to device (#2242)
This pull request resolves above issue regarding BOFT forward/merging with CUDA
by ensuring that all relevant tensors and models are moved to the correct
device. This change is necessary to prevent issues such as zero matrices and
test failures when using CUDA.

Also fixed the fbd_cuda deprecation warning.
2024-12-09 11:56:39 +01:00
de88c70306 Prepare for PEFT release of v0.14.0 (#2258)
- Bump versions
- Remove deprecated convert_pissa_to_lora argument
- Remove a pytest skip for older transformers versions
- Adjust some comments, docstrings
2024-12-06 12:19:42 +01:00
860f7838c8 ENH: Updates for upcoming BNB Int8 release (#2245)
* Updates to prepare for bitsandbytes release
2024-12-05 11:09:56 -05:00
15712db4a0 FIX Prevent CUDA context initialization due to AWQ (#2230)
Importing from AWQ triggers CUDA context initialization, which can be
problematic in some circumstances (see #1877). This PR moves the import
so that it's local, preventing this issue.
2024-12-05 14:00:45 +01:00
f86522e011 FIX Correctly determine word embeddings on Deberta (#2257)
After a recent change in
transformers (https://github.com/huggingface/transformers/pull/22105),
PEFT could no longer determine the word embeddings from Deberta. This PR
provides a very minimal fix that correctly determines the word
embeddings again.

Details

Previously, the word embeddings were determined in the following manner:

1. Find the transformers_backbone by checking the base model's children
for PreTrainedModel instances
2. If not found, the model itself is considered the transformers
backbone.
3. On the backbone, check for modules whose weight has the same size as
the vocab size. This module is now assumed to be the word embeddings.

Before the mentioned transformers PR, 1. did not find anything, so 2.
was applied. After the PR, however, the DebertaEncoder is now an
instance of PreTrainedModel (asked internally, this is intended).
Therefore, the encoder is now considered the transformer backbone. But
the encoder does not have the word embeddings attribute, therefore step
3. fails.

The fix of this PR is to first explicitly check for
model.embeddings.word_embeddings and if this attribute is found, use it
as the word embeddings. Only when it's not found do we use the other
method described above. This way, we can successfully determine the word
embeddings on models like Deberta.

This whole code is a bit messy and could probably be improved. However,
changing the logic too much could inadvertently break for some existing
models that are not included in the tests. Therefore, I chose this
method which leaves the existing logic mostly intact.
2024-12-04 15:34:45 +01:00
c05758989d FIX Correctly pass low_cpu_mem_usage argument (#2253)
There was a bug that when creating a PEFT model with the task_type
argument, the low_cpu_mem_usage argument was not passed along. This is
now fixed and unit tests for this were added.

This is a very niche bug because there is typically no need to pass
low_cpu_mem_usage=True when calling get_peft_model. Moreover, as the
option for this was only added recently (#2142) and is unreleased, few
if any users should be affected by the bug.
2024-12-03 17:04:09 +01:00
3f9ce553e2 DOC Update CPT docs, add example (#2229)
Update CPT docs and add example notebook.
2024-11-29 12:50:59 +01:00
131efba5d4 FIX TST Small regression in BNB LoRA output (#2238)
Our regression tests reveal that the 8bit LoRA BNB regression test is
failing. To reproduce, run:

pytest tests/regression/test_regression.py -s --regression -k
test_lora_8bit

The regression was introduced in #2122. We didn't notice this earlier
because of other failing tests in the nightly CI.

The cause of the error is subtle. In the original code, we would
calculate the LoRA output, convert the dtype if necessary, then add it
to the base output. After the mentioned PR, we calculate the LoRA
output, add it to the base output, then convert the dtype if necessary.
The difference is very small on a per layer basis, but it can accumulate
over the layers, leading to a significant difference in outputs, as
witnessed by the regression test.

This PR rolls back this specific part of the PR (both for 8bit and 4bit)
while leaving the main change of that PR intact.
2024-11-28 11:25:00 +01:00
943daf1de4 ENH Argument to enable bias for LoRA B (#2237)
This PR adds the argument lora_bias which, if set to True (default:
False), adds a bias term to the LoRA B module.

Typically, this should be disabled. The main use case is when the LoRA
weights were extracted from fully fine-tuned parameters, so the bias of
those parameters can be taken into account.

Merging is supported for this argument when using vanilla LoRA layers or
bitsandbytes LoRA layers. Other types of LoRA layers don't support
merging.

This option is also disabled for non-standard LoRA weight initialization
like LoftQ, as well as for embedding layers (since they use
nn.Parameter).
2024-11-27 18:37:10 +01:00
J.L
60978d759b ENH Improvements to Bone method (#2233)
New Bone is more memory efficient and faster, but at the cost of
slightly worse performance. The old Bone implementation can still be
used by passing init_weights="bat" to the config.
2024-11-27 13:00:00 +01:00
34e15be828 Bump version of MacOS from 12 to 13 (#2235)
Version 12 will be deprecated in the coming month and there
are already some problems with it so we might just as well
upgrade.
2024-11-27 12:57:05 +01:00
d13d7a401c TST: Skip test on multi-GPU as DataParallel fails (#2234)
This test fails in multi-GPU setting because transformers.Trainer
switches to DataParallel. As this is not a commonly used parallelization
strategy, it should be okay to just skip this.
2024-11-26 16:40:39 +01:00
ca1b3b1730 TST Update Llava model id in test (#2236)
Currently PEFT tests are failing because 2 trl internal models that we
relied on for testing were moved (one has also been changed). The new
models have now been copied to peft-internal-testing to avoid this in
the future.

I have updated peft-internal-testing/tiny_T5ForSeq2SeqLM-lora to use the
new copied model in peft-internal-testing as the base model (no change
in PEFT code necessary). This model also required the LoRA adapter to be
updated, as the shapes of the base model were changed.

This PR updates the used Llava model id to now use the copy of that
model that is inside of peft-internal-testing.
2024-11-26 15:26:45 +01:00
6a533b783d CI: Fix failing torchao test (#2232)
The test failure is caused by a dangling accelerate environment variable
indirectly set by previous tests through TrainingArguments, which
results in torchao using fp16, which breaks.

The current fix is to delete the env var for torchao. In the future, we
may use an accelerate function to clean up these env vars.

---------

Co-authored-by: githubnemo <githubnemo@users.noreply.github.com>
2024-11-25 16:16:14 +01:00
eaaf03c127 TST: Eva: Speed up consistency tests (#2224)
The EVA consistency tests are currently among the slowest tests overall,
taking roughly 4x 50 sec with an overall test runtime of 15-20 min, so
they make up a significant fraction of that runtime.

With this PR, the number of iterations until convergence is reduced by
passing a lower tau value and by reducing the number of tested seeds.

Overall, this cuts the runtime ~20 sec or less.

Besides this change, I made some smaller adjustments to EVA:

- break long lines
- hide progress bar in tests
- move an abs call for better consistency in test
2024-11-22 11:39:40 +01:00
029faf6eea ENH: EVA: Deterministic behavior of SVD on multi gpu setups (#2225)
Also: Some improvements to IncrementalPCA.
2024-11-21 16:32:17 +01:00
04437347da ENH Validation for task_type in PEFT config (#2210)
Raises an error when invalid task type is provided.
2024-11-21 16:29:23 +01:00
0155fa814a CI: Skip EETQ tests while broken (#2226)
EETQ tries to import the shard_checkpoint function from transformers but
it has been removed in the latest version. Therefore, trying to use EETQ
currently results in an import error that crashes all the tests. This
fix results in EETQ tests being skipped if there is an import error.

The issue has been reported to EETQ:

https://github.com/NetEase-FuXi/EETQ/issues/34
2024-11-21 14:05:15 +01:00
d9aa0898e4 FIX Correctly set device of input data in bnb test (#2227)
Fixes failing CI bitsandbytes CI tests.

The input data was on CPU but the model is on GPU, resulting in an
error. Note that this mismatch was not an issue previously because
accelerate added an AlignDevicesHook that took care of this. With the
latest accelerate, this is no longer the case, whereas the hook is
present for v1.1.1.

In any case, it's better that we set the device of the data explicitly,
so I think this is a safe change. But it would be good to know what
happened that caused this change.
2024-11-21 10:58:53 +01:00
8874ab5ed8 CI Update AutoAWQ version to fix CI (#2222)
Currently, the CI fails because awq tries to import a non-existing
function from transformers (presumably it was there at one point but no
longer is):

>     from transformers.modeling_utils import shard_checkpoint
> E   ImportError: cannot import name 'shard_checkpoint' from 'transformers.modeling_utils' (/opt/conda/envs/peft/lib/python3.11/site-packages/transformers/modeling_utils.py)

This has been fixed in awq v0.2.7. Therefore, this newer version is now
used in CI.
2024-11-19 16:53:04 +01:00
f8dbeb385a TST Move slow compile tests to nightly CI (#2223)
These tests use torch.compile on semi-realistic models and are thus slow
to execute. In sum, they take ~3 min to finish, with an overall CI
runtime of ~15 min, so it's significant.

As these tests are very unlikley to be affected by most code changes, it
should be fine to move them to nightly CI instead of running them on
each PR. Also, presence of GPUs might speed the tests up.
2024-11-19 16:25:42 +01:00
b297a169ad FIX Checks for loftq_config attribute in LoraConfig (#2215)
Improve logic in LoftQ checks during init.
2024-11-19 13:59:04 +01:00
a1d0fc7e79 FEAT: Add Context-aware Prompt Tuning (#2168)
Adds CPT: "Context-aware Prompt Tuning: Advancing In-Context Learning
with Adversarial Methods" from https://arxiv.org/abs/2410.17222.
2024-11-19 13:57:48 +01:00
3a8afbe2aa [FIX] EVA meta device fix, multi-gpu functionality (#2218)
- important bugfix for meta device check
- add multi gpu functionality and example
- update docs
2024-11-18 16:31:48 +01:00
221965b7e1 FEAT Add EVA initialization method to LoRA
Implements the paper "One Initialization to Rule them All: Fine-tuning
via Explained Variance Adaptation" (https://arxiv.org/abs/2410.07170).

This LoRA initialiization results in better initial values for the LoRA
weights and better distribution of LoRA ranks.
2024-11-12 23:24:48 +01:00
162d7e57ee FIX Dataset revision in example (#2207) 2024-11-09 22:15:24 +01:00
b1fd97dc3e FIX Several bugs in LoKr (#2180)
- Added rank_dropout_scale parameter 
- Fix scale related corrections
- Added lycoris weight initialization
2024-11-05 15:31:12 +01:00
J.L
13fb29f0cb FEAT Add Bone method (#2172)
Implements the method: "Block Affine Transformation as Parameter
Efficient Fine-tuning Methods for Large Language Models" described in
https://arxiv.org/abs/2409.15371.
2024-11-05 13:44:42 +01:00
7295b332d9 ENH Add notebook using lm-eval-harness toolkit (#2190) 2024-11-04 15:12:01 +01:00
a4f35971cd FIX Issue with rank_pattern and alpha_pattern (#2195) 2024-11-04 12:58:15 +01:00
4e57aa5b08 FIX Dora finetuning example collate fn (#2197) 2024-11-04 12:57:27 +01:00
b5b902368d FIX Check for prefix tuning + grad checkpointing (#2191)
See #869

Since transformers is moving to the new cache implementation, we had to
change prefix tuning to use this too. However, caching does not work
with gradient checkpointing. Therefore, this currently runs into an
error about size mismatches.

Now, PEFT checks for gradient checkpointing and raises a helpful error.
2024-11-01 10:48:13 +01:00
5cda3a883c FIX: Prefix tuning with model on multiple devices (#2189)
See #2134

After introducing the usage of DynamicCache for prefix tuning, a bug
could now occur if the model is dispatched to different devices. This is
because we need to move the key and value cache for each layer to that
layer's respective device.

The new code mostly consists of code copied from transformers to be
consistent with how transformers solves this.
2024-11-01 10:48:00 +01:00
8eeae0a63f TST: Skip AQLM test that is incompatible with torch 2.5 (#2187)
See: https://github.com/Vahe1994/AQLM/pull/139

It is unclear if/when AQLM will fix this and if there might not be other
issues with torch 2.5.
2024-10-30 14:04:25 +01:00
ff6dd9ed7f ENH: Warn when loading PiSSA/OLoRA together with other adapters (#2186)
Resolves #2184

Since PiSSA/OLoRA modifies the base weights, it should not be combined
with other adapters. We now warn users about that and tell them how to
mitigate this.
2024-10-30 10:16:37 +01:00
214345ee47 ENH Check layers to transforms and layer pattern (#2159) 2024-10-29 15:13:56 +01:00
9c730d7544 DOC: fix broken link in the README of loftq (#2183) 2024-10-29 11:50:28 +01:00
b3176eff49 FIX: Import location of HF hub errors (#2178)
Resolves #2097

Import errors from huggingface_hub.errors

Also set min version to 0.25.0
2024-10-28 11:49:02 +01:00
28a5ba1127 FIX VeRA failure on multiple GPUs (#2163)
The shared buffers vera_A and vera_B could be on the wrong device when
using multiple GPUs, resulting in an error. This PR moves the them to
the correct device to fix the error.

Since these buffers are shared, I chose *not* to move the whole buffer
to the device. Instead, when we create the slices from those buffers
during forward, I move the devices only there. This could be inefficient
in terms of runtime, but IIUC, the alternative would be to create new
copies of these buffers per device, using more memory.

The failing tests were introduced in #2076 but the error was already
there beforehand.

I did not discover these failing tests earlier because we had a
concurrent error caused by a transformers issue which looked very
similar and I wrongly assumed that the VeRA error was caused by the same
issue. But now that the issue has been fixed, the error still persists,
prompting me to investigate.
2024-10-25 15:08:17 +02:00
8d545c6c3b DOC: Extend modules_to_save doc with pooler example (#2175)
See #2171

---------

Co-authored-by: Steven Liu <59462357+stevhliu@users.noreply.github.com>
2024-10-25 11:09:31 +02:00
004143422f MNT Update docker nvidia base image to 12.4.1 (#2176) 2024-10-24 14:36:50 -04:00
b5db9c9350 MNT: Enable Python 3.12 on CI (#2173)
Python 3.8 was removed recently, adding 3.12 now.
2024-10-24 16:59:47 +02:00
fb6108a78e Fix to prefix tuning to fit transformers (#2096)
See #869, #1962

Fix several issues caused by changes to cache in transformers. In
particular, past_key_values for prefix tuning is now converted to a
transformers Cache instance.

---------

Co-authored-by: Raushan Turganbay <raushan.turganbay@alumni.nu.edu.kz>
2024-10-24 10:10:15 +02:00
cff2a454ad FEAT Add hotswapping functionality (#2120)
The idea of hotswapping an adapter is the following: We can already load
multiple adapters, e.g. two LoRAs, at the same time. But sometimes, we
want to load one LoRA and then replace its weights in-place with the
LoRA weights of another adapter. This is now possible the
hotswap_adapter function.

In general, this should be faster than deleting one adapter and loading
the adapter in its place, which would be the current way to achieve the
same final outcome. Another advantage of hotswapping is that it prevents
re-compilation in case the PEFT model is already compiled. This can save
quite a lot of time.

There are some caveats for hotswapping:

- It only works for the same PEFT method, so no swapping LoRA and LoHa.
- Right now, only LoRA is properly supported.
- The adapters must be compatible (e.g. same LoRA alpha, same target
  modules).
- To avoid recompilation, ranks must be identical

See also https://github.com/huggingface/diffusers/pull/9453
2024-10-23 13:33:55 +02:00
0d5894271b ENH Improve err msg for target modules (#2169) 2024-10-22 17:25:48 +02:00
095e86c036 MNT Remove Python 3.8 since it's end of life (#2135)
The end of life of Python 3.8 has arrived:

https://devguide.python.org/versions/

Therefore, Python 3.8 is removed from CI.

By default, Python 3.11 is now used.

Python 3.12 should be added to the CI matrix now, but that's for a
separate PR.

Also fixed:

The workflow tried to build on top of docker/README.md because globbing
was too broadly defined.

Reduce unnecessary steps to hopefully get disk space usage down, as
GitHub action currently fails with not enough disk space.
2024-10-22 16:43:53 +02:00
7717550a08 FIX fsdp_auto_wrap_policy for some models (#2167)
Some transformers models and custom models would throw an error when
used with PEFT's fsdp_auto_wrap_policy. This is problematatic because
Trainer applies the policy automatically when PEFT and FSDP are
detected. Now there is no error.
2024-10-22 11:57:07 +02:00
d5f4e6dfe5 ENH Improve HRA speed and docs (#2160) 2024-10-21 17:12:47 +02:00
e8259ff7bc DOC Improve docs for layers_pattern argument (#2157)
Addresses part of #2155.

Also fix type annotations where appropriate.
2024-10-18 14:23:41 +02:00
57a452ac11 MNT Remove version pin of diffusers (#2162)
The pin was added way back in #936 and then we forget to ever remove it.

This is now causing trouble, as the old diffusers version still uses
cached_download, which was removed from huggingface_hub:

> ImportError: cannot import name 'cached_download' from 'huggingface_hub'
2024-10-18 12:31:27 +02:00
58a9976284 FIX Missing low_cpu_mem_usage argument (#2156)
The newly introduced low_cpu_mem_usage argument was not propagated to
the add_adapter method of all PeftModel task types. This is now fixed
and tests were added.
2024-10-18 10:43:48 +02:00
338aeff38a ENH Faster DoRA in when no dropout/eval mode (#2122) 2024-10-16 19:18:17 +02:00
62f71e335f FIX: Sft train script FSDP QLoRA embedding mean resizing error (#2151)
Resizing the embedding layer with mean_resizing=True, which has been
introduced in transformers > 4.45, will result in an error. This is
because for FSDP + QLoRA the embedding matrix can be on meta device, in
which case mean resizing fails. Therefore, if these conditions are
detected, the script will set mean_resizing=False.

Also updated the recommended package versions to newer versions that I
have checked to be working.
2024-10-16 17:37:59 +02:00
93ddb1015a FIX Use SFTConfig instead of SFTTrainer keyword args (#2150)
Update training script using trl to fix deprecations in argument usage.
2024-10-15 11:26:42 +02:00
c039b00358 FIX Don't assume past_key_valus for encoder models (#2149)
Don't assume that past_key_values is part of the model_kwargs.

This fix is similar to #2140 but for encoder-decoder models. It became
necessary after https://github.com/huggingface/transformers/pull/34048
was merged into transformers.
2024-10-14 12:36:12 +02:00
749b924562 Bump version to 0.13.2.dev0 (#2145)
After the patch release of PEFT v0.13.2, let's bump the dev version of
PEFT to v0.13.3.dev0 so that it stays ahead (the bugfix from the patch
release is already contained in the main branch).
2024-10-12 00:20:05 +05:30
c925d0ae25 FIX Bug in target module optimization if suffix (#2144)
Solves the following bug:

https://github.com/huggingface/diffusers/pull/9622#issuecomment-2404789721

The cause for the bug is as follows: When we have, say, a module called
"bar.0.query" that we want to target and another module called
"foo_bar.0.query" that we don't want to target, there was potential for
an error. This is not caused by _find_minimal_target_modules directly,
but rather the bug was inside of BaseTuner.inject_adapter and how the
names_no_target were chosen. Those used to be chosen based on suffix. In
our example, however, "bar.0.query" is a suffix of "foo_bar.0.query",
therefore "foo_bar.0.query" was *not* added to names_no_target when it
should have. As a consequence, during the optimization, it looks like
"query" is safe to use as target_modules because we don't see that it
wrongly matches "foo_bar.0.query".
2024-10-10 16:43:28 +02:00
0aa7e3a221 FIX TST NaN issue with HQQ GPU test (#2143)
This test calculates the correlation coefficient of HQQ model outputs.
Although the model outputs are finite, the resulting matrix contains
NaNs. Casting the outputs from 16 to 32 bit precision resolves the
issue.
2024-10-10 14:40:54 +02:00
5758a7eb1c ENH LoRA notebook for NER task (#2126) 2024-10-10 11:04:16 +02:00
1eab9bd10f FIX Prompt learning with latest transformers error (#2140)
The error in PEFT is occurring after this transformers change:

https://github.com/huggingface/transformers/pull/33870

Now, in our tests, some model_kwargs no longer necessarily contain
past_key_values, resulting in a KeyError. We now account for this
possibility. Affected models were opt and gpt2.
2024-10-09 17:21:03 +02:00
8efa0cb735 FIX Raise mixed adapter infer with missing adapter (#2090)
PEFT allows mixed batch adapter inference, i.e. when predicting, the
same batch can use different adapters by passing the adapter_names
argument. However, when users pass an adapter name that does not
correspond to any of the existing adapters, these samples are currently
being ignored (i.e. just the base model output is used). This is
unexpected and can easily lead to errors, e.g. when users mistype the
name of an adapter.

This PR fixes this issue by checking all the existing adapter names
first and comparing them to the adapter_names that the user passed. If
there are unexpected entries, an error is raised.

Due to this fix, an error in the test
test_mixed_adapter_batches_lora_merged_raises was discovered and
promptly fixed.
2024-10-09 15:53:28 +02:00
85e3202a00 ENH Make PEFT configs forward compatible (#2038)
Right now, loading a PEFT config saved with a more recent PEFT version
than is currently installed will lead to errors when new arguments are
added to the config in the newer PEFT version. The current workaround is
for users to manually edit the adapter_config.json to remove those
entries.

With this PR, PEFT will make an attempt at removing these unknown keys
by inspecting the signature. The user will be warned about these removed
keys. This should generally be a safe measure because we will generally
not introduce new config settings that change the default behavior.
However, if a non-default is used, this could lead to wrong results.
This is mentioned in the warning.

While working on the tests, I also converted the unittest.TestCase to a
normal pytest test in order to be able to use pytest fixtures.

I also plan on adding the PEFT version to the adapter_config.json in the
future. This will allow us to better handle compatibility issues in the
future. As adding that new key to all PEFT configs could cause a lot of
disruption, I want to get this PR in first to ensure forward
compatibility.

Note that this new mechanism will not help anyone using a PEFT version
< 0.14.0, so this will be a slow transition.
2024-10-09 12:37:49 +02:00
3b314cc98b FIX Type annoations in vera/bnb.py (#2139)
The file was missing the from __future__ import annotations part. As
this code is only running nightly with GPU, the normal CI missed this
omission.
2024-10-09 15:46:46 +05:30
a724834ac4 FIX: PiSSA now works with Conv1D layers (#2103) (#2104)
Transpose weight matrix based on fan_in_fan_out condition in PiSSA
initialization.

Co-authored-by: Yang Su <suyang360@gmail.com>
2024-10-08 18:44:22 +02:00
9918977ecf FEAT: Support torchao (#2062)
Supports torch AO quantization. Currently supported:

- int8_weight_only
- int8_dynamic_activation_int8_weight

---------

Co-authored-by: Marc Sun <57196510+SunMarc@users.noreply.github.com>
2024-10-08 18:10:19 +02:00
5e91b54635 Bump version to 0.13.2.dev0 (#2137) 2024-10-08 16:36:34 +02:00
859fd880e6 FEAT: VeRA quantization using bitsandbytes (#2070) (#2076)
VeRA can now be used with 4bit and 8bit bnb quantization.
2024-10-07 15:00:42 +02:00
e6f927bfec FIX BC breaking change to boft conv2d scaling variable (#2127) 2024-10-07 11:44:38 +02:00
8d9ecbed08 FEAT: Adding exclude modules param(#2044) (#2102)
Allows to exclude target modules.
2024-10-03 13:08:08 +02:00
d9d3059e94 ENH: Warn when from_pretrained misses PEFT keys (#2118)
After merging #2084, we now clean up the missing_keys when loading a
PEFT adapter to remove all but the relevant keys (the fact that base
model keys are missing is expected when loading a PEFT adapter).

Since the presence of missing_keys now really means that something might
have gone wrong during loading, we can now warn the user if they call
PeftModel.from_pretrained.

Note that load_adapter still does not warn, as here we return the
load_result and users can already check, but for from_pretrained, they
don't have that possibility.
2024-10-02 18:52:00 +02:00
534d361e7c TST Mark flaky X-LoRA test as xfail (#2114)
Currently, CI is failing constantly because one of the X-LoRA tests has
become flaky lately, most likely caused by the transformers 4.45.0
release. Therefore, this test is now marked to non-strictly xfail.

I cannot reproduce this error locally, neither on CPU nor GPU. It is
thus unclear how to fix this test.
2024-10-02 18:31:01 +02:00
ca8462bb68 FIX low_cpu_mem_usage consolidates devices (#2113)
See: https://github.com/huggingface/diffusers/pull/9510#issuecomment-2378316687

Right now, the low_cpu_mem_usage=True option does not consolidate the
devices. E.g. when the model is on GPU and the state_dict on CPU, the
adapter weight will be on CPU after loading, when it should be GPU. This
fix ensures that the devices are consolidated.
2024-10-02 17:27:26 +02:00
ae297f0799 ENH: Improved attribute access for modules_to_save (#2117)
Resolves #2099

So far, if a module was wrapped due to modules_to_save, we handled
access to the weight and bias attribute (albeit incorrectly in case of
disabled adapters!). However, there could be more attributes than those
that could be accessed, in which case we got an error so far.

Instead of special properties, we now implement a generic __getattr__
method that can deal with any attribute. The implementation is a bit
complex to take into account the way that torch.nn.Module handles
__getattr__.
2024-10-02 12:43:05 +02:00
2a807359bd FIX Refactor OFT, small changes to BOFT (#1996)
The previous OFT implementation contained a few errors, which are fixed now.

Unfortunately, this makes previous OFT checkpoints invalid, which is why an
error will be raised. Users are instructed to either retrain the OFT adapter or
switch to an old PEFT version.
2024-10-01 16:51:18 +02:00
aa3bd8fbf6 DOC Update source install instruction (#2110) 2024-09-30 11:03:41 +02:00
c29810bad2 FIX: Change check if past_key_values is empty (#2106)
After transformers merged this PR:

https://github.com/huggingface/transformers/pull/33703

The bool of past_key_values (a Cache instance) would change from False
to True in one of our checks. Use get_seq_length() method instead, which
is consistent before and after that commit.

I checked the tests with the new change for both transformers before and
after that commit and they passed, so this change should be backwards
compatible.

Unrelated change: Mark X-LoRA scaling test as xfail-ing for now.

This should be addressed in a separate PR. Marking it to xfail for now
to get the original fix through CI.
2024-09-27 16:17:39 +02:00
ccc350151f FIX Reduce false positive missing keys when loading adapter (#2084)
When loading a PEFT adapter, a lot of missing keys are reported, because the
base model weights are not loaded. However, this is totally fine. Therefore,
those missing keys can be safely ignored.

When using from_pretrrained, the missing keys won't be returned to the user,
thus there is no room for confusion. But when using load_adapter, the missing
keys (and unexpected keys) are returned and can cause confusion. With this PR,
the missing keys are filtered to remove keys that are unrelated to the adapter.

A small gap is VB-LoRA which reports missing keys because the vector bank
parameters are actually only loaded once and then shared.
2024-09-25 15:35:16 +02:00
0f9bdad7fa ENH Support Conv3d layer in LoRA and IA3 (#2082) 2024-09-25 15:05:22 +02:00
58ca0ad26f Bump version to 0.13.1.dev0 (#2094) 2024-09-25 18:26:38 +05:30
f0b066eae8 Release v0.13.0 (#2093) 2024-09-25 13:09:08 +02:00
8f39708650 ENH: Better DoRA check in mixed adapter batch inference (#2089)
This is a bit of an edge case, but I noticed this while working on
something else.

PEFT allows mixed batch adapter inference, i.e. when predicting, the
same batch can use different adapters by passing the adapter_names
argument. However, this is not supported for DoRA (yet), so there is a
check that raises an error if DoRA is used.

Previously, this check would check all adapters for DoRA, even if those
adapters are not being used in adapter_names. This was unnecessarily
strict and with this PR, we only check the adapters that are actually
being used.
2024-09-24 10:16:31 +02:00
f4cf170a9c DOC Docstring of load_adapter, type annotation (#2087) 2024-09-23 11:18:24 +02:00
b67c9b64fd FIX: Bug in find_minimal_target_modules (#2083)
This bug was reported by Sayak and would occur if a required suffix had
itself as suffix a string that was already determined to be required, in
which case this required suffix would not be added.

The fix consists of prefixing a "." to the suffix before checking if it is
required or not.

On top of this, the algorithm has been changed to be deterministic.
Previously, it was not deterministic because a dictionary that was
looped over was built from a set, and sets don't guarantee order. This
would result in the loop being in arbitrary order.

As long as the algorithm is 100% correct, the order should not matter.
But in case we find bugs like this, the order does matter. We don't want
bugs to be flaky, therefore it is best to sort the dict and remove
randomness from the function.

---------

Co-authored-by: Sayak Paul <spsayakpaul@gmail.com>
2024-09-23 11:16:29 +02:00
5efeba1856 ENH: Add default target layers for gemma2 architecture (#2078)
Google's gemma 2 models have a slightly different architecture than
gemma 1 and thus a different model_type attribute. This PR adds default
target_layer for gemma 2 that correspond to the default target_layer of
gemma 1.

LayerNorm tuning adds one more LN layer.
2024-09-23 11:15:08 +02:00
af275d2d42 ENH: Allow empty initialization of adapter weight (#1961)
This PR allows to initialize the adpater weights as empty, i.e. on meta
device, by passing low_cpu_mem_usage=True.

Why would this be useful? For PEFT training, it is indeed not useful, as
we need the real weights in order to train the model. However, when
loading a trained PEFT adapter, it is unnecessary to initialize the
adapters for real, as we override them with the loaded weights later.

In the grand scheme of things, loading the base model will typically be
much slower, but if the user loads, say, dozens of adapters, the
overhead could add up. Of course, besides loading the model, this has no
performance impact and is thus not a high priority feature.

For the time being, this is completely opt in. However, it should be safe to
make this default for loading adapters. Therefore, in the future we may change
the default there.

---------

Co-authored-by: Sayak Paul <spsayakpaul@gmail.com>
2024-09-23 11:13:51 +02:00
9bc670eafb MNT Update author email in setup.py (#2086) 2024-09-23 10:43:57 +02:00
5d944589d2 ENH Expose bias of ModulesToSaveWrapper (#2081) 2024-09-20 19:35:24 +02:00
152ed70b00 ENH PiSSA/OLoRA: Preserve original config on save (#2077)
Resolves #2075

When saving PiSSA or OLoRA with the option to convert to normal LoRA,
the LoRA weight shapes change, which means that some values like r and
alpha need to be adjusted in the saved PEFT config. However, these
modifications should be limited to the saved config, while the loaded
config should stay the same.

This PR implements this change by creating a copy of the config before
modifying it.
2024-09-20 12:11:24 +02:00
f5dd2acfed TST Skip some quantization tests on XPU (#2074)
Eetq/hqq/aqlm don't support XPU yet.
2024-09-18 11:27:19 +02:00
3b2ebf1ba1 FIX Bug that prevents BOFT from loading 2 adapters (#2068)
There was a bug in BOFT that made it impossible in some circumstances to
load more than one adapter (creating more than 1 adapter was possible
though). This was because a code path that adjusts
boft_n_butterfly_factor was only visited when creating a fresh adapter,
but not when updating with the 2nd adapter. This was fixed by moving
this code path from the BOFT layer's __init__ method to update_layer.

A test for loading multiple adapters was added. Since this was a gap in
our test suite, this test will be applied to all appropriate PEFT
methods, not only BOFT, but the others methods are all passing without
needing further changes.

For good measure, I also added BOFT to the test suite that checks
multiple active adapters. These tests would have also passed without the
fix in this PR, since these tests do not load multiple adapters but
instead create them, which always worked. Still it's better to have
these tests as well.
2024-09-18 11:19:16 +02:00
adf0a1dc96 ENH Multi adapters in same batch: modules_to_save (#1990)
Extend the functionality of having different adapters in the same batch to also
work with `modules_to_save`.
2024-09-17 13:50:47 +02:00
18f3efe5c0 MNT Update deprecated evaluation_strategy (#1664)
In docs and examples, use eval_strategy instead of evaluation_strategy, which is
deprecated.
2024-09-13 18:01:26 +02:00
4a8dedb2a7 FIX Command line args in PiSSA preprocess (#2053)
Fix bug in parsing command line arguments in the PiSSA preprocess.py script from
the PiSSA example.
2024-09-13 13:59:27 +02:00
25202271bc ENH BOFT don't save boft_P buffer (#2050)
The buffer does not need to be part of the checkpoint, by making it
non-persistent, the file size can be greatly reduced.
2024-09-13 13:56:47 +02:00
214f891cd2 MAINT: Give stale bot permissions for PRs too (#2064) 2024-09-12 12:18:20 -04:00
7868d0372b MNT Permission for GH token in stale.yml (#2061) 2024-09-11 12:36:25 +02:00
734ea9a014 TST Make X-LoRA tests faster (#2059)
After some recent optimizations, the X-LoRA tests are now the slowest
ones. Part of that is that the lora adapters are re-created for each
test. By changing the fixture scope, they're now only created once. I
think this should be safe, as these files are not modified in the tests.

I also enabled test_scalings_logging_methods with the latest
transformers to ensure that this test also passes.
2024-09-11 12:13:24 +02:00
54be5a3db6 TST Speed up vision model tests (#2058)
The HRA vision model test is extremely slow on CI (> 600 sec, 50% of
total time). This change speeds up the test by using a smaller ResNet
model to run the tests.

It's still not clear why HRA was so slow specifically -- LoRA is 40x
faster -- but that can be fixed separately.
2024-09-10 16:15:51 +02:00
b180ae46f8 TST Fewer inference steps for stable diffusion (#2051)
Reduce the number of inference steps for stable diffusion tests. These
tests are the slowest ones on CI, this should help (~3 min on average).
2024-09-06 09:57:56 +02:00
31fbbd2203 FIX TST Scalings logging test latest transformers (#2042)
Fix test for latest transformers, skip for earlier versions.
2024-09-05 14:50:46 +02:00
c9f7240afc FEAT Add VB-LoRA (#2039)
Implements "VB-LoRA: Extreme Parameter Efficient Fine-Tuning with Vector
Banks"

https://arxiv.org/abs/2405.15179
2024-09-04 11:02:34 +02:00
95b39642fb FIX: Small numerical discrepancy for p-tuning after loading the model (#2047)
There is a small numerical discrepancy between the outputs of a p-tuning
model before and after loading. Even though it is small, it can still
affect generations, so this PR eliminates it.

As an example, without the fix, this is the difference in logits for
opt-125m:

>       torch.testing.assert_close(output_loaded, output_peft)
E       AssertionError: Tensor-likes are not close!
E
E       Mismatched elements: 30 / 10557120 (0.0%)
E       Greatest absolute difference: 1.1086463928222656e-05 at index (0, 9, 9314) (up to 1e-05 allowed)
E       Greatest relative difference: 0.00021288332936819643 at index (0, 9, 9314) (up to 1.3e-06 allowed)

Details about how this comes about are explained here:

https://github.com/huggingface/peft/issues/2043#issuecomment-2321522577

The gist of it is that if we take a single sample, repeat it X times,
and then forward it through a model (which is the training path in
p-tuning), we would expect the same output as if we forwarded this
sample only once and repeated the output X times (the inference path for
p-tuning). However, for sufficiently large models, the two approaches
can have tiny differences.

With the fixed approach, there is no difference between training and
inference code paths when it comes to this. The new code should also be
slightly more compute efficient, but in practice will not make a
noticeable difference.
2024-09-03 16:52:06 +02:00
37b9c5c74b FIX: Error with OLoRA init when using bnb (#2011) 2024-09-03 14:08:25 +02:00
01275b4cb3 ENH: Faster adapter loading if there are a lot of target modules (#2045)
This is an optimization to reduce the number of entries in the
target_modules list. The reason is that in some circumstances,
target_modules can contain hundreds of entries. Since each target module
is checked against each module of the net (which can be thousands), this
can become quite expensive when many adapters are being added. Often,
the target_modules can be condensed in such a case, which speeds up the
process.

A context in which this can happen is when diffusers loads non-PEFT
LoRAs. As there is no meta info on target_modules in that case, they are
just inferred by listing all keys from the state_dict, which can be
quite a lot. See: https://github.com/huggingface/diffusers/issues/9297

As shown there the speed improvements for loading many diffusers LoRAs
can be substantial. When loading 30 adapters, the time would go up from
0.6 sec per adapter to 3 sec per adapter. With this fix, the time goes
up from 0.6 sec per adapter to 1 sec per adapter.

As there is a small chance for undiscovered bugs, we apply this
optimization only if the list of target_modules is sufficiently big.
2024-09-02 12:59:51 +02:00
679bcd8777 ENH Warn if using tied target modules (#2025)
When users are targetting tied weights (e.g. embedding and LM head),
merging the adapter will lead to errors. Now users are warned about the
possibility when they create such a PEFT model and also when they try to
merge.
2024-08-29 10:51:13 +02:00
850eeb5c3a FIX Pre-commit version in config (#2034) 2024-08-26 11:50:02 +02:00
5996d39408 TST Enable more tests in XPU (#2036) 2024-08-26 11:49:18 +02:00
900f96c40d [Add] DoRA Embedding (#2006) 2024-08-23 20:20:42 +02:00
c3b63ce2c4 ENH Test and DoRA compatibility with XPU 2024-08-23 16:01:50 +02:00
1a5d0f8151 FIX: Don't target the classification head when using target_modules="all-linear" (#2033)
Fixes #2027

When using a transformers sequence classification model,
target_modules="all-linear" should not wrap the classification head with
LoRA. This is because it is already wrapped with ModulesToSave, i.e. it
will be fully fine-tuned, which is the generally desired behavior.

Before this bug fix, the classification head would be double-wrapped.
With #2028, this now raises an error. With this PR, it is avoided
completely. Still, keeping #2028 is good because it helps prevent other
situations where double-wrapping might occur due to misconfiguration.

Note that there is no fool-proof method to detect the classification
head, we have to rely on transformers convention.
2024-08-23 16:00:43 +02:00
f3c7c6e5c1 ENH Raise error when applying modules_to_save on tuner layer (#2028)
Relates to #2027

Normally, when selecting the layers for fine-tuning, PEFT already
ensures that the same layer is not targeted for both parameter-efficient
fine-tuning (e.g. LoRA layer) and full fine-tuning (via
modules_to_save), as that makes no sense.

However, there is a loophole when the modules_to_save is applied ex
post. This happens for instance when having a task type like sequence
classification, where PEFT will automatically add the classfication head
to modules_to_save for the user. This loophole is now closed by adding a
check to ModulesToSaveWrapper that validates that the targeted layer is
not a tuner layer.

This does not fully resolve #2027 but will raise an early error in the
future to avoid confusion.

On top of this, the error message inside of
ModulesToSaveWrapper.check_module has been slightly adjusted.
Previously, the class name would be used, which can be confusing. E.g.
for LoRA, the class name of the linear LoRA layer is just "Linear",
which looks the same as nn.Linear. Therefore, the full name is now
shown.
2024-08-22 17:10:39 +02:00
8fcb1951a5 MAINT: Update ruff version to ~0.6.1 (#1965)
Moving to ruff ~0.6.1. Changes:

- type comparisons now require is: str is str
- remove overridden class attribute active_adapter
- remove secondary import of fbd_cuda

Omit jupyter notebooks for now. We can think about adding that in a
separate PR.
2024-08-22 15:23:23 +02:00
fa218e1942 TST test_mixed_adapter_batches_lora_opt_timing on XPU (#2021) 2024-08-21 15:10:19 +02:00
6c832c1dd4 TST Make TestModelAndLayerStatus device-agnostic (#2026) 2024-08-21 12:43:35 +02:00
95821e5ce4 ENH: Better error msg for replace_lora_weights_loftq when using a local model. (#2022)
Resolves #2020

If users want to use a local model, they need to pass the model_path
argument. The error message now says so.
2024-08-21 10:10:54 +02:00
25ab6c9bb2 TST Enable regression tests on XPU (#2019) 2024-08-20 16:13:59 +02:00
b4cf1b3c46 CI Remove regression tests from BNB CI (#2024)
This is a test to see if the BNB CI for multi-backend single-GPU passes
if regression tests are disabled.
2024-08-20 14:15:37 +02:00
eb5eb6efb5 TST Enable test_vera_dtypes on XPU with bf16 (#2017) 2024-08-20 11:25:44 +02:00
f71e89f771 FIX Deprecated params/funcs in X-LoRA example (#2010) 2024-08-20 11:24:38 +02:00
e8ba7de573 CI Activate single core multi backend bnb tests (#2008)
See #1866 for context.

Let's check if this issue has resolved itself by now.
2024-08-16 17:19:20 +02:00
0222450f44 TST: Potentially Skip 8bit bnb regression test if compute capability is too low (#1998)
* TST Potentially Skip 8bit bnb regression test

The 8bit bnb LoRA regression test results are dependent on the
underlying compute capability. The logits are slightly different
depending on the version (up to 0.5 abs diff). Therefore, we now check
the compute capability for this test and skip it if it's too low. This
check may require updating if the hardware of the CI worker is updated.

Note that I have already invalidated the old regression artifacts and
created a new one.

* Fix pytest skip to work without cuda

* Instead of skipping, add a comment to explain

After internal discussion, we think this is the most practical solution
for the time being.
2024-08-16 17:18:25 +02:00
4c3a76fa68 FIX DOC Update X-LoRA docs, some bugfixes (#2002)
Bugs with dtype and loading of LoRA adapters.
2024-08-15 15:29:32 +02:00
670d0fac31 FIX CI Correctly report outcome of bnb import test (#2007) 2024-08-14 20:14:15 +02:00
22f042a107 ENH: Warn when a user provided model name in the config renamed (#2004)
Resolves #2001

In PEFT, users can provide a custom base_model_name_or_path argument to
the PEFT config. However, this value is overridden by the model's
name_or_path attribute. This can be surprising for users. Therefore,
there is now a warning about this.

To see why that can be relevant, check the original issue.
2024-08-14 15:42:58 +02:00
d6e772f192 TST Add LNTuningConfig and LoKrConfig to tests (#2005)
These two configs were missing in test_config.py. Also, reordered the
list of all config classes to be sorted, which makes it easier to spot
missing configs.
2024-08-14 15:42:32 +02:00
042123465c DOC Fix typos in lora.md (#2003) 2024-08-13 15:15:03 +02:00
41c274ecac FIX Import error in BOFT half precision test (#1995) 2024-08-08 15:15:47 +02:00
9988cb9d00 FIX BOFT, OFT saving merged layers (#1994)
Error occurred with safetensors when weights are not contiguous.
2024-08-07 19:26:33 +02:00
fcac30bef5 MAINT Default to loading weights_only for torch (#1993)
The torch.load function allows to pass weights_only=True, which is more
secure but may break some code that is more than just weights. For PEFT,
this should not be the case, so the switch should just work.

By making the switch now, we can find out early if there are any
problems, as torch.load will default to True in the future.
2024-08-07 19:16:55 +02:00
2a5d3132e9 ENH Small updates to helper.rescale_adapter_scale (#1989)
Some renaming, better docs.
2024-08-07 14:51:35 +02:00
c869664891 FIX BOFT mixed precision (#1925) 2024-08-07 14:12:34 +02:00
4611034ff8 FIX: Adjust transformers version check for bloom (#1992)
The fix to the bloom architecture was not actually released in
transformers 4.43.3, which makes the version check invalid. Instead, now
checking an attribute on the BloomPreTrainedModel.
2024-08-06 13:40:14 +02:00
b9260305e3 FIX Docker build CI (#1987)
Signed-off-by: Adrien <adrien@huggingface.co>
2024-08-02 16:51:48 +02:00
f51428313f DOC Docs and examples for X-LoRA (#1970) 2024-08-02 12:35:14 +02:00
9a087823c6 DOC Small fixes for HQQ and section title (#1986)
Changed:

- Helper section had placeholder title
- `device` is not a valid argument to `from_pretrained`
- Excess empty lines
- Move helpers section
2024-08-02 12:33:29 +02:00
46f78978f1 FEAT Context manager for scaling LoRA (#1951) 2024-08-01 17:21:55 +02:00
269aba5303 ENH AdaLoRA: warn when user use r argument (#1981)
For AdaLoRA, init_r is the correct one to use.
2024-08-01 12:24:42 +02:00
52a4ac9c2f ENH Faster bf16 merging on CPU (#1978)
Cast to fp32, as bf16 can be very slow on some CPUs.

This is already done for fp16.
2024-07-31 17:51:46 +02:00
c874ba3f1b CHORE Update CI configuration for workflows (#1985)
Signed-off-by: Adrien <adrien@huggingface.co>
2024-07-31 16:08:58 +02:00
f13d860e9f FIX Loading adapter honors offline mode (#1976)
HF_HUB_OFFLINE=1 was not honored when trying to load an adapter. This is
now fixed.
2024-07-30 16:11:27 +02:00
f6d3e38601 FIX active_adapters for transformers models (#1975)
Fixes the error reported here:

https://github.com/huggingface/transformers/pull/30790#issuecomment-2253808249

Unfortunately, transformers models have an active_adapters method but
it's 1) not a property and 2) calling it fails because the base
model (usually) has no loaded adapter. The base model can be a
transformers model for prompt learning, where the base model is not
wrapped in a LoraModel or similar. Therefore, this special case needs to
be handled separately.
2024-07-30 15:14:28 +02:00
7e7b55880e FIX: lora+: include lr in optimizer kwargs (#1973) 2024-07-30 14:20:04 +02:00
1b16753a6a ENH Update VeRA preconfigured models (#1941)
Some pre-configured models like mistral used not to work with VeRA
because the weight shapes were not identical. However, since #1817, this
is no longer a requirement. Therefore, this commented code can now be
uncommented.

I have tested mistral and gemma and they worked. I haven't tested btlm
and mixtral but with the update, I'm pretty sure they will work too.
2024-07-30 08:15:53 +05:30
27833a2e60 FIX: New bloom changes breaking prompt learning (#1969)
Bloom had two dimensions of the attention layer transposed (compared to
all other transformers models), which was fixed by:

https://github.com/huggingface/transformers/pull/31445

Therefore, for future transformers versions, skip the special handling
in PEFT.

There is also an issue that prompt injection did not take place when
past_key_values was a Cache object that is empty. This should now
hopefully work as expected.
2024-07-29 18:25:41 +02:00
273acf059e FEAT: Add LoRA+ (#1915)
Add LoRA+: Efficient Low Rank Adaptation of Large Models

https://arxiv.org/abs/2402.12354

Call create_loraplus_optimizer to initialize an optimizer with optimizer
parameters that are especially effective for LoRA training.

Builds upon this code base:

https://github.com/nikhil-ghosh-berkeley/loraplus

---------

Co-authored-by: moghadas76 <s.m.moghadas2012@gmail.com>
Co-authored-by: Chris Hua <stillmatic@users.noreply.github.com>
2024-07-29 12:50:30 +02:00
296fbcde3e FIX Prefix tuning if past_key_values is passed (#1942)
There was an error with prefix tuning when some models like Llava passed
past_key_values explicitly, even if it was None, because it resulted in
that argument passed twice (once explicit, once via kwargs). This is now
fixed.
2024-07-29 12:46:54 +02:00
f2b6d13f1d CI Fix Windows permission error on merge test (#1952)
For some reason, Windows CI suddenly started throwing permission
errors on test_merge_layers. These errors occur when using the
TempDirectory() context manager, which raises a PermissionError on
Windows when it tries to clean up after itself. Therefore, this context
manager is now avoided in favor of manual clean up.

More context:

I investigated this issue first in #1947. My suspicion that this could
be caused by a new pytest version was not confirmed. Maybe the reason is
that GH rolled out a new Windows worker, not sure.

Also note that this is not the first time that this workaround is
required, e.g. also here:

e6cd24c907/tests/test_custom_models.py (L1465)
2024-07-25 14:02:34 +02:00
8aacb993e7 Bump version to 0.12.1.dev0 (#1950) 2024-07-25 13:39:39 +02:00
514 changed files with 90682 additions and 16582 deletions

View File

@ -23,30 +23,14 @@ body:
Please tag fewer than 3 people.
Library: @benjaminbossan @sayakpaul
Library: @benjaminbossan @githubnemo
diffusers integration: @benjaminbossan @sayakpaul
Documentation: @stevhliu
placeholder: "@Username ..."
- type: checkboxes
id: information-scripts-examples
attributes:
label: Information
description: 'The problem arises when using:'
options:
- label: "The official example scripts"
- label: "My own modified scripts"
- type: checkboxes
id: information-tasks
attributes:
label: Tasks
description: "The tasks I am working on are:"
options:
- label: "An officially supported task in the `examples` folder"
- label: "My own task or dataset (give details below)"
- type: textarea
id: reproduction
validations:

View File

@ -11,15 +11,6 @@ body:
description: |
A clear and concise description of the feature proposal. Please provide a link to the paper and code in case they exist.
- type: textarea
id: motivation
validations:
required: true
attributes:
label: Motivation
description: |
Please outline the motivation for the proposal. Is your feature request related to a problem?
- type: textarea
id: contribution
validations:
@ -27,4 +18,4 @@ body:
attributes:
label: Your contribution
description: |
Is there any way that you could help, e.g. by submitting a PR?
Is there any way that you could help, e.g. by submitting a PR?

View File

@ -10,36 +10,31 @@ concurrency:
group: docker-image-builds
cancel-in-progress: false
permissions: {}
env:
CI_SLACK_CHANNEL: ${{ secrets.CI_DOCKER_CHANNEL }}
jobs:
latest-cpu:
name: "Latest Peft CPU [dev]"
runs-on: ubuntu-latest
runs-on:
group: aws-general-8-plus
steps:
- name: Cleanup disk
run: |
sudo ls -l /usr/local/lib/
sudo ls -l /usr/share/
sudo du -sh /usr/local/lib/
sudo du -sh /usr/share/
sudo rm -rf /usr/local/lib/android
sudo rm -rf /usr/share/dotnet
sudo du -sh /usr/local/lib/
sudo du -sh /usr/share/
- name: Set up Docker Buildx
uses: docker/setup-buildx-action@v1
uses: docker/setup-buildx-action@b5ca514318bd6ebac0fb2aedd5d36ec1b5c232a2 # v3.10.0
- name: Check out code
uses: actions/checkout@v3
uses: actions/checkout@08c6903cd8c0fde910a37f88322edcfb5dd907a8 # v5.0.0
with:
persist-credentials: false
- name: Login to DockerHub
uses: docker/login-action@v2
uses: docker/login-action@74a5d142397b4f367a81961eba4e8cd7edddf772 # v3.4.0
with:
username: ${{ secrets.DOCKERHUB_USERNAME }}
password: ${{ secrets.DOCKERHUB_PASSWORD }}
- name: Build and Push CPU
uses: docker/build-push-action@v4
uses: docker/build-push-action@14487ce63c7a62a4a324b0bfb37086795e31c6c1 # v6.16.0
with:
context: ./docker/peft-cpu
push: true
@ -47,171 +42,109 @@ jobs:
- name: Post to Slack
if: always()
uses: huggingface/hf-workflows/.github/actions/post-slack@main
uses: huggingface/hf-workflows/.github/actions/post-slack@3f88d63d3761558a32e8e46fc2a8536e04bb2aea # main from Feb 2025-02-24
with:
slack_channel: ${{ env.CI_SLACK_CHANNEL }}
title: 🤗 Results of the PEFT-CPU docker build
title: 🤗 Results of the PEFT-CPU docker build
status: ${{ job.status }}
slack_token: ${{ secrets.SLACK_CIFEEDBACK_BOT_TOKEN }}
latest-cuda:
name: "Latest Peft GPU [dev]"
runs-on: ubuntu-latest
runs-on:
group: aws-general-8-plus
steps:
- name: Cleanup disk
run: |
sudo ls -l /usr/local/lib/
sudo ls -l /usr/share/
sudo du -sh /usr/local/lib/
sudo du -sh /usr/share/
sudo rm -rf /usr/local/lib/android
sudo rm -rf /usr/share/dotnet
sudo du -sh /usr/local/lib/
sudo du -sh /usr/share/
- name: Set up Docker Buildx
uses: docker/setup-buildx-action@v1
uses: docker/setup-buildx-action@b5ca514318bd6ebac0fb2aedd5d36ec1b5c232a2 # v3.10.0
- name: Check out code
uses: actions/checkout@v3
uses: actions/checkout@08c6903cd8c0fde910a37f88322edcfb5dd907a8 # v5.0.0
with:
persist-credentials: false
- name: Login to DockerHub
uses: docker/login-action@v1
uses: docker/login-action@74a5d142397b4f367a81961eba4e8cd7edddf772 # v3.4.0
with:
username: ${{ secrets.DOCKERHUB_USERNAME }}
password: ${{ secrets.DOCKERHUB_PASSWORD }}
- name: Build and Push GPU
uses: docker/build-push-action@v4
uses: docker/build-push-action@14487ce63c7a62a4a324b0bfb37086795e31c6c1 # v6.16.0
with:
context: ./docker/peft-gpu
push: true
tags: huggingface/peft-gpu
- name: Post to Slack
if: always()
uses: huggingface/hf-workflows/.github/actions/post-slack@main
uses: huggingface/hf-workflows/.github/actions/post-slack@3f88d63d3761558a32e8e46fc2a8536e04bb2aea # main from Feb 2025-02-24
with:
slack_channel: ${{ env.CI_SLACK_CHANNEL }}
title: 🤗 Results of the PEFT-GPU docker build
title: 🤗 Results of the PEFT-GPU docker build
status: ${{ job.status }}
slack_token: ${{ secrets.SLACK_CIFEEDBACK_BOT_TOKEN }}
latest-cuda-bnb-source:
name: "Latest Peft GPU + bnb source [dev]"
runs-on: ubuntu-latest
runs-on:
group: aws-general-8-plus
steps:
- name: Cleanup disk
run: |
sudo ls -l /usr/local/lib/
sudo ls -l /usr/share/
sudo du -sh /usr/local/lib/
sudo du -sh /usr/share/
sudo rm -rf /usr/local/lib/android
sudo rm -rf /usr/share/dotnet
sudo du -sh /usr/local/lib/
sudo du -sh /usr/share/
- name: Set up Docker Buildx
uses: docker/setup-buildx-action@v1
uses: docker/setup-buildx-action@b5ca514318bd6ebac0fb2aedd5d36ec1b5c232a2 # v3.10.0
- name: Check out code
uses: actions/checkout@v3
uses: actions/checkout@08c6903cd8c0fde910a37f88322edcfb5dd907a8 # v5.0.0
with:
persist-credentials: false
- name: Login to DockerHub
uses: docker/login-action@v1
uses: docker/login-action@74a5d142397b4f367a81961eba4e8cd7edddf772 # v3.4.0
with:
username: ${{ secrets.DOCKERHUB_USERNAME }}
password: ${{ secrets.DOCKERHUB_PASSWORD }}
- name: Build and Push GPU
uses: docker/build-push-action@v4
uses: docker/build-push-action@14487ce63c7a62a4a324b0bfb37086795e31c6c1 # v6.16.0
with:
context: ./docker/peft-gpu-bnb-source
push: true
tags: huggingface/peft-gpu-bnb-source
- name: Post to Slack
if: always()
uses: huggingface/hf-workflows/.github/actions/post-slack@main
uses: huggingface/hf-workflows/.github/actions/post-slack@3f88d63d3761558a32e8e46fc2a8536e04bb2aea # main from Feb 2025-02-24
with:
slack_channel: ${{ env.CI_SLACK_CHANNEL }}
title: 🤗 Results of the PEFT-GPU (bnb source / HF latest) docker build
title: 🤗 Results of the PEFT-GPU (bnb source / HF latest) docker build
status: ${{ job.status }}
slack_token: ${{ secrets.SLACK_CIFEEDBACK_BOT_TOKEN }}
latest-cuda-bnb-source-latest:
name: "Latest Peft GPU + bnb source [accelerate / peft / transformers latest]"
runs-on: ubuntu-latest
runs-on:
group: aws-general-8-plus
steps:
- name: Cleanup disk
run: |
sudo ls -l /usr/local/lib/
sudo ls -l /usr/share/
sudo du -sh /usr/local/lib/
sudo du -sh /usr/share/
sudo rm -rf /usr/local/lib/android
sudo rm -rf /usr/share/dotnet
sudo du -sh /usr/local/lib/
sudo du -sh /usr/share/
- name: Set up Docker Buildx
uses: docker/setup-buildx-action@v1
uses: docker/setup-buildx-action@b5ca514318bd6ebac0fb2aedd5d36ec1b5c232a2 # v3.10.0
- name: Check out code
uses: actions/checkout@v3
uses: actions/checkout@08c6903cd8c0fde910a37f88322edcfb5dd907a8 # v5.0.0
with:
persist-credentials: false
- name: Login to DockerHub
uses: docker/login-action@v1
uses: docker/login-action@74a5d142397b4f367a81961eba4e8cd7edddf772 # v3.4.0
with:
username: ${{ secrets.DOCKERHUB_USERNAME }}
password: ${{ secrets.DOCKERHUB_PASSWORD }}
- name: Build and Push GPU
uses: docker/build-push-action@v4
uses: docker/build-push-action@14487ce63c7a62a4a324b0bfb37086795e31c6c1 # v6.16.0
with:
context: ./docker/peft-gpu-bnb-latest
push: true
tags: huggingface/peft-gpu-bnb-latest
- name: Post to Slack
if: always()
uses: huggingface/hf-workflows/.github/actions/post-slack@main
uses: huggingface/hf-workflows/.github/actions/post-slack@3f88d63d3761558a32e8e46fc2a8536e04bb2aea # main from Feb 2025-02-24
with:
slack_channel: ${{ env.CI_SLACK_CHANNEL }}
title: 🤗 Results of the PEFT-GPU (bnb source / HF source) docker build
title: 🤗 Results of the PEFT-GPU (bnb source / HF source) docker build
status: ${{ job.status }}
slack_token: ${{ secrets.SLACK_CIFEEDBACK_BOT_TOKEN }}
latest-cuda-bnb-source-multi:
name: "Latest Peft GPU + bnb (multi-backend) source [accelerate / peft / transformers source]"
runs-on: ubuntu-latest
steps:
- name: Cleanup disk
run: |
sudo ls -l /usr/local/lib/
sudo ls -l /usr/share/
sudo du -sh /usr/local/lib/
sudo du -sh /usr/share/
sudo rm -rf /usr/local/lib/android
sudo rm -rf /usr/share/dotnet
sudo du -sh /usr/local/lib/
sudo du -sh /usr/share/
- name: Set up Docker Buildx
uses: docker/setup-buildx-action@v1
- name: Check out code
uses: actions/checkout@v3
- name: Login to DockerHub
uses: docker/login-action@v1
with:
username: ${{ secrets.DOCKERHUB_USERNAME }}
password: ${{ secrets.DOCKERHUB_PASSWORD }}
- name: Build and Push GPU
uses: docker/build-push-action@v4
with:
context: ./docker/peft-gpu-bnb-multi-source
push: true
tags: huggingface/peft-gpu-bnb-multi-source
- name: Post to Slack
if: always()
uses: huggingface/hf-workflows/.github/actions/post-slack@main
with:
slack_channel: ${{ env.CI_SLACK_CHANNEL }}
title: 🤗 Results of the PEFT-GPU (bnb source multi-backend / HF latest) docker build
status: ${{ job.status }}
slack_token: ${{ secrets.SLACK_CIFEEDBACK_BOT_TOKEN }}

View File

@ -7,9 +7,11 @@ on:
- doc-builder*
- v*-release
permissions: {}
jobs:
build:
uses: huggingface/doc-builder/.github/workflows/build_main_documentation.yml@main
uses: huggingface/doc-builder/.github/workflows/build_main_documentation.yml@ba4b74d11c46d884a4cf6497687c090f55f027d9 # main from 2025-09-05
with:
commit_sha: ${{ github.sha }}
package: peft

View File

@ -7,9 +7,11 @@ concurrency:
group: ${{ github.workflow }}-${{ github.head_ref || github.run_id }}
cancel-in-progress: true
permissions: {}
jobs:
build:
uses: huggingface/doc-builder/.github/workflows/build_pr_documentation.yml@main
uses: huggingface/doc-builder/.github/workflows/build_pr_documentation.yml@ba4b74d11c46d884a4cf6497687c090f55f027d9 # main from 2025-09-05
with:
commit_sha: ${{ github.event.pull_request.head.sha }}
pr_number: ${{ github.event.number }}

View File

@ -0,0 +1,41 @@
name: Deploy "method_comparison" Gradio to Spaces
on:
push:
branches: [ main ]
paths:
- "method_comparison/**"
workflow_dispatch:
permissions: {}
jobs:
deploy:
runs-on: ubuntu-latest
steps:
- name: Checkout code
uses: actions/checkout@08c6903cd8c0fde910a37f88322edcfb5dd907a8 # v5.0.0
with:
fetch-depth: 0 # full history needed for subtree
persist-credentials: false
- name: Authenticate via ~/.netrc
env:
HF_TOKEN: ${{ secrets.PEFT_INTERNAL_REPO_READ_WRITE }}
run: |
# netrc needs BOTH login and password entries
printf "machine huggingface.co\nlogin hf\npassword ${HF_TOKEN}\n" >> ~/.netrc
chmod 600 ~/.netrc
- name: Deploy method_comparison app to HF Spaces
run: |
cd method_comparison
git init
# Spaces expect requirements.txt
mv requirements-app.txt requirements.txt
git config user.name "github-actions[bot]"
git config user.email "github-actions[bot]@users.noreply.github.com"
git remote add gradio-app https://huggingface.co/spaces/peft-internal-testing/PEFT-method-comparison
git add .
git commit -m "🚀 Deploy method comparison app from GH action"
git push -f gradio-app HEAD:main

View File

@ -7,6 +7,8 @@ on:
description: 'Branch to test on'
required: true
permissions: {}
jobs:
run_transformers_integration_tests:
strategy:
@ -15,20 +17,21 @@ jobs:
transformers-version: ['main', 'latest']
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
- uses: actions/checkout@08c6903cd8c0fde910a37f88322edcfb5dd907a8 # v5.0.0
with:
ref: ${{ github.event.inputs.branch }}
repository: ${{ github.event.pull_request.head.repo.full_name }}
persist-credentials: false
- name: Set up Python
uses: actions/setup-python@v4
uses: actions/setup-python@e797f83bcb11b83ae66e0230d6156d7c80228e7c # v6.0.0
with:
python-version: "3.10"
cache: "pip"
cache-dependency-path: "setup.py"
- name: print environment variables
run: |
echo "env.CI_BRANCH = ${{ env.CI_BRANCH }}"
echo "env.CI_SHA = ${{ env.CI_SHA }}"
echo "env.CI_BRANCH = ${CI_BRANCH}"
echo "env.CI_SHA = ${CI_SHA}"
- name: Install dependencies
run: |
python -m pip install --upgrade pip
@ -51,25 +54,26 @@ jobs:
diffusers-version: ['main']
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
- uses: actions/checkout@08c6903cd8c0fde910a37f88322edcfb5dd907a8 # v5.0.0
with:
ref: ${{ github.event.inputs.branch }}
repository: ${{ github.event.pull_request.head.repo.full_name }}
persist-credentials: false
- name: Set up Python
uses: actions/setup-python@v4
uses: actions/setup-python@e797f83bcb11b83ae66e0230d6156d7c80228e7c # v6.0.0
with:
python-version: "3.10"
cache: "pip"
cache-dependency-path: "setup.py"
- name: print environment variables
run: |
echo "env.CI_BRANCH = ${{ env.CI_BRANCH }}"
echo "env.CI_SHA = ${{ env.CI_SHA }}"
echo "env.CI_BRANCH = ${CI_BRANCH}"
echo "env.CI_SHA = ${CI_SHA}"
- name: Install dependencies
run: |
python -m pip install --upgrade pip
python -m pip install .[test]
if [ "${{ matrix.diffusers-version }}" == "main" ]; then
pip install -U git+https://github.com/huggingface/diffusers.git
else

View File

@ -10,8 +10,9 @@ env:
IS_GITHUB_CI: "1"
# To be able to run tests on CUDA 12.2
NVIDIA_DISABLE_REQUIRE: "1"
SLACK_API_TOKEN: ${{ secrets.SLACK_API_TOKEN }}
SLACK_API_TOKEN: ${{ secrets.SLACK_CIFEEDBACK_BOT_TOKEN }}
permissions: {}
jobs:
run_all_tests_single_gpu:
@ -19,8 +20,9 @@ jobs:
strategy:
fail-fast: false
matrix:
docker-image-name: ["huggingface/peft-gpu-bnb-source:latest", "huggingface/peft-gpu-bnb-latest:latest", "huggingface/peft-gpu-bnb-multi-source:latest"]
runs-on: [self-hosted, single-gpu, nvidia-gpu, t4, ci]
docker-image-name: ["huggingface/peft-gpu-bnb-source:latest", "huggingface/peft-gpu-bnb-latest:latest"]
runs-on:
group: aws-g6-4xlarge-plus
env:
CUDA_VISIBLE_DEVICES: "0"
TEST_TYPE: "single_gpu_${{ matrix.docker-image-name }}"
@ -31,7 +33,9 @@ jobs:
run:
shell: bash
steps:
- uses: actions/checkout@v3
- uses: actions/checkout@08c6903cd8c0fde910a37f88322edcfb5dd907a8 # v5.0.0
with:
persist-credentials: false
- name: Pip install
run: |
source activate peft
@ -45,7 +49,7 @@ jobs:
echo "Checking out tag for Transformers version: v$transformers_version"
git fetch --tags
git checkout tags/v$transformers_version
cd ..
cd ..
fi
- name: Test bnb import
@ -58,29 +62,28 @@ jobs:
- name: Post to Slack
if: always()
uses: huggingface/hf-workflows/.github/actions/post-slack@main
uses: huggingface/hf-workflows/.github/actions/post-slack@3f88d63d3761558a32e8e46fc2a8536e04bb2aea # main from Feb 2025-02-24
with:
slack_channel: ${{ secrets.BNB_SLACK_CHANNEL_ID }}
title: 🤗 Results of bitsandbytes import
status: ${{ steps.examples_tests.outcome }}
status: ${{ steps.import.outcome }}
slack_token: ${{ secrets.SLACK_CIFEEDBACK_BOT_TOKEN }}
# TODO: uncomment this block if error is solved or bnb multi backend branch is merged
# - name: Run examples on single GPU
# id: examples_tests
# if: always()
# run: |
# source activate peft
# make tests_examples_single_gpu_bnb
- name: Run examples on single GPU
id: examples_tests
if: always()
run: |
source activate peft
make tests_examples_single_gpu_bnb
# - name: Post to Slack
# if: always()
# uses: huggingface/hf-workflows/.github/actions/post-slack@main
# with:
# slack_channel: ${{ secrets.BNB_SLACK_CHANNEL_ID }}
# title: 🤗 Results of bitsandbytes examples tests - single GPU
# status: ${{ steps.examples_tests.outcome }}
# slack_token: ${{ secrets.SLACK_CIFEEDBACK_BOT_TOKEN }}
- name: Post to Slack
if: always()
uses: huggingface/hf-workflows/.github/actions/post-slack@3f88d63d3761558a32e8e46fc2a8536e04bb2aea # main from Feb 2025-02-24
with:
slack_channel: ${{ secrets.BNB_SLACK_CHANNEL_ID }}
title: 🤗 Results of bitsandbytes examples tests - single GPU
status: ${{ steps.examples_tests.outcome }}
slack_token: ${{ secrets.SLACK_CIFEEDBACK_BOT_TOKEN }}
- name: Run core tests on single GPU
id: core_tests
@ -91,28 +94,29 @@ jobs:
- name: Post to Slack
if: always()
uses: huggingface/hf-workflows/.github/actions/post-slack@main
uses: huggingface/hf-workflows/.github/actions/post-slack@3f88d63d3761558a32e8e46fc2a8536e04bb2aea # main from Feb 2025-02-24
with:
slack_channel: ${{ secrets.BNB_SLACK_CHANNEL_ID }}
title: 🤗 Results of bitsandbytes core tests - single GPU
status: ${{ steps.core_tests.outcome }}
slack_token: ${{ secrets.SLACK_CIFEEDBACK_BOT_TOKEN }}
- name: Run BNB regression tests on single GPU
id: regression_tests
if: always()
run: |
source activate peft
make tests_gpu_bnb_regression
# TODO: this is a test to see if BNB multi-backend single-GPU tests succeed w/o regression tests
# - name: Run BNB regression tests on single GPU
# id: regression_tests
# if: always()
# run: |
# source activate peft
# make tests_gpu_bnb_regression
- name: Post to Slack
if: always()
uses: huggingface/hf-workflows/.github/actions/post-slack@main
with:
slack_channel: ${{ secrets.BNB_SLACK_CHANNEL_ID }}
title: 🤗 Results of bitsandbytes regression tests - single GPU
status: ${{ steps.regression_tests.outcome }}
slack_token: ${{ secrets.SLACK_CIFEEDBACK_BOT_TOKEN }}
# - name: Post to Slack
# if: always()
# uses: huggingface/hf-workflows/.github/actions/post-slack@3f88d63d3761558a32e8e46fc2a8536e04bb2aea # main from Feb 2025-02-24
# with:
# slack_channel: ${{ secrets.BNB_SLACK_CHANNEL_ID }}
# title: 🤗 Results of bitsandbytes regression tests - single GPU
# status: ${{ steps.regression_tests.outcome }}
# slack_token: ${{ secrets.SLACK_CIFEEDBACK_BOT_TOKEN }}
- name: Run transformers tests on single GPU
id: transformers_tests
@ -123,13 +127,13 @@ jobs:
- name: Post to Slack
if: always()
uses: huggingface/hf-workflows/.github/actions/post-slack@main
uses: huggingface/hf-workflows/.github/actions/post-slack@3f88d63d3761558a32e8e46fc2a8536e04bb2aea # main from Feb 2025-02-24
with:
slack_channel: ${{ secrets.BNB_SLACK_CHANNEL_ID }}
title: 🤗 Results of bitsandbytes transformers tests - single GPU
status: ${{ steps.transformers_tests.outcome }}
slack_token: ${{ secrets.SLACK_CIFEEDBACK_BOT_TOKEN }}
- name: Generate Report
if: always()
run: |
@ -141,8 +145,9 @@ jobs:
strategy:
fail-fast: false
matrix:
docker-image-name: ["huggingface/peft-gpu-bnb-source:latest", "huggingface/peft-gpu-bnb-latest:latest", "huggingface/peft-gpu-bnb-multi-source:latest"]
runs-on: [self-hosted, multi-gpu, nvidia-gpu, t4, ci]
docker-image-name: ["huggingface/peft-gpu-bnb-source:latest", "huggingface/peft-gpu-bnb-latest:latest"]
runs-on:
group: aws-g6-12xlarge-plus
env:
CUDA_VISIBLE_DEVICES: "0,1"
TEST_TYPE: "multi_gpu_${{ matrix.docker-image-name }}"
@ -153,7 +158,9 @@ jobs:
run:
shell: bash
steps:
- uses: actions/checkout@v3
- uses: actions/checkout@08c6903cd8c0fde910a37f88322edcfb5dd907a8 # v5.0.0
with:
persist-credentials: false
- name: Pip install
run: |
source activate peft
@ -168,7 +175,7 @@ jobs:
git fetch --tags
git checkout tags/v$transformers_version
cd ..
fi
fi
- name: Test bnb import
id: import
@ -180,18 +187,13 @@ jobs:
- name: Post to Slack
if: always()
uses: huggingface/hf-workflows/.github/actions/post-slack@main
uses: huggingface/hf-workflows/.github/actions/post-slack@3f88d63d3761558a32e8e46fc2a8536e04bb2aea # main from Feb 2025-02-24
with:
slack_channel: ${{ secrets.BNB_SLACK_CHANNEL_ID }}
title: 🤗 Results of bitsandbytes import
status: ${{ steps.import.outcome }}
slack_token: ${{ secrets.SLACK_CIFEEDBACK_BOT_TOKEN }}
- name: Run core GPU tests on multi-gpu
if: always()
run: |
source activate peft
- name: Run examples on multi GPU
id: examples_tests
if: always()
@ -201,13 +203,13 @@ jobs:
- name: Post to Slack
if: always()
uses: huggingface/hf-workflows/.github/actions/post-slack@main
uses: huggingface/hf-workflows/.github/actions/post-slack@3f88d63d3761558a32e8e46fc2a8536e04bb2aea # main from Feb 2025-02-24
with:
slack_channel: ${{ secrets.BNB_SLACK_CHANNEL_ID }}
title: 🤗 Results of bitsandbytes examples tests - multi GPU
status: ${{ steps.examples_tests.outcome }}
slack_token: ${{ secrets.SLACK_CIFEEDBACK_BOT_TOKEN }}
- name: Run core tests on multi GPU
id: core_tests
if: always()
@ -217,7 +219,7 @@ jobs:
- name: Post to Slack
if: always()
uses: huggingface/hf-workflows/.github/actions/post-slack@main
uses: huggingface/hf-workflows/.github/actions/post-slack@3f88d63d3761558a32e8e46fc2a8536e04bb2aea # main from Feb 2025-02-24
with:
slack_channel: ${{ secrets.BNB_SLACK_CHANNEL_ID }}
title: 🤗 Results of bitsandbytes core tests - multi GPU
@ -233,13 +235,13 @@ jobs:
- name: Post to Slack
if: always()
uses: huggingface/hf-workflows/.github/actions/post-slack@main
uses: huggingface/hf-workflows/.github/actions/post-slack@3f88d63d3761558a32e8e46fc2a8536e04bb2aea # main from Feb 2025-02-24
with:
slack_channel: ${{ secrets.BNB_SLACK_CHANNEL_ID }}
title: 🤗 Results of bitsandbytes transformers tests - multi GPU
status: ${{ steps.transformers_tests.outcome }}
slack_token: ${{ secrets.SLACK_CIFEEDBACK_BOT_TOKEN }}
- name: Generate Report
if: always()
run: |

View File

@ -10,14 +10,16 @@ env:
IS_GITHUB_CI: "1"
# To be able to run tests on CUDA 12.2
NVIDIA_DISABLE_REQUIRE: "1"
SLACK_API_TOKEN: ${{ secrets.SLACK_API_TOKEN }}
SLACK_API_TOKEN: ${{ secrets.SLACK_CIFEEDBACK_BOT_TOKEN }}
permissions: {}
jobs:
run_all_tests_single_gpu:
strategy:
fail-fast: false
runs-on: [self-hosted, single-gpu, nvidia-gpu, t4, ci]
runs-on:
group: aws-g6-4xlarge-plus
env:
CUDA_VISIBLE_DEVICES: "0"
TEST_TYPE: "single_gpu"
@ -28,13 +30,15 @@ jobs:
run:
shell: bash
steps:
- uses: actions/checkout@v3
- uses: actions/checkout@08c6903cd8c0fde910a37f88322edcfb5dd907a8 # v5.0.0
with:
persist-credentials: false
- name: Pip install
run: |
source activate peft
pip install -e . --no-deps
pip install pytest-reportlog
- name: Run common tests on single GPU
run: |
source activate peft
@ -44,7 +48,7 @@ jobs:
run: |
source activate peft
make tests_examples_single_gpu
- name: Run core tests on single GPU
run: |
source activate peft
@ -54,7 +58,7 @@ jobs:
run: |
source activate peft
make tests_regression
- name: Generate Report
if: always()
run: |
@ -64,7 +68,8 @@ jobs:
run_all_tests_multi_gpu:
strategy:
fail-fast: false
runs-on: [self-hosted, multi-gpu, nvidia-gpu, t4, ci]
runs-on:
group: aws-g6-12xlarge-plus
env:
CUDA_VISIBLE_DEVICES: "0,1"
TEST_TYPE: "multi_gpu"
@ -75,7 +80,9 @@ jobs:
run:
shell: bash
steps:
- uses: actions/checkout@v3
- uses: actions/checkout@08c6903cd8c0fde910a37f88322edcfb5dd907a8 # v5.0.0
with:
persist-credentials: false
- name: Pip install
run: |
source activate peft
@ -85,22 +92,22 @@ jobs:
- name: Run core GPU tests on multi-gpu
run: |
source activate peft
- name: Run common tests on multi GPU
run: |
source activate peft
make tests_common_gpu
- name: Run examples on multi GPU
run: |
source activate peft
make tests_examples_multi_gpu
- name: Run core tests on multi GPU
run: |
source activate peft
make tests_core_multi_gpu
- name: Generate Report
if: always()
run: |

View File

@ -4,24 +4,31 @@ on:
schedule:
- cron: "0 15 * * *"
permissions: {}
jobs:
close_stale_issues:
name: Close Stale Issues
if: github.repository == 'huggingface/peft'
runs-on: ubuntu-latest
permissions:
issues: write
pull-requests: write
env:
GITHUB_TOKEN: ${{ secrets.GITHUB_TOKEN }}
steps:
- uses: actions/checkout@v3
- uses: actions/checkout@08c6903cd8c0fde910a37f88322edcfb5dd907a8 # v5.0.0
with:
persist-credentials: false
- name: Setup Python
uses: actions/setup-python@v4
uses: actions/setup-python@e797f83bcb11b83ae66e0230d6156d7c80228e7c # v6.0.0
with:
python-version: 3.8
python-version: 3.11
- name: Install requirements
run: |
pip install PyGithub
- name: Close stale issues
run: |
python scripts/stale.py
python scripts/stale.py

View File

@ -4,7 +4,10 @@ on:
pull_request:
paths:
# Run only when DockerFile files are modified
- "docker/**"
- "docker/*/Dockerfile"
permissions: {}
jobs:
get_changed_files:
name: "Build all modified docker images"
@ -13,12 +16,14 @@ jobs:
matrix: ${{ steps.set-matrix.outputs.matrix }}
steps:
- name: Check out code
uses: actions/checkout@v3
uses: actions/checkout@08c6903cd8c0fde910a37f88322edcfb5dd907a8 # v5.0.0
with:
persist-credentials: false
- name: Get changed files
id: changed-files
uses: tj-actions/changed-files@1c8e6069583811afb28f97afeaf8e7da80c6be5c #v42
with:
files: docker/**
files: docker/*/Dockerfile
json: "true"
- name: Run step if only the files listed above change
if: steps.changed-files.outputs.any_changed == 'true'
@ -26,12 +31,12 @@ jobs:
env:
ALL_CHANGED_FILES: ${{ steps.changed-files.outputs.all_changed_files }}
run: |
echo "matrix=${{ steps.changed-files.outputs.all_changed_files}}" >> $GITHUB_OUTPUT
echo "matrix=${ALL_CHANGED_FILES}" >> $GITHUB_OUTPUT
build_modified_files:
needs: get_changed_files
name: Build Docker images on modified files
runs-on: ubuntu-latest
if: ${{ needs.get_changed_files.outputs.matrix }} != ''
if: ${{ needs.get_changed_files.outputs.matrix != '[]' }}
strategy:
fail-fast: false
matrix:
@ -48,11 +53,13 @@ jobs:
sudo du -sh /usr/local/lib/
sudo du -sh /usr/share/
- name: Set up Docker Buildx
uses: docker/setup-buildx-action@v1
uses: docker/setup-buildx-action@b5ca514318bd6ebac0fb2aedd5d36ec1b5c232a2 # v3.10.0
- name: Check out code
uses: actions/checkout@v3
uses: actions/checkout@08c6903cd8c0fde910a37f88322edcfb5dd907a8 # v5.0.0
with:
persist-credentials: false
- name: Build Docker image
uses: docker/build-push-action@v4
uses: docker/build-push-action@14487ce63c7a62a4a324b0bfb37086795e31c6c1 # v6.16.0
with:
file: ${{ matrix.docker-file }}
context: .

View File

@ -6,13 +6,20 @@ on:
paths-ignore:
- 'docs/**'
env:
TRANSFORMERS_IS_CI: 1
permissions: {}
jobs:
tests:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v3
- uses: actions/checkout@08c6903cd8c0fde910a37f88322edcfb5dd907a8 # v5.0.0
with:
persist-credentials: false
- name: Set up Python 3.11
uses: actions/setup-python@v4
uses: actions/setup-python@e797f83bcb11b83ae66e0230d6156d7c80228e7c # v6.0.0
with:
python-version: 3.11
cache: "pip"
@ -28,7 +35,7 @@ jobs:
make test
- name: Post to Slack
if: always()
uses: huggingface/hf-workflows/.github/actions/post-slack@main
uses: huggingface/hf-workflows/.github/actions/post-slack@3f88d63d3761558a32e8e46fc2a8536e04bb2aea # main from Feb 2025-02-24
with:
slack_channel: ${{ secrets.SLACK_CHANNEL_ID }}
title: 🤗 Results of transformers main tests

View File

@ -9,15 +9,23 @@ on:
paths-ignore:
- 'docs/**'
env:
HF_HOME: .cache/huggingface
TRANSFORMERS_IS_CI: 1
permissions: {}
jobs:
check_code_quality:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v3
- name: Set up Python
uses: actions/setup-python@v4
- uses: actions/checkout@08c6903cd8c0fde910a37f88322edcfb5dd907a8 # v5.0.0
with:
python-version: "3.8"
persist-credentials: false
- name: Set up Python
uses: actions/setup-python@e797f83bcb11b83ae66e0230d6156d7c80228e7c # v6.0.0
with:
python-version: "3.11"
cache: "pip"
cache-dependency-path: "setup.py"
- name: Install dependencies
@ -31,16 +39,36 @@ jobs:
tests:
needs: check_code_quality
strategy:
# TODO: remove 'fail-fast' line once timeout issue from the Hub is solved
fail-fast: false
matrix:
python-version: ["3.8", "3.9", "3.10", "3.11"]
os: ["ubuntu-latest", "macos-12", "windows-latest"]
python-version: ["3.9", "3.10", "3.11", "3.12"]
os: ["ubuntu-latest", "macos-13", "windows-latest"]
runs-on: ${{ matrix.os }}
steps:
- uses: actions/checkout@v3
- uses: actions/checkout@08c6903cd8c0fde910a37f88322edcfb5dd907a8 # v5.0.0
with:
persist-credentials: false
- name: Model cache
uses: actions/cache/restore@0400d5f644dc74513175e3cd8d07132dd4860809 # v4.2.4
with:
# Avoid caching HF_HOME/modules and Python cache files to prevent interoperability
# issues and potential cache poisioning. We also avoid lock files to prevent runs
# avoiding re-download because they see a lock file.
path: |
${{ env.HF_HOME }}/hub/**
!${{ env.HF_HOME }}/**/*.pyc
key: model-cache-${{ github.run_id }}
restore-keys: model-cache-
enableCrossOsArchive: true
- name: Dump cache content
# TODO: remove this step after 2025-02-15
if: matrix.os != 'windows-latest'
run: |
SHASUM=sha256sum
[ -f "$(which shasum)" ] && SHASUM=shasum
find "${{ env.HF_HOME }}/hub" -type f -exec "$SHASUM" {} \; > cache_content_initial || true
- name: Set up Python ${{ matrix.python-version }}
uses: actions/setup-python@v4
uses: actions/setup-python@e797f83bcb11b83ae66e0230d6156d7c80228e7c # v6.0.0
with:
python-version: ${{ matrix.python-version }}
cache: "pip"
@ -48,14 +76,59 @@ jobs:
- name: Install dependencies
run: |
python -m pip install --upgrade pip
pip install setuptools
# cpu version of pytorch
pip install -e .[test]
- name: Downgrade numpy on MacOS and Windows
# TODO: remove numpy downgrade on MacOS & Windows once torch fixes numpy 2.0 issue
shell: bash
if: matrix.os == 'windows-latest' || matrix.os == 'macos-12'
if: matrix.os == 'windows-latest' || matrix.os == 'macos-13'
run: |
pip install --force-reinstall -U "numpy<2.0.0"
- name: Test with pytest
# MacOS tests are currently too flaky and will fail almost each time. Thus, continue (green checkmark) even if
# they fail, but add a notice so that the failure is not completely silent
continue-on-error: ${{ matrix.os == 'macos-13' }}
shell: bash
run: |
set +e
make test
status=$?
# Post a notice only if this is macOS AND tests failed
if [ "$status" -ne 0 ] && [ "${{ matrix.os }}" = "macos-13" ]; then
{
echo "## ⚠️ macOS tests failed"
echo ""
echo "- OS: ${{ matrix.os }}"
echo "- Python: ${{ matrix.python-version }}"
echo ""
echo "Check the logs from this step for details."
} >> "$GITHUB_STEP_SUMMARY"
fi
# Return the real status. On macOS this won't fail the job because of continue-on-error.
exit $status
- name: Dump cache content and diff
# This is just debug info so that we can monitor if the model cache diverges substantially
# over time and what the diverging model is.
# TODO: remove after 2025-02-15
if: matrix.os != 'windows-latest'
run: |
SHASUM=sha256sum
[ -f "$(which shasum)" ] && SHASUM=shasum
find "${{ env.HF_HOME }}/hub" -type f -exec "$SHASUM" {} \; > cache_content_after || true
diff -udp cache_content_initial cache_content_after || true
- name: Delete old model cache entries
run: |
# make sure that cache cleaning doesn't break the pipeline
python scripts/ci_clean_cache.py -d || true
- name: Update model cache
uses: actions/cache/save@0400d5f644dc74513175e3cd8d07132dd4860809 # v4.2.4
# Only let one runner (preferably the one that covers most tests) update the model cache
# after *every* run. This way we make sure that our cache is never outdated and we don't
# have to keep track of hashes.
if: always() && matrix.os == 'ubuntu-latest' && matrix.python-version == '3.10'
with:
path: |
${{ env.HF_HOME }}/hub/**
!${{ env.HF_HOME }}/**/*.pyc
key: model-cache-${{ github.run_id }}

View File

@ -17,13 +17,17 @@ env:
# To be able to run tests on CUDA 12.2
NVIDIA_DISABLE_REQUIRE: "1"
permissions: {}
jobs:
run_tests_with_compile:
runs-on: [self-hosted, single-gpu, nvidia-gpu, a10, ci]
runs-on:
group: aws-g6-4xlarge-plus
env:
PEFT_DEBUG_WITH_TORCH_COMPILE: 1
CUDA_VISIBLE_DEVICES: "0"
TEST_TYPE: "single_gpu_huggingface/peft-gpu-bnb-latest:latest"
USE_PYTORCH_NIGHTLY: "${{ github.event.inputs.pytorch_nightly }}"
container:
image: "huggingface/peft-gpu-bnb-latest:latest"
options: --gpus all --shm-size "16gb" --ipc host -v /mnt/cache/.cache/huggingface:/mnt/cache/
@ -31,17 +35,18 @@ jobs:
run:
shell: bash
steps:
- uses: actions/checkout@v4
- uses: actions/checkout@08c6903cd8c0fde910a37f88322edcfb5dd907a8 # v5.0.0
with:
ref: ${{ github.event.inputs.branch }}
repository: ${{ github.event.pull_request.head.repo.full_name }}
persist-credentials: false
- name: Pip install
run: |
source activate peft
pip install -e . --no-deps
pip install pytest-cov pytest-reportlog parameterized datasets scipy einops
pip install "pytest>=7.2.0,<8.0.0" # see: https://github.com/huggingface/transformers/blob/ce4fff0be7f6464d713f7ac3e0bbaafbc6959ae5/setup.py#L148C6-L148C26
if [ "${{ github.event.inputs.pytorch_nightly }}" = "true" ]; then
if [ "${USE_PYTORCH_NIGHTLY}" = "true" ]; then
python -m pip install --upgrade --pre torch --index-url https://download.pytorch.org/whl/nightly/cpu
fi
- name: Test compile with pytest

View File

@ -3,13 +3,16 @@ on:
name: Secret Leaks
permissions: {}
jobs:
trufflehog:
runs-on: ubuntu-latest
steps:
- name: Checkout code
uses: actions/checkout@v4
uses: actions/checkout@08c6903cd8c0fde910a37f88322edcfb5dd907a8 # v5.0.0
with:
fetch-depth: 0
persist-credentials: false
- name: Secret Scanning
uses: trufflesecurity/trufflehog@main
uses: trufflesecurity/trufflehog@0f58ae7c5036094a1e3e750d18772af92821b503 # v3.90.5

View File

@ -6,11 +6,13 @@ on:
types:
- completed
permissions: {}
jobs:
build:
uses: huggingface/doc-builder/.github/workflows/upload_pr_documentation.yml@main
uses: huggingface/doc-builder/.github/workflows/upload_pr_documentation.yml@ba4b74d11c46d884a4cf6497687c090f55f027d9 # main from 2025-09-05
with:
package_name: peft
secrets:
hf_token: ${{ secrets.HF_DOC_BUILD_PUSH }}
comment_bot_token: ${{ secrets.COMMENT_BOT_TOKEN }}
comment_bot_token: ${{ secrets.COMMENT_BOT_TOKEN }}

28
.github/workflows/zizmor.yaml vendored Normal file
View File

@ -0,0 +1,28 @@
name: CI security linting
on:
push:
branches: ["main"]
pull_request:
branches: ["*"]
paths:
- '.github/**'
permissions: {}
jobs:
zizmor:
name: zizmor latest via Cargo
runs-on: ubuntu-latest
permissions:
contents: read
security-events: write
steps:
- name: Checkout repository
uses: actions/checkout@08c6903cd8c0fde910a37f88322edcfb5dd907a8 # v5.0.0
with:
persist-credentials: false
- name: Install zizmor
run: cargo install --locked zizmor
- name: Run zizmor
run: zizmor .github/workflows

24
.github/zizmor.yml vendored Normal file
View File

@ -0,0 +1,24 @@
rules:
dangerous-triggers:
ignore:
# this workflow is only triggered after maintainer approval
- upload_pr_documentation.yml:3:1
cache-poisoning:
ignore:
# the docker buildx binary is cached and zizmor warns about a cache poisoning attack.
# OTOH this cache would make us more resilient against an intrusion on docker-buildx' side.
# There is no obvious benefit so we leave it as it is.
- build_docker_images.yml:37:9
- build_docker_images.yml:70:9
- build_docker_images.yml:103:9
- build_docker_images.yml:136:9
- build_docker_images.yml:169:9
unpinned-images:
ignore:
# We want to test these images with the latest version and we're not using them
# to deploy anything so we deem it safe to use those, even if they are unpinned.
- nightly-bnb.yml:30:7
- nightly-bnb.yml:155:7
- nightly.yml:27:7
- nightly.yml:77:7
- torch_compile_tests.yml:32:7

4
.gitignore vendored
View File

@ -139,3 +139,7 @@ dmypy.json
# More test things
wandb
# method_comparison logs
method_comparison/MetaMathQA/cancelled_results/
method_comparison/MetaMathQA/temporary_results/

View File

@ -1,13 +1,13 @@
repos:
- repo: https://github.com/astral-sh/ruff-pre-commit
rev: v0.2.1
rev: v0.12.8
hooks:
- id: ruff
args:
- --fix
- id: ruff-format
- repo: https://github.com/pre-commit/pre-commit-hooks
rev: v4.5.0
rev: v4.6.0
hooks:
- id: check-merge-conflict
- id: check-yaml

View File

@ -31,9 +31,14 @@ tests_core_multi_gpu:
tests_core_single_gpu:
python -m pytest -m single_gpu_tests tests/test_common_gpu.py $(if $(IS_GITHUB_CI),--report-log "core_single_gpu.log",)
# exclude gemma tests, as generation fails with torch.compile, these failures
# trigger side effects that make other tests fail with 'RuntimeError: Offset
# increment outside graph capture encountered unexpectedly.'
# TODO re-enable gemma once/if it is fixed
tests_common_gpu:
python -m pytest tests/test_decoder_models.py $(if $(IS_GITHUB_CI),--report-log "common_decoder.log",)
python -m pytest tests/test_decoder_models.py -k "not gemma" $(if $(IS_GITHUB_CI),--report-log "common_decoder.log",)
python -m pytest tests/test_encoder_decoder_models.py $(if $(IS_GITHUB_CI),--report-log "common_encoder_decoder.log",)
python -m pytest tests/test_gptqmodel.py $(if $(IS_GITHUB_CI),--report-log "gptqmodel_gpu.log",)
tests_examples_multi_gpu_bnb:
python -m pytest -m "multi_gpu_tests and bitsandbytes" tests/test_gpu_examples.py $(if $(IS_GITHUB_CI),--report-log "multi_gpu_examples.log",)

View File

@ -39,38 +39,43 @@ pip install peft
Prepare a model for training with a PEFT method such as LoRA by wrapping the base model and PEFT configuration with `get_peft_model`. For the bigscience/mt0-large model, you're only training 0.19% of the parameters!
```python
from transformers import AutoModelForSeq2SeqLM
from peft import get_peft_config, get_peft_model, LoraConfig, TaskType
model_name_or_path = "bigscience/mt0-large"
tokenizer_name_or_path = "bigscience/mt0-large"
from transformers import AutoModelForCausalLM
from peft import LoraConfig, TaskType, get_peft_model
device = torch.accelerator.current_accelerator().type if hasattr(torch, "accelerator") else "cuda"
model_id = "Qwen/Qwen2.5-3B-Instruct"
model = AutoModelForCausalLM.from_pretrained(model_id, device_map=device)
peft_config = LoraConfig(
task_type=TaskType.SEQ_2_SEQ_LM, inference_mode=False, r=8, lora_alpha=32, lora_dropout=0.1
r=16,
lora_alpha=32,
task_type=TaskType.CAUSAL_LM,
# target_modules=["q_proj", "v_proj", ...] # optionally indicate target modules
)
model = AutoModelForSeq2SeqLM.from_pretrained(model_name_or_path)
model = get_peft_model(model, peft_config)
model.print_trainable_parameters()
"trainable params: 2359296 || all params: 1231940608 || trainable%: 0.19151053100118282"
# prints: trainable params: 3,686,400 || all params: 3,089,625,088 || trainable%: 0.1193
# now perform training on your dataset, e.g. using transformers Trainer, then save the model
model.save_pretrained("qwen2.5-3b-lora")
```
To load a PEFT model for inference:
```py
from peft import AutoPeftModelForCausalLM
from transformers import AutoTokenizer
import torch
```python
from transformers import AutoModelForCausalLM, AutoTokenizer
from peft import PeftModel
model = AutoPeftModelForCausalLM.from_pretrained("ybelkada/opt-350m-lora").to("cuda")
tokenizer = AutoTokenizer.from_pretrained("facebook/opt-350m")
device = torch.accelerator.current_accelerator().type if hasattr(torch, "accelerator") else "cuda"
model_id = "Qwen/Qwen2.5-3B-Instruct"
tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(model_id, device_map=device)
model = PeftModel.from_pretrained(model, "qwen2.5-3b-lora")
model.eval()
inputs = tokenizer("Preheat the oven to 350 degrees and place the cookie dough", return_tensors="pt")
outputs = model.generate(**inputs.to(device), max_new_tokens=50)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))
outputs = model.generate(input_ids=inputs["input_ids"].to("cuda"), max_new_tokens=50)
print(tokenizer.batch_decode(outputs, skip_special_tokens=True)[0])
"Preheat the oven to 350 degrees and place the cookie dough in the center of the oven. In a large bowl, combine the flour, baking powder, baking soda, salt, and cinnamon. In a separate bowl, combine the egg yolks, sugar, and vanilla."
# prints something like: Preheat the oven to 350 degrees and place the cookie dough in a baking dish [...]
```
## Why you should use PEFT
@ -124,6 +129,32 @@ The iterative diffusion process consumes a lot of memory which can make it diffi
> [!TIP]
> Take a look at the [examples/lora_dreambooth/train_dreambooth.py](examples/lora_dreambooth/train_dreambooth.py) training script to try training your own Stable Diffusion model with LoRA, and play around with the [smangrul/peft-lora-sd-dreambooth](https://huggingface.co/spaces/smangrul/peft-lora-sd-dreambooth) Space which is running on a T4 instance. Learn more about the PEFT integration in Diffusers in this [tutorial](https://huggingface.co/docs/peft/main/en/tutorial/peft_integrations#diffusers).
### Transformers
PEFT is directly integrated with [Transformers](https://huggingface.co/docs/transformers/main/en/peft). After loading a model, call `add_adapter` to add a new PEFT adapter to the model:
```python
from peft import LoraConfig
model = ... # transformers model
peft_config = LoraConfig(...)
model.add_adapter(lora_config, adapter_name="lora_1")
```
To load a trained PEFT adapter, call `load_adapter`:
```python
model = ... # transformers model
model.load_adapter(<path-to-adapter>, adapter_name="lora_1")
```
And to switch between different adapters, call `set_adapter`:
```python
model.set_adapter("lora_2")
```
The Transformers integration doesn't include all the functionalities offered in PEFT, such as methods for merging the adapter into the base model.
### Accelerate
[Accelerate](https://huggingface.co/docs/accelerate/index) is a library for distributed training and inference on various training setups and hardware (GPUs, TPUs, Apple Silicon, etc.). PEFT models work with Accelerate out of the box, making it really convenient to train really large models or use them for inference on consumer hardware with limited resources.
@ -150,9 +181,9 @@ To use 🤗 PEFT in your publication, please cite it by using the following BibT
```bibtex
@Misc{peft,
title = {PEFT: State-of-the-art Parameter-Efficient Fine-Tuning methods},
title = {{PEFT}: State-of-the-art Parameter-Efficient Fine-Tuning methods},
author = {Sourab Mangrulkar and Sylvain Gugger and Lysandre Debut and Younes Belkada and Sayak Paul and Benjamin Bossan},
howpublished = {\url{https://github.com/huggingface/peft}},
year = {2022}
}
```
```

View File

@ -1,11 +1,8 @@
# PEFT Docker images
Here we store all PEFT Docker images used in our testing infrastructure. We use python 3.8 for now on all our images.
Here we store all PEFT Docker images used in our testing infrastructure. We use python 3.11 for now on all our images.
- `peft-cpu`: PEFT compiled on CPU with all other HF libraries installed on main branch
- `peft-gpu`: PEFT complied for NVIDIA GPUs wih all other HF libraries installed on main branch
- `peft-gpu`: PEFT complied for NVIDIA GPUs with all other HF libraries installed on main branch
- `peft-gpu-bnb-source`: PEFT complied for NVIDIA GPUs with `bitsandbytes` and all other HF libraries installed from main branch
- `peft-gpu-bnb-latest`: PEFT complied for NVIDIA GPUs with `bitsandbytes` complied from main and all other HF libraries installed from latest PyPi
- `peft-gpu-bnb-multi-source`: PEFT complied for NVIDIA GPUs with `bitsandbytes` complied from `multi-backend` branch and all other HF libraries installed from main branch
`peft-gpu-bnb-source` and `peft-gpu-bnb-multi-source` are essentially the same, with the only difference being `bitsandbytes` compiled on another branch. Make sure to propagate the changes you applied on one file to the other!

View File

@ -4,7 +4,7 @@
# Use base conda image to reduce time
FROM continuumio/miniconda3:latest AS compile-image
# Specify py version
ENV PYTHON_VERSION=3.8
ENV PYTHON_VERSION=3.11
# Install apt libs - copied from https://github.com/huggingface/accelerate/blob/main/docker/accelerate-gpu/Dockerfile
RUN apt-get update && \
apt-get install -y curl git wget software-properties-common git-lfs && \

View File

@ -4,7 +4,7 @@
# Use base conda image to reduce time
FROM continuumio/miniconda3:latest AS compile-image
# Specify py version
ENV PYTHON_VERSION=3.8
ENV PYTHON_VERSION=3.11
# Install apt libs - copied from https://github.com/huggingface/accelerate/blob/main/docker/accelerate-gpu/Dockerfile
RUN apt-get update && \
apt-get install -y curl git wget software-properties-common git-lfs && \
@ -31,7 +31,7 @@ RUN chsh -s /bin/bash
SHELL ["/bin/bash", "-c"]
# Stage 2
FROM nvidia/cuda:12.1.0-devel-ubuntu22.04 AS build-image
FROM nvidia/cuda:12.6.3-devel-ubuntu22.04 AS build-image
COPY --from=compile-image /opt/conda /opt/conda
ENV PATH /opt/conda/bin:$PATH
@ -56,7 +56,7 @@ RUN source activate peft && \
peft \
optimum \
auto-gptq && \
git clone https://github.com/TimDettmers/bitsandbytes && cd bitsandbytes && \
git clone https://github.com/bitsandbytes-foundation/bitsandbytes && cd bitsandbytes && \
cmake -B . -DCOMPUTE_BACKEND=cuda -S . && \
cmake --build . && \
pip install -e . && \

View File

@ -1,68 +0,0 @@
# Builds GPU docker image of PyTorch
# Uses multi-staged approach to reduce size
# Stage 1
# Use base conda image to reduce time
FROM continuumio/miniconda3:latest AS compile-image
# Specify py version
ENV PYTHON_VERSION=3.8
# Install apt libs - copied from https://github.com/huggingface/accelerate/blob/main/docker/accelerate-gpu/Dockerfile
RUN apt-get update && \
apt-get install -y curl git wget software-properties-common git-lfs && \
apt-get clean && \
rm -rf /var/lib/apt/lists*
# Install audio-related libraries
RUN apt-get update && \
apt install -y ffmpeg
RUN apt install -y libsndfile1-dev
RUN git lfs install
# Create our conda env - copied from https://github.com/huggingface/accelerate/blob/main/docker/accelerate-gpu/Dockerfile
RUN conda create --name peft python=${PYTHON_VERSION} ipython jupyter pip
RUN python3 -m pip install --no-cache-dir --upgrade pip
# Below is copied from https://github.com/huggingface/accelerate/blob/main/docker/accelerate-gpu/Dockerfile
# We don't install pytorch here yet since CUDA isn't available
# instead we use the direct torch wheel
ENV PATH /opt/conda/envs/peft/bin:$PATH
# Activate our bash shell
RUN chsh -s /bin/bash
SHELL ["/bin/bash", "-c"]
# Stage 2
FROM nvidia/cuda:12.1.0-devel-ubuntu22.04 AS build-image
COPY --from=compile-image /opt/conda /opt/conda
ENV PATH /opt/conda/bin:$PATH
RUN chsh -s /bin/bash
SHELL ["/bin/bash", "-c"]
# Install apt libs
RUN apt-get update && \
apt-get install -y curl git wget cmake && \
apt-get clean && \
rm -rf /var/lib/apt/lists*
# Activate the conda env and install transformers + accelerate from source
# Also clone BNB and build it from source.
RUN source activate peft && \
python3 -m pip install -U --no-cache-dir \
librosa \
"soundfile>=0.12.1" \
scipy \
git+https://github.com/huggingface/transformers \
git+https://github.com/huggingface/accelerate \
peft[test]@git+https://github.com/huggingface/peft \
optimum \
auto-gptq && \
git clone https://github.com/TimDettmers/bitsandbytes && cd bitsandbytes && git checkout multi-backend-refactor && \
cmake -B . -DCOMPUTE_BACKEND=cuda -S . && \
cmake --build . && \
pip install -e . && \
pip freeze | grep bitsandbytes
RUN echo "source activate peft" >> ~/.profile
# Activate the virtualenv
CMD ["/bin/bash"]

View File

@ -4,7 +4,7 @@
# Use base conda image to reduce time
FROM continuumio/miniconda3:latest AS compile-image
# Specify py version
ENV PYTHON_VERSION=3.8
ENV PYTHON_VERSION=3.11
# Install apt libs - copied from https://github.com/huggingface/accelerate/blob/main/docker/accelerate-gpu/Dockerfile
RUN apt-get update && \
apt-get install -y curl git wget software-properties-common git-lfs && \
@ -31,7 +31,7 @@ RUN chsh -s /bin/bash
SHELL ["/bin/bash", "-c"]
# Stage 2
FROM nvidia/cuda:12.1.0-devel-ubuntu22.04 AS build-image
FROM nvidia/cuda:12.6.3-devel-ubuntu22.04 AS build-image
COPY --from=compile-image /opt/conda /opt/conda
ENV PATH /opt/conda/bin:$PATH
@ -56,7 +56,7 @@ RUN source activate peft && \
peft[test]@git+https://github.com/huggingface/peft \
optimum \
auto-gptq && \
git clone https://github.com/TimDettmers/bitsandbytes && cd bitsandbytes && \
git clone https://github.com/bitsandbytes-foundation/bitsandbytes && cd bitsandbytes && \
cmake -B . -DCOMPUTE_BACKEND=cuda -S . && \
cmake --build . && \
pip install -e . && \

View File

@ -4,23 +4,18 @@
# Use base conda image to reduce time
FROM continuumio/miniconda3:latest AS compile-image
# Specify py version
ENV PYTHON_VERSION=3.8
ENV PYTHON_VERSION=3.11
# Install apt libs - copied from https://github.com/huggingface/accelerate/blob/main/docker/accelerate-gpu/Dockerfile
# Install audio-related libraries
RUN apt-get update && \
apt-get install -y curl git wget software-properties-common git-lfs && \
apt-get install -y curl git wget software-properties-common git-lfs ffmpeg libsndfile1-dev && \
apt-get clean && \
rm -rf /var/lib/apt/lists*
# Install audio-related libraries
RUN apt-get update && \
apt install -y ffmpeg
RUN apt install -y libsndfile1-dev
RUN git lfs install
# Create our conda env - copied from https://github.com/huggingface/accelerate/blob/main/docker/accelerate-gpu/Dockerfile
RUN conda create --name peft python=${PYTHON_VERSION} ipython jupyter pip
RUN python3 -m pip install --no-cache-dir --upgrade pip
# Below is copied from https://github.com/huggingface/accelerate/blob/main/docker/accelerate-gpu/Dockerfile
# We don't install pytorch here yet since CUDA isn't available
@ -31,29 +26,24 @@ RUN chsh -s /bin/bash
SHELL ["/bin/bash", "-c"]
# Stage 2
FROM nvidia/cuda:12.2.2-devel-ubuntu22.04 AS build-image
FROM nvidia/cuda:12.4.1-devel-ubuntu22.04 AS build-image
COPY --from=compile-image /opt/conda /opt/conda
ENV PATH /opt/conda/bin:$PATH
RUN chsh -s /bin/bash
SHELL ["/bin/bash", "-c"]
RUN source activate peft && \
python3 -m pip install --no-cache-dir bitsandbytes optimum auto-gptq
# Add autoawq for quantization testing
RUN source activate peft && \
python3 -m pip install --no-cache-dir https://github.com/casper-hansen/AutoAWQ/releases/download/v0.2.4/autoawq-0.2.4-cp38-cp38-linux_x86_64.whl
RUN source activate peft && \
python3 -m pip install --no-cache-dir https://github.com/casper-hansen/AutoAWQ_kernels/releases/download/v0.0.6/autoawq_kernels-0.0.6-cp38-cp38-linux_x86_64.whl
# Install apt libs
RUN apt-get update && \
apt-get install -y curl git wget && \
apt-get clean && \
rm -rf /var/lib/apt/lists*
# Add eetq for quantization testing
RUN source activate peft && \
RUN chsh -s /bin/bash
SHELL ["/bin/bash", "-c"]
RUN source activate peft && \
python3 -m pip install --no-cache-dir bitsandbytes optimum auto-gptq && \
# Add autoawq for quantization testing
python3 -m pip install --no-cache-dir https://github.com/casper-hansen/AutoAWQ/releases/download/v0.2.7.post2/autoawq-0.2.7.post2-py3-none-any.whl && \
python3 -m pip install --no-cache-dir https://github.com/casper-hansen/AutoAWQ_kernels/releases/download/v0.0.9/autoawq_kernels-0.0.9-cp311-cp311-linux_x86_64.whl && \
# Add eetq for quantization testing
python3 -m pip install git+https://github.com/NetEase-FuXi/EETQ.git
# Activate the conda env and install transformers + accelerate from source
@ -62,19 +52,16 @@ RUN source activate peft && \
librosa \
"soundfile>=0.12.1" \
scipy \
torchao \
git+https://github.com/huggingface/transformers \
git+https://github.com/huggingface/accelerate \
peft[test]@git+https://github.com/huggingface/peft
peft[test]@git+https://github.com/huggingface/peft \
# Add aqlm for quantization testing
aqlm[gpu]>=1.0.2 \
# Add HQQ for quantization testing
hqq
# Add aqlm for quantization testing
RUN source activate peft && \
pip install aqlm[gpu]>=1.0.2
# Add HQQ for quantization testing
RUN source activate peft && \
pip install hqq
RUN source activate peft && \
pip freeze | grep transformers
RUN echo "source activate peft" >> ~/.profile

View File

@ -45,8 +45,6 @@
title: Troubleshooting
- local: developer_guides/checkpoint
title: PEFT checkpoint format
- local: package_reference/helpers
title: Helpers
- title: 🤗 Accelerate integrations
sections:
@ -92,6 +90,8 @@
title: LoKr
- local: package_reference/lora
title: LoRA
- local: package_reference/xlora
title: X-LoRA
- local: package_reference/adapter_utils
title: LyCORIS
- local: package_reference/multitask_prompt_tuning
@ -114,11 +114,36 @@
title: VeRA
- local: package_reference/fourierft
title: FourierFT
- local: package_reference/vblora
title: VB-LoRA
- local: package_reference/hra
title: HRA
- local: package_reference/cpt
title: CPT
- local: package_reference/bone
title: Bone
- local: package_reference/trainable_tokens
title: Trainable Tokens
- local: package_reference/randlora
title: RandLora
- local: package_reference/shira
title: SHiRA
- local: package_reference/c3a
title: C3A
- local: package_reference/miss
title: MiSS
- local: package_reference/road
title: RoAd
title: Adapters
- sections:
- local: package_reference/merge_utils
title: Model merge
- local: package_reference/helpers
title: Helpers
- local: package_reference/hotswap
title: Hotswapping adapters
- local: package_reference/functional
title: Functions for PEFT integration
title: Utilities
title: API reference

View File

@ -94,7 +94,7 @@ accelerate launch --config_file "configs/deepspeed_config.yaml" train.py \
--logging_steps 5 \
--log_level "info" \
--logging_strategy "steps" \
--evaluation_strategy "epoch" \
--eval_strategy "epoch" \
--save_strategy "epoch" \
--push_to_hub \
--hub_private_repo True \
@ -128,24 +128,17 @@ Notice that we are using LoRA with rank=8, alpha=16 and targeting all linear la
Let's dive a little deeper into the script so you can see what's going on, and understand how it works.
The first thing to know is that the script uses DeepSpeed for distributed training as the DeepSpeed config has been passed. The `SFTTrainer` class handles all the heavy lifting of creating the PEFT model using the peft config that is passed. After that, when you call `trainer.train()`, `SFTTrainer` internally uses 🤗 Accelerate to prepare the model, optimizer and trainer using the DeepSpeed config to create DeepSpeed engine which is then trained. The main code snippet is below:
The first thing to know is that the script uses DeepSpeed for distributed training as the DeepSpeed config has been passed. The [`~trl.SFTTrainer`] class handles all the heavy lifting of creating the PEFT model using the peft config that is passed. After that, when you call `trainer.train()`, [`~trl.SFTTrainer`] internally uses 🤗 Accelerate to prepare the model, optimizer and trainer using the DeepSpeed config to create DeepSpeed engine which is then trained. The main code snippet is below:
```python
# trainer
trainer = SFTTrainer(
model=model,
tokenizer=tokenizer,
processing_class=tokenizer,
args=training_args,
train_dataset=train_dataset,
eval_dataset=eval_dataset,
peft_config=peft_config,
packing=data_args.packing,
dataset_kwargs={
"append_concat_token": data_args.append_concat_token,
"add_special_tokens": data_args.add_special_tokens,
},
dataset_text_field=data_args.dataset_text_field,
max_seq_length=data_args.max_seq_length,
)
trainer.accelerator.print(f"{trainer.model}")
@ -175,7 +168,7 @@ You can also refer this blog post [Falcon 180B Finetuning using 🤗 PEFT and De
# Use PEFT QLoRA and DeepSpeed with ZeRO3 for finetuning large models on multiple GPUs
In this section, we will look at how to use QLoRA and DeepSpeed Stage-3 for finetuning 70B llama model on 2X40GB GPUs.
For this, we first need `bitsandbytes>=0.43.0`, `accelerate>=0.28.0`, `transformers>4.38.2`, `trl>0.7.11` and `peft>0.9.0`. We need to set `zero3_init_flag` to true when using Accelerate config. Below is the config which can be found at [deepspeed_config_z3_qlora.yaml](https://github.com/huggingface/peft/blob/main/examples/sft/configs/deepspeed_config_z3_qlora.yaml):
For this, we first need `bitsandbytes>=0.43.3`, `accelerate>=1.0.1`, `transformers>4.44.2`, `trl>0.11.4` and `peft>0.13.0`. We need to set `zero3_init_flag` to true when using Accelerate config. Below is the config which can be found at [deepspeed_config_z3_qlora.yaml](https://github.com/huggingface/peft/blob/main/examples/sft/configs/deepspeed_config_z3_qlora.yaml):
```yml
compute_environment: LOCAL_MACHINE
@ -202,7 +195,7 @@ tpu_use_sudo: false
use_cpu: false
```
Launch command is given below which is available at [run_peft_qlora_deepspeed_stage3.sh](https://github.com/huggingface/peft/blob/main/examples/sft/run_peft_deepspeed.sh):
Launch command is given below which is available at [run_peft_qlora_deepspeed_stage3.sh](https://github.com/huggingface/peft/blob/main/examples/sft/run_peft_qlora_deepspeed_stage3.sh):
```
accelerate launch --config_file "configs/deepspeed_config_z3_qlora.yaml" train.py \
--seed 100 \
@ -217,7 +210,7 @@ accelerate launch --config_file "configs/deepspeed_config_z3_qlora.yaml" train.
--logging_steps 5 \
--log_level "info" \
--logging_strategy "steps" \
--evaluation_strategy "epoch" \
--eval_strategy "epoch" \
--save_strategy "epoch" \
--push_to_hub \
--hub_private_repo True \
@ -445,3 +438,21 @@ dataset['train'][label_column][:10]=['no complaint', 'no complaint', 'complaint'
1. Merging when using PEFT and DeepSpeed is currently unsupported and will raise error.
2. When using CPU offloading, the major gains from using PEFT to shrink the optimizer states and gradients to that of the adapter weights would be realized on CPU RAM and there won't be savings with respect to GPU memory.
3. DeepSpeed Stage 3 and qlora when used with CPU offloading leads to more GPU memory usage when compared to disabling CPU offloading.
<Tip>
💡 When you have code that requires merging (and unmerging) of weights, try to manually collect the parameters with DeepSpeed Zero-3 beforehand:
```python
import deepspeed
is_ds_zero_3 = ... # check if Zero-3
with deepspeed.zero.GatheredParameters(list(model.parameters()), enabled= is_ds_zero_3):
model.merge_adapter()
# do whatever is needed, then unmerge in the same context if unmerging is required
...
model.unmerge_adapter()
```
</Tip>

View File

@ -74,7 +74,7 @@ accelerate launch --config_file "configs/fsdp_config.yaml" train.py \
--logging_steps 5 \
--log_level "info" \
--logging_strategy "steps" \
--evaluation_strategy "epoch" \
--eval_strategy "epoch" \
--save_strategy "epoch" \
--push_to_hub \
--hub_private_repo True \
@ -108,24 +108,17 @@ Notice that we are using LoRA with rank=8, alpha=16 and targeting all linear la
Let's dive a little deeper into the script so you can see what's going on, and understand how it works.
The first thing to know is that the script uses FSDP for distributed training as the FSDP config has been passed. The `SFTTrainer` class handles all the heavy lifting of creating PEFT model using the peft config that is passed. After that when you call `trainer.train()`, Trainer internally uses 🤗 Accelerate to prepare model, optimizer and trainer using the FSDP config to create FSDP wrapped model which is then trained. The main code snippet is below:
The first thing to know is that the script uses FSDP for distributed training as the FSDP config has been passed. The [`~trl.SFTTrainer`] class handles all the heavy lifting of creating PEFT model using the peft config that is passed. After that when you call `trainer.train()`, Trainer internally uses 🤗 Accelerate to prepare model, optimizer and trainer using the FSDP config to create FSDP wrapped model which is then trained. The main code snippet is below:
```python
# trainer
trainer = SFTTrainer(
model=model,
tokenizer=tokenizer,
processing_class=tokenizer,
args=training_args,
train_dataset=train_dataset,
eval_dataset=eval_dataset,
peft_config=peft_config,
packing=data_args.packing,
dataset_kwargs={
"append_concat_token": data_args.append_concat_token,
"add_special_tokens": data_args.add_special_tokens,
},
dataset_text_field=data_args.dataset_text_field,
max_seq_length=data_args.max_seq_length,
)
trainer.accelerator.print(f"{trainer.model}")
if model_args.use_peft_lora:
@ -173,7 +166,7 @@ In the above example, the memory consumed per GPU is 72-80 GB (90-98%) as seen
In this section, we will look at how to use QLoRA and FSDP for finetuning 70B llama model on 2X24GB GPUs. [Answer.AI](https://www.answer.ai/) in collaboration with bitsandbytes and Hugging Face 🤗 open sourced code enabling the usage of FSDP+QLoRA and explained the whole process in their insightful blogpost [You can now train a 70b language model at home](https://www.answer.ai/posts/2024-03-06-fsdp-qlora.html). This is now integrated in Hugging Face ecosystem.
For this, we first need `bitsandbytes>=0.43.0`, `accelerate>=0.28.0`, `transformers>4.38.2`, `trl>0.7.11` and `peft>0.9.0`. We need to set `fsdp_cpu_ram_efficient_loading=true`, `fsdp_use_orig_params=false` and `fsdp_offload_params=true`(cpu offloading) when using Accelerate config. When not using accelerate launcher, you can alternately set the environment variable `export FSDP_CPU_RAM_EFFICIENT_LOADING=true`. Here, we will be using accelerate config and below is the config which can be found at [fsdp_config_qlora.yaml](https://github.com/huggingface/peft/blob/main/examples/sft/configs/fsdp_config_qlora.yaml):
For this, we first need `bitsandbytes>=0.43.3`, `accelerate>=1.0.1`, `transformers>4.44.2`, `trl>0.11.4` and `peft>0.13.0`. We need to set `fsdp_cpu_ram_efficient_loading=true`, `fsdp_use_orig_params=false` and `fsdp_offload_params=true`(cpu offloading) when using Accelerate config. When not using accelerate launcher, you can alternately set the environment variable `export FSDP_CPU_RAM_EFFICIENT_LOADING=true`. Here, we will be using accelerate config and below is the config which can be found at [fsdp_config_qlora.yaml](https://github.com/huggingface/peft/blob/main/examples/sft/configs/fsdp_config_qlora.yaml):
```yml
compute_environment: LOCAL_MACHINE
@ -218,7 +211,7 @@ accelerate launch --config_file "configs/fsdp_config_qlora.yaml" train.py \
--logging_steps 5 \
--log_level "info" \
--logging_strategy "steps" \
--evaluation_strategy "epoch" \
--eval_strategy "epoch" \
--save_strategy "epoch" \
--push_to_hub \
--hub_private_repo True \

View File

@ -50,6 +50,18 @@ In principle, LoRA can be applied to any subset of weight matrices in a neural n
</div>
<small><a href="https://hf.co/papers/2103.10385">Navigating Text-To-Image Customization: From LyCORIS Fine-Tuning to Model Evaluation</a></small>
## Mixture of LoRA Experts (X-LoRA)
[X-LoRA](https://huggingface.co/papers/2402.07148) is a mixture of experts method for LoRA which works by using dense or sparse gating to dynamically activate LoRA experts. The LoRA experts as well as the base model are frozen during training, resulting in a low parameter count as only the gating layers must be trained. In particular, the gating layers output scalings which (depending on config) are granular on the layer and token level. Additionally, during inference, X-LoRA dynamically activates LoRA adapters to recall knowledge and effectively mix them:
The below graphic demonstrates how the scalings change for different prompts for each token. This highlights the activation of different adapters as the generation progresses and the sequence creates new context.
![Token-by-token scalings](https://github.com/EricLBuehler/xlora/raw/master/res/token_by_token_scalings.gif)
For each step, X-LoRA requires the base model to be run twice: first, to get hidden states without any LoRA adapters, and secondly, the hidden states are used to calculate scalings which are applied to the LoRA adapters and the model is run a second time. The output of the second run is the result of the model step.
Ultimately, X-LoRA allows the model to reflect upon its knowledge because of the dual forward pass scheme, and dynamically reconfigure the architecture.
## Low-Rank Hadamard Product (LoHa)
Low-rank decomposition can impact performance because the weight updates are limited to the low-rank space, which can constrain a model's expressiveness. However, you don't necessarily want to use a larger rank because it increases the number of trainable parameters. To address this, [LoHa](https://huggingface.co/papers/2108.06098) (a method originally developed for computer vision) was applied to diffusion models where the ability to generate diverse images is an important consideration. LoHa should also work with general model types, but the embedding layers aren't currently implemented in PEFT.
@ -73,19 +85,23 @@ OFT preserves the hyperspherical energy by learning an orthogonal transformation
## Orthogonal Butterfly (BOFT)
[BOFT](https://hf.co/papers/2311.06243) is a method that primarily focuses on preserving a pretrained model's generative performance in the finetuned model. It tries to maintain the same cosine similarity (hyperspherical energy) between all pairwise neurons in a layer because this better captures the semantic information among neurons. This means OFT is more capable at preserving the subject and it is better for controllable generation (similar to [ControlNet](https://huggingface.co/docs/diffusers/using-diffusers/controlnet)).
[BOFT](https://hf.co/papers/2311.06243) is an improved orthogonal finetuning method that focuses on preserving a pretrained model's generative capabilities while being significantly more parameter-efficient than standard OFT. Like OFT, BOFT maintains the same cosine similarity (hyperspherical energy) between all pairwise neurons in a layer by applying an orthogonal transformation to the pretrained weight matrix, ensuring the semantic relationships among neurons are preserved.
OFT preserves the hyperspherical energy by learning an orthogonal transformation for neurons to keep the cosine similarity between them unchanged. In practice, this means taking the matrix product of an orthogonal matrix with the pretrained weight matrix. However, to be parameter-efficient, the orthogonal matrix is represented as a block-diagonal matrix with rank `r` blocks. Whereas LoRA reduces the number of trainable parameters with low-rank structures, OFT reduces the number of trainable parameters with a sparse block-diagonal matrix structure.
Instead of using a block-diagonal orthogonal matrix, BOFT factorizes the orthogonal transformation into a product of **sparse butterfly matrices** (originally introduced in the [CooleyTukey FFT](https://en.wikipedia.org/wiki/Cooley%E2%80%93Tukey_FFT_algorithm)). Unlike OFT's block-diagonal rotations, which only mix inputs within each block, the butterfly structure guarantees that every input can influence every output, producing a **dense connectivity** with just `O(d log d)` parameters. This factorization preserves expressivity while drastically reducing the parameter count compared to OFT (at the expense of computation time).
In practice, BOFT multiplies each pretrained weight matrix by a sequence of butterfly-structured orthogonal factors, enabling efficient and expressive neuron rotations. This makes BOFT well-suited for controllable generation and tasks where maintaining the pretrained model's subject representation is critical, while also scaling to larger models with lower memory and compute overhead.
## Adaptive Low-Rank Adaptation (AdaLoRA)
[AdaLoRA](https://hf.co/papers/2303.10512) manages the parameter budget introduced from LoRA by allocating more parameters - in other words, a higher rank `r` - for important weight matrices that are better adapted for a task and pruning less important ones. The rank is controlled by a method similar to singular value decomposition (SVD). The ∆W is parameterized with two orthogonal matrices and a diagonal matrix which contains singular values. This parametrization method avoids iteratively applying SVD which is computationally expensive. Based on this method, the rank of ∆W is adjusted according to an importance score. ∆W is divided into triplets and each triplet is scored according to its contribution to model performance. Triplets with low importance scores are pruned and triplets with high importance scores are kept for finetuning.
Training with AdaLoRA has three phases: the init phase, the budgeting phase and the final phase. In the initial phase, no budgeting is applied, therefore the ranks are not touched. During the budgeting phase the process described above is applied and the rank is redistributed according to a budget, aiming to give more important adapters more rank and less important layers less. When reaching the final phase, budgeting has ended, the ranks are redistributed but we may continue training for a while with the redistributed ranks to further improve performance.
## Llama-Adapter
[Llama-Adapter](https://hf.co/papers/2303.16199) is a method for adapting Llama into a instruction-following model. To help adapt the model for instruction-following, the adapter is trained with a 52K instruction-output dataset.
[Llama-Adapter](https://hf.co/papers/2303.16199) is a method for adapting Llama into an instruction-following model. To help adapt the model for instruction-following, the adapter is trained with a 52K instruction-output dataset.
A set of of learnable adaption prompts are prefixed to the input instruction tokens. These are inserted into the upper layers of the model because it is better to learn with the higher-level semantics of the pretrained model. The instruction-output tokens prefixed to the input guide the adaption prompt to generate a contextual response.
A set of learnable adaption prompts are prefixed to the input instruction tokens. These are inserted into the upper layers of the model because it is better to learn with the higher-level semantics of the pretrained model. The instruction-output tokens prefixed to the input guide the adaption prompt to generate a contextual response.
<div class="flex justify-center">
<img src="https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/peft/llama-adapter.png"/>
@ -93,3 +109,31 @@ A set of of learnable adaption prompts are prefixed to the input instruction tok
<small><a href="https://hf.co/papers/2303.16199">LLaMA-Adapter: Efficient Fine-tuning of Language Models with Zero-init Attention</a></small>
To avoid adding noise to the tokens, the adapter uses zero-initialized attention. On top of this, the adapter adds a learnable gating factor (initialized with zeros) to progressively add information to the model during training. This prevents overwhelming the model's pretrained knowledge with the newly learned instructions.
## Householder Reflection Adaptation (HRA)
[HRA](https://huggingface.co/papers/2405.17484) provides a new perspective connecting LoRA to OFT, which means it can harness the advantages of both strategies, reduce parameters and computation costs while penalizing the loss of pre-training knowledge.
<div class="flex justify-center">
<img src="https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/peft/hra.png"/>
</div>
<small><a href="https://huggingface.co/papers/2405.17484">Bridging The Gap between Low-rank and Orthogonal Adaptation via Householder Reflection Adaptation</a></small>
HRA constructs a chain of `r` trainable Householder reflections (HRs). Because the Householder reflection matrix is an orthogonal matrix and the product of orthogonal matrices is also an orthogonal matrix, HRA satisfies the theoretical guarantee of Orthogonal Finetuning (OFT). Meanwhile, HRA can also be viewed as a low-rank fine-tuning adapter by rewriting formula.
The higher `r`, the more trainable parameters, resulting in a larger model capacity and better performance. Besides, due to the chain structure, the orthogonality of HR planes impacts the capacity and regularity of HRA. To achieve a trade-off between the model capacity and regularity, an orthogonality regularizer of the HR planes is added to the loss function. The weight \\(\lambda\\) can control the strength of the regularizer.
## Bone
[MiSS](https://huggingface.co/papers/2409.15371) New version of paper(MiSS: Balancing LoRA Performance and Efficiency with Simple Shard Sharing)
If you already have a Bone checkpoint, you can use `/scripts/convert-bone-to-miss.py` to convert it into a MiSS checkpoint and proceed with training using MiSS.
## MiSS
[MiSS](https://huggingface.co/papers/2409.15371) MiSS (Matrix Shard Sharing) is a novel Parameter-Efficient Fine-Tuning (PEFT) method designed to address the trade-off between adaptability and efficiency in Large Language Models. The core approach of MiSS involves a simple shard-sharing mechanism. It achieves low-rank adaptation by decomposing a weight matrix into multiple fragments and then utilizing a shared, trainable "common fragment." The final low-rank update matrix is constructed by replicating these shared, partitioned shards. (MiSS is a novel PEFT method that adopts a low-rank structure, requires only a single trainable matrix, and introduces a new update mechanism distinct from LoRA, achieving an excellent balance between performance and efficiency.)
<small><a href="https://huggingface.co/papers/2409.15371">MiSS: Balancing LoRA Performance and Efficiency with Simple Shard Sharing</a></small>
Intuitively, the shape of a single trainable matrix in MiSS is consistent with `lora_B`, so the `r` parameter in MiSS is less than the `r` in LoRA by (`in_feature * r`).
Note: Bat's r (b) is special and requires that weight W satisfies the conditions `in_features % r == 0` and `out_features % r == 0`. Additionally, when `in_features == out_features` and MiSS-r equals LoRA-r, MiSS's number of trainable parameters is only half that of LoRA.
Although the nonlinear updates of Bat bring some performance improvements, they also increase computational overhead. Its main purpose is to provide researchers with a direction for improvement. Therefore, we recommend fine-tuning the comprehensive MiSS model instead.

View File

@ -16,7 +16,7 @@ rendered properly in your Markdown viewer.
# IA3
This conceptual guide gives a brief overview of [IA3](https://arxiv.org/abs/2205.05638), a parameter-efficient fine tuning technique that is
This conceptual guide gives a brief overview of [IA3](https://huggingface.co/papers/2205.05638), a parameter-efficient fine tuning technique that is
intended to improve over [LoRA](./lora).
To make fine-tuning more efficient, IA3 (Infused Adapter by Inhibiting and Amplifying Inner Activations)

View File

@ -16,9 +16,9 @@ rendered properly in your Markdown viewer.
# Orthogonal Finetuning (OFT and BOFT)
This conceptual guide gives a brief overview of [OFT](https://arxiv.org/abs/2306.07280) and [BOFT](https://arxiv.org/abs/2311.06243), a parameter-efficient fine-tuning technique that utilizes orthogonal matrix to multiplicatively transform the pretrained weight matrices.
This conceptual guide gives a brief overview of [OFT](https://huggingface.co/papers/2306.07280), [OFTv2](https://www.arxiv.org/abs/2506.19847) and [BOFT](https://huggingface.co/papers/2311.06243), a parameter-efficient fine-tuning technique that utilizes orthogonal matrix to multiplicatively transform the pretrained weight matrices.
To achieve efficient fine-tuning, OFT represents the weight updates with an orthogonal transformation. The orthogonal transformation is parameterized by an orthogonal matrix multiplied to the pretrained weight matrix. These new matrices can be trained to adapt to the new data while keeping the overall number of changes low. The original weight matrix remains frozen and doesnt receive any further adjustments. To produce the final results, both the original and the adapted weights are multiplied togethor.
To achieve efficient fine-tuning, OFT represents the weight updates with an orthogonal transformation. The orthogonal transformation is parameterized by an orthogonal matrix multiplied to the pretrained weight matrix. These new matrices can be trained to adapt to the new data while keeping the overall number of changes low. The original weight matrix remains frozen and doesn't receive any further adjustments. To produce the final results, both the original and the adapted weights are multiplied togethor.
Orthogonal Butterfly (BOFT) generalizes OFT with Butterfly factorization and further improves its parameter efficiency and finetuning flexibility. In short, OFT can be viewed as a special case of BOFT. Different from LoRA that uses additive low-rank weight updates, BOFT uses multiplicative orthogonal weight updates. The comparison is shown below.
@ -30,7 +30,7 @@ Orthogonal Butterfly (BOFT) generalizes OFT with Butterfly factorization and fur
BOFT has some advantages compared to LoRA:
* BOFT proposes a simple yet generic way to finetune pretrained models to downstream tasks, yielding a better preservation of pretraining knowledge and a better parameter efficiency.
* Through the orthogonality, BOFT introduces a structural constraint, i.e., keeping the [hyperspherical energy](https://arxiv.org/abs/1805.09298) unchanged during finetuning. This can effectively reduce the forgetting of pretraining knowledge.
* Through the orthogonality, BOFT introduces a structural constraint, i.e., keeping the [hyperspherical energy](https://huggingface.co/papers/1805.09298) unchanged during finetuning. This can effectively reduce the forgetting of pretraining knowledge.
* BOFT uses the butterfly factorization to efficiently parameterize the orthogonal matrix, which yields a compact yet expressive learning space (i.e., hypothesis class).
* The sparse matrix decomposition in BOFT brings in additional inductive biases that are beneficial to generalization.
@ -58,13 +58,25 @@ As with other methods supported by PEFT, to fine-tune a model using OFT or BOFT,
4. Train the `PeftModel` as you normally would train the base model.
### BOFT-specific paramters
### OFT-specific parameters
`BOFTConfig` allows you to control how OFT/BOFT is applied to the base model through the following parameters:
`OFTConfig` allows you to control how OFT is applied to the base model through the following parameters:
- `boft_block_size`: the BOFT matrix block size across different layers, expressed in `int`. Smaller block size results in sparser update matrices with fewer trainable paramters. **Note**, please choose `boft_block_size` to be divisible by most layer's input dimension (`in_features`), e.g., 4, 8, 16. Also, please only
- `r`: OFT rank, number of OFT blocks per injected layer. **Bigger** `r` results in more sparse update matrices with **fewer** trainable paramters. **Note**: You can only specify either `r` or `oft_block_size`, but not both simultaneously, because `r` × `oft_block_size` = layer dimension. For simplicity, we let the user speficy either `r` or `oft_block_size` and infer the other one. Default set to `r = 0`, the user is advised to set the `oft_block_size` instead for better clarity.
- `oft_block_size`: OFT block size across different layers. **Bigger** `oft_block_size` results in more dense update matrices with **more** trainable parameters. **Note**: Please choose `oft_block_size` to be divisible by layer's input dimension (`in_features`), e.g., 4, 8, 16. You can only specify either `r` or `oft_block_size`, but not both simultaneously, because `r` × `oft_block_size` = layer dimension. For simplicity, we let the user speficy either `r` or `oft_block_size` and infer the other one. Default set to `oft_block_size = 32`.
- `use_cayley_neumann`: Specifies whether to use the Cayley-Neumann parameterization (efficient but approximate) or the vanilla Cayley parameterization (exact but computationally expensive because of matrix inverse). We recommend to set it to `True` for better efficiency, but performance may be slightly worse because of the approximation error. Please test both settings (`True` and `False`) depending on your needs. Default is `False`.
- `module_dropout`: The multiplicative dropout probability, by setting OFT blocks to identity during training, similar to the dropout layer in LoRA.
- `bias`: specify if the `bias` parameters should be trained. Can be `"none"`, `"all"` or `"oft_only"`.
- `target_modules`: The modules (for example, attention blocks) to inject the OFT matrices.
- `modules_to_save`: List of modules apart from OFT matrices to be set as trainable and saved in the final checkpoint. These typically include model's custom head that is randomly initialized for the fine-tuning task.
### BOFT-specific parameters
`BOFTConfig` allows you to control how BOFT is applied to the base model through the following parameters:
- `boft_block_size`: the BOFT matrix block size across different layers, expressed in `int`. **Bigger** `boft_block_size` results in more dense update matrices with **more** trainable parameters. **Note**, please choose `boft_block_size` to be divisible by most layer's input dimension (`in_features`), e.g., 4, 8, 16. Also, please only
specify either `boft_block_size` or `boft_block_num`, but not both simultaneously or leaving both to 0, because `boft_block_size` x `boft_block_num` must equal the layer's input dimension.
- `boft_block_num`: the number of BOFT matrix blocks across different layers, expressed in `int`. Fewer blocks result in sparser update matrices with fewer trainable paramters. **Note**, please choose `boft_block_num` to be divisible by most layer's input dimension (`in_features`), e.g., 4, 8, 16. Also, please only
- `boft_block_num`: the number of BOFT matrix blocks across different layers, expressed in `int`. **Bigger** `boft_block_num` result in sparser update matrices with **fewer** trainable parameters. **Note**, please choose `boft_block_num` to be divisible by most layer's input dimension (`in_features`), e.g., 4, 8, 16. Also, please only
specify either `boft_block_size` or `boft_block_num`, but not both simultaneously or leaving both to 0, because `boft_block_size` x `boft_block_num` must equal the layer's input dimension.
- `boft_n_butterfly_factor`: the number of butterfly factors. **Note**, for `boft_n_butterfly_factor=1`, BOFT is the same as vanilla OFT, for `boft_n_butterfly_factor=2`, the effective block size of OFT becomes twice as big and the number of blocks become half.
- `bias`: specify if the `bias` parameters should be trained. Can be `"none"`, `"all"` or `"boft_only"`.
@ -74,13 +86,59 @@ specify either `boft_block_size` or `boft_block_num`, but not both simultaneousl
## OFT Example Usage
For using OFT for quantized finetuning with [TRL](https://github.com/huggingface/trl) for `SFT`, `PPO`, or `DPO` fine-tuning, follow the following outline:
```py
from transformers import AutoTokenizer, AutoModelForCausalLM, BitsAndBytesConfig
from trl import SFTTrainer
from peft import OFTConfig
if use_quantization:
bnb_config = BitsAndBytesConfig(
load_in_4bit=True,
bnb_4bit_quant_type="nf4",
bnb_4bit_compute_dtype=torch.bfloat16,
bnb_4bit_use_double_quant=True,
bnb_4bit_quant_storage=torch.bfloat16,
)
model = AutoModelForCausalLM.from_pretrained(
"model_name",
quantization_config=bnb_config
)
tokenizer = AutoTokenizer.from_pretrained("model_name")
# Configure OFT
peft_config = OFTConfig(
oft_block_size=32,
use_cayley_neumann=True,
target_modules="all-linear",
bias="none",
task_type="CAUSAL_LM"
)
trainer = SFTTrainer(
model=model,
train_dataset=ds['train'],
peft_config=peft_config,
processing_class=tokenizer,
args=training_arguments,
data_collator=collator,
)
trainer.train()
```
## BOFT Example Usage
For an example of the BOFT method application to various downstream tasks, please refer to the following guides:
Take a look at the following step-by-step guides on how to finetune a model with BOFT:
- [Dreambooth finetuning with BOFT](../task_guides/boft_dreambooth)
- [Controllable generation finetuning with BOFT (ControlNet)](../task_guides/boft_controlnet)
- [Dreambooth finetuning with BOFT](https://github.com/huggingface/peft/blob/main/examples/boft_dreambooth/boft_dreambooth.md)
- [Controllable generation finetuning with BOFT (ControlNet)](https://github.com/huggingface/peft/blob/main/examples/boft_controlnet/boft_controlnet.md)
For the task of image classification, one can initialize the BOFT config for a DinoV2 model as follows:

View File

@ -75,3 +75,19 @@ Take a look at [P-tuning for sequence classification](../task_guides/ptuning-seq
<img src="https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/peft/mpt-decomposition.png"/>
</div>
<small><a href="https://hf.co/papers/2103.10385">Prompt decomposition</a>.</small>
## Context-Aware Prompt Tuning (CPT)
<div class="flex justify-center">
<img src="https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/peft/cpt.png"/>
</div>
<small>CPT optimizing only specific token embeddings while keeping the rest of the model frozen <a href="https://huggingface.co/papers/2410.17222">(image source)</a>.</small>
[Context-Aware Prompt Tuning (CPT)](https://huggingface.co/papers/2410.17222) is designed to enhance few-shot classification by refining only context embeddings.
This approach combines ideas from In-Context Learning (ICL), Prompt Tuning (PT), and adversarial optimization, focusing on making model adaptation both parameter-efficient and effective.
In CPT, only specific context token embeddings are optimized, while the rest of the model remains frozen.
To prevent overfitting and maintain stability, CPT uses controlled perturbations to limit the allowed changes to context embeddings within a defined range.
Additionally, to address the phenomenon of recency bias—where examples near the end of the context tend to be prioritized over earlier ones—CPT applies a decay loss factor.
Take a look at [Example](https://github.com/huggingface/peft/blob/main/examples/cpt_finetuning/README.md) for a step-by-step guide on how to train a model with CPT.

View File

@ -49,17 +49,17 @@ $ pip install pre-commit
$ pre-commit install
```
Running all the tests can take a couple of minutes, so during development it can be more efficient to only run tests specific to your change:
Running all the tests can take a while, so during development it can be more efficient to only [run tests specific to your change](https://docs.pytest.org/en/6.2.x/usage.html#specifying-tests-selecting-tests), e.g. via:
```sh
pytest tests/ -k <name-of-test>
pytest tests/<test-file-name> -k <name-of-test>
```
This should finish much quicker and allow for faster iteration. However, you should still run the whole test suite before creating a PR because your change can inadvertently break tests that at first glance are unrelated.
This should finish much quicker and allow for faster iteration.
If your change is specific to a hardware setting (e.g., it requires CUDA), take a look at [tests/test_gpu_examples.py](https://github.com/huggingface/peft/blob/1c1c7fdaa6e6abaa53939b865dee1eded82ad032/tests/test_gpu_examples.py) and [tests/test_common_gpu.py](https://github.com/huggingface/peft/blob/1c1c7fdaa6e6abaa53939b865dee1eded82ad032/tests/test_common_gpu.py) to see if it makes sense to add tests there. If your change could have an effect on saving and loading models, please run the tests with the `--regression` flag to trigger regression tests.
It can happen that while youre working on your PR, the underlying code base changes due to other changes being merged. If that happens especially when there is a merge conflict please update your branch with the latest changes. This can be a merge or a rebase, and we'll squash and merge the PR once its ready.
It can happen that while youre working on your PR, the underlying code base changes due to other changes being merged. If that happens especially when there is a merge conflict please update your branch with the latest changes. This can be a merge or a rebase, and we'll squash and merge the PR once its ready. If possible, avoid force pushes to make reviews easier.
## PR description
@ -77,10 +77,14 @@ Ideally when a bugfix is provided, it should be accompanied by a test for the bu
New parameter-efficient fine-tuning methods are developed all the time. If you would like to add a new and promising method to PEFT, please follow these steps.
1. Before you start to implement the new method, please open a GitHub issue with your proposal. This way, the maintainers can give you some early feedback.
2. Please add a link to the source (usually a paper) of the method. Some evidence should be provided there is general interest in using the method. We will not add new methods that are freshly published, but there is no evidence of demand for it.
1. Before you start to implement the new method, please open a [GitHub issue](https://github.com/huggingface/peft/issues) with your proposal. This way, the maintainers can give you some early feedback.
2. Please add a link to the source (usually a paper) of the method. The paper should be in a final state to avoid changing requirements during development (e.g. due to reviewer feedback).
3. When implementing the method, it makes sense to look for existing implementations that already exist as a guide. Moreover, when you structure your code, please take inspiration from the other PEFT methods. For example, if your method is similar to LoRA, it makes sense to structure your code similarly or even reuse some functions or classes where it makes sense (some code duplication is okay, but dont overdo it).
4. Ideally, in addition to the implementation of the new method, there should also be examples (notebooks, scripts), documentation, and an extensive test suite that proves the method works with a variety of tasks. However, this can be more challenging so it is acceptable to only provide the implementation and at least one working example. Documentation and tests can be added in follow up PRs.
4. Ideally, in addition to the implementation of the new method, there should also be
- [examples](https://github.com/huggingface/peft/tree/main/examples) (notebooks, scripts)
- [documentation](https://github.com/huggingface/peft/tree/main/docs/source)
- [extensive test suite](https://github.com/huggingface/peft/tree/main/tests) that proves the method correctly integrates with PEFT
- [experimental setup](https://github.com/huggingface/peft/tree/main/method_comparison#creating-new-experiments) to run benchmarks
5. Once you have something that seems to be working, dont hesitate to create a draft PR even if its not in a mergeable state yet. The maintainers are happy to give you feedback and guidance along the way.
## Add other features

View File

@ -204,7 +204,7 @@ For a complete example, check out [this notebook](https://github.com/huggingface
When new popular transformers architectures are released, we do our best to quickly add them to PEFT. If you come across a transformers model that is not supported out of the box, don't worry, it will most likely still work if the config is set correctly. Specifically, you have to identify the layers that should be adapted and set them correctly when initializing the corresponding config class, e.g. `LoraConfig`. Here are some tips to help with this.
As a first step, it is a good idea is to check the existing models for inspiration. You can find them inside of [constants.py](https://github.com/huggingface/peft/blob/main/src/peft/utils/constants.py) in the PEFT repository. Often, you'll find a similar architecture that uses the same names. For example, if the new model architecture is a variation of the "mistral" model and you want to apply LoRA, you can see that the entry for "mistral" in `TRANSFORMERS_MODELS_TO_LORA_TARGET_MODULES_MAPPING` contains `["q_proj", "v_proj"]`. This tells you that for "mistral" models, the `target_modules` for LoRA should be `["q_proj", "v_proj"]`:
As a first step, it is a good idea to check the existing models for inspiration. You can find them inside of [constants.py](https://github.com/huggingface/peft/blob/main/src/peft/utils/constants.py) in the PEFT repository. Often, you'll find a similar architecture that uses the same names. For example, if the new model architecture is a variation of the "mistral" model and you want to apply LoRA, you can see that the entry for "mistral" in `TRANSFORMERS_MODELS_TO_LORA_TARGET_MODULES_MAPPING` contains `["q_proj", "v_proj"]`. This tells you that for "mistral" models, the `target_modules` for LoRA should be `["q_proj", "v_proj"]`:
```python
from peft import LoraConfig, get_peft_model
@ -219,7 +219,7 @@ peft_model = get_peft_model(my_mistral_model, config)
If that doesn't help, check the existing modules in your model architecture with the `named_modules` method and try to identify the attention layers, especially the key, query, and value layers. Those will often have names such as `c_attn`, `query`, `q_proj`, etc. The key layer is not always adapted, and ideally, you should check whether including it results in better performance.
Additionally, linear layers are common targets to be adapted (e.g. in [QLoRA paper](https://arxiv.org/abs/2305.14314), authors suggest to adapt them as well). Their names will often contain the strings `fc` or `dense`.
Additionally, linear layers are common targets to be adapted (e.g. in [QLoRA paper](https://huggingface.co/papers/2305.14314), authors suggest to adapt them as well). Their names will often contain the strings `fc` or `dense`.
If you want to add a new model to PEFT, please create an entry in [constants.py](https://github.com/huggingface/peft/blob/main/src/peft/utils/constants.py) and open a pull request on the [repository](https://github.com/huggingface/peft/pulls). Don't forget to update the [README](https://github.com/huggingface/peft#models-support-matrix) as well.

View File

@ -9,7 +9,7 @@ Unless required by applicable law or agreed to in writing, software distributed
an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the
specific language governing permissions and limitations under the License.
⚠️ Note that this file is in Markdown but contain specific syntax for our doc-builder (similar to MDX) that may not be
⚠️ Note that this file is in Markdown but contains specific syntax for our doc-builder (similar to MDX) that may not be
rendered properly in your Markdown viewer.
-->
@ -41,7 +41,7 @@ config = LoraConfig(init_lora_weights=False, ...)
```
### PiSSA
[PiSSA](https://arxiv.org/abs/2404.02948) initializes the LoRA adapter using the principal singular values and singular vectors. This straightforward modification allows PiSSA to converge more rapidly than LoRA and ultimately attain superior performance. Moreover, PiSSA reduces the quantization error compared to QLoRA, leading to further enhancements.
[PiSSA](https://huggingface.co/papers/2404.02948) initializes the LoRA adapter using the principal singular values and singular vectors. This straightforward modification allows PiSSA to converge more rapidly than LoRA and ultimately attain superior performance. Moreover, PiSSA reduces the quantization error compared to QLoRA, leading to further enhancements.
Configure the initialization method to "pissa", which may take several minutes to execute SVD on the pre-trained model:
```python
@ -50,12 +50,43 @@ config = LoraConfig(init_lora_weights="pissa", ...)
```
Alternatively, execute fast SVD, which takes only a few seconds. The number of iterations determines the trade-off between the error and computation time:
```python
lora_config = LoraConfig(init_lora_weights="pissa_niter_[number of iters]", ...)
lora_config = LoraConfig(init_lora_weights="pissa_niter_[number of iters]", ...)
```
For detailed instruction on using PiSSA, please follow [these instructions](https://github.com/fxmeng/peft/tree/main/examples/pissa_finetuning).
For detailed instruction on using PiSSA, please follow [these instructions](https://github.com/huggingface/peft/tree/main/examples/pissa_finetuning).
### CorDA
[CorDA](https://huggingface.co/papers/2406.05223) builds task-aware LoRA adapters from weight decomposition oriented by the context of downstream task to learn (instruction-previewed mode, IPM) or world knowledge to maintain (knowledge-preserved mode, KPM).
The KPM not only achieves better performance than LoRA on fine-tuning tasks, but also mitigates the catastrophic forgetting of pre-trained world knowledge.
When preserving pre-trained knowledge is not a concern,
the IPM is favored because it can further accelerate convergence and enhance the fine-tuning performance.
You need to configure the initialization method to "corda", and specify the mode of IPM or KPM and the dataset to collect covariance matrices.
```py
@torch.no_grad()
def run_model():
# Assume `model` and `dataset` is in context...
model.eval()
for batch in dataset:
model(**batch)
corda_config = CordaConfig(
corda_method="kpm",
)
lora_config = LoraConfig(
init_lora_weights="corda",
corda_config=corda_config,
)
preprocess_corda(model, lora_config, run_model=run_model)
peft_model = get_peft_model(model, lora_config)
```
For detailed instruction on using CorDA, please follow [these instructions](https://github.com/huggingface/peft/tree/main/examples/corda_finetuning).
### OLoRA
[OLoRA](https://arxiv.org/abs/2406.01775) utilizes QR decomposition to initialize the LoRA adapters. OLoRA translates the base weights of the model by a factor of their QR decompositions, i.e., it mutates the weights before performing any training on them. This approach significantly improves stability, accelerates convergence speed, and ultimately achieves superior performance.
[OLoRA](https://huggingface.co/papers/2406.01775) utilizes QR decomposition to initialize the LoRA adapters. OLoRA translates the base weights of the model by a factor of their QR decompositions, i.e., it mutates the weights before performing any training on them. This approach significantly improves stability, accelerates convergence speed, and ultimately achieves superior performance.
You just need to pass a single additional option to use OLoRA:
```python
@ -63,15 +94,46 @@ from peft import LoraConfig
config = LoraConfig(init_lora_weights="olora", ...)
```
For more advanced usage, please refer to our [documentation](https://github.com/huggingface/peft/tree/main/examples/olora_finetuning).
### EVA
[EVA](https://huggingface.co/papers/2410.07170) performs SVD on the input activations of each layer and uses the right-singular vectors to initialize LoRA weights. It is therefore a data-driven initialization scheme. Furthermore EVA adaptively allocates ranks across layers based on their "explained variance ratio" - a metric derived from the SVD analysis.
You can use EVA by setting `init_lora_weights="eva"` and defining [`EvaConfig`] in [`LoraConfig`]:
```python
from peft import LoraConfig, EvaConfig
peft_config = LoraConfig(
init_lora_weights = "eva",
eva_config = EvaConfig(rho = 2.0),
...
)
```
The parameter `rho` (≥ 1.0) determines how much redistribution is allowed. When `rho=1.0` and `r=16`, LoRA adapters are limited to exactly 16 ranks, preventing any redistribution from occurring. A recommended value for EVA with redistribution is 2.0, meaning the maximum rank allowed for a layer is 2r.
It is recommended to perform EVA initialization on an accelerator(e.g. CUDA GPU, Intel XPU) as it is much faster. To optimize the amount of available memory for EVA, you can use the `low_cpu_mem_usage` flag in [`get_peft_model`]:
```python
peft_model = get_peft_model(model, peft_config, low_cpu_mem_usage=True)
```
Then, call [`initialize_lora_eva_weights`] to initialize the EVA weights (in most cases the dataloader used for eva initialization can be the same as the one used for finetuning):
```python
initialize_lora_eva_weights(peft_model, dataloader)
```
EVA works out of the box with bitsandbytes. Simply initialize the model with `quantization_config` and call [`initialize_lora_eva_weights`] as usual.
<Tip>
For further instructions on using EVA, please refer to our [documentation](https://github.com/huggingface/peft/tree/main/examples/eva_finetuning).
</Tip>
### LoftQ
#### Standard approach
When quantizing the base model for QLoRA training, consider using the [LoftQ initialization](https://arxiv.org/abs/2310.08659), which has been shown to improve performance when training quantized models. The idea is that the LoRA weights are initialized such that the quantization error is minimized. To use LoftQ, follow [these instructions](https://github.com/huggingface/peft/tree/main/examples/loftq_finetuning).
When quantizing the base model for QLoRA training, consider using the [LoftQ initialization](https://huggingface.co/papers/2310.08659), which has been shown to improve performance when training quantized models. The idea is that the LoRA weights are initialized such that the quantization error is minimized. To use LoftQ, follow [these instructions](https://github.com/huggingface/peft/tree/main/examples/loftq_finetuning).
In general, for LoftQ to work best, it is recommended to target as many layers with LoRA as possible, since those not targeted cannot have LoftQ applied. This means that passing `LoraConfig(..., target_modules="all-linear")` will most likely give the best results. Also, you should use `nf4` as quant type in your quantization config when using 4bit quantization, i.e. `BitsAndBytesConfig(load_in_4bit=True, bnb_4bit_quant_type="nf4")`.
#### A more convienient way
#### A more convenient way
An easier but more limited way to apply LoftQ initialization is to use the convenience function `replace_lora_weights_loftq`. This takes the quantized PEFT model as input and replaces the LoRA weights in-place with their LoftQ-initialized counterparts.
@ -89,7 +151,7 @@ replace_lora_weights_loftq(peft_model)
`replace_lora_weights_loftq` also allows you to pass a `callback` argument to give you more control over which layers should be modified or not, which empirically can improve the results quite a lot. To see a more elaborate example of this, check out [this notebook](https://github.com/huggingface/peft/blob/main/examples/loftq_finetuning/LoftQ_weight_replacement.ipynb).
`replace_lora_weights_loftq` implements only one iteration step of LoftQ. This means that only the LoRA weights are updated, instead of iteratevily updating LoRA weights and quantized base model weights. This may lead to lower performance but has the advantage that we can use the original quantized weights derived from the base model, instead of having to keep an extra copy of modified quantized weights. Whether this tradeoff is worthwhile depends on the use case.
`replace_lora_weights_loftq` implements only one iteration step of LoftQ. This means that only the LoRA weights are updated, instead of iteratively updating LoRA weights and quantized base model weights. This may lead to lower performance but has the advantage that we can use the original quantized weights derived from the base model, instead of having to keep an extra copy of modified quantized weights. Whether this tradeoff is worthwhile depends on the use case.
At the moment, `replace_lora_weights_loftq` has these additional limitations:
@ -111,10 +173,115 @@ from peft import LoraConfig
config = LoraConfig(use_rslora=True, ...)
```
### Activated LoRA (aLoRA)
Activated LoRA (aLoRA) is a low rank adapter architecture for Causal LMs that allows for reusing existing base model KV cache for more efficient inference. This approach is best suited for inference pipelines which rely on the base model for most tasks/generations, but use aLoRA adapter(s) to perform specialized task(s) within the chain. For example, checking or correcting generated outputs of the base model. In these settings, inference times can be sped up by an order of magnitude or more. For more information on aLoRA and many example use cases, see https://huggingface.co/papers/2504.12397.
This technique scans for the last occurence of an invocation sequence (`alora_invocation_tokens`) in each input (this can be as short as 1 token), and activates the adapter weights on tokens starting with the beginning of the invocation sequence (any inputs after the invocation sequence are also adapted, and all generated tokens will use the adapted weights). Weights on prior tokens are left un-adapted -- making the cache for those tokens interchangeable with base model cache due to the causal attention mask in Causal LMs. Usage is very similar to standard LoRA, with the key difference that this invocation sequence must be specified when the adapter is created:
```py
from peft import LoraConfig
config = LoraConfig(alora_invocation_tokens=alora_invocation_tokens, task_type="CAUSAL_LM", ...)
```
where `alora_invocation_tokens` is a list of integer token ids. Given a desired invocation string, this can be obtained as
```
invocation_string = "placeholder"
alora_invocation_tokens = tokenizer.encode(invocation_string, add_special_tokens=False).
```
where the tokenizer is the tokenizer for the base model. Note that we have `add_special_tokens=False` to avoid adding SOS/EOS tokens in our search string (which will most likely cause failure to find).
**Notes**
* aLoRA is only supported for `task_type=CAUSAL_LM` tasks due to its focus on cache reuse.
* Since the weights are adapted on fewer tokens, often (not always) aLoRA requires higher rank (`r`) than LoRA. `r=32` can be a good starting point.
* aLoRA weights cannot be merged into the base model by definition, since the adapter weights are selectively applied to a subset of tokens. Attempts to merge will throw errors.
* Beam search is not yet supported.
* It is generally not recommended to add new tokens to the tokenizer that are not present in the base model, as this can complicate the target use case of both the base model and adapter model operating on overlapping context. That said, there is a possible workaround by first efficiently adding [trainable tokens](https://huggingface.co/docs/peft/en/package_reference/trainable_tokens) to the base model prior to training the adapter.
#### Choice of invocation sequence and SFT design
Each input must have the `alora_invocation_tokens` sequence present, it is not added automatically. To maximize model performance without compromising cache reuse, it is recommended to have the adapter weights activated early, i.e. at the start of any adapter-specific prompting, but after any long inputs such as prior generations or documents. As with any model,
formatting should be consistent between train and test.
Consider the following example, where the base model has a chat template,
and the goal is to train the adapter to generate a desired output.
* Option 1: If there is no task-specific prompt, i.e. the input is a chat history with the `assistant` prompt, then the chat template's `assistant` prompt (e.g. `<|start_of_role|>assistant<|end_of_role|>`) is a natural choice for the invocation string. See the model's chat template to find the prompt for the model.
* Option 2: If there is a task-specific prompt for the adapter that describes the task the adapter is learning, and that prompt is put as a `user` turn immediately prior to the generation, then the chat template's `user` prompt (e.g. `<|start_of_role|>user<|end_of_role|>`) is a natural choice for the invocation string.
Once deciding on an invocation string, get the model tokenizer and obtain `alora_invocation_tokens` as
```
alora_invocation_tokens = tokenizer.encode(invocation_string, add_special_tokens=False).
```
An example inference setup is at [alora finetuning](https://github.com/huggingface/peft/blob/main/examples/alora_finetuning/alora_finetuning.py).
**Note** If using custom strings for the invocation string, make sure that the start and end of the string are special tokens to avoid issues with tokenization at the boundaries.
To see why, imagine that 'a', 'b', 'c', and 'ab' are tokens in your tokenizer (numbers 1, 2, 3, 4 respectively). Suppose that your alora_invocation_tokens = [2, 3]. Now imagine your input string is "abc". Because "ab" is a token, this will get tokenized as [4,3]. So the alora_invocation_tokens will fail to be found, despite the string "bc" being in it. If the start and end of the invocation string are special tokens, however, this failure case will never happen since special tokens are never tokenized into the same token with other characters.
#### Using (and reusing) cache for generation
The main purpose of Activated LoRA is to make KV cache interchangeable between the base model and aLoRA adapter models **prior to the invocation sequence** since base and adapted KV values are not compatible. Specifically, keys and values stored during one model generation can be used in subsequent generations to avoid expensive prefill operations for context tokens. When sharing cache between the base model and aLoRA adapters, there are 2 main patterns:
1. The base model has generated something, and an aLoRA adapter is then called to do a followup generation. Example: the base model answers a question, and an aLoRA trained to detect hallucinations checks the base model response.
2. An aLoRA adapter has generated something, and the base model or a different aLoRA adapter is called to do a followup generation where there is partial context overlap with the original aLoRA. Example: The user provides a query, and an aLoRA rewrites the query to be more self-contained and improve retrieval in a RAG system. Then, documents are retrieved and loaded into context, an aLoRA checks if these documents are indeed relevant to the question, and then the base model generates an answer.
To demonstrate the above behaviors when using caching, we're using [DynamicCache](https://huggingface.co/docs/transformers/en/kv_cache) from `transformers`. Care must be taken to ensure that adapted cache values are not mixed with base cache values. In particular, an extra step is required for sharing the cache when there is partial context overlap (pattern 2).
**Pattern 1: Base model followed by aLoRA** Here, the entire input and generation from the base model is input into the aLoRA adapter, along with the invocation sequence:
```
from transformers import DynamicCache
...
cache = DynamicCache()
inputs_base = tokenizer(prompt_base, return_tensors="pt")
# Generate from base model and save cache
with model_alora.disable_adapter():
output = model_alora.generate(inputs_base["input_ids"].to(device),attention_mask=inputs_base["attention_mask"].to(device),past_key_values = cache,return_dict_in_generate=True)
output_text_base = tokenizer.decode(output.sequences[0])
cache = output.past_key_values
# Generate with aLoRA adapter from cache
prompt_alora = output_text + INVOCATION_STRING
inputs_alora = tokenizer(prompt_alora, return_tensors="pt").to(device)
output = model_alora.generate(**inputs_alora, past_key_values=cache)
output_text_alora = tokenizer.decode(output[0])
# Note: cache is now tainted with adapter values and cannot be used in base model from here on!
```
**Pattern 2: aLoRA generation followed by base model (or another aLoRA) with partial context overlap** Here, we prefill the shared context using the base model, and then generate.
```
from transformers import DynamicCache
import copy
...
cache = DynamicCache()
inputs_shared = tokenizer(prompt_shared, return_tensors="pt").to(device)
# Prefill from base model and save cache
with model_alora.disable_adapter():
with torch.no_grad():
model_alora(**inputs_shared, past_key_values=cache)
cache_copy = copy.deepcopy(cache)
# Generate from aLoRA using prefilled cache
prompt_alora = prompt_shared + INVOCATION_STRING
inputs_alora = tokenizer(prompt_alora, return_tensors="pt").to(device)
output = model_alora.generate(**inputs_alora, past_key_values=cache)
output_text_alora = tokenizer.decode(output[0])
# Generate from base model using saved cache not tainted by aLoRA KV values
prompt_base = prompt_shared
inputs_base = tokenizer(prompt_base, return_tensors="pt").to(device)
with model_alora.disable_adapter():
output = model_alora.generate(**inputs_base, past_key_values=cache_copy)
output_text_base = tokenizer.decode(output[0])
```
### Weight-Decomposed Low-Rank Adaptation (DoRA)
This technique decomposes the updates of the weights into two parts, magnitude and direction. Direction is handled by normal LoRA, whereas the magnitude is handled by a separate learnable parameter. This can improve the performance of LoRA, especially at low ranks. For more information on DoRA, see https://arxiv.org/abs/2402.09353.
This technique decomposes the updates of the weights into two parts, magnitude and direction. Direction is handled by normal LoRA, whereas the magnitude is handled by a separate learnable parameter. This can improve the performance of LoRA, especially at low ranks. For more information on DoRA, see https://huggingface.co/papers/2402.09353.
```py
from peft import LoraConfig
@ -138,10 +305,22 @@ from peft import PeftModel
model = PeftModel.from_pretrained(base_model, peft_model_id, ephemeral_gpu_offload=True)
```
DoRA is optimized (computes faster and takes less memory) for models in the evaluation mode, or when dropout is set to 0. We reuse the
base result at those times to get the speedup.
Running [dora finetuning](https://github.com/huggingface/peft/blob/main/examples/dora_finetuning/dora_finetuning.py)
with `CUDA_VISIBLE_DEVICES=0 ZE_AFFINITY_MASK=0 time python examples/dora_finetuning/dora_finetuning.py --quantize --lora_dropout 0 --batch_size 16 --eval_step 2 --use_dora`
on a 4090 with gradient accumulation set to 2 and max step to 20 resulted with the following observations:
| | Without Optimization | With Optimization |
| :--: | :--: | :--: |
| train_runtime | 359.7298 | **279.2676** |
| train_samples_per_second | 1.779 | **2.292** |
| train_steps_per_second | 0.056 | **0.072** |
#### Caveats
- DoRA only supports linear and Conv2d layers at the momement.
- DoRA introduces a bigger overhead than pure LoRA, so it is recommended to merge weights for inference, see [`LoraModel.merge_and_unload`].
- DoRA only supports embedding, linear, and Conv2d layers at the moment.
- DoRA introduces a bigger overhead than pure LoRA, so it is recommended to merge weights for inference, see [`LoraModel.merge_and_unload`].
- DoRA should work with weights quantized with bitsandbytes ("QDoRA"). However, issues have been reported when using QDoRA with DeepSpeed Zero2.
### QLoRA-style training
@ -154,17 +333,171 @@ config = LoraConfig(target_modules="all-linear", ...)
### Memory efficient Layer Replication with LoRA
An approach used to improve the performance of models is to expand a model by duplicating layers in the model to build a larger model from a pretrained model of a given size. For example increasing a 7B model to a 10B model as described in the [SOLAR](https://arxiv.org/abs/2312.15166) paper. PEFT LoRA supports this kind of expansion in a memory efficient manner that supports further fine-tuning using LoRA adapters attached to the layers post replication of the layers. The replicated layers do not take additional memory as they share the underlying weights so the only additional memory required is the memory for the adapter weights. To use this feature you would create a config with the `layer_replication` argument.
An approach used to improve the performance of models is to expand a model by duplicating layers in the model to build a larger model from a pretrained model of a given size. For example increasing a 7B model to a 10B model as described in the [SOLAR](https://huggingface.co/papers/2312.15166) paper. PEFT LoRA supports this kind of expansion in a memory efficient manner that supports further fine-tuning using LoRA adapters attached to the layers post replication of the layers. The replicated layers do not take additional memory as they share the underlying weights so the only additional memory required is the memory for the adapter weights. To use this feature you would create a config with the `layer_replication` argument.
```py
config = LoraConfig(layer_replication=[[0,4], [2,5]], ...)
```
Assuming the original model had 5 layers `[0, 1, 2 ,3, 4]`, this would create a model with 7 layers arranged as `[0, 1, 2, 3, 2, 3, 4]`. This follows the [mergekit](https://github.com/arcee-ai/mergekit) pass through merge convention where sequences of layers specified as start inclusive and end exclusive tuples are stacked to build the final model. Each layer in the final model gets its own distinct set of LoRA adpaters.
Assuming the original model had 5 layers `[0, 1, 2 ,3, 4]`, this would create a model with 7 layers arranged as `[0, 1, 2, 3, 2, 3, 4]`. This follows the [mergekit](https://github.com/arcee-ai/mergekit) pass through merge convention where sequences of layers specified as start inclusive and end exclusive tuples are stacked to build the final model. Each layer in the final model gets its own distinct set of LoRA adapters.
[Fewshot-Metamath-OrcaVicuna-Mistral-10B](https://huggingface.co/abacusai/Fewshot-Metamath-OrcaVicuna-Mistral-10B) is an example of a model trained using this method on Mistral-7B expanded to 10B. The
[adapter_config.json](https://huggingface.co/abacusai/Fewshot-Metamath-OrcaVicuna-Mistral-10B/blob/main/adapter_config.json) shows a sample LoRA adapter config applying this method for fine-tuning.
### Fine grained control over ranks and alpha (scaling)
By default, all layers targeted with LoRA will have the same rank `r` and the same `lora_alpha` (which determines the LoRA scaling), depending on what was specified in the [`LoraConfig`]. In some cases, however, you may want to indicate different values for different layers. This is possible by passing the `rank_pattern` and `alpha_pattern` arguments to [`LoraConfig`]. These arguments should be dictionaries with the key being the layer name and the value being the rank/alpha value. The keys can be [regular expressions](https://docs.python.org/3/library/re.html) (regex). All LoRA layers that are not explicitly mentioned in `rank_pattern` and `alpha_pattern` will take the default `r` and `lora_alpha` values.
To give an example, let's assume that we have a model with the following structure:
```python
>>> print(model)
Outer(
(foo): Linear(...)
(module): Middle(
(foo): Linear(...)
(foobar): Linear(...)
(module): Inner(
(foo): Linear(...)
(barfoo): Linear(...)
)
)
)
```
- `rank_pattern={"foo": 42}` will match all 3 `foo` layers. Neither `foobar` nor `barfoo` are matched.
- `rank_pattern={"^foo": 42}` will only match the `foo` layer of the model, but neither `module.foo` nor `module.module.foo`. This is because the `^` means "start of string" when using regular expressions, and only `foo` starts with `"foo"`, the other layer names have prefixes.
- `rank_pattern={"^module.foo": 42}` matches only `module.foo`, but not `module.module.foo`, for the same reason.
- `rank_pattern={"module.foo": 42}` matches both `module.foo` and `module.module.foo`, but not `foo`.
- `rank_pattern={"^foo": 42, "^module.module.foo": 55}` matches `foo` and `module.module.foo`, respectively, but not `module.foo`.
- There is no need to indicate `$` to mark the end of the match, as this is added automatically by PEFT.
The same logic applies to `alpha_pattern`. If you're in doubt, don't try to get fancy with regular expressions -- just pass the full name for each module with a different rank/alpha, preceded by the `^` prefix, and you should be good.
### Targeting `nn.Parameter` directly
> [!WARNING]
> This feature is experimental and subject to change.
Generally, you should use `target_modules` to target the module (e.g. `nn.Linear`). However, in some circumstances, this is not possible. E.g., in many mixture of expert (MoE) layers in HF Transformers, instead of using `nn.Linear`, an `nn.Parameter` is used. PEFT normally overwrites the `forward` method for LoRA, but for `nn.Parameter`, there is none. Therefore, to apply LoRA to that parameter, it needs to be targeted with `target_parameters`. As an example, for [Llama4](https://huggingface.co/collections/meta-llama/llama-4-67f0c30d9fe03840bc9d0164), you can pass: `target_parameters=['feed_forward.experts.gate_up_proj', 'feed_forward.experts.down_proj]`.
#### Caveats
- At the moment, this argument allows to target 2-dim or 3-dim `nn.Parameter`s. It is assumed that in the case of a 3-dim parameter, the 0th dimension is the expert dimension.
- It is currently not possible to add multiple LoRA adapters (via `model.add_adapter` or `model.load_adapter`) that use `target_parameters` at the same time.
## Optimizers
LoRA training can optionally include special purpose optimizers. Currently PEFT supports LoRA-FA and LoRA+.
### LoRA-FA Optimizer
LoRA training can be more effective and efficient using LoRA-FA, as described in [LoRA-FA](https://huggingface.co/papers/2308.03303). LoRA-FA reduces activation memory consumption by fixing the matrix A and only tuning the matrix B. During training, the gradient of B is optimized to approximate the full parameter fine-tuning gradient. Moreover, the memory consumption of LoRA-FA is not sensitive to the rank (since it erases the activation of $A$), therefore it can improve performance by enlarging lora rank without increasing memory consumption.
```py
from peft import LoraConfig, get_peft_model
from peft.optimizers import create_lorafa_optimizer
from transformers import Trainer, get_cosine_schedule_with_warmup
base_model = AutoModelForCausalLM.from_pretrained("meta-llama/Meta-Llama-3-8B-Instruct")
config = LoraConfig(...)
model = get_peft_model(base_model, config)
optimizer = create_lorafa_optimizer(
model=model,
r=128,
lora_alpha=32,
lr=7e-5,
)
scheduler = get_cosine_schedule_with_warmup(
optimizer,
num_warmup_steps=100,
num_training_steps=1000,
)
trainer = Trainer(
...,
optimizers=(optimizer, scheduler),
)
```
### LoRA+ optimized LoRA
LoRA training can be optimized using [LoRA+](https://huggingface.co/papers/2402.12354), which uses different learning rates for the adapter matrices A and B, shown to increase finetuning speed by up to 2x and performance by 1-2%.
```py
from peft import LoraConfig, get_peft_model
from peft.optimizers import create_loraplus_optimizer
from transformers import Trainer
import bitsandbytes as bnb
base_model = ...
config = LoraConfig(...)
model = get_peft_model(base_model, config)
optimizer = create_loraplus_optimizer(
model=model,
optimizer_cls=bnb.optim.Adam8bit,
lr=5e-5,
loraplus_lr_ratio=16,
)
scheduler = None
...
trainer = Trainer(
...,
optimizers=(optimizer, scheduler),
)
```
## Efficiently train tokens alongside LoRA
Sometimes it is necessary to not only change some layer's weights but to add new tokens as well. With larger models this can be a memory-costly endeavour. PEFT LoRA adapters support the `trainable_token_indices` parameter which allows tuning of other tokens alongside fine-tuning of specific layers with LoRA. This method only trains the tokens you specify and leaves all other tokens untouched. This saves memory and doesn't throw away learned context of existing token embeddings in contrast to when training the whole embedding matrix. Under the hood this method uses the layer of [`TrainableTokensModel`].
```py
# for layer 'embed_tokens'
config = LoraConfig(trainable_token_indices=[idx_1, idx_2, ...], ...)
# specific embedding layer
config = LoraConfig(trainable_token_indices={'emb_tokens': [idx_1, idx_2, ...]}, ...)
```
In the snippet below we show how to add new tokens to the model and how to train it alongside the other layers in the model.
```py
from transformers import AutoTokenizer, AutoModelForCausalLM
from peft import get_peft_model, LoraConfig
base_model = AutoModelForCausalLM.from_pretrained("mistralai/Mistral-7B-v0.1")
tokenizer = AutoTokenizer.from_pretrained("mistralai/Mistral-7B-v0.1")
# we define our new tokens and add them to the tokenizer as special tokens
special_tokens = ['<|start_think|>', '<|stop_think|>']
tokenizer.add_special_tokens({'additional_special_tokens': special_tokens})
# make room for new tokens in the embedding matrix if it isn't big enough already
base_model.resize_token_embeddings(max(len(tokenizer), base_model.model.embed_tokens.num_embeddings))
# typical LoRA config with `trainable_token_indices` targeting embedding layer `embed_tokens`
# and specifically our new tokens we just added
lora_config = LoraConfig(
target_modules='all-linear',
trainable_token_indices={'embed_tokens': tokenizer.convert_tokens_to_ids(special_tokens)},
)
peft_model = get_peft_model(base_model, lora_config)
# proceed to train the model like normal
[...]
```
The token weights are part of your adapter state dict and saved alongside the LoRA weights.
If we would have used full fine-tuning with `modules_to_save=['embed_tokens']` we would have stored the full embedding matrix in the checkpoint, leading to a much bigger file.
To give a bit of an indication how much VRAM can be saved, a rudimentary comparison of the above example was made between training the embedding matrix fully (`modules_to_save=["embed_tokens"]`), using a LoRA for the embedding matrix (`target_modules=[..., "embed_tokens"]`, rank 32) and trainable tokens (`trainable_token_indices=[...]`, 6 tokens). Trainable tokens used about as much VRAM (15,562MB vs. 15,581MB) as LoRA while being specific to the tokens and saved ~1GB of VRAM over fully training the embedding matrix.
## Merge LoRA weights into the base model
While LoRA is significantly smaller and faster to train, you may encounter latency issues during inference due to separately loading the base model and the LoRA adapter. To eliminate latency, use the [`~LoraModel.merge_and_unload`] function to merge the adapter weights with the base model. This allows you to use the newly merged model as a standalone model. The [`~LoraModel.merge_and_unload`] function doesn't keep the adapter weights in memory.
@ -216,7 +549,7 @@ base_model = AutoModelForCausalLM.from_pretrained(
)
```
Then we load the first adapter:
Then we load the first adapter:
```python
peft_model_id = "alignment-handbook/zephyr-7b-sft-lora"
@ -246,11 +579,13 @@ There are several supported methods for `combination_type`. Refer to the [docume
Now, perform inference:
```python
device = torch.accelerator.current_accelerator().type if hasattr(torch, "accelerator") else "cuda"
tokenizer = AutoTokenizer.from_pretrained("mistralai/Mistral-7B-v0.1")
prompt = "Hey, are you conscious? Can you talk to me?"
inputs = tokenizer(prompt, return_tensors="pt")
inputs = {k: v.to("cuda") for k, v in inputs.items()}
inputs = {k: v.to(device) for k, v in inputs.items()}
with torch.no_grad():
generate_ids = model.generate(**inputs, max_length=30)
@ -291,7 +626,7 @@ model.delete_adapter("dpo")
Normally, each inference batch has to use the same adapter(s) in PEFT. This can sometimes be annoying, because we may have batches that contain samples intended to be used with different LoRA adapters. For example, we could have a base model that works well in English and two more LoRA adapters, one for French and one for German. Usually, we would have to split our batches such that each batch only contains samples of one of the languages, we cannot combine different languages in the same batch.
Thankfully, it is possible to mix different LoRA adapters in the same batch using the `adapter_name` argument. Below, we show an examle of how this works in practice. First, let's load the base model, English, and the two adapters, French and German, like this:
Thankfully, it is possible to mix different LoRA adapters in the same batch using the `adapter_name` argument. Below, we show an example of how this works in practice. First, let's load the base model, English, and the two adapters, French and German, like this:
```python
from transformers import AutoTokenizer, AutoModelForCausalLM
@ -336,16 +671,164 @@ output = peft_model.generate(**inputs, adapter_names=adapter_names, max_new_toke
Note that the order does not matter here, i.e. the samples in the batch don't need to be grouped by adapter as in the example above. We just need to ensure that the `adapter_names` argument is aligned correctly with the samples.
Additionally, the same approach also works with the `modules_to_save` feature, which allows for saving and reusing specific neural network layers, such as custom heads for classification tasks, across different LoRA adapters.
### Caveats
Using this features has some drawbacks, namely:
Using this feature has some drawbacks, namely:
- It only works for inference, not for training.
- Disabling adapters using the `with model.disable_adapter()` context takes precedence over `adapter_names`.
- You cannot pass `adapter_names` when some adapter weights where merged with base weight using the `merge_adapter` method. Please unmerge all adapters first by calling `model.unmerge_adapter()`.
- You cannot pass `adapter_names` when some adapter weights were merged with base weight using the `merge_adapter` method. Please unmerge all adapters first by calling `model.unmerge_adapter()`.
- For obvious reasons, this cannot be used after calling `merge_and_unload()`, since all the LoRA adapters will be merged into the base weights in this case.
- This feature does not currently work with DoRA, so set `use_dora=False` in your `LoraConfig` if you want to use it.
- The `modules_to_save` feature is currently only supported for the layers of types `Linear`, `Embedding`, `Conv2d` and `Conv1d`.
- There is an expected overhead for inference with `adapter_names`, especially if the amount of different adapters in the batch is high. This is because the batch size is effectively reduced to the number of samples per adapter. If runtime performance is your top priority, try the following:
- Increase the batch size.
- Try to avoid having a large number of different adapters in the same batch, prefer homogeneous batches. This can be achieved by buffering samples with the same adapter and only perform inference with a small handfull of different adapters.
- Try to avoid having a large number of different adapters in the same batch, prefer homogeneous batches. This can be achieved by buffering samples with the same adapter and only perform inference with a small handful of different adapters.
- Take a look at alternative implementations such as [LoRAX](https://github.com/predibase/lorax), [punica](https://github.com/punica-ai/punica), or [S-LoRA](https://github.com/S-LoRA/S-LoRA), which are specialized to work with a large number of different adapters.
## Composing and Reusing LoRA Adapters
### Arrow
[Arrow](https://huggingface.co/papers/2405.11157) is a modular routing algorithm designed to combine multiple pre-trained task-specific LoRA adapters to solve a given task. Rather than merging all adapters naively, Arrow introduces a **gradient-free, token-wise mixture-of-experts (MoE) routing mechanism**. At inference time, it first computes a _prototype_ for each LoRA by extracting the top right singular vector from its SVD decomposition. Each token representation is then compared to these prototypes via cosine similarity to obtain routing coefficients. Tokens are assigned to the top-k most relevant LoRA adapters, with the coefficients normalized through softmax, and their outputs linearly combined. This allows effective reuse of existing LoRA modules for new tasks and leads to stronger zero-shot generalization.
In PEFT, Arrow is enabled through ```ArrowConfig``` and ```create_arrow_model```. You can also configure parameters such as ```top_k``` (the number of LoRA adapters combined per token), ```router_temperature``` (the softmax temperature applied to the routing coefficients), and ```rng_seed``` (for reproducibility).
```py
from peft import create_arrow_model, ArrowConfig
from transformers import AutoModelForCausalLM
# Loading the model
base_model = AutoModelForCausalLM.from_pretrained("microsoft/Phi-3-mini-4k-instruct")
# Creating the Arrow config
arrow_config = ArrowConfig(
top_k=3,
router_temperature=1.0,
rng_seed=42,
)
# The LoRA adapters below were trained on a clustered FLAN dataset.
# Task clustering was performed using the Model-Based Clustering (MBC) method,
# as described in the Arrow paper.
# While one could train a separate LoRA for each task and let Arrow route tokens among them,
# training LoRAs on clusters of tasks instead provides an indirect optimization for
# transfer across the multi-task dataset.
task_specific_adapter_paths = [
f"TahaBa/phi3-mini-clustered-flan/ts_expert_{i}" for i in range(10)
]
# Creating the Arrow model
model = create_arrow_model(
base_model=base_model,
task_specific_adapter_paths=task_specific_adapter_paths,
arrow_config=arrow_config,
)
# Now the forward path could be called on this model, like a normal PeftModel.
```
Furthermore, you can add or remove adapters after calling ```create_arrow_model```—for example, to fine-tune a new adapter or discard an unnecessary one. Once the adapters are in place, you can activate the ```"arrow_router"``` for inference to use Arrow. Note that if you add a new LoRA adapter after ```create_arrow_model``` and want to fine-tune it, you must explicitly set the new adapter as active, since ```"arrow_router"``` is activated by default in ```create_arrow_model```.
```py
from trl import SFTTrainer, SFTConfig
# Adding a new adapter and activating it
model.add_adapter(adapter_name='new_adapter')
model.set_adapter('new_adapter')
# Now the model could be trained along the `new_adapter`.
trainer = SFTTrainer(
model=model,
args=SFTConfig(...),
...
)
# Once the training is done, you can activate `arrow_router` and use it in inference
model.set_adapter('arrow_router') # Model is ready to be used at inference time now
```
### GenKnowSub
[GenKnowSub](https://aclanthology.org/2025.acl-short.54/) augments Arrow by purifying task-specific LoRA adapters before routing. The key idea is to subtract general knowledge encoded in LoRA space—based on the [forgetting-via-negation principle](https://huggingface.co/papers/2212.04089)—so that task adapters become more isolated and focused on task-relevant signals. Concretely, GenKnowSub estimates a low-dimensional “general” subspace from a set of general (non task-specific) LoRA adapters and removes this component from each task adapters LoRA update prior to Arrows token-wise routing. This typically improves compositionality and reduces interference when combining many task adapters.
In PEFT, enable GenKnowSub by setting ```use_gks=True``` in ArrowConfig, and providing ```general_adapter_paths``` in ```create_arrow_model```:
```py
from peft import create_arrow_model, ArrowConfig
from transformers import AutoModelForCausalLM
# Loading the model
base_model = AutoModelForCausalLM.from_pretrained("microsoft/Phi-3-mini-4k-instruct")
# Creating the Arrow config
arrow_config = ArrowConfig(
top_k=3,
router_temperature=1.0,
use_gks=True,
rng_seed=42,
)
# Path to task-specific, trained on flan clustered dataset (as we explained before.)
task_specific_adapter_paths = [
f"TahaBa/phi3-mini-clustered-flan/ts_expert_{i}" for i in range(10)
]
# These general adapters are trained on English, German, and French Wikipedia dataset,
# with causal language modelling objective, each pair like: (507 token tsentence, 5 token completion), and the loss computed on the completion
general_adapter_paths = [
"TahaBa/phi3-mini-general-adapters/cluster0_batch16_prop1.0_langen/checkpoint-17",
"TahaBa/phi3-mini-general-adapters/cluster0_batch16_prop1.0_langfr/checkpoint-35",
"TahaBa/phi3-mini-general-adapters/cluster0_batch16_prop1.0_langger/checkpoint-17"
]
# Creating the Arrow model
model = create_arrow_model(
base_model=base_model,
task_specific_adapter_paths=task_specific_adapter_paths,
general_adapter_paths=general_adapter_paths,
arrow_config=arrow_config,
)
# Now the forward path could be called on this model, like a normal PeftModel.
```
To encode general knowledge, GenKnowSub subtracts the average of the provided general adapters from each task-specific adapter once, before routing begins. Furthermore, the ability to add or remove adapters after calling ```create_arrow_model``` (as described in the Arrow section) is still supported in this case.
<Tip>
**Things to keep in mind when using Arrow + GenKnowSub:**
- All LoRA adapters (task-specific and general) must share the same ```rank``` and ```target_modules```.
- Any inconsistency in these settings will raise an error in ```create_arrow_model```.
- Having different scaling factors (```lora_alpha```) across task adapters is supported — Arrow handles them automatically.
- Merging the ```"arrow_router"``` is not supported, due to its dynamic routing behavior.
- In create_arrow_model, task adapters are loaded as ```task_i``` and general adapters as ```gks_j``` (where ```i``` and ```j``` are indices). The function ensures consistency of ```target_modules```, ```rank```, and whether adapters are applied to ```Linear``` or ```Linear4bit``` layers. It then adds the ```"arrow_router"``` module and activates it. Any customization of this process requires overriding ```create_arrow_model```.
- This implementation is compatible with 4-bit quantization (via bitsandbytes):
```py
from transformers import AutoModelForCausalLM, BitsAndBytesConfig
import torch
# Quantisation config
bnb_config = BitsAndBytesConfig(
load_in_4bit=True,
bnb_4bit_quant_type="nf4",
bnb_4bit_compute_dtype=torch.bfloat16,
bnb_4bit_use_double_quant=False,
)
# Loading the model
base_model = AutoModelForCausalLM.from_pretrained(
"microsoft/Phi-3-mini-4k-instruct",
torch_dtype=torch.bfloat16,
device_map="auto",
quantization_config=bnb_config,
)
# Now call create_arrow_model() as we explained before.
```
</Tip>

View File

@ -16,7 +16,7 @@ rendered properly in your Markdown viewer.
# Adapter injection
With PEFT, you can inject trainable adapters into any `torch` module which allows you to use adapter methods without relying on the modeling classes in PEFT. Currently, PEFT supports injecting [LoRA](../conceptual_guides/adapter#low-rank-adaptation-lora), [AdaLoRA](../conceptual_guides/adapter#adaptive-low-rank-adaptation-adalora), and [IA3](../conceptual_guides/ia3) into models because for these adapters, inplace modification of the model is sufficient for finetuning it.
With PEFT, you can inject trainable adapters into any `torch` module which allows you to use adapter methods without relying on the modeling classes in PEFT. This works for all adapters except for those based on prompt learning (e.g. prefix tuning or p-tuning).
Check the table below to see when you should inject adapters.
@ -25,6 +25,8 @@ Check the table below to see when you should inject adapters.
| the model is modified inplace, keeping all the original attributes and methods | manually write the `from_pretrained` and `save_pretrained` utility functions from Hugging Face to save and load adapters |
| works for any `torch` module and modality | doesn't work with any of the utility methods provided by `PeftModel` such as disabling and merging adapters |
## Creating a new PEFT model
To perform the adapter injection, use the [`inject_adapter_in_model`] method. This method takes 3 arguments, the PEFT config, the model, and an optional adapter name. You can also attach multiple adapters to the model if you call [`inject_adapter_in_model`] multiple times with different adapter names.
For example, to inject LoRA adapters into the `linear` submodule of the `DummyModel` module:
@ -85,6 +87,30 @@ DummyModel(
)
```
### Injection based on a `state_dict`
Sometimes, it is possible that there is a PEFT adapter checkpoint but the corresponding PEFT config is not known for whatever reason. To inject the PEFT layers for this checkpoint, you would usually have to reverse-engineer the corresponding PEFT config, most notably the `target_modules` argument, based on the `state_dict` from the checkpoint. This can be cumbersome and error prone. To avoid this, it is also possible to call [`inject_adapter_in_model`] and pass the loaded `state_dict` as an argument:
```python
from safetensors.torch import load_file
model = ...
state_dict = load_file(<path-to-safetensors-file>)
lora_config = LoraConfig(...)
model = inject_adapter_in_model(lora_config, model, state_dict=state_dict)
```
In this case, PEFT will use the `state_dict` as reference for which layers to target instead of using the PEFT config. As a user, you don't have to set the exact `target_modules` of the PEFT config for this to work. However, you should still pass a PEFT config of the right type, in this example `LoraConfig`, you can leave the `target_modules` as `None`.
Be aware that this still only creates the uninitialized PEFT layers, the values from the `state_dict` are not used to populate the model weights. To populate the weights, proceed with calling [`set_peft_model_state_dict`] as described below.
⚠️ Note that if there is a mismatch between what is configured in the PEFT config and what is found in the `state_dict`, PEFT will warn you about this. You can ignore the warning if you know that the PEFT config is not correctly specified.
> [!WARNING]
> If the original PEFT adapters was using `target_parameters` instead of `target_modules`, injecting from a `state_dict` will not work correctly. In this case, it is mandatory to use the correct PEFT config for injection.
## Saving the model
To only save the adapter, use the [`get_peft_model_state_dict`] function:
```python
@ -95,3 +121,28 @@ print(peft_state_dict)
```
Otherwise, `model.state_dict()` returns the full state dict of the model.
## Loading the model
After loading the saved `state_dict`, it can be applied using the [`set_peft_model_state_dict`] function:
```python
from peft import set_peft_model_state_dict
model = DummyModel()
model = inject_adapter_in_model(lora_config, model)
outcome = set_peft_model_state_dict(model, peft_state_dict)
# check that there were no wrong keys
print(outcome.unexpected_keys)
```
If injecting the adapter is slow or you need to load a large number of adapters, you may use an optimization that allows to create an "empty" adapter on meta device and only fills the weights with real weights when the [`set_peft_model_state_dict`] is called. To do this, pass `low_cpu_mem_usage=True` to both [`inject_adapter_in_model`] and [`set_peft_model_state_dict`].
```python
model = DummyModel()
model = inject_adapter_in_model(lora_config, model, low_cpu_mem_usage=True)
print(model.linear.lora_A["default"].weight.device.type == "meta") # should be True
set_peft_model_state_dict(model, peft_state_dict, low_cpu_mem_usage=True)
print(model.linear.lora_A["default"].weight.device.type == "cpu") # should be True
```

View File

@ -50,6 +50,9 @@ config = PeftConfig.from_pretrained("smangrul/tinyllama_lora_norobots")
model = AutoModelForCausalLM.from_pretrained(config.base_model_name_or_path, load_in_4bit=True, device_map="auto").eval()
tokenizer = AutoTokenizer.from_pretrained("smangrul/tinyllama_lora_norobots")
model.config.vocab_size = 32005
model.resize_token_embeddings(32005)
model = PeftModel.from_pretrained(model, "smangrul/tinyllama_lora_norobots", adapter_name="norobots")
_ = model.load_adapter("smangrul/tinyllama_lora_sql", adapter_name="sql")
_ = model.load_adapter("smangrul/tinyllama_lora_adcopy", adapter_name="adcopy")
@ -96,12 +99,13 @@ Now you can use the merged model as an instruction-tuned model to write ad copy
<hfoption id="instruct">
```py
device = torch.accelerator.current_accelerator().type if hasattr(torch, "accelerator") else "cuda"
messages = [
{"role": "user", "content": "Write an essay about Generative AI."},
]
text = tokenizer.apply_chat_template(messages, add_generation_prompt=True, tokenize=False)
inputs = tokenizer(text, return_tensors="pt")
inputs = {k: v.to("cuda") for k, v in inputs.items()}
inputs = {k: v.to(device) for k, v in inputs.items()}
outputs = model.generate(**inputs, max_new_tokens=256, do_sample=True, top_p=0.95, temperature=0.2, repetition_penalty=1.2, eos_token_id=tokenizer.eos_token_id)
print(tokenizer.decode(outputs[0]))
```
@ -110,13 +114,14 @@ print(tokenizer.decode(outputs[0]))
<hfoption id="ad copy">
```py
device = torch.accelerator.current_accelerator().type if hasattr(torch, "accelerator") else "cuda"
messages = [
{"role": "system", "content": "Create a text ad given the following product and description."},
{"role": "user", "content": "Product: Sony PS5 PlayStation Console\nDescription: The PS5 console unleashes new gaming possibilities that you never anticipated."},
]
text = tokenizer.apply_chat_template(messages, add_generation_prompt=True, tokenize=False)
inputs = tokenizer(text, return_tensors="pt")
inputs = {k: v.to("cuda") for k, v in inputs.items()}
inputs = {k: v.to(device) for k, v in inputs.items()}
outputs = model.generate(**inputs, max_new_tokens=128, do_sample=True, top_p=0.95, temperature=0.2, repetition_penalty=1.2, eos_token_id=tokenizer.eos_token_id)
print(tokenizer.decode(outputs[0]))
```
@ -125,13 +130,15 @@ print(tokenizer.decode(outputs[0]))
<hfoption id="SQL">
```py
device = torch.accelerator.current_accelerator().type if hasattr(torch, "accelerator") else "cuda"
text = """Table: 2-11365528-2
Columns: ['Team', 'Head Coach', 'President', 'Home Ground', 'Location']
Natural Query: Who is the Head Coach of the team whose President is Mario Volarevic?
SQL Query:"""
inputs = tokenizer(text, return_tensors="pt")
inputs = {k: v.to("cuda") for k, v in inputs.items()}
inputs = {k: v.to(device) for k, v in inputs.items()}
outputs = model.generate(**inputs, max_new_tokens=64, repetition_penalty=1.1, eos_token_id=tokenizer("</s>").input_ids[-1])
print(tokenizer.decode(outputs[0]))
```

View File

@ -21,7 +21,7 @@ Quantization represents data with fewer bits, making it a useful technique for r
* optimizing which model weights are quantized with the [AWQ](https://hf.co/papers/2306.00978) algorithm
* independently quantizing each row of a weight matrix with the [GPTQ](https://hf.co/papers/2210.17323) algorithm
* quantizing to 8-bit and 4-bit precision with the [bitsandbytes](https://github.com/TimDettmers/bitsandbytes) library
* quantizing to as low as 2-bit precision with the [AQLM](https://arxiv.org/abs/2401.06118) algorithm
* quantizing to as low as 2-bit precision with the [AQLM](https://huggingface.co/papers/2401.06118) algorithm
However, after a model is quantized it isn't typically further trained for downstream tasks because training can be unstable due to the lower precision of the weights and activations. But since PEFT methods only add *extra* trainable parameters, this allows you to train a quantized model with a PEFT adapter on top! Combining quantization with PEFT can be a good strategy for training even the largest models on a single GPU. For example, [QLoRA](https://hf.co/papers/2305.14314) is a method that quantizes a model to 4-bits and then trains it with LoRA. This method allows you to finetune a 65B parameter model on a single 48GB GPU!
@ -107,11 +107,37 @@ QLoRA adds trainable weights to all the linear layers in the transformer archite
config = LoraConfig(target_modules="all-linear", ...)
```
## GPTQ quantization
You can learn more about gptq based `[2, 3, 4, 8]` bits quantization at [GPTQModel](https://github.com/ModelCloud/GPTQModel) and the Transformers [GPTQ](https://huggingface.co/docs/transformers/quantization/gptq) doc. Post-quant training, PEFT can use both [GPTQModel](https://github.com/ModelCloud/GPTQModel) or [AutoGPTQ](https://github.com/autogptq/autogptq) libraries, but we recommend GPTQModel because AutoGPTQ will be deprecated in a future release.
```bash
# gptqmodel install
pip install gptqmodel --no-build-isolation
```
```py
from transformers import AutoModelForCausalLM, AutoTokenizer, GPTQConfig
model_id = "facebook/opt-125m"
tokenizer = AutoTokenizer.from_pretrained(model_id)
gptq_config = GPTQConfig(bits=4, group_size=128, dataset="wikitext2", tokenizer=tokenizer)
quantized_model = AutoModelForCausalLM.from_pretrained(model_id, device_map="auto", quantization_config=gptq_config)
# save quantized model
quantized_model.save_pretrained("./opt-125m-gptq")
tokenizer.save_pretrained("./opt-125m-gptq")
```
Once quantized, you can post-train GPTQ models with PEFT APIs.
## AQLM quantization
Additive Quantization of Language Models ([AQLM](https://arxiv.org/abs/2401.06118)) is a Large Language Models compression method. It quantizes multiple weights together and takes advantage of interdependencies between them. AQLM represents groups of 8-16 weights as a sum of multiple vector codes. This allows it to compress models down to as low as 2-bit with considerably low accuracy losses.
Additive Quantization of Language Models ([AQLM](https://huggingface.co/papers/2401.06118)) is a Large Language Models compression method. It quantizes multiple weights together and takes advantage of interdependencies between them. AQLM represents groups of 8-16 weights as a sum of multiple vector codes. This allows it to compress models down to as low as 2-bit with considerably low accuracy losses.
Since the AQLM quantization process is computationally expensive, a use of prequantized models is recommended. A partial list of available models can be found in the official aqlm [repository](https://github.com/Vahe1994/AQLM).
Since the AQLM quantization process is computationally expensive, the use of prequantized models is recommended. A partial list of available models can be found in the official aqlm [repository](https://github.com/Vahe1994/AQLM).
The models support LoRA adapter tuning. To tune the quantized model you'll need to install the `aqlm` inference library: `pip install aqlm>=1.0.2`. Finetuned LoRA adapters shall be saved separately, as merging them with AQLM quantized weights is not possible.
@ -166,15 +192,15 @@ model = get_peft_model(model, config)
## HQQ quantization
The models that is quantized using Half-Quadratic Quantization of Large Machine Learning Models ([HQQ](https://mobiusml.github.io/hqq_blog/)) support LoRA adapter tuning. To tune the quantized model, you'll need to install the `hqq` library with: `pip install hqq`.
The models that are quantized using Half-Quadratic Quantization of Large Machine Learning Models ([HQQ](https://mobiusml.github.io/hqq_blog/)) support LoRA adapter tuning. To tune the quantized model, you'll need to install the `hqq` library with: `pip install hqq`.
```py
```python
from hqq.engine.hf import HQQModelForCausalLM
quantized_model = HQQModelForCausalLM.from_quantized(save_dir_or_hfhub, device='cuda')
device = torch.accelerator.current_accelerator().type if hasattr(torch, "accelerator") else "cuda"
quantized_model = HQQModelForCausalLM.from_quantized(save_dir_or_hfhub, device=device)
peft_config = LoraConfig(...)
quantized_model = get_peft_model(quantized_model, peft_config)
```
@ -184,17 +210,85 @@ Or using transformers version that is compatible with HQQ (e.g. by installing it
from transformers import HqqConfig, AutoModelForCausalLM
quant_config = HqqConfig(nbits=4, group_size=64)
quantized_model = AutoModelForCausalLM.from_pretrained(save_dir_or_hfhub, device='cuda', quantization_config=quant_config)
quantized_model = AutoModelForCausalLM.from_pretrained(save_dir_or_hfhub, device_map=device_map, quantization_config=quant_config)
peft_config = LoraConfig(...)
quantized_model = get_peft_model(quantized_model, peft_config)
```
## torchao (PyTorch Architecture Optimization)
PEFT supports models quantized with [torchao](https://github.com/pytorch/ao) ("ao") for int8 quantization.
```python
from peft import LoraConfig, get_peft_model
from transformers import AutoModelForCausalLM, TorchAoConfig
model_id = ...
quantization_config = TorchAoConfig(quant_type="int8_weight_only")
base_model = AutoModelForCausalLM.from_pretrained(model_id, quantization_config=quantization_config)
peft_config = LoraConfig(...)
model = get_peft_model(base_model, peft_config)
```
### Caveats:
- Use the most recent versions of torchao (>= v0.4.0) and transformers (> 4.42).
- Only linear layers are currently supported.
- `quant_type = "int4_weight_only"` is currently not supported.
- `NF4` is not implemented in transformers as of yet and is thus also not supported.
- DoRA only works with `quant_type = "int8_weight_only"` at the moment.
- There is explicit support for torchao when used with LoRA. However, when torchao quantizes a layer, its class does not change, only the type of the underlying tensor. For this reason, PEFT methods other than LoRA will generally also work with torchao, even if not explicitly supported. Be aware, however, that **merging only works correctly with LoRA and with `quant_type = "int8_weight_only"`**. If you use a different PEFT method or dtype, merging will likely result in an error, and even it doesn't, the results will still be incorrect.
## INC quantization
Intel Neural Compressor ([INC](https://github.com/intel/neural-compressor)) enables model quantization for various devices,
including Intel Gaudi accelerators (also known as HPU devices). You can perform LoRA fine-tuning on models that have been
quantized using INC. To use INC with PyTorch models, install the library with: `pip install neural-compressor[pt]`.
Quantizing a model to FP8 precision for HPU devices can be done with the following single-step quantization workflow:
```python
import torch
from neural_compressor.torch.quantization import FP8Config, convert, finalize_calibration, prepare
quant_configs = {
...
}
config = FP8Config(**quant_configs)
```
Pass the config to the `prepare` method, run inference to gather calibration stats, and call `finalize_calibration`
and `convert` methods to quantize model to FP8 precision:
```python
model = prepare(model, config)
# Run inference to collect calibration statistics
...
# Finalize calibration and convert the model to FP8 precision
finalize_calibration(model)
model = convert(model)
# Load PEFT LoRA adapter as usual
...
```
An example demonstrating how to load a PEFT LoRA adapter into an INC-quantized FLUX text-to-image model for HPU
devices is provided [here](https://github.com/huggingface/peft/blob/main/examples/stable_diffusion/inc_flux_lora_hpu.py).
### Caveats:
- `merge()` and `unmerge()` methods are currently not supported for INC-quantized models.
- Currently, only **Linear** INC-quantized layers are supported when loading PEFT adapters.
## Other Supported PEFT Methods
Besides LoRA, the following PEFT methods also support quantization:
- **VeRA** (supports bitsandbytes quantization)
- **AdaLoRA** (supports both bitsandbytes and GPTQ quantization)
- **(IA)³** (supports bitsandbytes quantization)
## Next steps
If you're interested in learning more about quantization, the following may be helpful:
* Learn more about details about QLoRA and check out some benchmarks on its impact in the [Making LLMs even more accessible with bitsandbytes, 4-bit quantization and QLoRA](https://huggingface.co/blog/4bit-transformers-bitsandbytes) blog post.
* Learn more details about QLoRA and check out some benchmarks on its impact in the [Making LLMs even more accessible with bitsandbytes, 4-bit quantization and QLoRA](https://huggingface.co/blog/4bit-transformers-bitsandbytes) blog post.
* Read more about different quantization schemes in the Transformers [Quantization](https://hf.co/docs/transformers/main/quantization) guide.

View File

@ -18,7 +18,7 @@ rendered properly in your Markdown viewer.
In PEFT, [torch.compile](https://pytorch.org/tutorials/intermediate/torch_compile_tutorial.html) works for some but not all features. The reason why it won't always work is because PEFT is highly dynamic in certain places (loading and switching between multiple adapters, for instance), which can cause trouble for `torch.compile`. In other places, `torch.compile` may work, but won't be as fast as expected because of graph breaks.
If you don't see an error, it doesn't necessarily mean that `torch.compile` worked correctly. It might give you an output, but the output is incorrect. This guide describes what works with `torch.compile` and what doesn't.
If you don't see an error, it doesn't necessarily mean that `torch.compile` worked correctly. It might give you an output, but the output is incorrect. This guide describes what works with `torch.compile` and what doesn't. For your own testing, we recommend using the latest PyTorch version, as `torch.compile` is constantly being improved.
> [!TIP]
> Unless indicated otherwise, the default `torch.compile` settings were used.
@ -36,20 +36,18 @@ The following adapters were tested successfully:
- AdaLoRA
- BOFT
- Bone
- IA³
- Layer Norm Tuning
- LoHa
- LoKr
- LoRA
- LoRA + DoRA
- LoRA applied to embedding layers
- OFT
- VeRA
- HRA
The following adapters **don't work** correctly for training or inference when using `torch.compile`:
- LoKr
- LoRA targeting embedding layers
## Advanced PEFT features with `torch.compile`
Below are some of the more advanced PEFT features that **work**. They were all tested with LoRA.
@ -57,17 +55,14 @@ Below are some of the more advanced PEFT features that **work**. They were all t
- `modules_to_save` (i.e. `config = LoraConfig(..., modules_to_save=...)`)
- Merging adapters (one or multiple)
- Merging multiple adapters into one adapter (i.e. calling `model.add_weighted_adapter(...)`)
- Using PEFT adapters with quantization (bitsandbytes)
- Disabling adapters (i.e. using `with model.disable_adapter()`)
- Unloading (i.e. calling `model.merge_and_unload()`)
- Mixed adapter batches (i.e. calling `model(batch, adapter_names=["__base__", "default", "other", ...])`)
- Inference with multiple adapters (i.e. using `model.add_adapter` or `model.load_adapter` to load more than 1 adapter); for this, only call `torch.compile` _after_ loading all adapters
Generally, we can expect that if a feature works correctly with LoRA and is also supported by other adapter types, it should also work for that adapter type.
The more advanced PEFT features below **don't work** in conjunction with `torch.compile`. Tests were run with LoRA:
- Using PEFT adapters with quantization (bitsandbytes)
- Inference with multiple adapters
- Unloading (i.e. calling `model.merge_and_unload()`)
- Disabling adapters (i.e. using `with model.disable_adapter()`)
- Mixed adapter batches (i.e. calling `model(batch, adapter_names=["__base__", "default", "other", ...])`)
## Test cases
All the use cases listed above are tested inside of [`peft/tests/test_torch_compile.py`](https://github.com/huggingface/peft/blob/main/tests/test_torch_compile.py). If you want to check in more detail how we tested a certain feature, please go to that file and check the test that corresponds to your use case.

View File

@ -39,7 +39,9 @@ Installing PEFT from source is useful for keeping up with the latest development
python -m pip install git+https://github.com/huggingface/peft
```
## ValueError: Attempting to unscale FP16 gradients
## Dtype-related issues
### ValueError: Attempting to unscale FP16 gradients
This error probably occurred because the model was loaded with `torch_dtype=torch.float16` and then used in an automatic mixed precision (AMP) context, e.g. by setting `fp16=True` in the [`~transformers.Trainer`] class from 🤗 Transformers. The reason is that when using AMP, trainable weights should never use fp16. To make this work without loading the whole model in fp32, add the following to your code:
@ -71,10 +73,27 @@ trainer.train()
<Tip>
Starting from PEFT verion v0.12.0, PEFT automatically promotes the dtype of adapter weights from `torch.float16` and `torch.bfloat16` to `torch.float32` where appropriate. To _prevent_ this behavior, you can pass `autocast_adapter_dtype=False` to [`~get_peft_model`], to [`~PeftModel.from_pretrained`], and to [`~PeftModel.load_adapter`].
Starting from PEFT version v0.12.0, PEFT automatically promotes the dtype of adapter weights from `torch.float16` and `torch.bfloat16` to `torch.float32` where appropriate. To _prevent_ this behavior, you can pass `autocast_adapter_dtype=False` to [`~get_peft_model`], to [`~PeftModel.from_pretrained`], and to [`~PeftModel.load_adapter`].
</Tip>
### Selecting the dtype of the adapter
Most PEFT methods, like LoRA, work by adding trainable adapter weights. By default, those weights are stored in float32 dtype (fp32), i.e. at a relatively high precision. Therefore, even if the base model is loaded in float16 (fp16) or bfloat16 (bf16), the adapter weights are float32. When the adapter results are calculated during the forward pass, the input will typically be in the dtype of the base model, thus it will be upcast to float32 if necessary, then cast back to the original dtype.
If you prefer to have the adapter weights in the lower precision of the base model, i.e. in float16 or bfloat16, you can pass `autocast_adapter_dtype=False` when creating the model ([`~get_peft_model`]) or loading the model ([`~PeftModel.from_pretrained`]). There are some advantages and disadvantages to this:
Advantages of half precision adapter:
- computation slightly faster
- slightly less memory
- smaller file size of checkpoint (half the size)
Disadvantages of half precision adapter:
- slightly worse loss
- higher risk of overflow or underflow
Note that for most use cases, overall runtime and memory cost will be determined by the size of the base model and by the dataset, while the dtype of the PEFT adapter will only have a small impact.
## Bad results from a loaded PEFT model
There can be several reasons for getting a poor result from a loaded PEFT model which are listed below. If you're still unable to troubleshoot the problem, see if anyone else had a similar [issue](https://github.com/huggingface/peft/issues) on GitHub, and if you can't find any, open a new issue.
@ -118,11 +137,45 @@ You should probably TRAIN this model on a down-stream task to be able to use it
The mentioned layers should be added to `modules_to_save` in the config to avoid the described problem.
<Tip>
As an example, when loading a model that is using the DeBERTa architecture for sequence classification, you'll see a warning that the following weights are newly initialized: `['classifier.bias', 'classifier.weight', 'pooler.dense.bias', 'pooler.dense.weight']`. From this, it follows that the `classifier` and `pooler` layers should be added to: `modules_to_save=["classifier", "pooler"]`.
</Tip>
### Extending the vocabulary
For many language fine-tuning tasks, extending the model's vocabulary is necessary since new tokens are being introduced. This requires extending the embedding layer to account for the new tokens and also storing the embedding layer in addition to the adapter weights when saving the adapter.
For many language fine-tuning tasks, extending the model's vocabulary is necessary since new tokens are being introduced. This requires extending the embedding layer to account for the new tokens and, depending on the fine-tuning method, also storing the embedding layer in addition to the adapter weights when saving the adapter. There are a few ways of achieving this ordered by parameter effectiveness:
Save the embedding layer by adding it to the `target_modules` of the config. The embedding layer name must follow the standard naming scheme from Transformers. For example, the Mistral config could look like this:
- [trainable tokens](../package_reference/trainable_tokens), train only the specified tokens, optionally store only the updated values
- training an adapter on the embedding matrix, optionally store only the updated values
- full-finetuning of the embedding layer
#### Using trainable tokens
Let's start with trainable tokens, in this case its [LoRA integration](../developer_guides/lora#efficiently-train-tokens-alongside-lora). If you're interested in only training the new embeddings and nothing else, refer to the [standalone documentation](../package_reference/trainable_tokens).
To enable selective token training of the embedding layer, you'll need to supply the token ids of your newly added tokens via the `trainable_token_indices` parameter. Optionally you can specify which layer to target if there is more than one embedding layer. For a Mistral model this could look like this:
```python
new_tokens = ['<think>', '</think>']
tokenizer.add_tokens(new_tokens)
base_model.resize_token_embeddings(len(tokenizer))
lora_config = LoraConfig(
...,
trainable_token_indices={'embed_tokens': tokenizer.convert_tokens_to_ids(new_tokens)},
)
```
If your model uses tied weights (such as the `lm_head`), trainable tokens will try to resolve those and keep them updated as well, so in that case there should be no need for adding `modules_to_save=["lm_head"]`. This only works if the model uses the Transformers convention for tying weights.
Saving the model with `model.save_pretrained` may save the full embedding matrix instead of
only the difference as a precaution because the embedding matrix was resized. To save space you can disable this behavior by setting `save_embedding_layers=False` when calling `save_pretrained`. This is safe to do as long as you don't modify the embedding matrix through other means as well, as such changes will be not tracked by trainable tokens.
#### Using an adapter, e.g. LoRA
Prepare the embedding layer by adding it to the `target_modules` of your adapter config. For example, the Mistral config could look like this:
```python
config = LoraConfig(..., target_modules=["embed_tokens", "lm_head", "q_proj", "v_proj"])
@ -130,7 +183,7 @@ config = LoraConfig(..., target_modules=["embed_tokens", "lm_head", "q_proj", "v
Once added to `target_modules`, PEFT automatically stores the embedding layer when saving the adapter if the model has the [`~transformers.PreTrainedModel.get_input_embeddings`] and [`~transformers.PreTrainedModel.get_output_embeddings`]. This is generally the case for Transformers models.
If the model's embedding layer doesn't follow the Transformer's naming scheme, you can still save it by manually passing `save_embedding_layers=True` when saving the adapter:
If the model's embedding layer doesn't follow the Transformer's naming scheme but nevertheless implements `get_input_embeddings`, you can still save it by manually passing `save_embedding_layers=True` when saving the adapter:
```python
model = get_peft_model(...)
@ -142,6 +195,42 @@ For inference, load the base model first and resize it the same way you did befo
For a complete example, please check out [this notebook](https://github.com/huggingface/peft/blob/main/examples/causal_language_modeling/peft_lora_clm_with_additional_tokens.ipynb).
#### Full fine-tuning
Full fine-tuning is more costly in terms of VRAM or storage space but if all else fails, you can fall back to this and see if it works for you. Achieve it by adding the name of the embedding layer to `modules_to_save`. Note that you need to add tied layers as well, e.g. `lm_head`. Example for a Mistral model with LoRA:
```python
config = LoraConfig(..., modules_to_save=["embed_tokens", "lm_head"], target_modules=["q_proj", "v_proj"])
```
### Getting a warning about "weights not being initialized from the model checkpoint"
When you load your PEFT model which has been trained on a task (for example, classification), you may get a warning like:
> Some weights of LlamaForSequenceClassification were not initialized from the model checkpoint at meta-llama/Llama-3.2-1B and are newly initialized: ['score.weight']. You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
Although this looks scary, it is most likely nothing to worry about. This warning comes from Transformers, and it isn't a PEFT specific warning. It lets you know that a randomly initialized classification head (`score`) is attached to the base model, and the head must be trained to produce sensible predictions.
When you get this warning _before_ training the model, PEFT automatically takes care of making the classification head trainable if you correctly passed the `task_type` argument to the PEFT config.
```python
from peft import LoraConfig, TaskType
lora_config = LoraConfig(..., task_type=TaskType.SEQ_CLS)
```
If your classification head does not follow the usual naming conventions from Transformers (which is rare), you have to explicitly tell PEFT the name of the head in `modules_to_save`.
```python
lora_config = LoraConfig(..., modules_to_save=["name-of-classification-head"])
```
To check the name of the classification head, print the model and it should be the last module.
If you get this warning from your inference code, i.e. _after_ training the model, when you load the PEFT model, you always have to load the Transformers model first. Since Transformers does not know that you will load PEFT weights afterwards, it still gives the warning.
As always, it is best practice to ensure the model works correctly for inference by running some validation on it.
### Check layer and model status
Sometimes a PEFT model can end up in a bad state, especially when handling multiple adapters. There can be some confusion around what adapters exist, which one is active, which one is merged, etc. To help investigate this issue, call the [`~peft.PeftModel.get_layer_status`] and the [`~peft.PeftModel.get_model_status`] methods.
@ -250,6 +339,19 @@ TunerModelStatus(
)
```
## Speed
### Loading adapter weights is slow
Loading adapters like LoRA weights should generally be fast compared to loading the base model. However, there can be use cases where the adapter weights are quite large or where users need to load a large number of adapters -- the loading time can add up in this case. The reason for this is that the adapter weights are first initialized and then overridden by the loaded weights, which is wasteful. To speed up the loading time, you can pass the `low_cpu_mem_usage=True` argument to [`~PeftModel.from_pretrained`] and [`~PeftModel.load_adapter`].
<Tip>
If this option works well across different use cases, it may become the default for adapter loading in the future.
</Tip>
## Reproducibility
### Models using batch norm
@ -271,3 +373,31 @@ config = LoraConfig(
```
Depending on the type of model you use, the batch norm layers could have different names than `"normalization"`, so please ensure that the name matches your model architecture.
## Version mismatch
### Error while loading the config because of an unexpected keyword argument
When you encounter an error like the one shown below, it means the adapter you're trying to load was trained with a more recent version of PEFT than the version you have installed on your system.
```
TypeError: LoraConfig.__init__() got an unexpected keyword argument <argument-name>
```
The best way to resolve this issue is to install the latest PEFT version:
```sh
python -m pip install -U PEFT
```
If the adapter was trained from a source install of PEFT (an unreleased version of PEFT), then you also need to install PEFT from source.
```sh
python -m pip install -U git+https://github.com/huggingface/peft.git
```
If it is not possible for you to upgrade PEFT, there is a workaround you can try.
Assume the error message says that the unknown keyword argument is named `foobar`. Search inside the `adapter_config.json` of this PEFT adapter for the `foobar` entry and delete it from the file. Then save the file and try loading the model again.
This solution works most of the time. As long as it is the default value for `foobar`, it can be ignored. However, when it is set to some other value, you will get incorrect results. Upgrading PEFT is the recommended solution.

View File

@ -23,14 +23,14 @@ PEFT is integrated with the Transformers, Diffusers, and Accelerate libraries to
<div class="mt-10">
<div class="w-full flex flex-col space-y-4 md:space-y-0 md:grid md:grid-cols-2 md:gap-y-4 md:gap-x-5">
<a class="!no-underline border dark:border-gray-700 p-5 rounded-lg shadow hover:shadow-lg" href="quicktour"
><div class="w-full text-center bg-gradient-to-br from-blue-400 to-blue-500 rounded-lg py-1.5 font-semibold mb-5 text-white text-lg leading-relaxed">Get started</div>
><div class="w-full text-center bg-gradient-to-br from-blue-400 to-blue-500 rounded-lg py-1.5 font-semibold mb-5 text-white text-lg leading-relaxed">Quicktour</div>
<p class="text-gray-700">Start here if you're new to 🤗 PEFT to get an overview of the library's main features, and how to train a model with a PEFT method.</p>
</a>
<a class="!no-underline border dark:border-gray-700 p-5 rounded-lg shadow hover:shadow-lg" href="./task_guides/image_classification_lora"
<a class="!no-underline border dark:border-gray-700 p-5 rounded-lg shadow hover:shadow-lg" href="./task_guides/prompt_based_methods"
><div class="w-full text-center bg-gradient-to-br from-indigo-400 to-indigo-500 rounded-lg py-1.5 font-semibold mb-5 text-white text-lg leading-relaxed">How-to guides</div>
<p class="text-gray-700">Practical guides demonstrating how to apply various PEFT methods across different types of tasks like image classification, causal language modeling, automatic speech recognition, and more. Learn how to use 🤗 PEFT with the DeepSpeed and Fully Sharded Data Parallel scripts.</p>
</a>
<a class="!no-underline border dark:border-gray-700 p-5 rounded-lg shadow hover:shadow-lg" href="./conceptual_guides/lora"
<a class="!no-underline border dark:border-gray-700 p-5 rounded-lg shadow hover:shadow-lg" href="./conceptual_guides/adapter"
><div class="w-full text-center bg-gradient-to-br from-pink-400 to-pink-500 rounded-lg py-1.5 font-semibold mb-5 text-white text-lg leading-relaxed">Conceptual guides</div>
<p class="text-gray-700">Get a better theoretical understanding of how LoRA and various soft prompting methods help reduce the number of trainable parameters to make training more efficient.</p>
</a>

View File

@ -16,7 +16,7 @@ rendered properly in your Markdown viewer.
# Installation
Before you start, you will need to setup your environment, install the appropriate packages, and configure 🤗 PEFT. 🤗 PEFT is tested on **Python 3.8+**.
Before you start, you will need to setup your environment, install the appropriate packages, and configure 🤗 PEFT. 🤗 PEFT is tested on **Python 3.9+**.
🤗 PEFT is available on PyPI, as well as GitHub:
@ -43,5 +43,5 @@ repository:
```bash
git clone https://github.com/huggingface/peft
cd peft
pip install -e .
pip install -e .[test]
```

View File

@ -16,7 +16,7 @@ rendered properly in your Markdown viewer.
# BOFT
[Orthogonal Butterfly (BOFT)](https://hf.co/papers/2311.06243) is a generic method designed for finetuning foundation models. It improves the paramter efficiency of the finetuning paradigm -- Orthogonal Finetuning (OFT), by taking inspiration from Cooley-Tukey fast Fourier transform, showing favorable results across finetuning different foundation models, including large vision transformers, large language models and text-to-image diffusion models.
[Orthogonal Butterfly (BOFT)](https://hf.co/papers/2311.06243) is a generic method designed for finetuning foundation models. It improves the parameter efficiency of the finetuning paradigm -- Orthogonal Finetuning (OFT), by taking inspiration from Cooley-Tukey fast Fourier transform, showing favorable results across finetuning different foundation models, including large vision transformers, large language models and text-to-image diffusion models.
The abstract from the paper is:

View File

@ -0,0 +1,33 @@
<!--Copyright 2024 The HuggingFace Team. All rights reserved.
Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with
the License. You may obtain a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on
an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the
specific language governing permissions and limitations under the License.
⚠️ Note that this file is in Markdown but contain specific syntax for our doc-builder (similar to MDX) that may not be
rendered properly in your Markdown viewer.
-->
# Bone
DiSHA: Dimension-Sharding Adaptation ([DiSHA](https://huggingface.co/papers/2409.15371)) We introduce Dimension-Sharding Adaptation (DiSHA), which expands the PEFT design space to unlock lower intrinsic ranks and faster convergence by default. Building on DiSHA, we propose an efficient algorithm called Block-Affine Adaptation (Bone) structure and a non-linear update method called Block Affine Transformation Adaptation (BAT).
The abstract from the paper is:
Low-Rank Adaptation (LoRA) leverages the low intrinsic rank of weight updates in Large Language Models (LLMs), establishing a Parameter-Efficient Fine-Tuning (PEFT) paradigm. However, LoRA suffers from slow convergence. We introduce Dimension-Sharding Adaptation (DiSHA), which expands the PEFT design space to unlock lower intrinsic ranks and faster convergence by default. Within DiSHA's design space, we propose Block Affine Adaptation (Bone), a computationally efficient structure that delivers both high performance and efficiency. While certain DiSHA configurations may result in colinear updates to weight shards, we address this with Block Affine Transformation Adaptation (BAT), a nonlinear variant of DiSHA. BAT introduces nonlinearity by combining trainable matrices with original weight shards in a nonlinear manner, inducing nonlinearity in matrix updates without introducing additional parameters. Empirical results show that Bone, under the DiSHA framework, consistently outperforms LoRA variants in both NLG and NLU tasks, with significantly improved computational efficiency. Further analysis demonstrates that BAT enhances model capabilities by leveraging its nonlinear design.
## BoneConfig
[[autodoc]] tuners.bone.config.BoneConfig
## BoneModel
[[autodoc]] tuners.bone.model.BoneModel

View File

@ -0,0 +1,43 @@
<!--Copyright 2025 The HuggingFace Team. All rights reserved.
Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with
the License. You may obtain a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on
an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the
specific language governing permissions and limitations under the License.
⚠️ Note that this file is in Markdown but contain specific syntax for our doc-builder (similar to MDX) that may not be
rendered properly in your Markdown viewer.
-->
# C3A: Parameter-Efficient Fine-Tuning via Circular Convolution
[C3A](https://huggingface.co/papers/2407.19342) is a parameter-efficient fine-tuning technique that leverages Circular Convolution to achieve high rank adaptation within reasonable resource limits.
Note that you should use a much larger learning rate (LR) for C3A than for other methods. For example, a LR of 1e-1 for C3A is a good starting point. Besides, a much smaller weight decay should be used. You can refer to the `method_comparison` folder for more details.
For the `block_size`, it affects tunable parameters and performance. To start with, you can choose a $\mathrm{gcd}(d_1,d_2)$ near $\frac{\sqrt{d_1\times d_2}}{r}$, where $r$ is the rank for LoRA you would use for this task.
C3A currently has the following constraints:
- Only `nn.Linear` layers are supported.
- Quantized layers are not supported.
- The block size should be a common divisor of both the input and output sizes of target layers.
If these constraints don't work for your use case, consider other methods instead.
The abstract from the paper is:
> Low-Rank Adaptation (LoRA) has gained popularity for fine-tuning large foundation models, leveraging low-rank matrices $\mathbf{A}$ and $\mathbf{B}$ to represent weight changes (i.e., $\Delta \mathbf{W} = \mathbf{B} \mathbf{A}$). This method reduces trainable parameters and mitigates heavy memory consumption associated with full delta matrices by sequentially multiplying $\mathbf{A}$ and $\mathbf{B}$ with the activation. Despite its success, the intrinsic low-rank characteristic may limit its performance. Although several variants have been proposed to address this issue, they often overlook the crucial computational and memory efficiency brought by LoRA. In this paper, we propose Circular Convolution Adaptation (C3A), which not only achieves high-rank adaptation with enhanced performance but also excels in both computational power and memory utilization. Extensive experiments demonstrate that C3A consistently outperforms LoRA and its variants across various fine-tuning tasks.
## C3AConfig
[[autodoc]] tuners.c3a.config.C3AConfig
## C3AModel
[[autodoc]] tuners.c3a.model.C3AModel

View File

@ -0,0 +1,34 @@
<!-- Copyright 2024 The HuggingFace Team. All rights reserved.
Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with
the License. You may obtain a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0
⚠️ Note that this file is in Markdown but contains specific syntax for our doc-builder (similar to MDX) that may not be
rendered properly in your Markdown viewer.
-->
# Context-aware Prompt Tuning: Advancing In-Context Learning with Adversarial Methods
[CPT](https://huggingface.co/papers/2410.17222) combines In-Context Learning (ICL), Prompt Tuning (PT), and adversarial optimization to improve few-shot learning by refining context embeddings. CPT updates the context tokens by optimizing both the context and the training examples, encapsulating them into a novel loss design that minimizes overfitting, enables more effective optimization, and drives significant improvements in classification tasks.
[//]: # ([CPT]&#40;https://huggingface.co/papers/2410.17222&#41; for the paper)
The abstract from the paper is:
> Large Language Models (LLMs) can perform few-shot learning using either optimization-based approaches or In-Context Learning (ICL). Optimization-based methods often suffer from overfitting, as they require updating a large number of parameters with limited data. In contrast, ICL avoids overfitting but typically underperforms compared to optimization-based methods and is highly sensitive to the selection, order, and format of demonstration examples. To overcome these challenges, we introduce Context-aware Prompt Tuning (CPT), a method inspired by ICL, Prompt Tuning (PT), and adversarial attacks. CPT builds on the ICL strategy of concatenating examples before the input, extending it by incorporating PT-like learning to refine the context embedding through iterative optimization, extracting deeper insights from the training examples. Our approach carefully modifies specific context tokens, considering the unique structure of the examples within the context. In addition to updating the context with PT-like optimization, CPT draws inspiration from adversarial attacks, adjusting the input based on the labels present in the context while preserving the inherent value of the user-provided data. To ensure robustness and stability during optimization, we employ a projected gradient descent algorithm, constraining token embeddings to remain close to their original values and safeguarding the quality of the context. Our method has demonstrated superior accuracy across multiple classification tasks using various LLM models, outperforming existing baselines and effectively addressing the overfitting challenge in few-shot learning.
Take a look at [Example](https://github.com/huggingface/peft/blob/main/examples/cpt_finetuning/README.md) for a step-by-step guide on how to train a model with CPT.
## CPTConfig
[[autodoc]] tuners.cpt.config.CPTConfig
## CPTEmbedding
[[autodoc]] tuners.cpt.model.CPTEmbedding

View File

@ -0,0 +1,33 @@
<!--⚠️ Note that this file is in Markdown but contain specific syntax for our doc-builder (similar to MDX) that may not be
rendered properly in your Markdown viewer.
-->
# Functions for PEFT integration
A collection of functions that could be useful for non-PeftModel models, e.g. transformers or diffusers integration
The functions provided here can be considered "public API" of PEFT and hence are safe to be used by packages that provide PEFT integrations.
## Cast the adapter weight dtypes
[[autodoc]] functional.cast_adapter_dtype
- all
## Delete the PEFT adapter from model
[[autodoc]] functional.delete_adapter
- all
## Get the state dict of the PEFT adapter
[[autodoc]] functional.get_peft_model_state_dict
- all
## Inject a PEFT adapter into the model based on a PEFT config
[[autodoc]] functional.inject_adapter_in_model
- all
## Set the active PEFT adapter(s) of the model
[[autodoc]] functional.set_adapter
- all
## Load the weights of the PEFT state dict into the model
[[autodoc]] functional.set_peft_model_state_dict
- all

View File

@ -2,7 +2,7 @@
rendered properly in your Markdown viewer.
-->
# Document Title
# Helper methods
A collection of helper functions for PEFT.
@ -10,3 +10,13 @@ A collection of helper functions for PEFT.
[[autodoc]] helpers.check_if_peft_model
- all
## Temporarily Rescaling Adapter Scale in LoraLayer Modules
[[autodoc]] helpers.rescale_adapter_scale
- all
## Context manager to disable input dtype casting in the `forward` method of LoRA layers
[[autodoc]] helpers.disable_input_dtype_casting
- all

View File

@ -0,0 +1,76 @@
<!--⚠️ Note that this file is in Markdown but contain specific syntax for our doc-builder (similar to MDX) that may not be
rendered properly in your Markdown viewer.
-->
# Hotswapping adapters
The idea of hotswapping an adapter is the following: We can already load multiple adapters, e.g. two LoRAs, at the same time. But sometimes, we want to load one LoRA and then replace its weights in-place with the LoRA weights of another adapter. This is now possible the `hotswap_adapter` function.
In general, this should be faster than deleting one adapter and loading the adapter in its place, which would be the how to achieve the same final outcome without hotswapping. Another advantage of hotswapping is that it prevents re-compilation in case the PEFT model is already compiled using `torch.compile`. This can save quite a lot of time.
## Example without `torch.compile`
```python
import torch
from transformers import AutoModelForCausalLM
from peft import PeftModel
from peft.utils.hotswap import hotswap_adapter
model_id = ...
inputs = ...
device = ...
model = AutoModelForCausalLM.from_pretrained(model_id).to(device)
# load lora 0
model = PeftModel.from_pretrained(model, <path-adapter-0>)
with torch.inference_mode():
output_adapter_0 = model(inputs)
# replace the "default" lora adapter with the new one
hotswap_adapter(model, <path-adapter-1>, adapter_name="default", torch_device=device)
with torch.inference_mode():
output_adapter_1 = model(inputs).logits
```
## Example with `torch.compile`
```python
import torch
from transformers import AutoModelForCausalLM
from peft import PeftModel
from peft.utils.hotswap import hotswap_adapter, prepare_model_for_compiled_hotswap
model_id = ...
inputs = ...
device = ...
max_rank = ... # maximum rank among all LoRA adapters that will be used
model = AutoModelForCausalLM.from_pretrained(model_id).to(device)
# load lora 0
model = PeftModel.from_pretrained(model, <path-adapter-0>)
# Prepare the model to allow hotswapping even if ranks/scalings of 2nd adapter differ.
# You can skip this step if all ranks and scalings are identical.
prepare_model_for_compiled_hotswap(model, target_rank=max_rank)
model = torch.compile(model)
with torch.inference_mode():
output_adapter_0 = model(inputs)
# replace the "default" lora adapter with the new one
hotswap_adapter(model, <path-adapter-1>, adapter_name="default", torch_device=device)
with torch.inference_mode():
output_adapter_1 = model(inputs).logits
```
## Caveats
Hotswapping works with transformers models and diffusers models. However, there are some caveats:
- Right now, only LoRA is properly supported.
- It only works for the same PEFT method, so no swapping LoRA and LoHa, for example.
- The adapter that is being swapped in must target the same layers as the previous adapter or a subset of those layers. It cannot target new layers. Therefore, if possible, start with the adapter that targets most layers.
[[autodoc]] utils.hotswap.hotswap_adapter
- all
[[autodoc]] utils.hotswap.hotswap_adapter_from_state_dict
- all

View File

@ -0,0 +1,32 @@
<!--Copyright 2024 The HuggingFace Team. All rights reserved.
Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with
the License. You may obtain a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on
an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the
specific language governing permissions and limitations under the License.
⚠️ Note that this file is in Markdown but contain specific syntax for our doc-builder (similar to MDX) that may not be
rendered properly in your Markdown viewer.
-->
# Bridging The Gap between Low-rank and Orthogonal Adaptation via Householder Reflection Adaptation (HRA)
[HRA](https://huggingface.co/papers/2405.17484) is a simple but effective adapter-based fine-tuning method by leveraging Householder reflections. This method harnesses the advantages of both strategies, reducing parameters and computation costs while penalizing the loss of pre-training knowledge. It consistently achieves better performance with fewer trainable parameters and outperforms state-of-the-art adapters across different models, including large language models (LLMs) and conditional image generators.
The abstract from the paper is:
> While following different technical routes, both low-rank and orthogonal adaptation techniques can efficiently adapt large-scale pre-training models in specific tasks or domains based on a small piece of trainable parameters. In this study, we bridge the gap between these two techniques, proposing a simple but effective adaptation method based on Householder reflections. Given a pre-trained model, our method fine-tunes its layers by multiplying each frozen weight matrix with an orthogonal matrix constructed by a chain of learnable Householder reflections (HRs). This HR-based orthogonal fine-tuning is equivalent to an adaptive low-rank adaptation. Moreover, we show that the orthogonality of the reflection planes corresponding to the HRs impacts the model capacity and regularity. The analysis motivates us to regularize the orthogonality of the HRs, leading to different implementations of the proposed Householder reflection adaptation (HRA) method. Compared with state-of-the-art methods, HRA achieves superior performance with fewer learnable parameters when adapting large language models and conditional image generators. The code is available at [peft](https://github.com/huggingface/peft/tree/main/src/peft/tuners/hra) and [HRA](https://github.com/DaShenZi721/HRA).
## HRAConfig
[[autodoc]] tuners.hra.config.HRAConfig
## HRAModel
[[autodoc]] tuners.hra.model.HRAModel

View File

@ -32,4 +32,24 @@ The abstract from the paper is:
## Utility
### ArrowConfig
[[autodoc]] tuners.lora.config.ArrowConfig
### LoftQ
[[autodoc]] utils.loftq_utils.replace_lora_weights_loftq
### Eva
#### EvaConfig
[[autodoc]] tuners.lora.config.EvaConfig
#### initialize_lora_eva_weights
[[autodoc]] tuners.lora.eva.initialize_lora_eva_weights
#### get_eva_state_dict
[[autodoc]] tuners.lora.eva.get_eva_state_dict

View File

@ -0,0 +1,32 @@
<!--Copyright 2025 The HuggingFace Team. All rights reserved.
Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with
the License. You may obtain a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on
an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the
specific language governing permissions and limitations under the License.
⚠️ Note that this file is in Markdown but contain specific syntax for our doc-builder (similar to MDX) that may not be
rendered properly in your Markdown viewer.
-->
# MiSS
MiSS: Balancing LoRA Performance and Efficiency with Simple Shard Sharing([MiSS](https://huggingface.co/papers/2409.15371)) is a novel PEFT method that adopts a low-rank structure, requires only a single trainable matrix, and introduces a new update mechanism distinct from LoRA, achieving an excellent balance between performance and efficiency.
The abstract from the paper is:
*Parameter-Efficient Fine-Tuning (PEFT) methods, particularly Low-Rank Adaptation (LoRA), effectively reduce the number of trainable parameters in Large Language Models (LLMs). However, as model scales continue to grow, the demand for computational resources remains a significant challenge. Existing LoRA variants often struggle to strike an optimal balance between adaptability (model performance and convergence speed) and efficiency (computational overhead, memory usage, and initialization time). This paper introduces MiSS(Matrix Shard Sharing ), a novel PEFT approach that addresses this trade-off through a simple shard-sharing mechanism. MiSS leverages the insight that a low-rank adaptation can be achieved by decomposing the weight matrix into multiple fragment matrices and utilizing a shared, trainable common fragment. This method constructs the low-rank update matrix through the replication of these shared, partitioned shards. We also propose a hardware-efficient and broadly applicable implementation for MiSS. Extensive experiments conducted on a range of tasks, alongside a systematic analysis of computational performance, demonstrate MiSS's superiority. The results show that MiSS significantly outperforms standard LoRA and its prominent variants in both model performance metrics and computational efficiency, including initialization speed and training throughput. By effectively balancing expressive power and resource utilization, MiSS offers a compelling solution for efficiently adapting large-scale models*.
## MissConfig
[[autodoc]] tuners.miss.config.MissConfig
## MissModel
[[autodoc]] tuners.miss.model.MissModel

View File

@ -16,7 +16,7 @@ rendered properly in your Markdown viewer.
# OFT
[Orthogonal Finetuning (OFT)](https://hf.co/papers/2306.07280) is a method developed for adapting text-to-image diffusion models. It works by reparameterizing the pretrained weight matrices with it's orthogonal matrix to preserve information in the pretrained model. To reduce the number of parameters, OFT introduces a block-diagonal structure in the orthogonal matrix.
[Orthogonal Finetuning (OFT)](https://hf.co/papers/2306.07280) is a method developed for adapting text-to-image diffusion models. It works by reparameterizing the pretrained weight matrices with its orthogonal matrix to preserve information in the pretrained model. To reduce the number of parameters, OFT introduces a block-diagonal structure in the orthogonal matrix.
The abstract from the paper is:

View File

@ -16,7 +16,7 @@ rendered properly in your Markdown viewer.
# Polytropon
[Polytropon](https://hf.co/papers/2202.13914) is a multitask model with a number of different LoRA adapters in it's "inventory". The model learns the correct combination of adapters from the inventory with a routing function to choose the best subset of modules for a specific task. PEFT also supports [Multi-Head Adapter Routing (MHR)](https://hf.co/papers/2211.03831) for Polytropon which builds on and improves the routing function by combining the adapter heads more granularly. The adapter heads are separated into disjoint blocks and a different routing function is learned for each one, allowing for more expressivity.
[Polytropon](https://hf.co/papers/2202.13914) is a multitask model with a number of different LoRA adapters in its "inventory". The model learns the correct combination of adapters from the inventory with a routing function to choose the best subset of modules for a specific task. PEFT also supports [Multi-Head Adapter Routing (MHR)](https://hf.co/papers/2211.03831) for Polytropon which builds on and improves the routing function by combining the adapter heads more granularly. The adapter heads are separated into disjoint blocks and a different routing function is learned for each one, allowing for more expressivity.
<hfoptions id="paper">
<hfoption id="Combining Modular Skills in Multitask Learning">

View File

@ -0,0 +1,45 @@
<!--Copyright 2025 The HuggingFace Team. All rights reserved.
Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with
the License. You may obtain a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on
an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the
specific language governing permissions and limitations under the License.
⚠️ Note that this file is in Markdown but contain specific syntax for our doc-builder (similar to MDX) that may not be
rendered properly in your Markdown viewer.
-->
# RandLora: Full-rank parameter-efficient fine-tuning of large models
[RandLora](https://huggingface.co/papers/2502.00987) is a parameter-efficient fine-tuning technique that is similar to [LoRA](https://huggingface.co/papers/2106.09685) and [VeRA](https://huggingface.co/papers/2310.11454) but performs full rank updates to improve performance. RandLora can be particulary usefull when adapting large model to hard tasks that require complex updates while preserving the parameter efficiency of LoRA. The full rank update of RandLora is achieved by linearly scaling random bases. The random bases are a collection of multiple low rank matrices such that the summation of their ranks if greater or equal to the full rank of the parameter matrices. The trainable parameters of RandLora are two diagonal matrices (vectors) that get multiplied with the right hand low rank random bases, in a similar way to VeRA's update. To maintain low memory usage, RandLora uses a custom function that prevents storing unnecessary bases in memory for backpropagation.
RandLora presents the noteworthy difference that contrary to other LoRA-like PEFT algorithm, increasing RandLora's random base ranks increases the amount of trainable parameters. Because number of bases x bases rank is constant in RandLora, reducing the rank will increase the number of random bases, hence the number of base-specific trainable diagonal bases.
Because reducing the rank of RandLora's random bases will increase their number, RandLora can become slower to train than LoRA for very small ranks where typically, ranks below 4 with result in a large training time increase. This does not affect inference though as the RandLora adapters can be merged into the pretrained weight matrices.
RandLora additionally supports training with sparse, ternary random bases (only containing -1, 0 and 1). These bases are as described in [Bingham et al.](https://cs-people.bu.edu/evimaria/cs565/kdd-rp.pdf) and [Ping et al.](https://hastie.su.domains/Papers/Ping/KDD06_rp.pdf) and could theoretically be used to reduce compute needs by performing aggregations instead of matrix multiplications to create the weight update. This is not currently supported. Although it does not currently reduce compute, using sparse random bases in RandLora can reduce overfitting in some cases. For users intersted in using sparse ternary bases, the `sparse` option is recommended over the `very_sparse` one that can reduce perfromance.
Similarly to VeRA, when saving the RandLora's parameters, it's possible to eschew storing the low rank matrices by setting `save_projection=False` on the `VeraConfig`. In that case, these matrices will be restored based on the fixed random seed from the `projection_prng_key` argument. This cuts down on the size of the checkpoint, but we cannot guarantee reproducibility on all devices and for all future versions of PyTorch. If you want to ensure reproducibility, set `save_projection=True` (which is the default).
As in Vera and to handle different shapes of adapted layers, RandLora initializes shared A and B matrices with the largest required size for each dimension. During the forward pass, submatrices A and B for a given layer are sliced out from these shared matrices and used as described in the paper. For example, adapting two linear layers of shapes (100, 20) and (80, 50) will create A and B matrices of shapes (rank, 50) and (100, rank) respectively. Then, to adapt a layer of shape (100, 20), submatrices A and B of shapes (rank, 20) and (100, rank) will be extracted.
RandLora currently has the following constraint:
- Only `nn.Linear` layers are supported.
The abstract from the paper is:
> Low-Rank Adaptation (LoRA) and its variants have shown impressive results in reducing the number of trainable parameters and memory requirements of large transformer networks while maintaining fine-tuning performance. The low-rank nature of the weight update inherently limits the representation power of fine-tuned models, however, thus potentially compromising performance on complex tasks. This raises a critical question: when a performance gap between LoRA and standard fine-tuning is observed, is it due to the reduced number of trainable parameters or the rank deficiency?
This paper aims to answer this question by introducing RandLora, a parameter-efficient method that performs full-rank updates using a learned linear combinations of low-rank, non-trainable random matrices. Our method limits the number of trainable parameters by restricting optimization to diagonal scaling matrices applied to the fixed random matrices. This allows us to effectively overcome the low-rank limitations while maintaining parameter and memory efficiency during training. Through extensive experimentation across vision, language, and vision-language benchmarks, we systematically evaluate the limitations of LoRA and existing random basis methods. Our findings reveal that full-rank updates are beneficial across vision and language tasks individually, and even more so for vision-language tasks, where RandLora significantly reduces---and sometimes eliminates---the performance gap between standard fine-tuning and LoRA, demonstrating its efficacy.
## RandLoraConfig
[[autodoc]] tuners.randlora.config.RandLoraConfig
## RandLoraModel
[[autodoc]] tuners.randlora.model.RandLoraModel

View File

@ -0,0 +1,31 @@
<!--Copyright 2025 The HuggingFace Team. All rights reserved.
Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with
the License. You may obtain a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on
an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the
specific language governing permissions and limitations under the License.
⚠️ Note that this file is in Markdown but contain specific syntax for our doc-builder (similar to MDX) that may not be
rendered properly in your Markdown viewer.
-->
# RoAd
[RoAd](https://arxiv.org/pdf/2409.00119) is a parameterefficient finetuning technique that adapts large language models by learning a small set of 2×2 rotation matrices (and optional scaling factors) applied to pairs of hidden dimensions. RoAd achieves competitive or superior performance compared to other PEFT methods with under 0.1% trainable parameters. Unlike LoRAs batched lowrank updates, RoAds sparse rotations reformulate to simple elementwise operations, yielding significantly higher serving throughput when handling heterogeneous requests in the same batch, i.e. serving multiple adapters simulatenously. Moreover, RoAd integrates seamlessly into a distributed interchange intervention framework, interpreting its sparse 2D rotations as task-specific interventions within learned subspaces of hidden representations. These orthogonal subspaces can be composed to merge multiple task-specific behaviors—like multilingual capabilities or instruction following—without additional fine-tuning, enabling modular, interpretable adaptations in LLMs.
Finetuning with RoAd typically requires higher learning rate compared to LoRA or similar methods, around 1e-3. Currently RoAd only supports linear layers and it can be used on models quantized with bitsandbytes (4-bit or 8-bit).
For running inference with different RoAd adapters in the same batch see [Inference with different LoRA adapters in the same batch](../developer_guides/lora#inference-with-different-lora-adapters-in-the-same-batch).
## RoadConfig
[[autodoc]] tuners.road.config.RoadConfig
## RoadModel
[[autodoc]] tuners.road.model.RoadModel

View File

@ -0,0 +1,35 @@
<!--Copyright 2025 The HuggingFace Team. All rights reserved.
Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with
the License. You may obtain a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on
an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the
specific language governing permissions and limitations under the License.
⚠️ Note that this file is in Markdown but contain specific syntax for our doc-builder (similar to MDX) that may not be
rendered properly in your Markdown viewer.
-->
# Sparse High Rank Adapters
Sparse High Rank Adapters or [SHiRA](https://arxiv.org/abs/2406.13175) is an alternate type of adapter and has been found to have significant advantages over the low rank adapters. Specifically, SHiRA achieves better accuracy than LoRA for a variety of vision and language tasks. It also offers simpler and higher quality multi-adapter fusion by significantly reducing concept loss, a common problem faced by low rank adapters. SHiRA directly finetunes a small number of the base model's parameters to finetune the model on any adaptation task.
SHiRA currently has the following constraint:
- Only `nn.Linear` layers are supported.
The abstract from the paper is:
> Low Rank Adaptation (LoRA) has gained massive attention in the recent generative AI research. One of the main advantages of LoRA is its ability to be fused with pretrained models, adding no overhead during inference. However, from a mobile deployment standpoint, we can either avoid inference overhead in the fused mode but lose the ability to switch adapters rapidly, or suffer significant (up to 30% higher) inference latency while enabling rapid switching in the unfused mode. LoRA also exhibits concept-loss when multiple adapters are used concurrently. In this paper, we propose Sparse High Rank Adapters (SHiRA), a new paradigm which incurs no inference overhead, enables rapid switching, and significantly reduces concept-loss. Specifically, SHiRA can be trained by directly tuning only 1-2% of the base model weights while leaving others unchanged. This results in a highly sparse adapter which can be switched directly in the fused mode. We further provide theoretical and empirical insights on how high sparsity in SHiRA can aid multi-adapter fusion by reducing concept loss. Our extensive experiments on LVMs and LLMs demonstrate that finetuning only a small fraction of the parameters in the base model significantly outperforms LoRA while enabling both rapid switching and multi-adapter fusion. Finally, we provide a latency- and memory-efficient SHiRA implementation based on Parameter-Efficient Finetuning (PEFT) Library which trains at nearly the same speed as LoRA while consuming up to 16% lower peak GPU memory, thus making SHiRA easy to adopt for practical use cases. To demonstrate rapid switching benefits during inference, we show that loading SHiRA on a base model can be 5x-16x faster than LoRA fusion on a CPU.
## ShiraConfig
[[autodoc]] tuners.shira.config.ShiraConfig
## ShiraModel
[[autodoc]] tuners.shira.model.ShiraModel

View File

@ -0,0 +1,50 @@
<!--Copyright 2025 The HuggingFace Team. All rights reserved.
Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with
the License. You may obtain a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on
an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the
specific language governing permissions and limitations under the License.
⚠️ Note that this file is in Markdown but contain specific syntax for our doc-builder (similar to MDX) that may not be
rendered properly in your Markdown viewer.
-->
# Trainable Tokens
The Trainable Tokens method provides a way to target specific token embeddings for fine-tuning without resorting to
training the full embedding matrix or using an adapter on the embedding matrix. It is based on the initial implementation from
[here](https://github.com/huggingface/peft/pull/1541).
The method only targets specific tokens and selectively trains the token indices you specify. Consequently the
required RAM will be lower and disk memory is also significantly lower than storing the full fine-tuned embedding matrix.
Some preliminary benchmarks acquired with [this script](https://github.com/huggingface/peft/blob/main/scripts/train_memory.py)
suggest that for `gemma-2-2b` (which has a rather large embedding matrix) you can save ~4 GiB VRAM with Trainable Tokens
over fully fine-tuning the embedding matrix. While LoRA will use comparable amounts of VRAM it might also target
tokens you don't want to be changed. Note that these are just indications and varying embedding matrix sizes might skew
these numbers a bit.
Note that this method does not add tokens for you, you have to add tokens to the tokenizer yourself and resize the
embedding matrix of the model accordingly. This method will only re-train the embeddings for the tokens you specify.
This method can also be used in conjunction with LoRA layers! See [the LoRA developer guide](../developer_guides/lora#efficiently-train-tokens-alongside-lora).
> [!TIP]
> Saving the model with [`~PeftModel.save_pretrained`] or retrieving the state dict using
> [`get_peft_model_state_dict`] when adding new tokens may save the full embedding matrix instead of only the difference
> as a precaution because the embedding matrix was resized. To save space you can disable this behavior by setting
> `save_embedding_layers=False` when calling `save_pretrained`. This is safe to do as long as you don't modify the
> embedding matrix through other means as well, as such changes will be not tracked by trainable tokens.
## TrainableTokensConfig
[[autodoc]] tuners.trainable_tokens.config.TrainableTokensConfig
## TrainableTokensModel
[[autodoc]] tuners.trainable_tokens.model.TrainableTokensModel

View File

@ -0,0 +1,40 @@
<!--Copyright 2024 The HuggingFace Team. All rights reserved.
Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with
the License. You may obtain a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on
an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the
specific language governing permissions and limitations under the License.
⚠️ Note that this file is in Markdown but contain specific syntax for our doc-builder (similar to MDX) that may not be
rendered properly in your Markdown viewer.
-->
# VB-LoRA: Extreme Parameter Efficient Fine-Tuning with Vector Banks
## Overview
[VB-LoRA](https://huggingface.co/papers/2405.15179) is a parameter-efficient fine-tuning technique that extends LoRA by learning a fine-grained parameter-sharing scheme at the sub-vector level, achieving significantly higher parameter efficiency. This makes VB-LoRA especially useful in scenarios where storage and transmission costs are critical. It works by decomposing low-rank matrices—from different layers and modules such as K, Q, V, and FFN—into sub-vectors, which are then globally shared through a vector bank.
The abstract from the paper is:
*As the adoption of large language models increases and the need for per-user or per-task model customization grows, the parameter-efficient fine-tuning (PEFT) methods, such as low-rank adaptation (LoRA) and its variants, incur substantial storage and transmission costs. To further reduce stored parameters, we introduce a "divide-and-share" paradigm that breaks the barriers of low-rank decomposition across matrix dimensions, modules and layers by sharing parameters globally via a vector bank. As an instantiation of the paradigm to LoRA, our proposed VB-LoRA composites all the low-rank matrices of LoRA from a shared vector bank with a differentiable top-k admixture module. VB-LoRA achieves extreme parameter efficiency while maintaining comparable or better performance compared to state-of-the-art PEFT methods. Extensive experiments demonstrate the effectiveness of VB-LoRA on natural language understanding, natural language generation, and instruction tuning tasks. When fine-tuning the Llama2-13B model, VB-LoRA only uses 0.4% of LoRA's stored parameters, yet achieves superior results.*
## Usage Tips
- VB-LoRA utilizes a sparse top-k module to learn the sharing machanism. When saving adapter parameters, you can either save only the top-k weights and their indices by setting `save_only_topk_weights = True` in `VBLoRAConfig`, or save all the trainable logits by setting it to `False`. Enabling `save_only_topk_weights = True` significantly reduces storage space; for instance, in Llama2-7B, the storage file size decreases from 308MB to 2.5MB. Note that models saved with `save_only_topk_weights = True` are intended for merging or inference only and cannot be used to resume training.
- VB-LoRA has two sets of training parameters: vector bank parameters and logit parameters. In practice, we found that logit parameters require a higher learning rate, while vector bank parameters require a lower learning rate. When using the AdamW optimizer, typical learning rates are 0.01 for logits and 0.001 for vector bank parameters.
## VBLoRAConfig
[[autodoc]] tuners.vblora.config.VBLoRAConfig
## VBLoRAModel
[[autodoc]] tuners.vblora.model.VBLoRAModel

View File

@ -22,12 +22,9 @@ When saving the adapter parameters, it's possible to eschew storing the low rank
To handle different shapes of adapted layers, VeRA initializes shared A and B matrices with the largest required size for each dimension. During the forward pass, submatrices A and B for a given layer are sliced out from these shared matrices and used as described in the paper. For example, adapting two linear layers of shapes (100, 20) and (80, 50) will create A and B matrices of shapes (rank, 50) and (100, rank) respectively. Then, to adapt a layer of shape (100, 20), submatrices A and B of shapes (rank, 20) and (100, rank) will be extracted.
VeRA currently has the following constraints:
VeRA currently has the following constraint:
- Only `nn.Linear` layers are supported.
- Quantized layers are not supported.
If these constraints don't work for your use case, use LoRA instead.
The abstract from the paper is:

View File

@ -0,0 +1,56 @@
<!--Copyright 2023 The HuggingFace Team. All rights reserved.
Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with
the License. You may obtain a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on
an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the
specific language governing permissions and limitations under the License.
⚠️ Note that this file is in Markdown but contain specific syntax for our doc-builder (similar to MDX) that may not be
rendered properly in your Markdown viewer.
-->
# X-LoRA
Mixture of LoRA Experts ([X-LoRA](https://huggingface.co/papers/2402.07148)) is a PEFT method enabling sparse or dense mixture of LoRA experts based on a high granularity (token, layer, sequence) scalings matrix. This leverages frozen LoRA adapters and a frozen base model to drastically reduces the number of parameters that need to be fine-tuned.
A unique aspect of X-LoRA is its versatility: it can be applied to any `transformers` base model with LoRA adapters. This means that, despite the mixture of experts strategy, no changes to the model code must be made.
The below graphic demonstrates how the scalings change for different prompts for each token. This highlights the activation of different adapters as the generation progresses and the sequence creates new context.
![Token-by-token scalings](https://github.com/EricLBuehler/xlora/raw/master/res/token_by_token_scalings.gif)
The abstract from the paper is:
*We report a mixture of expert strategy to create fine-tuned large language models using a deep layer-wise token-level approach based on low-rank adaptation (LoRA). Starting with a set of pre-trained LoRA adapters, our gating strategy uses the hidden states to dynamically mix adapted layers, allowing the resulting X-LoRA model to draw upon different capabilities and create never-before-used deep layer-wise combinations to solve tasks. The design is inspired by the biological principles of universality and diversity, where neural network building blocks are reused in different hierarchical manifestations. Hence, the X-LoRA model can be easily implemented for any existing large language model (LLM) without a need for modifications of the underlying structure. We develop a tailored X-LoRA model that offers scientific capabilities including forward/inverse analysis tasks and enhanced reasoning capability, focused on biomaterial analysis, protein mechanics and design. The impact of this work include access to readily expandable and adaptable models with strong domain knowledge and the capability to integrate across areas of knowledge. Featuring experts in biology, mathematics, reasoning, bio-inspired materials, mechanics and materials, chemistry, protein biophysics, mechanics and quantum-mechanics based molecular properties, we conduct a series of physics-focused case studies. We examine knowledge recall, protein mechanics forward/inverse tasks, protein design, adversarial agentic modeling including ontological knowledge graph construction, as well as molecular design. The model is capable not only of making quantitative predictions of nanomechanical properties of proteins or quantum mechanical molecular properties, but also reasons over the results and correctly predicts likely mechanisms that explain distinct molecular behaviors.*.
Please cite X-LoRA as:
```bibtex
@article{10.1063/5.0203126,
author = {Buehler, Eric L. and Buehler, Markus J.},
title = "{X-LoRA: Mixture of low-rank adapter experts, a flexible framework for large language models with applications in protein mechanics and molecular design}",
journal = {APL Machine Learning},
volume = {2},
number = {2},
pages = {026119},
year = {2024},
month = {05},
abstract = "{We report a mixture of expert strategy to create fine-tuned large language models using a deep layer-wise token-level approach based on low-rank adaptation (LoRA). Starting with a set of pre-trained LoRA adapters, our gating strategy uses the hidden states to dynamically mix adapted layers, allowing the resulting X-LoRA model to draw upon different capabilities and create never-before-used deep layer-wise combinations to solve tasks. The design is inspired by the biological principles of universality and diversity, where neural network building blocks are reused in different hierarchical manifestations. Hence, the X-LoRA model can be easily implemented for any existing large language model without a need for modifications of the underlying structure. We develop a tailored X-LoRA model that offers scientific capabilities, including forward/inverse analysis tasks and enhanced reasoning capability, focused on biomaterial analysis, protein mechanics, and design. The impact of this work includes access to readily expandable and adaptable models with strong domain knowledge and the capability to integrate across areas of knowledge. Featuring experts in biology, mathematics, reasoning, bio-inspired materials, mechanics and materials, chemistry, protein biophysics, mechanics, and quantum-mechanics based molecular properties, we conduct a series of physics-focused case studies. We examine knowledge recall, protein mechanics forward/inverse tasks, protein design, adversarial agentic modeling including ontological knowledge graph construction, and molecular design. The model is capable not only of making quantitative predictions of nanomechanical properties of proteins or quantum mechanical molecular properties but also reasoning over the results and correctly predicting likely mechanisms that explain distinct molecular behaviors.}",
issn = {2770-9019},
doi = {10.1063/5.0203126},
url = {https://doi.org/10.1063/5.0203126},
eprint = {https://pubs.aip.org/aip/aml/article-pdf/doi/10.1063/5.0203126/19964043/026119\_1\_5.0203126.pdf},
}
```
## XLoraConfig
[[autodoc]] tuners.xlora.config.XLoraConfig
## XLoraModel
[[autodoc]] tuners.xlora.model.XLoraModel

View File

@ -76,7 +76,7 @@ training_args = TrainingArguments(
per_device_eval_batch_size=32,
num_train_epochs=2,
weight_decay=0.01,
evaluation_strategy="epoch",
eval_strategy="epoch",
save_strategy="epoch",
load_best_model_at_end=True,
)
@ -90,7 +90,7 @@ trainer = Trainer(
args=training_args,
train_dataset=tokenized_datasets["train"],
eval_dataset=tokenized_datasets["test"],
tokenizer=tokenizer,
processing_class=tokenizer,
data_collator=data_collator,
compute_metrics=compute_metrics,
)

View File

@ -92,7 +92,7 @@ processed_ds = ds.map(
)
```
Create a training and evaluation [`DataLoader`](https://pytorch.org/docs/stable/data.html#torch.utils.data.DataLoader), and set `pin_memory=True` to speed up data transfer to the GPU during training if your dataset samples are on a CPU.
Create a training and evaluation [`DataLoader`](https://pytorch.org/docs/stable/data.html#torch.utils.data.DataLoader), and set `pin_memory=True` to speed up data transfer to the accelerator during training if your dataset samples are on a CPU.
```py
from torch.utils.data import DataLoader
@ -159,12 +159,12 @@ lr_scheduler = get_linear_schedule_with_warmup(
)
```
Move the model to the GPU and create a training loop that reports the loss and perplexity for each epoch.
Move the model to the accelerator and create a training loop that reports the loss and perplexity for each epoch.
```py
from tqdm import tqdm
device = "cuda"
device = torch.accelerator.current_accelerator().type if hasattr(torch, "accelerator") else "cuda"
model = model.to(device)
for epoch in range(num_epochs):
@ -219,7 +219,9 @@ To load the model for inference, use the [`~AutoPeftModelForSeq2SeqLM.from_pretr
```py
from peft import AutoPeftModelForSeq2SeqLM
model = AutoPeftModelForSeq2SeqLM.from_pretrained("<your-hf-account-name>/mt0-large-ia3").to("cuda")
device = torch.accelerator.current_accelerator().type if hasattr(torch, "accelerator") else "cuda"
model = AutoPeftModelForSeq2SeqLM.from_pretrained("<your-hf-account-name>/mt0-large-ia3").to(device)
tokenizer = AutoTokenizer.from_pretrained("bigscience/mt0-large")
i = 15

View File

@ -20,6 +20,8 @@ A popular way to efficiently train large models is to insert (typically in the a
There are several different ways to express the weight matrix as a low-rank decomposition, but [Low-Rank Adaptation (LoRA)](../conceptual_guides/adapter#low-rank-adaptation-lora) is the most common method. The PEFT library supports several other LoRA variants, such as [Low-Rank Hadamard Product (LoHa)](../conceptual_guides/adapter#low-rank-hadamard-product-loha), [Low-Rank Kronecker Product (LoKr)](../conceptual_guides/adapter#low-rank-kronecker-product-lokr), and [Adaptive Low-Rank Adaptation (AdaLoRA)](../conceptual_guides/adapter#adaptive-low-rank-adaptation-adalora). You can learn more about how these methods work conceptually in the [Adapters](../conceptual_guides/adapter) guide. If you're interested in applying these methods to other tasks and use cases like semantic segmentation, token classification, take a look at our [notebook collection](https://huggingface.co/collections/PEFT/notebooks-6573b28b33e5a4bf5b157fc1)!
Additionally, PEFT supports the [X-LoRA](../conceptual_guides/adapter#mixture-of-lora-experts-x-lora) Mixture of LoRA Experts method.
This guide will show you how to quickly train an image classification model - with a low-rank decomposition method - to identify the class of food shown in an image.
<Tip>
@ -257,7 +259,7 @@ batch_size = 128
args = TrainingArguments(
peft_model_id,
remove_unused_columns=False,
evaluation_strategy="epoch",
eval_strategy="epoch",
save_strategy="epoch",
learning_rate=5e-3,
per_device_train_batch_size=batch_size,
@ -279,7 +281,7 @@ trainer = Trainer(
args,
train_dataset=train_ds,
eval_dataset=val_ds,
tokenizer=image_processor,
processing_class=image_processor,
data_collator=collate_fn,
)
trainer.train()

View File

@ -43,7 +43,13 @@ Use the [`~datasets.load_dataset`] function to load the dataset and create a new
```py
from datasets import load_dataset
ds = load_dataset("ought/raft", "twitter_complaints")
ds = load_dataset(
"parquet",
data_files={
"train": "hf://datasets/ought/raft@refs/convert/parquet/twitter_complaints/train/0000.parquet",
"test": "hf://datasets/ought/raft@refs/convert/parquet/twitter_complaints/test/0000.parquet"
}
)
classes = [k.replace("_", " ") for k in ds["train"].features["Label"].names]
ds = ds.map(

View File

@ -16,7 +16,7 @@ rendered properly in your Markdown viewer.
# PEFT configurations and models
The sheer size of today's large pretrained models - which commonly have billions of parameters - present a significant training challenge because they require more storage space and more computational power to crunch all those calculations. You'll need access to powerful GPUs or TPUs to train these large pretrained models which is expensive, not widely accessible to everyone, not environmentally friendly, and not very practical. PEFT methods address many of these challenges. There are several types of PEFT methods (soft prompting, matrix decomposition, adapters), but they all focus on the same thing, reduce the number of trainable parameters. This makes it more accessible to train and store large models on consumer hardware.
The sheer size of today's large pretrained models - which commonly have billions of parameters - presents a significant training challenge because they require more storage space and more computational power to crunch all those calculations. You'll need access to powerful GPUs or TPUs to train these large pretrained models which is expensive, not widely accessible to everyone, not environmentally friendly, and not very practical. PEFT methods address many of these challenges. There are several types of PEFT methods (soft prompting, matrix decomposition, adapters), but they all focus on the same thing, reduce the number of trainable parameters. This makes it more accessible to train and store large models on consumer hardware.
The PEFT library is designed to help you quickly train large models on free or low-cost GPUs, and in this tutorial, you'll learn how to setup a configuration to apply a PEFT method to a pretrained base model for training. Once the PEFT configuration is setup, you can use any training framework you like (Transformer's [`~transformers.Trainer`] class, [Accelerate](https://hf.co/docs/accelerate), a custom PyTorch training loop).
@ -135,6 +135,9 @@ lora_model.print_trainable_parameters()
"trainable params: 1,572,864 || all params: 332,769,280 || trainable%: 0.472659014678278"
```
> [!WARNING]
> When calling [`get_peft_model`], the base model will be modified *in-place*. That means, when calling [`get_peft_model`] on a model that was already modified in the same way before, this model will be further mutated. Therefore, if you would like to modify your PEFT configuration after having called [`get_peft_model()`] before, you would first have to unload the model with [`~LoraModel.unload`] and then call [`get_peft_model()`] with your new configuration. Alternatively, you can re-initialize the model to ensure a fresh, unmodified state before applying a new PEFT configuration.
Now you can train the [`PeftModel`] with your preferred training framework! After training, you can save your model locally with [`~PeftModel.save_pretrained`] or upload it to the Hub with the [`~transformers.PreTrainedModel.push_to_hub`] method.
```py

View File

@ -0,0 +1,76 @@
# Activated LoRA (aLoRA)
## Introduction
Activated LoRA (aLoRA) is an adapter that selectively activates its weights only after a given invocation sequence, ensuring that hidden states match the base model prior to this point. This allows reusing the base model KVs (stored in the KV cache) for tokens before the invocation,
enabling much faster real-world inference (e.g. vLLM) when switching between generation with the base model and generation with adapters.
See the [paper](https://huggingface.co/papers/2504.12397) for more details.
## Quick start (shown for Mistral 7B)
```python
import torch
from peft import LoraConfig, get_peft_model
from transformers import AutoTokenizer, AutoModelForCausalLM, Trainer, DataCollatorForLanguageModeling
from datasets import load_dataset
model = AutoModelForCausalLM.from_pretrained("mistralai/Mistral-7B-Instruct-v0.3", device_map="cuda")
tokenizer = AutoTokenizer.from_pretrained("mistralai/Mistral-7B-Instruct-v0.3")
dataset = load_dataset("Lots-of-LoRAs/task1660_super_glue_question_generation", split="train")
invocation_string = "[/INST]" # End of user turn in Mistral chat template
invocation_tokens = tokenizer.encode(invocation_string, add_special_tokens=False)
lora_config = LoraConfig(
task_type="CAUSAL_LM",
alora_invocation_tokens=invocation_tokens,
r=32,
target_modules=["q_proj", "k_proj", "v_proj"],
)
peft_model = get_peft_model(model, lora_config)
data_collator = DataCollatorForLanguageModeling(tokenizer, mlm=False)
trainer = Trainer(
model=peft_model,
train_dataset=dataset,
dataset_text_field="text",
max_seq_length=2048,
tokenizer=tokenizer,
data_collator=data_collator,
)
trainer.train()
peft_model.save_pretrained("alora-mistral-7b")
```
### Use the training example script directly
Pass the invocation string with `--invocation_string` when running the training example
script. For Mistral 7B, do:
```bash
python examples/alora_finetuning/alora_finetuning.py --base_model mistralai/Mistral-7B-Instruct-v0.3 --data_path Lots-of-LoRAs/task1660_super_glue_question_generation --invocation_string "[/INST]"
```
and similarly for Llama-3.2-3B-Instruct:
```bash
python examples/alora_finetuning/alora_finetuning.py --base_model meta-llama/Llama-3.2-3B-Instruct --data_path Lots-of-LoRAs/task1660_super_glue_question_generation --invocation_string "<|start_header_id|>assistant<|end_header_id|>"
```
### Full example of the script
```bash
python alora_finetuning.py \
--base_model "PATH_TO_MODEL" \
--data_path "PATH_TO_DATASET" \
--output_dir "PATH_TO_OUTPUT_DIR" \
--batch_size 1 \
--num_epochs 3 \
--learning_rate 3e-4 \
--cutoff_len 512 \
--val_set_size 500 \
--invocation_string "[/INST]" \
--quantize \
--eval_step 10 \
--save_step 100 \
--device "cuda:0" \
--lora_r 32 \
--lora_alpha 32 \
--lora_dropout 0.05 \
--lora_target_modules "q_proj,k_proj,v_proj,o_proj,gate_proj,up_proj,down_proj" \
--hub_model_id "YOUR_HF_REPO" \
--push_to_hub
```

View File

@ -0,0 +1,251 @@
import os
import torch
from datasets import load_dataset
from transformers import (
AutoModelForCausalLM,
AutoTokenizer,
BitsAndBytesConfig,
DataCollatorForLanguageModeling,
Trainer,
TrainingArguments,
)
from peft import LoraConfig, PeftModel, get_peft_model, prepare_model_for_kbit_training
def train_model(
base_model: str,
data_path: str,
output_dir: str,
batch_size: int,
num_epochs: int,
learning_rate: float,
cutoff_len: int,
val_set_size: int,
invocation_string: str,
quantize: bool,
eval_step: int,
save_step: int,
device: str,
lora_r: int,
lora_alpha: int,
lora_dropout: float,
lora_target_modules: str,
hub_model_id: str,
push_to_hub: bool,
):
os.environ["TOKENIZERS_PARALLELISM"] = "false"
hf_token = os.getenv("HF_TOKEN")
device = torch.device(device)
print(f"Using device: {device}")
tokenizer = AutoTokenizer.from_pretrained(base_model, token=hf_token)
tokenizer.pad_token = tokenizer.unk_token
invocation_tokens = tokenizer.encode(invocation_string, add_special_tokens=False)
if quantize:
model = AutoModelForCausalLM.from_pretrained(
base_model,
token=hf_token,
quantization_config=BitsAndBytesConfig(
load_in_4bit=True,
bnb_4bit_compute_dtype=(
torch.bfloat16 if torch.cuda.is_available() and torch.cuda.is_bf16_supported() else torch.float16
),
bnb_4bit_use_double_quant=True,
bnb_4bit_quant_type="nf4",
),
)
model = prepare_model_for_kbit_training(model, use_gradient_checkpointing=True)
else:
model = AutoModelForCausalLM.from_pretrained(base_model, token=hf_token)
lora_config = LoraConfig(
task_type="CAUSAL_LM",
alora_invocation_tokens=invocation_tokens,
r=lora_r,
lora_alpha=lora_alpha,
target_modules=(lora_target_modules.split(",") if lora_target_modules else ["q_proj", "k_proj", "v_proj"]),
lora_dropout=lora_dropout,
bias="none",
)
model = get_peft_model(model, lora_config)
model.to(device)
tokenizer.pad_token = tokenizer.eos_token
dataset = load_dataset(data_path)
def tokenize_function(examples):
formatted_texts = [
tokenizer.apply_chat_template(
[
{"role": "user", "content": user_msg},
{"role": "assistant", "content": assistant_msg},
],
tokenize=False, # get plain text first
add_generation_prompt=False,
)
for user_msg, assistant_msg in zip(examples["input"], examples["output"])
]
# 2) Tokenize those texts
model_inputs = tokenizer(
formatted_texts,
padding="max_length",
truncation=True,
max_length=cutoff_len,
)
labels = []
for ids in model_inputs["input_ids"]:
labels.append([(token_id if token_id != tokenizer.pad_token_id else -100) for token_id in ids])
model_inputs["labels"] = labels
return model_inputs
# Tokenize the dataset and prepare for training
tokenized_datasets = dataset.map(tokenize_function, batched=True, remove_columns=dataset["train"].column_names)
# Data collator to dynamically pad the batched examples
data_collator = DataCollatorForLanguageModeling(tokenizer, mlm=False)
training_args = TrainingArguments(
output_dir=output_dir,
num_train_epochs=num_epochs,
per_device_train_batch_size=batch_size,
per_device_eval_batch_size=batch_size,
warmup_steps=100,
weight_decay=0.01,
logging_dir="./logs",
logging_steps=eval_step,
save_steps=save_step,
save_total_limit=2,
push_to_hub=push_to_hub,
hub_model_id=hub_model_id,
gradient_accumulation_steps=16,
fp16=True,
learning_rate=learning_rate,
hub_token=hf_token,
)
torch.cuda.empty_cache()
trainer = Trainer(
model=model,
args=training_args,
train_dataset=tokenized_datasets["train"],
eval_dataset=tokenized_datasets["test"],
data_collator=data_collator,
)
trainer.train()
if push_to_hub:
trainer.push_to_hub(commit_message="Fine-tuned model")
model.save_pretrained(output_dir)
tokenizer.save_pretrained(output_dir)
def model_inference(model_path: str, adapter_path: str, prompt: str = None, data_path: str = None):
"""
Simple inference with the tuned aLoRA adapter. Optionally (reuse_cache = True) demonstrates
that the aLoRA adapter can (but does not need to) use KV cache created by the base model,
perhaps during a prior generation turn.
Purely for demonstration purposes. See the [paper](https://huggingface.co/papers/2504.12397)
for realistic multiturn cache reuse examples.
"""
if prompt is None:
# Use first row of test data
dataset = load_dataset(data_path)
prompt = dataset["test"][0]["input"]
tokenizer = AutoTokenizer.from_pretrained(model_path)
base_model = AutoModelForCausalLM.from_pretrained(model_path)
alora_model = PeftModel.from_pretrained(base_model, adapter_path)
chat = [{"role": "user", "content": prompt}]
text = tokenizer.apply_chat_template(chat, tokenize=False, add_generation_prompt=True)
inputs = tokenizer(text, return_tensors="pt").to(base_model.device)
# Generate answer with adapter
output_dict = alora_model.generate(**inputs, return_dict_in_generate=True, max_new_tokens=20)
alora_outputs = output_dict.sequences
# Print results
print(f"Prompt: {text}")
response = tokenizer.decode(alora_outputs[0][inputs["input_ids"].shape[1] :], skip_special_tokens=True)
print(f"Trained adapter response: {response}")
if __name__ == "__main__":
import argparse
parser = argparse.ArgumentParser(description="Fine-tune Mistral with Activated LoRA")
parser.add_argument(
"--base_model", type=str, default="mistralai/Mistral-7B-Instruct-v0.3", help="Base model path or name"
)
parser.add_argument(
"--data_path",
type=str,
default="Lots-of-LoRAs/task1660_super_glue_question_generation",
help="Dataset path or name",
)
parser.add_argument(
"--output_dir", type=str, default="path/to/output", help="Output directory for the fine-tuned model"
)
parser.add_argument("--batch_size", type=int, default=2, help="Batch size")
parser.add_argument("--num_epochs", type=int, default=1, help="Number of training epochs")
parser.add_argument("--learning_rate", type=float, default=1e-4, help="Learning rate")
parser.add_argument("--cutoff_len", type=int, default=2048, help="Cutoff length for tokenization")
parser.add_argument("--val_set_size", type=int, default=500, help="Validation set size")
parser.add_argument(
"--invocation_string",
type=str,
default="[/INST]",
help="String that activates the aLoRA adapter. Model dependent.",
)
parser.add_argument("--quantize", action="store_true", help="Use quantization")
parser.add_argument("--eval_step", type=int, default=10, help="Evaluation step interval")
parser.add_argument("--save_step", type=int, default=100, help="Save step interval")
parser.add_argument("--device", type=str, default="cuda:0", help="Device to use for training")
parser.add_argument("--lora_r", type=int, default=32, help="LoRA rank")
parser.add_argument("--lora_alpha", type=int, default=32, help="LoRA alpha")
parser.add_argument("--lora_dropout", type=float, default=0.05, help="LoRA dropout rate")
parser.add_argument(
"--lora_target_modules", type=str, default=None, help="Comma-separated list of target modules for LoRA"
)
parser.add_argument(
"--hub_model_id",
type=str,
default="path/to/repo",
help="Repository name to push the model on the Hugging Face Hub",
)
parser.add_argument("--push_to_hub", action="store_true", help="Whether to push the model to Hugging Face Hub")
args = parser.parse_args()
train_model(
base_model=args.base_model,
data_path=args.data_path,
output_dir=args.output_dir,
batch_size=args.batch_size,
num_epochs=args.num_epochs,
learning_rate=args.learning_rate,
cutoff_len=args.cutoff_len,
val_set_size=args.val_set_size,
invocation_string=args.invocation_string,
quantize=args.quantize,
eval_step=args.eval_step,
save_step=args.save_step,
device=args.device,
lora_r=args.lora_r,
lora_alpha=args.lora_alpha,
lora_dropout=args.lora_dropout,
lora_target_modules=args.lora_target_modules,
hub_model_id=args.hub_model_id,
push_to_hub=args.push_to_hub,
)
print("Model trained. Running test inference.")
model_inference(model_path=args.base_model, adapter_path=args.output_dir, data_path=args.data_path)

View File

@ -0,0 +1,375 @@
# Copyright 2025-present the HuggingFace Inc. team.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
"""
This script provides a simple evaluation pipeline for multiple-choice reasoning datasets
(e.g., BoolQ, HellaSwag, ARC, OpenBookQA, Winogrande) with different composition strategies.
Usage examples:
python arrow_phi3_mini.py --strategy base --ds_name arc-challenge
python arrow_phi3_mini.py --strategy arrow --ds_name boolq
python arrow_phi3_mini.py --strategy gks --ds_name hswag
Key features:
- Supports three strategies:
"base" → Evaluate the quantized base model directly
"arrow" → Use Arrow modular routing with task-specific adapters
"gks" → Use Arrow + GenKnowSub (subtracting general-domain knowledge)
- Loads evaluation datasets from the Hugging Face Hub
- Implements a batched evaluation loop that computes per-option likelihoods and selects
the answer with the lowest average loss
- Reports simple accuracy
Implementation details:
- The base model is quantized to 4-bit using `BitsAndBytesConfig` (nf4, bf16 compute).
- For Arrow and GKS, task-specific adapters are loaded from the Hugging Face Hub:
TahaBa/phi3-mini-clustered-flan/ts_expert_i
- Task-specific adapters were trained on 10 clusters of FLAN tasks.
- The clusters were created using Model-Based Clustering (MBC):
1. Train a LoRA adapter for each individual task.
2. Apply k-means clustering to group tasks based on these adapters.
3. Train a LoRA adapter for each resulting cluster.
For more details, see the Arrow paper: https://huggingface.co/papers/2405.11157
- For GKS, general adapters are loaded from:
TahaBa/phi3-mini-general-adapters/...
- These adapters were trained on English, French, and German Wikipedia data
using a causal language modeling objective with (507-token context → 5-token completion) pairs.
- This setup encodes general knowledge into the LoRA space, which can then be
subtracted from task-specific adapters during inference to isolate and purify them.
For more details, see the GenKnowSub paper: https://huggingface.co/papers/2505.10939
- `evaluate_on_multi_choice_batched` handles tokenization, masking context tokens,
and computing per-choice log-likelihoods for fair comparison.
- Accuracy is printed at the end for the selected dataset.
This script is mainly meant for demonstration purposes and lightweight evaluation,
not full-scale benchmarking (batch size / max length can be tuned).
=======================================================================================
Results (evaluated with microsoft/Phi-3-mini-4k-instruct, 4-bit quantization):
| Dataset | Base Acc. | Arrow Acc. | Arrow+GKS Acc. |
|--------------|-----------|------------|----------------|
| ARC-Challenge| 0.4515 | 0.5418 | 0.5585 |
| ARC-Easy | 0.6894 | 0.8404 | 0.8473 |
| Winogrande | 0.5769 | 0.6550 | 0.6724 |
| BoolQ | 0.8146 | 0.8030 | 0.8247 |
| OpenBookQA | 0.43 | 0.448 | 0.472 |
| HellaSwag | 0.7318 | 0.7150 | 0.7376 |
Observations:
- Arrow generally improves over the base model by routing tokens to the most relevant task adapters.
- Applying GKS (general knowledge subtraction) consistently gives further gains compared to Arrow and Base.
These numbers are not meant as leaderboard results, but as a sanity check
to verify that the implementation works as expected and demonstrates
the benefits of Arrow and GenKnowSub.
"""
import argparse
import random
import numpy as np
import torch
from datasets import load_dataset
from sklearn.metrics import accuracy_score
from tqdm import tqdm
from transformers import AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig
from peft import ArrowConfig, create_arrow_model
MODEL_NAME = "microsoft/Phi-3-mini-4k-instruct"
MODEL_MAX_LEN = 2048
def parse_args():
parser = argparse.ArgumentParser(description="Training script with strategy selection")
parser.add_argument(
"--strategy",
type=str,
choices=["base", "arrow", "gks"],
default="base",
help="Training strategy to use: base, arrow, or gks",
)
parser.add_argument(
"--ds_name",
type=str,
choices=["boolq", "hswag", "arc-easy", "arc-challenge", "oqa", "wg"],
default="arc-challenge",
help="Dataset to use: boolq, hswag, arc-easy, arc-challenge, oqa, wg",
)
return parser.parse_args()
def read_test_dataset(ds_name):
if ds_name == "boolq":
ds = load_dataset("google/boolq", split="validation", trust_remote_code=True)
elif ds_name == "hswag":
ds = load_dataset("Rowan/hellaswag", split="validation", trust_remote_code=True)
elif ds_name == "arc-challenge":
ds = load_dataset("allenai/ai2_arc", "ARC-Challenge", split="validation", trust_remote_code=True)
elif ds_name == "arc-easy":
ds = load_dataset("allenai/ai2_arc", "ARC-Easy", split="validation", trust_remote_code=True)
elif ds_name == "oqa":
ds = load_dataset("allenai/openbookqa", split="validation", trust_remote_code=True)
elif ds_name == "wg":
ds = load_dataset("allenai/winogrande", "winogrande_xl", split="validation", trust_remote_code=True)
else:
raise f"Dataset {ds_name} is not supported yet."
return ds
def extract_input_content(ds_name, row):
if ds_name == "boolq":
return f"[passage]{row['passage']}[question]{row['question']}"
if ds_name == "hswag":
return row["ctx"]
if (ds_name == "arc-challenge") or (ds_name == "arc-easy"):
return row["question"]
if ds_name == "oqa":
return row["question_stem"]
if ds_name == "wg":
return row["sentence"]
def create_multi_choice_options(row, ds_name):
options_texts = []
content = extract_input_content(ds_name, row)
if ds_name == "boolq":
choices = ["true", "false"]
if ds_name == "hswag":
choices = row["endings"]
if (ds_name == "arc-challenge") or (ds_name == "arc-easy"):
choices = row["choices"]["text"]
if ds_name == "wg":
choices = [row["option1"], row["option2"]]
if ds_name == "oqa":
choices = row["choices"]["text"]
for choice in choices:
options_texts.append(f"<|user|>\n{content}<|end|>\n<|assistant|>{choice}<|end|>\n")
return options_texts
def extract_multi_choice_target_index(row, ds_name):
if ds_name == "boolq":
return 0 if row["answer"] is True else 1
if ds_name == "hswag":
return int(row["label"])
if (ds_name == "arc-challenge") or (ds_name == "arc-easy"):
return row["choices"]["label"].index(row["answerKey"])
if ds_name == "wg":
return int(row["answer"]) - 1
if ds_name == "oqa":
return row["choices"]["label"].index(row["answerKey"])
def set_seed(seed: int):
random.seed(seed)
np.random.seed(seed)
torch.manual_seed(seed)
torch.cuda.manual_seed_all(seed)
def compute_loglike_loss(logits, labels, reduction="none"):
bs = logits.size(0)
vocab_size = logits.size(-1)
labels = labels.squeeze(-1)
shift_logits = logits[..., :-1, :].contiguous()
shift_labels = labels[..., 1:].contiguous()
# Flatten the tokens
loss_fct = torch.nn.CrossEntropyLoss(reduction=reduction)
shift_logits = shift_logits.view(-1, vocab_size)
shift_labels = shift_labels.view(-1)
shift_labels = shift_labels.to(shift_logits.device)
loss = loss_fct(shift_logits, shift_labels)
# reshape back
if reduction == "none":
loss = loss.view((bs, -1))
non_zero_loss = (loss != 0).sum(dim=-1)
non_zero_loss[non_zero_loss == 0] = 1
loss = loss.sum(dim=-1) / non_zero_loss
return loss.float() # Convert to float32 before returning
def evaluate_on_multi_choice_batched(
eval_dataset, model, tokenizer, ds_name, labels, predictions, args, batch_size=32, max_length=512, device="cuda"
):
# Local import to mirror your original function
model.eval()
for start in tqdm(
range(0, len(eval_dataset), batch_size), total=(len(eval_dataset) + batch_size - 1) // batch_size
):
rows = [eval_dataset[i] for i in range(start, min(start + batch_size, len(eval_dataset)))]
# Build the flattened option texts for this batch
all_texts = []
options_per_sample = [] # number of options for each sample
ctx_lens_per_option = [] # context length replicated per option
for row in rows:
# options: ["<|user|>...<|assistant|>choiceA<|end|>", ...]
options = create_multi_choice_options(row, ds_name)
options_per_sample.append(len(options))
# compute context length once per sample (align with your -1 shift)
content = extract_input_content(ds_name, row)
context_prompt = f"<|user|>\n{content}<|end|>\n<|assistant|>"
ctx_len = len(tokenizer.encode(context_prompt)) - 1
all_texts.extend(options)
ctx_lens_per_option.extend([ctx_len] * len(options))
# collect gold label
labels.append(extract_multi_choice_target_index(row, ds_name))
# Tokenize all options in one go
tokenized = tokenizer(
all_texts,
return_tensors="pt",
padding=True,
truncation=True,
max_length=max_length,
)
tokenized = {k: v.to(device) for k, v in tokenized.items()}
# Create masked labels: ignore context and padding
masked_labels = tokenized["input_ids"].clone()
for i, ctx_len in enumerate(ctx_lens_per_option):
masked_labels[i, :ctx_len] = -100
masked_labels[tokenized["attention_mask"] == 0] = -100
with torch.no_grad():
logits = model(input_ids=tokenized["input_ids"], attention_mask=tokenized["attention_mask"]).logits
# per-sequence losses
losses = compute_loglike_loss(logits, masked_labels, reduction="none").detach().cpu()
# Reduce per sample (argmin across its options)
idx = 0
for n_opt in options_per_sample:
pred = torch.argmin(losses[idx : idx + n_opt]).item()
predictions.append(pred)
idx += n_opt
print(
f"Accuracy for dataset {args.ds_name} and strategy {args.strategy} is: {accuracy_score(labels, predictions)}"
)
if __name__ == "__main__":
args = parse_args()
print(f"Selected strategy: {args.strategy}")
print(f"Dataset name: {args.ds_name}")
# Loading the tokeniser
tokenizer = AutoTokenizer.from_pretrained(
MODEL_NAME,
use_fast=True,
padding_side="right",
model_max_length=MODEL_MAX_LEN,
)
# Quantisation config
bnb_config = BitsAndBytesConfig(
load_in_4bit=True,
bnb_4bit_quant_type="nf4",
bnb_4bit_compute_dtype=torch.bfloat16,
bnb_4bit_use_double_quant=False,
)
# Loading the model
base_model = AutoModelForCausalLM.from_pretrained(
MODEL_NAME,
torch_dtype=torch.bfloat16,
device_map="auto",
quantization_config=bnb_config,
)
# Loading the test dataset
test_dataset = read_test_dataset(args.ds_name)
print(f"{args.ds_name} is loaded with size: {len(test_dataset)}.")
labels, predictions = [], []
if args.strategy == "base":
# Batch-wise inference
with torch.no_grad():
evaluate_on_multi_choice_batched(
test_dataset,
base_model,
tokenizer,
args.ds_name,
labels,
predictions,
args,
batch_size=64, # tune this
max_length=512, # tune if options are long
device="cuda",
)
else:
general_adapter_paths = []
if args.strategy == "gks":
arrow_config = ArrowConfig(
top_k=3,
router_temperature=1.0,
use_gks=True,
)
# General adapter paths from the hub
general_adapter_paths = [
"TahaBa/phi3-mini-general-adapters/cluster0_batch16_prop1.0_langen/checkpoint-17",
"TahaBa/phi3-mini-general-adapters/cluster0_batch16_prop1.0_langfr/checkpoint-35",
"TahaBa/phi3-mini-general-adapters/cluster0_batch16_prop1.0_langger/checkpoint-17",
]
else:
arrow_config = ArrowConfig(
top_k=3,
router_temperature=1.0,
)
# Task-specific adapter paths from the hub
task_specific_adapter_paths = [f"TahaBa/phi3-mini-clustered-flan/ts_expert_{i}" for i in range(10)]
# Creating the Arrow model
model = create_arrow_model(
base_model=base_model,
task_specific_adapter_paths=task_specific_adapter_paths,
general_adapter_paths=general_adapter_paths,
arrow_config=arrow_config,
)
# Batch-wise inference
with torch.no_grad():
evaluate_on_multi_choice_batched(
test_dataset,
model,
tokenizer,
args.ds_name,
labels,
predictions,
args,
batch_size=32, # tune this
max_length=512, # tune if options are long
device="cuda",
)

View File

@ -0,0 +1,8 @@
torch
transformers
accelerate
datasets
scikit-learn
tqdm
numpy
bitsandbytes

View File

@ -19,9 +19,9 @@ rendered properly in your Markdown viewer.
This guide demonstrates how to use BOFT, an orthogonal fine-tuning method, to fine-tune Stable Diffusion with either `stabilityai/stable-diffusion-2-1` or `runwayml/stable-diffusion-v1-5` model for controllable generation.
By using BOFT from 🤗 PEFT, we can significantly reduce the number of trainable parameters while still achieving impressive results in various fine-tuning tasks across different foundation models. BOFT enhances model efficiency by integrating full-rank orthogonal matrices with a butterfly structure into specific model blocks, such as attention blocks, mirroring the approach used in LoRA. During fine-tuning, only these inserted matrices are trained, leaving the original model parameters untouched. During inference, the trainable BOFT paramteres can be merged into the original model, eliminating any additional computational costs.
By using BOFT from 🤗 PEFT, we can significantly reduce the number of trainable parameters while still achieving impressive results in various fine-tuning tasks across different foundation models. BOFT enhances model efficiency by integrating full-rank orthogonal matrices with a butterfly structure into specific model blocks, such as attention blocks, mirroring the approach used in LoRA. During fine-tuning, only these inserted matrices are trained, leaving the original model parameters untouched. During inference, the trainable BOFT parameters can be merged into the original model, eliminating any additional computational costs.
As a member of the **orthogonal finetuning** class, BOFT presents a systematic and principled method for fine-tuning. It possesses several unique properties and has demonstrated superior performance compared to LoRA in a variety of scenarios. For further details on BOFT, please consult the [PEFT's GitHub repo's concept guide OFT](https://https://huggingface.co/docs/peft/index), the [original BOFT paper](https://arxiv.org/abs/2311.06243) and the [original OFT paper](https://arxiv.org/abs/2306.07280).
As a member of the **orthogonal finetuning** class, BOFT presents a systematic and principled method for fine-tuning. It possesses several unique properties and has demonstrated superior performance compared to LoRA in a variety of scenarios. For further details on BOFT, please consult the [PEFT's GitHub repo's concept guide OFT](https://https://huggingface.co/docs/peft/index), the [original BOFT paper](https://huggingface.co/papers/2311.06243) and the [original OFT paper](https://huggingface.co/papers/2306.07280).
In this guide we provide a controllable generation (ControlNet) fine-tuning script that is available in [PEFT's GitHub repo examples](https://github.com/huggingface/peft/tree/main/examples/boft_controlnet). This implementation is adapted from [diffusers's ControlNet](https://github.com/huggingface/diffusers/tree/main/examples/controlnet) and [Hecong Wu's ControlLoRA](https://github.com/HighCWu/ControlLoRA). You can try it out and finetune on your custom images.
@ -58,7 +58,7 @@ export DATASET_NAME="oftverse/control-celeba-hq"
## Train controllable generation (ControlNet) with BOFT
Start with setting some hyperparamters for BOFT:
Start with setting some hyperparameters for BOFT:
```bash
PEFT_TYPE="boft"
BLOCK_NUM=8
@ -174,4 +174,4 @@ accelerate launch eval.py \
--output_dir=$OUTPUT_DIR \
--dataset_name=$DATASET_NAME \
--vis_overlays \
```
```

View File

@ -13,7 +13,7 @@
# limitations under the License.
# The implementation is based on "Parameter-Efficient Orthogonal Finetuning
# via Butterfly Factorization" (https://arxiv.org/abs/2311.06243) in ICLR 2024.
# via Butterfly Factorization" (https://huggingface.co/papers/2311.06243) in ICLR 2024.
import glob
import os
@ -32,8 +32,14 @@ from utils.args_loader import parse_args
from utils.dataset import make_dataset
detect_model = face_alignment.FaceAlignment(face_alignment.LandmarksType.TWO_D, device="cuda:0", flip_input=False)
# Determine the best available device
if torch.cuda.is_available():
device = "cuda:0"
else:
# TODO: xpu support in facealignment will be ready after this PR is merged:https://github.com/1adrianb/face-alignment/pull/371
device = "cpu"
detect_model = face_alignment.FaceAlignment(face_alignment.LandmarksType.TWO_D, device=device, flip_input=False)
# with open('./data/celebhq-text/prompt_val_blip_full.json', 'rt') as f: # fill50k, COCO
# for line in f:
# val_data = json.loads(line)

View File

@ -1,8 +1,10 @@
datasets==2.16.1
diffusers==0.17.1
transformers==4.36.2
accelerate==0.25.0
diffusers==0.34.0
transformers==4.54.0
accelerate==1.9.0
wandb==0.16.1
scikit-image==0.22.0
opencv-python==4.9.0.80
face-alignment==1.4.1
git+https://github.com/1adrianb/face-alignment.git
huggingface_hub==0.34.3
numpy<2.0.0

View File

@ -13,7 +13,7 @@
# limitations under the License.
# The implementation is based on "Parameter-Efficient Orthogonal Finetuning
# via Butterfly Factorization" (https://arxiv.org/abs/2311.06243) in ICLR 2024.
# via Butterfly Factorization" (https://huggingface.co/papers/2311.06243) in ICLR 2024.
import os
import sys
@ -42,7 +42,12 @@ from peft import PeftModel # noqa: E402
# Will error if the minimal version of diffusers is not installed. Remove at your own risks.
check_min_version("0.10.0.dev0")
device = torch.device("cuda:0")
if torch.xpu.is_available():
device = "xpu:0"
elif torch.cuda.is_available():
device = "cuda:0"
else:
device = "cpu"
def main(args):

View File

@ -13,7 +13,7 @@ export DATASET_NAME="oftverse/control-celeba-hq"
export CKPT_NAME="checkpoint-${ITER_NUM}"
export OUTPUT_DIR="./output/${DATASET_NAME}/${RUN_NAME}/${CKPT_NAME}"
export CONTROLNET_PATH="${OUTPUT_DIR}/controlnet/model.safetensors"
export UNET_PATH="${OUTPUT_DIR}/unet/${RUN_NAME}"
export UNET_PATH="${OUTPUT_DIR}/unet"
export RESULTS_PATH="${OUTPUT_DIR}/results"

View File

@ -14,7 +14,7 @@
# limitations under the License.
# The implementation is based on "Parameter-Efficient Orthogonal Finetuning
# via Butterfly Factorization" (https://arxiv.org/abs/2311.06243) in ICLR 2024.
# via Butterfly Factorization" (https://huggingface.co/papers/2311.06243) in ICLR 2024.
import itertools
import logging
@ -215,7 +215,9 @@ def main(args):
text_encoder.to(accelerator.device, dtype=weight_dtype)
if args.enable_xformers_memory_efficient_attention:
if is_xformers_available():
if accelerator.device.type == "xpu":
logger.warning("XPU doesn't support xformers yet, xformers is not applied.")
elif is_xformers_available():
import xformers
xformers_version = version.parse(xformers.__version__)
@ -513,11 +515,17 @@ def main(args):
break
# Printing the GPU memory usage details such as allocated memory, peak memory, and total memory usage
accelerator.print(f"GPU Memory before entering the train : {b2mb(tracemalloc.begin)}")
accelerator.print(f"GPU Memory consumed at the end of the train (end-begin): {tracemalloc.used}")
accelerator.print(f"GPU Peak Memory consumed during the train (max-begin): {tracemalloc.peaked}")
accelerator.print(
f"GPU Total Peak Memory consumed during the train (max): {tracemalloc.peaked + b2mb(tracemalloc.begin)}"
f"{accelerator.device.type.upper()} Memory before entering the train : {b2mb(tracemalloc.begin)}"
)
accelerator.print(
f"{accelerator.device.type.upper()} Memory consumed at the end of the train (end-begin): {tracemalloc.used}"
)
accelerator.print(
f"{accelerator.device.type.upper()} Peak Memory consumed during the train (max-begin): {tracemalloc.peaked}"
)
accelerator.print(
f"{accelerator.device.type.upper()} Total Peak Memory consumed during the train (max): {tracemalloc.peaked + b2mb(tracemalloc.begin)}"
)
accelerator.print(f"CPU Memory before entering the train : {b2mb(tracemalloc.cpu_begin)}")

View File

@ -14,13 +14,13 @@
from dataclasses import dataclass
from typing import Dict, List, Optional, Tuple, Union
from typing import Optional, Union
import torch
from diffusers.configuration_utils import ConfigMixin, register_to_config
from diffusers.models.attention_processor import AttentionProcessor, AttnProcessor
from diffusers.models.modeling_utils import ModelMixin
from diffusers.models.unet_2d_blocks import (
from diffusers.models.unets.unet_2d_blocks import (
CrossAttnDownBlock2D,
DownBlock2D,
)
@ -34,13 +34,13 @@ logger = logging.get_logger(__name__) # pylint: disable=invalid-name
@dataclass
class ControlNetOutput(BaseOutput):
down_block_res_samples: Tuple[torch.Tensor]
down_block_res_samples: tuple[torch.Tensor]
mid_block_res_sample: torch.Tensor
class ControlNetConditioningEmbedding(nn.Module):
"""
Quoting from https://arxiv.org/abs/2302.05543: "Stable Diffusion uses a pre-processing method similar to VQ-GAN
Quoting from https://huggingface.co/papers/2302.05543: "Stable Diffusion uses a pre-processing method similar to VQ-GAN
[11] to convert the entire dataset of 512 × 512 images into smaller 64 × 64 “latent images” for stabilized
training. This requires ControlNets to convert image-based conditions to 64 × 64 feature space to match the
convolution size. We use a tiny network E(·) of four convolution layers with 4 × 4 kernels and 2 × 2 strides
@ -52,7 +52,7 @@ class ControlNetConditioningEmbedding(nn.Module):
self,
conditioning_embedding_channels: int,
conditioning_channels: int = 3,
block_out_channels: Tuple[int] = (16, 32, 96, 256),
block_out_channels: tuple[int] = (16, 32, 96, 256),
):
super().__init__()
@ -92,7 +92,7 @@ class ControlNetModel(ModelMixin, ConfigMixin):
in_channels: int = 4,
out_channels: int = 320,
controlnet_conditioning_channel_order: str = "rgb",
conditioning_embedding_out_channels: Optional[Tuple[int]] = (16, 32, 96, 256),
conditioning_embedding_out_channels: Optional[tuple[int]] = (16, 32, 96, 256),
):
super().__init__()
@ -104,7 +104,7 @@ class ControlNetModel(ModelMixin, ConfigMixin):
@property
# Copied from diffusers.models.unet_2d_condition.UNet2DConditionModel.attn_processors
def attn_processors(self) -> Dict[str, AttentionProcessor]:
def attn_processors(self) -> dict[str, AttentionProcessor]:
r"""
Returns:
`dict` of attention processors: A dictionary containing all attention processors used in the model with
@ -113,7 +113,7 @@ class ControlNetModel(ModelMixin, ConfigMixin):
# set recursively
processors = {}
def fn_recursive_add_processors(name: str, module: torch.nn.Module, processors: Dict[str, AttentionProcessor]):
def fn_recursive_add_processors(name: str, module: torch.nn.Module, processors: dict[str, AttentionProcessor]):
if hasattr(module, "set_processor"):
processors[f"{name}.processor"] = module.processor
@ -128,7 +128,7 @@ class ControlNetModel(ModelMixin, ConfigMixin):
return processors
# Copied from diffusers.models.unet_2d_condition.UNet2DConditionModel.set_attn_processor
def set_attn_processor(self, processor: Union[AttentionProcessor, Dict[str, AttentionProcessor]]):
def set_attn_processor(self, processor: Union[AttentionProcessor, dict[str, AttentionProcessor]]):
r"""
Parameters:
`processor (`dict` of `AttentionProcessor` or `AttentionProcessor`):
@ -220,7 +220,7 @@ class ControlNetModel(ModelMixin, ConfigMixin):
# Recursively walk through all the children.
# Any children which exposes the set_attention_slice method
# gets the message
def fn_recursive_set_attention_slice(module: torch.nn.Module, slice_size: List[int]):
def fn_recursive_set_attention_slice(module: torch.nn.Module, slice_size: list[int]):
if hasattr(module, "set_attention_slice"):
module.set_attention_slice(slice_size.pop())
@ -238,7 +238,7 @@ class ControlNetModel(ModelMixin, ConfigMixin):
def forward(
self,
controlnet_cond: torch.FloatTensor,
) -> Union[ControlNetOutput, Tuple]:
) -> Union[ControlNetOutput, tuple]:
# check channel order
channel_order = self.config.controlnet_conditioning_channel_order

View File

@ -13,14 +13,14 @@
# limitations under the License.
from dataclasses import dataclass
from typing import Any, Callable, Dict, List, Optional, Union
from typing import Any, Callable, Optional, Union
import numpy as np
import PIL.Image
import torch
from diffusers.pipelines.controlnet.multicontrolnet import MultiControlNetModel
from diffusers.pipelines.controlnet.pipeline_controlnet import StableDiffusionControlNetPipeline
from diffusers.utils import BaseOutput, is_compiled_module, logging
from diffusers.utils import BaseOutput, logging
from torch.nn import functional as F
from utils.light_controlnet import ControlNetModel
@ -42,8 +42,8 @@ class LightControlNetPipelineOutput(BaseOutput):
(nsfw) content, or `None` if safety checking could not be performed.
"""
images: Union[List[PIL.Image.Image], np.ndarray]
nsfw_content_detected: Optional[List[bool]]
images: Union[list[PIL.Image.Image], np.ndarray]
nsfw_content_detected: Optional[list[bool]]
class LightControlNetPipeline(StableDiffusionControlNetPipeline):
@ -164,23 +164,23 @@ class LightControlNetPipeline(StableDiffusionControlNetPipeline):
@torch.no_grad()
def __call__(
self,
prompt: Union[str, List[str]] = None,
prompt: Union[str, list[str]] = None,
image: Union[
torch.FloatTensor,
PIL.Image.Image,
np.ndarray,
List[torch.FloatTensor],
List[PIL.Image.Image],
List[np.ndarray],
list[torch.FloatTensor],
list[PIL.Image.Image],
list[np.ndarray],
] = None,
height: Optional[int] = None,
width: Optional[int] = None,
num_inference_steps: int = 50,
guidance_scale: float = 7.5,
negative_prompt: Optional[Union[str, List[str]]] = None,
negative_prompt: Optional[Union[str, list[str]]] = None,
num_images_per_prompt: Optional[int] = 1,
eta: float = 0.0,
generator: Optional[Union[torch.Generator, List[torch.Generator]]] = None,
generator: Optional[Union[torch.Generator, list[torch.Generator]]] = None,
latents: Optional[torch.FloatTensor] = None,
prompt_embeds: Optional[torch.FloatTensor] = None,
negative_prompt_embeds: Optional[torch.FloatTensor] = None,
@ -188,8 +188,8 @@ class LightControlNetPipeline(StableDiffusionControlNetPipeline):
return_dict: bool = True,
callback: Optional[Callable[[int, int, torch.FloatTensor], None]] = None,
callback_steps: int = 1,
cross_attention_kwargs: Optional[Dict[str, Any]] = None,
controlnet_conditioning_scale: Union[float, List[float]] = 1.0,
cross_attention_kwargs: Optional[dict[str, Any]] = None,
controlnet_conditioning_scale: Union[float, list[float]] = 1.0,
guess_mode: bool = False,
):
r"""
@ -215,9 +215,9 @@ class LightControlNetPipeline(StableDiffusionControlNetPipeline):
The number of denoising steps. More denoising steps usually lead to a higher quality image at the
expense of slower inference.
guidance_scale (`float`, *optional*, defaults to 7.5):
Guidance scale as defined in [Classifier-Free Diffusion Guidance](https://arxiv.org/abs/2207.12598).
Guidance scale as defined in [Classifier-Free Diffusion Guidance](https://huggingface.co/papers/2207.12598).
`guidance_scale` is defined as `w` of equation 2. of [Imagen
Paper](https://arxiv.org/pdf/2205.11487.pdf). Guidance scale is enabled by setting `guidance_scale >
Paper](https://huggingface.co/papers/2205.11487). Guidance scale is enabled by setting `guidance_scale >
1`. Higher guidance scale encourages to generate images that are closely linked to the text `prompt`,
usually at the expense of lower image quality.
negative_prompt (`str` or `List[str]`, *optional*):
@ -227,7 +227,7 @@ class LightControlNetPipeline(StableDiffusionControlNetPipeline):
num_images_per_prompt (`int`, *optional*, defaults to 1):
The number of images to generate per prompt.
eta (`float`, *optional*, defaults to 0.0):
Corresponds to parameter eta (η) in the DDIM paper: https://arxiv.org/abs/2010.02502. Only applies to
Corresponds to parameter eta (η) in the DDIM paper: https://huggingface.co/papers/2010.02502. Only applies to
[`schedulers.DDIMScheduler`], will be ignored for others.
generator (`torch.Generator` or `List[torch.Generator]`, *optional*):
One or a list of [torch generator(s)](https://pytorch.org/docs/stable/generated/torch.Generator.html)
@ -298,11 +298,11 @@ class LightControlNetPipeline(StableDiffusionControlNetPipeline):
device = self._execution_device
# here `guidance_scale` is defined analog to the guidance weight `w` of equation (2)
# of the Imagen paper: https://arxiv.org/pdf/2205.11487.pdf . `guidance_scale = 1`
# of the Imagen paper: https://huggingface.co/papers/2205.11487 . `guidance_scale = 1`
# corresponds to doing no classifier free guidance.
do_classifier_free_guidance = guidance_scale > 1.0
controlnet = self.controlnet._orig_mod if is_compiled_module(self.controlnet) else self.controlnet
controlnet = self.controlnet._orig_mod if hasattr(self.controlnet, "_orig_mod") else self.controlnet
if isinstance(controlnet, MultiControlNetModel) and isinstance(controlnet_conditioning_scale, float):
controlnet_conditioning_scale = [controlnet_conditioning_scale] * len(controlnet.nets)
@ -426,7 +426,10 @@ class LightControlNetPipeline(StableDiffusionControlNetPipeline):
if hasattr(self, "final_offload_hook") and self.final_offload_hook is not None:
self.unet.to("cpu")
self.controlnet.to("cpu")
torch.cuda.empty_cache()
if torch.cuda.is_available():
torch.cuda.empty_cache()
elif torch.xpu.is_available():
torch.xpu.empty_cache()
if not output_type == "latent":
image = self.vae.decode(latents / self.vae.config.scaling_factor, return_dict=False)[0]

View File

@ -13,10 +13,12 @@ def b2mb(x):
# This context manager is used to track the peak memory usage of the process
class TorchTracemalloc:
def __enter__(self):
self.device_type = torch.accelerator.current_accelerator().type if hasattr(torch, "accelerator") else "cuda"
self.device_module = getattr(torch, self.device_type, torch.cuda)
gc.collect()
torch.cuda.empty_cache()
torch.cuda.reset_max_memory_allocated() # reset the peak gauge to zero
self.begin = torch.cuda.memory_allocated()
self.device_module.empty_cache()
self.device_module.reset_peak_memory_stats() # reset the peak gauge to zero
self.begin = self.device_module.memory_allocated()
self.process = psutil.Process()
self.cpu_begin = self.cpu_mem_used()
@ -46,9 +48,9 @@ class TorchTracemalloc:
self.peak_monitoring = False
gc.collect()
torch.cuda.empty_cache()
self.end = torch.cuda.memory_allocated()
self.peak = torch.cuda.max_memory_allocated()
self.device_module.empty_cache()
self.end = self.device_module.memory_allocated()
self.peak = self.device_module.max_memory_allocated()
self.used = b2mb(self.end - self.begin)
self.peaked = b2mb(self.peak - self.begin)

View File

@ -13,7 +13,7 @@
# limitations under the License.
from dataclasses import dataclass
from typing import Any, Dict, Optional, Tuple, Union
from typing import Any, Optional, Union
import torch
from diffusers.models import UNet2DConditionModel
@ -44,13 +44,13 @@ class UNet2DConditionNewModel(UNet2DConditionModel):
class_labels: Optional[torch.Tensor] = None,
timestep_cond: Optional[torch.Tensor] = None,
attention_mask: Optional[torch.Tensor] = None,
cross_attention_kwargs: Optional[Dict[str, Any]] = None,
added_cond_kwargs: Optional[Dict[str, torch.Tensor]] = None,
down_block_additional_residuals: Optional[Tuple[torch.Tensor]] = None,
cross_attention_kwargs: Optional[dict[str, Any]] = None,
added_cond_kwargs: Optional[dict[str, torch.Tensor]] = None,
down_block_additional_residuals: Optional[tuple[torch.Tensor]] = None,
mid_block_additional_residual: Optional[torch.Tensor] = None,
encoder_attention_mask: Optional[torch.Tensor] = None,
return_dict: bool = True,
) -> Union[UNet2DConditionOutput, Tuple]:
) -> Union[UNet2DConditionOutput, tuple]:
r"""
Args:
sample (`torch.FloatTensor`): (batch, channel, height, width) noisy inputs tensor

View File

@ -18,9 +18,9 @@ rendered properly in your Markdown viewer.
This guide demonstrates how to use BOFT, an orthogonal fine-tuning method, to fine-tune Dreambooth with either `stabilityai/stable-diffusion-2-1` or `runwayml/stable-diffusion-v1-5` model.
By using BOFT from 🤗 PEFT, we can significantly reduce the number of trainable parameters while still achieving impressive results in various fine-tuning tasks across different foundation models. BOFT enhances model efficiency by integrating full-rank orthogonal matrices with a butterfly structure into specific model blocks, such as attention blocks, mirroring the approach used in LoRA. During fine-tuning, only these inserted matrices are trained, leaving the original model parameters untouched. During inference, the trainable BOFT paramteres can be merged into the original model, eliminating any additional computational costs.
By using BOFT from 🤗 PEFT, we can significantly reduce the number of trainable parameters while still achieving impressive results in various fine-tuning tasks across different foundation models. BOFT enhances model efficiency by integrating full-rank orthogonal matrices with a butterfly structure into specific model blocks, such as attention blocks, mirroring the approach used in LoRA. During fine-tuning, only these inserted matrices are trained, leaving the original model parameters untouched. During inference, the trainable BOFT parameters can be merged into the original model, eliminating any additional computational costs.
As a member of the **orthogonal finetuning** class, BOFT presents a systematic and principled method for fine-tuning. It possesses several unique properties and has demonstrated superior performance compared to LoRA in a variety of scenarios. For further details on BOFT, please consult the [PEFT's GitHub repo's concept guide OFT](https://https://huggingface.co/docs/peft/index), the [original BOFT paper](https://arxiv.org/abs/2311.06243) and the [original OFT paper](https://arxiv.org/abs/2306.07280).
As a member of the **orthogonal finetuning** class, BOFT presents a systematic and principled method for fine-tuning. It possesses several unique properties and has demonstrated superior performance compared to LoRA in a variety of scenarios. For further details on BOFT, please consult the [PEFT's GitHub repo's concept guide OFT](https://https://huggingface.co/docs/peft/index), the [original BOFT paper](https://huggingface.co/papers/2311.06243) and the [original OFT paper](https://huggingface.co/papers/2306.07280).
In this guide we provide a Dreambooth fine-tuning script that is available in [PEFT's GitHub repo examples](https://github.com/huggingface/peft/tree/main/examples/boft_dreambooth). This implementation is adapted from [peft's lora_dreambooth](https://github.com/huggingface/peft/tree/main/examples/lora_dreambooth). You can try it out and finetune on your custom images.
@ -40,6 +40,7 @@ cd peft/examples/boft_dreambooth
Set up your environment: install PEFT, and all the required libraries. At the time of writing this guide we recommend installing PEFT from source. The following environment setup should work on A100 and H100:
### CUDA
```bash
conda create --name peft python=3.10
conda activate peft
@ -48,6 +49,16 @@ conda install xformers -c xformers
pip install -r requirements.txt
pip install git+https://github.com/huggingface/peft
```
The follwing environment setuo is validated work on Intel XPU:
### Intel XPU
```bash
conda create --name peft python=3.10
conda activate peft
pip install pip install torch==2.8.0.dev20250615+xpu torchvision==0.23.0.dev20250615+xpu torchaudio==2.8.0.dev20250615+xpu --index-url https://download.pytorch.org/whl/nightly/xpu --no-cache-dir
pip install -r requirements.txt
pip install git+https://github.com/huggingface/peft
```
## Download the data
@ -92,10 +103,10 @@ To learn more about DreamBooth fine-tuning with prior-preserving loss, check out
Launch the training script with `accelerate` and pass hyperparameters, as well as LoRa-specific arguments to it such as:
- `use_boft`: Enables BOFT in the training script.
- `boft_block_size`: the BOFT matrix block size across different layers, expressed in `int`. Smaller block size results in sparser update matrices with fewer trainable paramters. **Note**, please choose it to be dividable to most layer `in_features` dimension, e.g., 4, 8, 16. Also, you can only specify either `boft_block_size` or `boft_block_num`, but not both simultaneously, because `boft_block_size` x `boft_block_num` = layer dimension.
- `boft_block_num`: the number of BOFT matrix blocks across different layers, expressed in `int`. Fewer blocks result in sparser update matrices with fewer trainable paramters. **Note**, please choose it to be dividable to most layer `in_features` dimension, e.g., 4, 8, 16. Also, you can only specify either `boft_block_size` or `boft_block_num`, but not both simultaneously, because `boft_block_size` x `boft_block_num` = layer dimension.
- `boft_n_butterfly_factor`: the number of butterfly factors. **Note**, for `boft_n_butterfly_factor=1`, BOFT is the same as vanilla OFT, for `boft_n_butterfly_factor=2`, the effective block size of OFT becomes twice as big and the number of blocks become half.
- `bias`: specify if the `bias` paramteres should be traind. Can be `none`, `all` or `boft_only`.
- `boft_block_size`: the BOFT matrix block size across different layers, expressed in `int`. Smaller block size results in sparser update matrices with fewer trainable parameters. **Note**, please choose it to be dividable to most layer `in_features` dimension, e.g., 4, 8, 16. Also, you can only specify either `boft_block_size` or `boft_block_num`, but not both simultaneously, because `boft_block_size` x `boft_block_num` = layer dimension.
- `boft_block_num`: the number of BOFT matrix blocks across different layers, expressed in `int`. Fewer blocks result in sparser update matrices with fewer trainable parameters. **Note**, please choose it to be dividable to most layer `in_features` dimension, e.g., 4, 8, 16. Also, you can only specify either `boft_block_size` or `boft_block_num`, but not both simultaneously, because `boft_block_size` x `boft_block_num` = layer dimension.
- `boft_n_butterfly_factor`: the number of butterfly factors. **Note**, for `boft_n_butterfly_factor=1`, BOFT is the same as vanilla OFT, for `boft_n_butterfly_factor=2`, the effective block size of OFT becomes twice as big and the number of blocks becomes half.
- `bias`: specify if the `bias` parameters should be trained. Can be `none`, `all` or `boft_only`.
- `boft_dropout`: specify the probability of multiplicative dropout.
Here's what the full set of script arguments may look like:

View File

@ -44,8 +44,10 @@
"outputs": [],
"source": [
"def get_boft_sd_pipeline(\n",
" ckpt_dir, base_model_name_or_path=None, epoch=int, dtype=torch.float32, device=\"cuda\", adapter_name=\"default\"\n",
" ckpt_dir, base_model_name_or_path=None, epoch=int, dtype=torch.float32, device=\"auto\", adapter_name=\"default\"\n",
"):\n",
" if device == \"auto\":\n",
" device = torch.accelerator.current_accelerator().type if hasattr(torch, \"accelerator\") else \"cuda\"\n",
"\n",
" if base_model_name_or_path is None:\n",
" raise ValueError(\"Please specify the base model name or path\")\n",
@ -152,14 +154,6 @@
"image = pipe(prompt, num_inference_steps=50, guidance_scale=7, negative_prompt=negative_prompt).images[0]\n",
"image"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "f534eca2-94a4-432b-b092-7149ac44b12f",
"metadata": {},
"outputs": [],
"source": []
}
],
"metadata": {

View File

@ -1,13 +1,13 @@
transformers==4.36.2
accelerate==0.25.0
transformers==4.54.0
accelerate==1.9.0
evaluate
tqdm
datasets==2.16.1
diffusers==0.17.1
datasets==4.0.0
diffusers==0.34.0
Pillow
huggingface_hub
safetensors
nb_conda_kernels
ipykernel
ipywidgets
wandb==0.16.1
wandb==0.21.0

View File

@ -14,7 +14,7 @@
# limitations under the License.
# The implementation is based on "Parameter-Efficient Orthogonal Finetuning
# via Butterfly Factorization" (https://arxiv.org/abs/2311.06243) in ICLR 2024.
# via Butterfly Factorization" (https://huggingface.co/papers/2311.06243) in ICLR 2024.
import hashlib
import itertools
@ -139,7 +139,7 @@ def main(args):
cur_class_images = len(list(class_images_dir.iterdir()))
if cur_class_images < args.num_class_images:
torch_dtype = torch.float16 if accelerator.device.type == "cuda" else torch.float32
torch_dtype = torch.float16 if accelerator.device.type in ["cuda", "xpu"] else torch.float32
if args.prior_generation_precision == "fp32":
torch_dtype = torch.float32
elif args.prior_generation_precision == "fp16":
@ -176,6 +176,8 @@ def main(args):
del pipeline
if torch.cuda.is_available():
torch.cuda.empty_cache()
elif torch.xpu.is_available():
torch.xpu.empty_cache()
# Handle the repository creation
if accelerator.is_main_process:
@ -263,7 +265,9 @@ def main(args):
text_encoder.to(accelerator.device, dtype=weight_dtype)
if args.enable_xformers_memory_efficient_attention:
if is_xformers_available():
if accelerator.device.type == "xpu":
logger.warn("XPU hasn't support xformers yet, ignore it.")
elif is_xformers_available():
unet.enable_xformers_memory_efficient_attention()
else:
raise ValueError("xformers is not available. Make sure it is installed correctly")
@ -276,7 +280,7 @@ def main(args):
# Enable TF32 for faster training on Ampere GPUs,
# cf https://pytorch.org/docs/stable/notes/cuda.html#tensorfloat-32-tf32-on-ampere-devices
if args.allow_tf32:
if args.allow_tf32 and torch.cuda.is_available():
torch.backends.cuda.matmul.allow_tf32 = True
if args.scale_lr:
@ -581,18 +585,27 @@ def main(args):
)
del pipeline
torch.cuda.empty_cache()
if torch.cuda.is_available():
torch.cuda.empty_cache()
elif torch.xpu.is_available():
torch.xpu.empty_cache()
if global_step >= args.max_train_steps:
break
# Printing the GPU memory usage details such as allocated memory, peak memory, and total memory usage
# Printing the accelerator memory usage details such as allocated memory, peak memory, and total memory usage
if not args.no_tracemalloc:
accelerator.print(f"GPU Memory before entering the train : {b2mb(tracemalloc.begin)}")
accelerator.print(f"GPU Memory consumed at the end of the train (end-begin): {tracemalloc.used}")
accelerator.print(f"GPU Peak Memory consumed during the train (max-begin): {tracemalloc.peaked}")
accelerator.print(
f"GPU Total Peak Memory consumed during the train (max): {tracemalloc.peaked + b2mb(tracemalloc.begin)}"
f"{accelerator.device.type.upper()} Memory before entering the train : {b2mb(tracemalloc.begin)}"
)
accelerator.print(
f"{accelerator.device.type.upper()} Memory consumed at the end of the train (end-begin): {tracemalloc.used}"
)
accelerator.print(
f"{accelerator.device.type.upper()} Peak Memory consumed during the train (max-begin): {tracemalloc.peaked}"
)
accelerator.print(
f"{accelerator.device.type.upper()} Total Peak Memory consumed during the train (max): {tracemalloc.peaked + b2mb(tracemalloc.begin)}"
)
accelerator.print(f"CPU Memory before entering the train : {b2mb(tracemalloc.cpu_begin)}")

View File

@ -13,10 +13,12 @@ def b2mb(x):
# This context manager is used to track the peak memory usage of the process
class TorchTracemalloc:
def __enter__(self):
self.device_type = torch.accelerator.current_accelerator().type if hasattr(torch, "accelerator") else "cuda"
self.device_module = getattr(torch, self.device_type, torch.cuda)
gc.collect()
torch.cuda.empty_cache()
torch.cuda.reset_max_memory_allocated() # reset the peak gauge to zero
self.begin = torch.cuda.memory_allocated()
self.device_module.empty_cache()
self.device_module.reset_peak_memory_stats() # reset the peak gauge to zero
self.begin = self.device_module.memory_allocated()
self.process = psutil.Process()
self.cpu_begin = self.cpu_mem_used()
@ -46,9 +48,9 @@ class TorchTracemalloc:
self.peak_monitoring = False
gc.collect()
torch.cuda.empty_cache()
self.end = torch.cuda.memory_allocated()
self.peak = torch.cuda.max_memory_allocated()
self.device_module.empty_cache()
self.end = self.device_module.memory_allocated()
self.peak = self.device_module.max_memory_allocated()
self.used = b2mb(self.end - self.begin)
self.peaked = b2mb(self.peak - self.begin)

View File

@ -0,0 +1,96 @@
# DiSHA: Dimension-Sharding Adaptation with Fast Convergence and Fast Computation
## Introduction ([Paper](https://huggingface.co/papers/2409.15371), [code](https://github.com/JL-er/DiSHA))
Low-Rank Adaptation (LoRA) leverages the low intrinsic rank of weight updates in Large Language Models (LLMs), establishing a Parameter-Efficient Fine-Tuning (PEFT) paradigm. However, LoRA suffers from slow convergence. We introduce Dimension-Sharding Adaptation (DiSHA), which expands the PEFT design space to unlock lower intrinsic ranks and faster convergence by default. Within DiSHA's design space, we propose Block Affine Adaptation (Bone), a computationally efficient structure that delivers both high performance and efficiency. While certain DiSHA configurations may result in colinear updates to weight shards, we address this with Block Affine Transformation Adaptation (BAT), a nonlinear variant of DiSHA. BAT introduces nonlinearity by combining trainable matrices with original weight shards in a nonlinear manner, inducing nonlinearity in matrix updates without introducing additional parameters. Empirical results show that Bone, under the DiSHA framework, consistently outperforms LoRA variants in both NLG and NLU tasks, with significantly improved computational efficiency. Further analysis demonstrates that BAT enhances model capabilities by leveraging its nonlinear design.
## Quick Start
```python
import torch
from peft import LoraConfig, get_peft_model
from transformers import AutoTokenizer, AutoModelForCausalLM
from trl import SFTConfig, SFTTrainer
from datasets import load_dataset
model = AutoModelForCausalLM.from_pretrained("meta-llama/Llama-2-7b-hf", torch_dtype=torch.bfloat16, device_map="auto")
tokenizer = AutoTokenizer.from_pretrained("meta-llama/Llama-2-7b-hf")
tokenizer.pad_token_id = tokenizer.eos_token_id
bone_config = BoneConfig(
r = 64
)
#Bat performs better than Bone, but it uses more memory and is twice as slow. If you want to use the Bat method, you only need to add the parameter init_weights="bat".
# bone_config = BoneConfig(
# r = 64,
# init_weights="bat"
# )
peft_model = get_peft_model(model, bone_config)
peft_model.print_trainable_parameters()
dataset = load_dataset("imdb", split="train[:1%]")
training_args = SFTConfig(dataset_text_field="text", max_seq_length=128)
trainer = SFTTrainer(
model=peft_model,
args=training_args,
train_dataset=dataset,
processing_class=tokenizer,
)
trainer.train()
peft_model.save_pretrained("bone-llama-2-7b")
```
To utilize the fine-tuned Bone modules, simply run the following command:
```python
import torch
from peft import PeftModel
from transformers import AutoModelForCausalLM
model = AutoModelForCausalLM.from_pretrained(
"meta-llama/Llama-2-7b-hf", torch_dtype=torch.bfloat16, device_map="auto"
)
peft_model = PeftModel.from_pretrained(model, "bone-llama-2-7b")
```
## Advanced Usage
### Fine-tune
```shell
#Bat performs better than Bone, but it uses more memory and is twice as slow. If you want to use the Bat method, you only need to add the parameter init_weights="bat".
python bone_finetuning.py \
--base_model_name_or_path meta-llama/Llama-2-7b-hf \
--output_dir output/bone-llama-2-7b-metamath-10k \
--bone_r 64 \
--init_weights True \
--bits bf16 \
--data_path meta-math/MetaMathQA \
--dataset_split train[:100000] \
--dataset_field query response \
--bf16 True \
--num_train_epochs 1 \
--per_device_train_batch_size 2 \
--gradient_accumulation_steps 8 \
--save_strategy "steps" \
--save_steps 1000 \
--save_total_limit 1 \
--logging_steps 1 \
--learning_rate 2e-5 \
--weight_decay 0. \
--warmup_ratio 0.03 \
--tf32 True \
--report_to none
```
# Citation
```bib
@misc{kang2025dishadimensionshardingadaptationlarge,
title={DiSHA: Dimension-Sharding Adaptation of Large Language Models with Fast Convergence and Fast Computation},
author={Jiale Kang},
year={2025},
eprint={2409.15371},
archivePrefix={arXiv},
primaryClass={cs.CL},
url={https://huggingface.co/papers/2409.15371},
}

View File

@ -0,0 +1,105 @@
# Copyright 2023-present the HuggingFace Inc. team.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
import os
from dataclasses import dataclass, field
from typing import Literal, Optional
import torch
from datasets import load_dataset
from transformers import AutoModelForCausalLM, AutoTokenizer, HfArgumentParser
from trl import SFTConfig, SFTTrainer
from peft import BoneConfig, get_peft_model
@dataclass
class ScriptArguments(SFTConfig):
# model configs
base_model_name_or_path: Optional[str] = field(
default=None, metadata={"help": "The name or path of the fp32/16 base model."}
)
bits: str = field(default="bf16", metadata={"help": "(`['bf16', 'fp16', fp32]`)"})
init_weights: Literal[True, "bat"] = field(
default=True,
metadata={
"help": ("True -> Bone; `bat` -> Bat"),
},
)
bone_r: int = field(default=16)
merge_and_save: bool = field(default=False)
# dataset configs
data_path: str = field(default="imdb", metadata={"help": "Path to the training data."})
dataset_split: str = field(default="train[:1%]", metadata={"help": "(`['train', 'test', 'eval']`):"})
dataset_field: list[str] = field(default=None, metadata={"help": "Fields of dataset input and output."})
parser = HfArgumentParser(ScriptArguments)
script_args = parser.parse_args_into_dataclasses()[0]
print(script_args)
print(f"Load pre-processed residual model in {script_args.bits} bits.")
if script_args.bits in ["nf4", "fp4", "int8"]:
print("Bone currently does not support quantization.")
elif script_args.base_model_name_or_path is not None:
print(f"No available pre-processed model, manually initialize a Bone using {script_args.base_model_name_or_path}.")
model = AutoModelForCausalLM.from_pretrained(
script_args.base_model_name_or_path,
torch_dtype=(
torch.float16
if script_args.bits == "fp16"
else (torch.bfloat16 if script_args.bits == "bf16" else torch.float32)
),
device_map="auto",
)
tokenizer = AutoTokenizer.from_pretrained(script_args.base_model_name_or_path)
tokenizer.pad_token_id = tokenizer.eos_token_id
bone_config = BoneConfig(
r=script_args.bone_r,
target_modules=["q_proj", "o_proj", "k_proj", "v_proj", "gate_proj", "up_proj", "down_proj"],
bias="none",
task_type="CAUSAL_LM",
init_weights=script_args.init_weights,
)
peft_model = get_peft_model(model, bone_config)
print(peft_model)
peft_model.print_trainable_parameters()
print(f"Training Bone with trl on the {script_args.data_path}[{script_args.dataset_split}] dataset.")
dataset = load_dataset(script_args.data_path, split=script_args.dataset_split)
dataset = dataset.map(
lambda example: {
"text": f"### USER: {example[script_args.dataset_field[0]]}\n### ASSISTANT: {example[script_args.dataset_field[1]]}"
}
)
trainer = SFTTrainer(
model=peft_model,
args=script_args,
train_dataset=dataset,
processing_class=tokenizer,
)
trainer.train()
trainer.save_state()
peft_model.save_pretrained(
os.path.join(script_args.output_dir, "bone_ft"),
)
if script_args.merge_and_save:
model = peft_model.merge_and_unload()
model.save_pretrained(os.path.join(script_args.output_dir, "bone_merged"))
tokenizer.save_pretrained(os.path.join(script_args.output_dir, "bone_merged"))

View File

@ -2,7 +2,7 @@
"cells": [
{
"cell_type": "code",
"execution_count": 1,
"execution_count": null,
"id": "71fbfca2",
"metadata": {},
"outputs": [],
@ -16,10 +16,9 @@
"from torch.utils.data import DataLoader\n",
"from transformers import default_data_collator, get_linear_schedule_with_warmup\n",
"from tqdm import tqdm\n",
"from datasets import load_dataset\n",
"\n",
"# Hyper-parameters\n",
"device = \"cuda\"\n",
"device = torch.accelerator.current_accelerator().type if hasattr(torch, \"accelerator\") else \"cuda\"\n",
"model_name_or_path = \"bigscience/bloomz-560m\"\n",
"tokenizer_name_or_path = \"bigscience/bloomz-560m\"\n",
"peft_config = LNTuningConfig(\n",
@ -48,7 +47,7 @@
},
{
"cell_type": "code",
"execution_count": 3,
"execution_count": null,
"id": "e1a3648b",
"metadata": {},
"outputs": [
@ -84,9 +83,13 @@
}
],
"source": [
"from datasets import load_dataset\n",
"\n",
"dataset = load_dataset(\"ought/raft\", dataset_name)\n",
"dataset = load_dataset(\n",
" \"parquet\",\n",
" data_files={\n",
" \"train\": f\"hf://datasets/ought/raft@refs/convert/parquet/{dataset_name}/train/0000.parquet\",\n",
" \"test\": f\"hf://datasets/ought/raft@refs/convert/parquet/{dataset_name}/test/0000.parquet\"\n",
" }\n",
")\n",
"\n",
"classes = [k.replace(\"_\", \" \") for k in dataset[\"train\"].features[\"Label\"].names]\n",
"print(classes)\n",

View File

@ -1,481 +0,0 @@
{
"cells": [
{
"cell_type": "code",
"execution_count": 1,
"id": "71fbfca2",
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"\n",
"===================================BUG REPORT===================================\n",
"Welcome to bitsandbytes. For bug reports, please submit your error trace to: https://github.com/TimDettmers/bitsandbytes/issues\n",
"For effortless bug reporting copy-paste your error into this form: https://docs.google.com/forms/d/e/1FAIpQLScPB8emS3Thkp66nvqwmjTEgxp8Y9ufuWTzFyr9kJ5AoI47dQ/viewform?usp=sf_link\n",
"================================================================================\n",
"CUDA SETUP: CUDA runtime path found: /home/sourab/miniconda3/envs/ml/lib/libcudart.so\n",
"CUDA SETUP: Highest compute capability among GPUs detected: 7.5\n",
"CUDA SETUP: Detected CUDA version 117\n",
"CUDA SETUP: Loading binary /home/sourab/miniconda3/envs/ml/lib/python3.10/site-packages/bitsandbytes/libbitsandbytes_cuda117.so...\n"
]
}
],
"source": [
"from transformers import AutoModelForCausalLM\n",
"from peft import PeftModel, PeftConfig\n",
"import torch\n",
"from datasets import load_dataset\n",
"import os\n",
"from transformers import AutoTokenizer\n",
"from torch.utils.data import DataLoader\n",
"from transformers import default_data_collator, get_linear_schedule_with_warmup\n",
"from tqdm import tqdm\n",
"from datasets import load_dataset\n",
"\n",
"device = \"cuda\"\n",
"model_name_or_path = \"bigscience/bloomz-7b1\"\n",
"tokenizer_name_or_path = \"bigscience/bloomz-7b1\"\n",
"dataset_name = \"twitter_complaints\"\n",
"text_column = \"Tweet text\"\n",
"label_column = \"text_label\"\n",
"max_length = 64\n",
"lr = 1e-3\n",
"num_epochs = 50\n",
"batch_size = 8"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "e1a3648b",
"metadata": {},
"outputs": [],
"source": [
"from datasets import load_dataset\n",
"\n",
"dataset = load_dataset(\"ought/raft\", dataset_name)\n",
"\n",
"classes = [k.replace(\"_\", \" \") for k in dataset[\"train\"].features[\"Label\"].names]\n",
"print(classes)\n",
"dataset = dataset.map(\n",
" lambda x: {\"text_label\": [classes[label] for label in x[\"Label\"]]},\n",
" batched=True,\n",
" num_proc=1,\n",
")\n",
"print(dataset)\n",
"dataset[\"train\"][0]"
]
},
{
"cell_type": "code",
"execution_count": 3,
"id": "fe12d4d3",
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"3\n"
]
},
{
"data": {
"application/vnd.jupyter.widget-view+json": {
"model_id": "10cabeec92ab428f9a660ebaecbaf865",
"version_major": 2,
"version_minor": 0
},
"text/plain": [
"Running tokenizer on dataset: 0%| | 0/1 [00:00<?, ?ba/s]"
]
},
"metadata": {},
"output_type": "display_data"
},
{
"data": {
"application/vnd.jupyter.widget-view+json": {
"model_id": "8a344e989ab34c71b230acee68b477e8",
"version_major": 2,
"version_minor": 0
},
"text/plain": [
"Running tokenizer on dataset: 0%| | 0/4 [00:00<?, ?ba/s]"
]
},
"metadata": {},
"output_type": "display_data"
}
],
"source": [
"# data preprocessing\n",
"tokenizer = AutoTokenizer.from_pretrained(model_name_or_path)\n",
"if tokenizer.pad_token_id is None:\n",
" tokenizer.pad_token_id = tokenizer.eos_token_id\n",
"target_max_length = max([len(tokenizer(class_label)[\"input_ids\"]) for class_label in classes])\n",
"print(target_max_length)\n",
"\n",
"\n",
"def preprocess_function(examples):\n",
" batch_size = len(examples[text_column])\n",
" inputs = [f\"{text_column} : {x} Label : \" for x in examples[text_column]]\n",
" targets = [str(x) for x in examples[label_column]]\n",
" model_inputs = tokenizer(inputs)\n",
" labels = tokenizer(targets, add_special_tokens=False) # don't add bos token because we concatenate with inputs\n",
" for i in range(batch_size):\n",
" sample_input_ids = model_inputs[\"input_ids\"][i]\n",
" label_input_ids = labels[\"input_ids\"][i] + [tokenizer.eos_token_id]\n",
" # print(i, sample_input_ids, label_input_ids)\n",
" model_inputs[\"input_ids\"][i] = sample_input_ids + label_input_ids\n",
" labels[\"input_ids\"][i] = [-100] * len(sample_input_ids) + label_input_ids\n",
" model_inputs[\"attention_mask\"][i] = [1] * len(model_inputs[\"input_ids\"][i])\n",
" # print(model_inputs)\n",
" for i in range(batch_size):\n",
" sample_input_ids = model_inputs[\"input_ids\"][i]\n",
" label_input_ids = labels[\"input_ids\"][i]\n",
" model_inputs[\"input_ids\"][i] = [tokenizer.pad_token_id] * (\n",
" max_length - len(sample_input_ids)\n",
" ) + sample_input_ids\n",
" model_inputs[\"attention_mask\"][i] = [0] * (max_length - len(sample_input_ids)) + model_inputs[\n",
" \"attention_mask\"\n",
" ][i]\n",
" labels[\"input_ids\"][i] = [-100] * (max_length - len(sample_input_ids)) + label_input_ids\n",
" model_inputs[\"input_ids\"][i] = torch.tensor(model_inputs[\"input_ids\"][i][:max_length])\n",
" model_inputs[\"attention_mask\"][i] = torch.tensor(model_inputs[\"attention_mask\"][i][:max_length])\n",
" labels[\"input_ids\"][i] = torch.tensor(labels[\"input_ids\"][i][:max_length])\n",
" model_inputs[\"labels\"] = labels[\"input_ids\"]\n",
" return model_inputs\n",
"\n",
"\n",
"processed_datasets = dataset.map(\n",
" preprocess_function,\n",
" batched=True,\n",
" num_proc=1,\n",
" remove_columns=dataset[\"train\"].column_names,\n",
" load_from_cache_file=False,\n",
" desc=\"Running tokenizer on dataset\",\n",
")\n",
"\n",
"train_dataset = processed_datasets[\"train\"]\n",
"\n",
"\n",
"train_dataloader = DataLoader(\n",
" train_dataset, shuffle=True, collate_fn=default_data_collator, batch_size=batch_size, pin_memory=True\n",
")"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "2795b9d0",
"metadata": {},
"outputs": [],
"source": [
"def test_preprocess_function(examples):\n",
" batch_size = len(examples[text_column])\n",
" inputs = [f\"{text_column} : {x} Label : \" for x in examples[text_column]]\n",
" model_inputs = tokenizer(inputs)\n",
" # print(model_inputs)\n",
" for i in range(batch_size):\n",
" sample_input_ids = model_inputs[\"input_ids\"][i]\n",
" model_inputs[\"input_ids\"][i] = [tokenizer.pad_token_id] * (\n",
" max_length - len(sample_input_ids)\n",
" ) + sample_input_ids\n",
" model_inputs[\"attention_mask\"][i] = [0] * (max_length - len(sample_input_ids)) + model_inputs[\n",
" \"attention_mask\"\n",
" ][i]\n",
" model_inputs[\"input_ids\"][i] = torch.tensor(model_inputs[\"input_ids\"][i][:max_length])\n",
" model_inputs[\"attention_mask\"][i] = torch.tensor(model_inputs[\"attention_mask\"][i][:max_length])\n",
" return model_inputs\n",
"\n",
"\n",
"processed_datasets = dataset.map(\n",
" test_preprocess_function,\n",
" batched=True,\n",
" num_proc=1,\n",
" remove_columns=dataset[\"train\"].column_names,\n",
" load_from_cache_file=False,\n",
" desc=\"Running tokenizer on dataset\",\n",
")\n",
"\n",
"eval_dataset = processed_datasets[\"train\"]\n",
"test_dataset = processed_datasets[\"test\"]\n",
"\n",
"eval_dataloader = DataLoader(eval_dataset, collate_fn=default_data_collator, batch_size=batch_size, pin_memory=True)\n",
"test_dataloader = DataLoader(test_dataset, collate_fn=default_data_collator, batch_size=batch_size, pin_memory=True)\n",
"print(next(iter(eval_dataloader)))\n",
"print(next(iter(test_dataloader)))"
]
},
{
"cell_type": "markdown",
"id": "42b14a11",
"metadata": {},
"source": [
"You can load model from hub or local\n",
"\n",
"- Load model from Hugging Face Hub, you can change to your own model id\n",
"```python\n",
"peft_model_id = \"username/twitter_complaints_bigscience_bloomz-7b1_LORA_CAUSAL_LM\"\n",
"```\n",
"- Or load model form local\n",
"```python\n",
"peft_model_id = \"twitter_complaints_bigscience_bloomz-7b1_LORA_CAUSAL_LM\"\n",
"```"
]
},
{
"cell_type": "code",
"execution_count": 5,
"id": "9caac014",
"metadata": {},
"outputs": [
{
"name": "stderr",
"output_type": "stream",
"text": [
"/home/sourab/pet/src/peft/tuners/lora.py:143: UserWarning: fan_in_fan_out is set to True but the target module is not a Conv1D. Setting fan_in_fan_out to False.\n",
" warnings.warn(\n"
]
},
{
"data": {
"application/vnd.jupyter.widget-view+json": {
"model_id": "bc38030106a14173a1363eb1ee388eda",
"version_major": 2,
"version_minor": 0
},
"text/plain": [
"Downloading: 0%| | 0.00/15.8M [00:00<?, ?B/s]"
]
},
"metadata": {},
"output_type": "display_data"
}
],
"source": [
"from peft import PeftModel, PeftConfig\n",
"\n",
"max_memory = {0: \"1GIB\", 1: \"1GIB\", 2: \"2GIB\", 3: \"10GIB\", \"cpu\": \"30GB\"}\n",
"peft_model_id = \"smangrul/twitter_complaints_bigscience_bloomz-7b1_LORA_CAUSAL_LM\"\n",
"config = PeftConfig.from_pretrained(peft_model_id)\n",
"model = AutoModelForCausalLM.from_pretrained(config.base_model_name_or_path, device_map=\"auto\", max_memory=max_memory)\n",
"model = PeftModel.from_pretrained(model, peft_model_id, device_map=\"auto\", max_memory=max_memory)"
]
},
{
"cell_type": "code",
"execution_count": 35,
"id": "6fac10b5",
"metadata": {},
"outputs": [],
"source": [
"# model"
]
},
{
"cell_type": "code",
"execution_count": 7,
"id": "2a08ee6d",
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"{'base_model.model.transformer.word_embeddings': 3,\n",
" 'base_model.model.lm_head': 3,\n",
" 'base_model.model.transformer.word_embeddings_layernorm': 3,\n",
" 'base_model.model.transformer.h.0': 3,\n",
" 'base_model.model.transformer.h.1': 3,\n",
" 'base_model.model.transformer.h.2': 3,\n",
" 'base_model.model.transformer.h.3': 3,\n",
" 'base_model.model.transformer.h.4': 3,\n",
" 'base_model.model.transformer.h.5': 3,\n",
" 'base_model.model.transformer.h.6': 3,\n",
" 'base_model.model.transformer.h.7': 3,\n",
" 'base_model.model.transformer.h.8': 'cpu',\n",
" 'base_model.model.transformer.h.9': 'cpu',\n",
" 'base_model.model.transformer.h.10': 'cpu',\n",
" 'base_model.model.transformer.h.11': 'cpu',\n",
" 'base_model.model.transformer.h.12': 'cpu',\n",
" 'base_model.model.transformer.h.13': 'cpu',\n",
" 'base_model.model.transformer.h.14': 'cpu',\n",
" 'base_model.model.transformer.h.15': 'cpu',\n",
" 'base_model.model.transformer.h.16': 'cpu',\n",
" 'base_model.model.transformer.h.17': 'cpu',\n",
" 'base_model.model.transformer.h.18': 'cpu',\n",
" 'base_model.model.transformer.h.19': 'cpu',\n",
" 'base_model.model.transformer.h.20': 'cpu',\n",
" 'base_model.model.transformer.h.21': 'cpu',\n",
" 'base_model.model.transformer.h.22': 'cpu',\n",
" 'base_model.model.transformer.h.23': 'cpu',\n",
" 'base_model.model.transformer.h.24': 'cpu',\n",
" 'base_model.model.transformer.h.25': 'cpu',\n",
" 'base_model.model.transformer.h.26': 'cpu',\n",
" 'base_model.model.transformer.h.27': 'cpu',\n",
" 'base_model.model.transformer.h.28': 'cpu',\n",
" 'base_model.model.transformer.h.29': 'cpu',\n",
" 'base_model.model.transformer.ln_f': 'cpu'}"
]
},
"execution_count": 7,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"model.hf_device_map"
]
},
{
"cell_type": "code",
"execution_count": 34,
"id": "b33be5e6",
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"@HondaCustSvc Your customer service has been horrible during the recall process. I will never purchase a Honda again.\n",
"{'input_ids': tensor([[227985, 5484, 915, 2566, 216744, 38, 1316, 54, 42705,\n",
" 32465, 52166, 9440, 1809, 3784, 88483, 9411, 368, 84342,\n",
" 4451, 17, 473, 2152, 11705, 82406, 267, 51591, 5734,\n",
" 17, 77658, 915, 210]]), 'attention_mask': tensor([[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,\n",
" 1, 1, 1, 1, 1, 1, 1]])}\n",
"tensor([[227985, 5484, 915, 2566, 216744, 38, 1316, 54, 42705,\n",
" 32465, 52166, 9440, 1809, 3784, 88483, 9411, 368, 84342,\n",
" 4451, 17, 473, 2152, 11705, 82406, 267, 51591, 5734,\n",
" 17, 77658, 915, 210, 16449, 5952, 3, 3, 3,\n",
" 3, 3, 3, 3, 3]])\n",
"['Tweet text : @HondaCustSvc Your customer service has been horrible during the recall process. I will never purchase a Honda again. Label : complaint']\n"
]
}
],
"source": [
"model.eval()\n",
"i = 89\n",
"inputs = tokenizer(f'{text_column} : {dataset[\"test\"][i][\"Tweet text\"]} Label : ', return_tensors=\"pt\")\n",
"print(dataset[\"test\"][i][\"Tweet text\"])\n",
"print(inputs)\n",
"\n",
"with torch.no_grad():\n",
" outputs = model.generate(input_ids=inputs[\"input_ids\"], max_new_tokens=10)\n",
" print(outputs)\n",
" print(tokenizer.batch_decode(outputs.detach().cpu().numpy(), skip_special_tokens=True))"
]
},
{
"cell_type": "code",
"execution_count": 9,
"id": "b6d6cd5b",
"metadata": {},
"outputs": [
{
"name": "stderr",
"output_type": "stream",
"text": [
"100%|████████████████████████████████████████████████████████████████████████████████████████████| 7/7 [01:42<00:00, 14.70s/it]\n"
]
}
],
"source": [
"model.eval()\n",
"eval_preds = []\n",
"for _, batch in enumerate(tqdm(eval_dataloader)):\n",
" batch = {k: v for k, v in batch.items() if k != \"labels\"}\n",
" with torch.no_grad():\n",
" outputs = model.generate(**batch, max_new_tokens=10)\n",
" preds = outputs[:, max_length:].detach().cpu().numpy()\n",
" eval_preds.extend(tokenizer.batch_decode(preds, skip_special_tokens=True))"
]
},
{
"cell_type": "code",
"execution_count": 11,
"id": "61264abe",
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"accuracy=100.0\n",
"eval_preds[:10]=['no complaint', 'no complaint', 'complaint', 'complaint', 'no complaint', 'no complaint', 'no complaint', 'complaint', 'complaint', 'no complaint']\n",
"dataset['train'][label_column][:10]=['no complaint', 'no complaint', 'complaint', 'complaint', 'no complaint', 'no complaint', 'no complaint', 'complaint', 'complaint', 'no complaint']\n"
]
}
],
"source": [
"correct = 0\n",
"total = 0\n",
"for pred, true in zip(eval_preds, dataset[\"train\"][label_column]):\n",
" if pred.strip() == true.strip():\n",
" correct += 1\n",
" total += 1\n",
"accuracy = correct / total * 100\n",
"print(f\"{accuracy=}\")\n",
"print(f\"{eval_preds[:10]=}\")\n",
"print(f\"{dataset['train'][label_column][:10]=}\")"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "a70802a3",
"metadata": {},
"outputs": [],
"source": [
"model.eval()\n",
"test_preds = []\n",
"\n",
"for _, batch in enumerate(tqdm(test_dataloader)):\n",
" batch = {k: v for k, v in batch.items() if k != \"labels\"}\n",
" with torch.no_grad():\n",
" outputs = model.generate(**batch, max_new_tokens=10)\n",
" preds = outputs[:, max_length:].detach().cpu().numpy()\n",
" test_preds.extend(tokenizer.batch_decode(preds, skip_special_tokens=True))\n",
" if len(test_preds) > 100:\n",
" break\n",
"test_preds"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "e1c4ad9c",
"metadata": {},
"outputs": [],
"source": []
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3 (ipykernel)",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.10.4"
},
"vscode": {
"interpreter": {
"hash": "aee8b7b246df8f9039afb4144a1f6fd8d2ca17a180786b69acc140d282b71a49"
}
}
},
"nbformat": 4,
"nbformat_minor": 5
}

View File

@ -61,9 +61,11 @@ def b2mb(x):
class TorchTracemalloc:
def __enter__(self):
gc.collect()
torch.cuda.empty_cache()
torch.cuda.reset_max_memory_allocated() # reset the peak gauge to zero
self.begin = torch.cuda.memory_allocated()
self.device_type = torch.accelerator.current_accelerator().type if hasattr(torch, "accelerator") else "cuda"
self.device_module = getattr(torch, self.device_type, torch.cuda)
self.device_module.empty_cache()
self.device_module.reset_peak_memory_stats() # reset the peak gauge to zero
self.begin = self.device_module.memory_allocated()
self.process = psutil.Process()
self.cpu_begin = self.cpu_mem_used()
@ -93,9 +95,9 @@ class TorchTracemalloc:
self.peak_monitoring = False
gc.collect()
torch.cuda.empty_cache()
self.end = torch.cuda.memory_allocated()
self.peak = torch.cuda.max_memory_allocated()
self.device_module.empty_cache()
self.end = self.device_module.memory_allocated()
self.peak = self.device_module.max_memory_allocated()
self.used = b2mb(self.end - self.begin)
self.peaked = b2mb(self.peak - self.begin)
@ -120,7 +122,13 @@ def main():
do_test = False
set_seed(seed)
dataset = load_dataset("ought/raft", dataset_name)
dataset = load_dataset(
"parquet",
data_files={
"train": f"hf://datasets/ought/raft@refs/convert/parquet/{dataset_name}/train/0000.parquet",
"test": f"hf://datasets/ought/raft@refs/convert/parquet/{dataset_name}/test/0000.parquet",
},
)
classes = [k.replace("_", " ") for k in dataset["train"].features["Label"].names]
dataset = dataset.map(
lambda x: {"text_label": [classes[label] for label in x["Label"]]},
@ -162,7 +170,6 @@ def main():
batch_size = len(examples[text_column])
inputs = [f"{text_column} : {x} Label : " for x in examples[text_column]]
model_inputs = tokenizer(inputs)
# print(model_inputs)
for i in range(batch_size):
sample_input_ids = model_inputs["input_ids"][i]
model_inputs["input_ids"][i] = [tokenizer.pad_token_id] * (
@ -248,12 +255,18 @@ def main():
optimizer.step()
lr_scheduler.step()
optimizer.zero_grad()
# Printing the GPU memory usage details such as allocated memory, peak memory, and total memory usage
accelerator.print(f"GPU Memory before entering the train : {b2mb(tracemalloc.begin)}")
accelerator.print(f"GPU Memory consumed at the end of the train (end-begin): {tracemalloc.used}")
accelerator.print(f"GPU Peak Memory consumed during the train (max-begin): {tracemalloc.peaked}")
# Printing the memory usage details such as allocated memory, peak memory, and total memory usage
accelerator.print(
f"GPU Total Peak Memory consumed during the train (max): {tracemalloc.peaked + b2mb(tracemalloc.begin)}"
f"{accelerator.device.type.upper()} Memory before entering the train : {b2mb(tracemalloc.begin)}"
)
accelerator.print(
f"{accelerator.device.type.upper()} Memory consumed at the end of the train (end-begin): {tracemalloc.used}"
)
accelerator.print(
f"{accelerator.device.type.upper()} Peak Memory consumed during the train (max-begin): {tracemalloc.peaked}"
)
accelerator.print(
f"{accelerator.device.type.upper()} Total Peak Memory consumed during the train (max): {tracemalloc.peaked + b2mb(tracemalloc.begin)}"
)
accelerator.print(f"CPU Memory before entering the train : {b2mb(tracemalloc.cpu_begin)}")
@ -280,12 +293,18 @@ def main():
preds = preds[:, max_length:].detach().cpu().numpy()
eval_preds.extend(tokenizer.batch_decode(preds, skip_special_tokens=True))
# Printing the GPU memory usage details such as allocated memory, peak memory, and total memory usage
accelerator.print(f"GPU Memory before entering the eval : {b2mb(tracemalloc.begin)}")
accelerator.print(f"GPU Memory consumed at the end of the eval (end-begin): {tracemalloc.used}")
accelerator.print(f"GPU Peak Memory consumed during the eval (max-begin): {tracemalloc.peaked}")
# Printing the memory usage details such as allocated memory, peak memory, and total memory usage
accelerator.print(
f"GPU Total Peak Memory consumed during the eval (max): {tracemalloc.peaked + b2mb(tracemalloc.begin)}"
f"{accelerator.device.type.upper()} Memory before entering the eval : {b2mb(tracemalloc.begin)}"
)
accelerator.print(
f"{accelerator.device.type.upper()} Memory consumed at the end of the eval (end-begin): {tracemalloc.used}"
)
accelerator.print(
f"{accelerator.device.type.upper()} Peak Memory consumed during the eval (max-begin): {tracemalloc.peaked}"
)
accelerator.print(
f"{accelerator.device.type.upper()} Total Peak Memory consumed during the eval (max): {tracemalloc.peaked + b2mb(tracemalloc.begin)}"
)
accelerator.print(f"CPU Memory before entering the eval : {b2mb(tracemalloc.cpu_begin)}")
@ -297,9 +316,9 @@ def main():
correct = 0
total = 0
assert len(eval_preds) == len(
dataset["train"][label_column]
), f"{len(eval_preds)} != {len(dataset['train'][label_column])}"
assert len(eval_preds) == len(dataset["train"][label_column]), (
f"{len(eval_preds)} != {len(dataset['train'][label_column])}"
)
for pred, true in zip(eval_preds, dataset["train"][label_column]):
if pred.strip() == true.strip():
correct += 1

View File

@ -26,14 +26,13 @@
},
{
"cell_type": "code",
"execution_count": 1,
"execution_count": null,
"id": "6f864c90",
"metadata": {},
"outputs": [],
"source": [
"import os\n",
"\n",
"os.environ[\"CUDA_VISIBLE_DEVICES\"] = \"3\"\n",
"os.environ[\"WANDB_PROJECT\"] = \"PeftExamples\"\n",
"import transformers\n",
"from peft import (\n",
@ -168,7 +167,7 @@
"model = AutoModelForCausalLM.from_pretrained(\n",
" model_name,\n",
" low_cpu_mem_usage=True\n",
" # use_flash_attention_2=True, # leading to an error\n",
" # attn_implementation =\"flash_attention_2\", # leading to an error\n",
")\n",
"model.resize_token_embeddings(len(tokenizer))"
]
@ -740,7 +739,7 @@
},
{
"cell_type": "code",
"execution_count": 11,
"execution_count": null,
"id": "71851793",
"metadata": {},
"outputs": [
@ -763,7 +762,8 @@
"context = dataset[\"test\"][i][\"context\"]\n",
"\n",
"batch = tokenizer(context, return_tensors=\"pt\")\n",
"batch = {k: v.to(\"cuda\") for k, v in batch.items()}\n",
"device = torch.accelerator.current_accelerator().type if hasattr(torch, \"accelerator\") else \"cuda\"\n",
"batch = {k: v.to(device) for k, v in batch.items()}\n",
"model.eval()\n",
"output_tokens = model.generate(\n",
" **batch,\n",
@ -892,7 +892,7 @@
},
{
"cell_type": "code",
"execution_count": 13,
"execution_count": null,
"id": "589c46d7-d567-40b4-ab7d-e0a9e1cab40e",
"metadata": {},
"outputs": [
@ -956,12 +956,12 @@
"inference_model = AutoModelForCausalLM.from_pretrained(\n",
" model_name,\n",
" low_cpu_mem_usage=True,\n",
" # use_flash_attention_2=True,\n",
" # attn_implementation =\"flash_attention_2\",\n",
")\n",
"inference_model.resize_token_embeddings(len(tokenizer))\n",
"\n",
"inference_model = PeftModel.from_pretrained(inference_model, \"smangrul/mistral_lora_clm_with_added_tokens\")\n",
"inference_model.to(\"cuda\")\n",
"inference_model.to(device)\n",
"inference_model.eval()\n",
"\n",
"output_tokens = inference_model.generate(\n",

View File

@ -2,7 +2,7 @@
"cells": [
{
"cell_type": "code",
"execution_count": 2,
"execution_count": null,
"id": "71fbfca2",
"metadata": {},
"outputs": [],
@ -16,9 +16,8 @@
"from torch.utils.data import DataLoader\n",
"from transformers import default_data_collator, get_linear_schedule_with_warmup\n",
"from tqdm import tqdm\n",
"from datasets import load_dataset\n",
"\n",
"device = \"cuda\"\n",
"device = torch.accelerator.current_accelerator().type if hasattr(torch, \"accelerator\") else \"cuda\"\n",
"model_name_or_path = \"bigscience/bloomz-560m\"\n",
"tokenizer_name_or_path = \"bigscience/bloomz-560m\"\n",
"peft_config = PrefixTuningConfig(task_type=TaskType.CAUSAL_LM, num_virtual_tokens=30)\n",
@ -37,7 +36,7 @@
},
{
"cell_type": "code",
"execution_count": 3,
"execution_count": null,
"id": "e1a3648b",
"metadata": {},
"outputs": [
@ -102,9 +101,14 @@
}
],
"source": [
"from datasets import load_dataset\n",
"dataset = load_dataset(\n",
" \"parquet\",\n",
" data_files={\n",
" \"train\": f\"hf://datasets/ought/raft@refs/convert/parquet/{dataset_name}/train/0000.parquet\",\n",
" \"test\": f\"hf://datasets/ought/raft@refs/convert/parquet/{dataset_name}/test/0000.parquet\"\n",
" }\n",
")\n",
"\n",
"dataset = load_dataset(\"ought/raft\", dataset_name)\n",
"\n",
"classes = [k.replace(\"_\", \" \") for k in dataset[\"train\"].features[\"Label\"].names]\n",
"print(classes)\n",
@ -318,24 +322,6 @@
"model.print_trainable_parameters()"
]
},
{
"cell_type": "code",
"execution_count": 10,
"id": "bd419634",
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"trainable params: 1474560 || all params: 560689152 || trainable%: 0.26299064191632515\n"
]
}
],
"source": [
"model.print_trainable_parameters()"
]
},
{
"cell_type": "code",
"execution_count": null,
@ -1276,7 +1262,7 @@
"metadata": {},
"outputs": [],
"source": [
"ckpt = f\"{peft_model_id}/adapter_model.bin\"\n",
"ckpt = f\"{peft_model_id}/adapter_model.safetensors\"\n",
"!du -h $ckpt"
]
},

View File

@ -16,9 +16,8 @@
"from torch.utils.data import DataLoader\n",
"from transformers import default_data_collator, get_linear_schedule_with_warmup\n",
"from tqdm import tqdm\n",
"from datasets import load_dataset\n",
"\n",
"device = \"cuda\"\n",
"device = torch.accelerator.current_accelerator().type if hasattr(torch, \"accelerator\") else \"cuda\"\n",
"model_name_or_path = \"bigscience/bloomz-560m\"\n",
"tokenizer_name_or_path = \"bigscience/bloomz-560m\"\n",
"peft_config = PromptTuningConfig(\n",
@ -48,9 +47,13 @@
"metadata": {},
"outputs": [],
"source": [
"from datasets import load_dataset\n",
"\n",
"dataset = load_dataset(\"ought/raft\", dataset_name)\n",
"dataset = load_dataset(\n",
" \"parquet\",\n",
" data_files={\n",
" \"train\": f\"hf://datasets/ought/raft@refs/convert/parquet/{dataset_name}/train/0000.parquet\",\n",
" \"test\": f\"hf://datasets/ought/raft@refs/convert/parquet/{dataset_name}/test/0000.parquet\"\n",
" }\n",
")\n",
"\n",
"classes = [k.replace(\"_\", \" \") for k in dataset[\"train\"].features[\"Label\"].names]\n",
"print(classes)\n",
@ -1115,24 +1118,12 @@
},
{
"cell_type": "code",
"execution_count": 13,
"execution_count": null,
"id": "4928c7f1",
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...\n",
"To disable this warning, you can either:\n",
"\t- Avoid using `tokenizers` before the fork if possible\n",
"\t- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)\n",
"36K\tbigscience/bloomz-560m_PROMPT_TUNING_CAUSAL_LM/adapter_model.bin\n"
]
}
],
"outputs": [],
"source": [
"ckpt = f\"{peft_model_id}/adapter_model.bin\"\n",
"ckpt = f\"{peft_model_id}/adapter_model.safetensors\"\n",
"!du -h $ckpt"
]
},

View File

@ -1,6 +1,7 @@
transformers
transformers<4.54.0
accelerate
evaluate
deepspeed
tqdm
datasets
dataclass-csv
datasets==3.6.0

View File

@ -9,12 +9,13 @@
},
"outputs": [],
"source": [
"import torch\n",
"from datasets import load_dataset\n",
"from transformers import set_seed, AutoModelForSeq2SeqLM, AutoTokenizer\n",
"from peft import get_peft_model, MultitaskPromptTuningConfig, TaskType, MultitaskPromptTuningInit\n",
"\n",
"set_seed(42)\n",
"\n",
"device = torch.accelerator.current_accelerator().type if hasattr(torch, \"accelerator\") else \"cuda\"\n",
"model_name = \"google/flan-t5-base\"\n",
"\n",
"peft_config = MultitaskPromptTuningConfig(\n",
@ -31,18 +32,18 @@
"model = AutoModelForSeq2SeqLM.from_pretrained(model_name)\n",
"model = get_peft_model(model, peft_config)\n",
"\n",
"model = model.cuda()\n",
"model = model.to(device)\n",
"\n",
"\n",
"def send_to_device(batch):\n",
" for i in batch:\n",
" batch[i] = batch[i].cuda()\n",
" batch[i] = batch[i].to(device)\n",
" return batch"
]
},
{
"cell_type": "code",
"execution_count": null,
"execution_count": 9,
"id": "eb112bc1-ffaf-49fa-a216-0d601ec304ee",
"metadata": {
"tags": []
@ -86,7 +87,7 @@
},
{
"cell_type": "code",
"execution_count": null,
"execution_count": 10,
"id": "e5a16ec4-8fef-4ba9-95b6-a661eb51e50c",
"metadata": {
"tags": []
@ -159,7 +160,7 @@
},
{
"cell_type": "code",
"execution_count": null,
"execution_count": 11,
"id": "cceecc94-f43a-4f62-8d45-926f2f02f36d",
"metadata": {
"tags": []
@ -293,7 +294,7 @@
" num_tasks=1,\n",
" task_type=TaskType.SEQ_2_SEQ_LM,\n",
" prompt_tuning_init=MultitaskPromptTuningInit.EXACT_SOURCE_TASK,\n",
" prompt_tuning_init_state_dict_path=\"checkpoints_source/50000/adapter_model.bin\",\n",
" prompt_tuning_init_state_dict_path=\"checkpoints_source/50000/adapter_model.safetensors\",\n",
" num_virtual_tokens=50,\n",
" num_transformer_submodules=1,\n",
")\n",
@ -302,7 +303,7 @@
"model = AutoModelForSeq2SeqLM.from_pretrained(model_name)\n",
"model = get_peft_model(model, peft_config)\n",
"\n",
"model = model.cuda()"
"model = model.to(device)"
]
},
{
@ -360,8 +361,9 @@
"source": [
"# load last checkpoint for now\n",
"from peft import set_peft_model_state_dict\n",
"from safetensors.torch import load_file\n",
"\n",
"sd_6000 = torch.load(\"checkpoints_target/6000/adapter_model.bin\")\n",
"sd_6000 = load_file(\"checkpoints_target/6000/adapter_model.safetensors\")\n",
"set_peft_model_state_dict(model, sd_6000)\n",
"\n",
"# evaluate val\n",
@ -382,6 +384,22 @@
"f1 = {f1}\"\"\"\n",
")"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "1d18325c-9607-4cb5-a5b0-5b44dfee2a75",
"metadata": {},
"outputs": [],
"source": []
},
{
"cell_type": "code",
"execution_count": null,
"id": "43988e92-af42-45cb-8bca-f19c193ad04f",
"metadata": {},
"outputs": [],
"source": []
}
],
"metadata": {
@ -400,7 +418,7 @@
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.9.13"
"version": "3.11.13"
}
},
"nbformat": 4,

View File

@ -11,7 +11,7 @@ from peft import AdaLoraConfig, PeftConfig, PeftModel, TaskType, get_peft_model
os.environ["TOKENIZERS_PARALLELISM"] = "false"
device = "cuda"
device = torch.accelerator.current_accelerator().type if hasattr(torch, "accelerator") else "cuda"
model_name_or_path = "facebook/bart-base"
tokenizer_name_or_path = "facebook/bart-base"
@ -24,6 +24,20 @@ num_epochs = 8
batch_size = 8
# loading dataset
dataset = load_dataset("financial_phrasebank", "sentences_allagree")
dataset = dataset["train"].train_test_split(test_size=0.1)
dataset["validation"] = dataset["test"]
del dataset["test"]
classes = dataset["train"].features["label"].names
dataset = dataset.map(
lambda x: {"text_label": [classes[label] for label in x["label"]]},
batched=True,
num_proc=1,
)
# creating model
peft_config = AdaLoraConfig(
init_r=12,
@ -37,6 +51,7 @@ peft_config = AdaLoraConfig(
lora_dropout=0.1,
task_type=TaskType.SEQ_2_SEQ_LM,
inference_mode=False,
total_step=len(dataset["train"]) * num_epochs,
)
model = AutoModelForSeq2SeqLM.from_pretrained(model_name_or_path)
@ -44,20 +59,6 @@ model = get_peft_model(model, peft_config)
model.print_trainable_parameters()
# loading dataset
dataset = load_dataset("financial_phrasebank", "sentences_allagree")
dataset = dataset["train"].train_test_split(test_size=0.1)
dataset["validation"] = dataset["test"]
del dataset["test"]
classes = dataset["train"].features["label"].names
dataset = dataset.map(
lambda x: {"text_label": [classes[label] for label in x["label"]]},
batched=True,
num_proc=1,
)
# data preprocessing
tokenizer = AutoTokenizer.from_pretrained(model_name_or_path)
@ -159,7 +160,7 @@ peft_model_id = f"{model_name_or_path}_{peft_config.peft_type}_{peft_config.task
model.save_pretrained(peft_model_id)
ckpt = f"{peft_model_id}/adapter_model.bin"
ckpt = f"{peft_model_id}/adapter_model.safetensors"
# get_ipython().system('du -h $ckpt')

View File

@ -2,7 +2,8 @@
"cells": [
{
"cell_type": "code",
"execution_count": 12,
"execution_count": null,
"id": "0c152fc8",
"metadata": {
"id": "5f93b7d1"
},
@ -22,7 +23,7 @@
"from tqdm import tqdm\n",
"from datasets import load_dataset\n",
"\n",
"device = \"cuda\"\n",
"device = torch.accelerator.current_accelerator().type if hasattr(torch, \"accelerator\") else \"cuda\"\n",
"model_name_or_path = \"bigscience/mt0-large\"\n",
"tokenizer_name_or_path = \"bigscience/mt0-large\"\n",
"\n",
@ -37,7 +38,8 @@
},
{
"cell_type": "code",
"execution_count": 13,
"execution_count": 2,
"id": "4e23624f",
"metadata": {
"colab": {
"base_uri": "https://localhost:8080/"
@ -49,10 +51,10 @@
{
"data": {
"text/plain": [
"<module 'peft' from '/usr/local/lib/python3.10/dist-packages/peft/__init__.py'>"
"<module 'peft' from '/usr/local/lib/python3.11/dist-packages/peft/__init__.py'>"
]
},
"execution_count": 13,
"execution_count": 2,
"metadata": {},
"output_type": "execute_result"
}
@ -65,7 +67,8 @@
},
{
"cell_type": "code",
"execution_count": 14,
"execution_count": null,
"id": "da74b569",
"metadata": {
"id": "8d0850ac"
},
@ -79,7 +82,8 @@
},
{
"cell_type": "code",
"execution_count": 15,
"execution_count": 4,
"id": "df33fce2",
"metadata": {
"colab": {
"base_uri": "https://localhost:8080/"
@ -233,7 +237,7 @@
")"
]
},
"execution_count": 15,
"execution_count": 4,
"metadata": {},
"output_type": "execute_result"
}
@ -244,7 +248,8 @@
},
{
"cell_type": "code",
"execution_count": 16,
"execution_count": 5,
"id": "63d7bc2d",
"metadata": {
"colab": {
"base_uri": "https://localhost:8080/"
@ -257,7 +262,7 @@
"name": "stdout",
"output_type": "stream",
"text": [
"trainable params: 282,624 || all params: 1,229,863,936 || trainable%: 0.022980103060766553\n"
"trainable params: 282,624 || all params: 1,229,863,936 || trainable%: 0.0230\n"
]
},
{
@ -276,11 +281,11 @@
" (SelfAttention): MT5Attention(\n",
" (q): Linear(in_features=1024, out_features=1024, bias=False)\n",
" (k): Linear(\n",
" in_features=1024, out_features=1024, bias=False\n",
" (base_layer): Linear(in_features=1024, out_features=1024, bias=False)\n",
" (ia3_l): ParameterDict( (default): Parameter containing: [torch.FloatTensor of size 1024x1])\n",
" )\n",
" (v): Linear(\n",
" in_features=1024, out_features=1024, bias=False\n",
" (base_layer): Linear(in_features=1024, out_features=1024, bias=False)\n",
" (ia3_l): ParameterDict( (default): Parameter containing: [torch.FloatTensor of size 1024x1])\n",
" )\n",
" (o): Linear(in_features=1024, out_features=1024, bias=False)\n",
@ -293,7 +298,7 @@
" (DenseReluDense): MT5DenseGatedActDense(\n",
" (wi_0): Linear(in_features=1024, out_features=2816, bias=False)\n",
" (wi_1): Linear(\n",
" in_features=1024, out_features=2816, bias=False\n",
" (base_layer): Linear(in_features=1024, out_features=2816, bias=False)\n",
" (ia3_l): ParameterDict( (default): Parameter containing: [torch.FloatTensor of size 2816x1])\n",
" )\n",
" (wo): Linear(in_features=2816, out_features=1024, bias=False)\n",
@ -311,11 +316,11 @@
" (SelfAttention): MT5Attention(\n",
" (q): Linear(in_features=1024, out_features=1024, bias=False)\n",
" (k): Linear(\n",
" in_features=1024, out_features=1024, bias=False\n",
" (base_layer): Linear(in_features=1024, out_features=1024, bias=False)\n",
" (ia3_l): ParameterDict( (default): Parameter containing: [torch.FloatTensor of size 1024x1])\n",
" )\n",
" (v): Linear(\n",
" in_features=1024, out_features=1024, bias=False\n",
" (base_layer): Linear(in_features=1024, out_features=1024, bias=False)\n",
" (ia3_l): ParameterDict( (default): Parameter containing: [torch.FloatTensor of size 1024x1])\n",
" )\n",
" (o): Linear(in_features=1024, out_features=1024, bias=False)\n",
@ -327,7 +332,7 @@
" (DenseReluDense): MT5DenseGatedActDense(\n",
" (wi_0): Linear(in_features=1024, out_features=2816, bias=False)\n",
" (wi_1): Linear(\n",
" in_features=1024, out_features=2816, bias=False\n",
" (base_layer): Linear(in_features=1024, out_features=2816, bias=False)\n",
" (ia3_l): ParameterDict( (default): Parameter containing: [torch.FloatTensor of size 2816x1])\n",
" )\n",
" (wo): Linear(in_features=2816, out_features=1024, bias=False)\n",
@ -352,11 +357,11 @@
" (SelfAttention): MT5Attention(\n",
" (q): Linear(in_features=1024, out_features=1024, bias=False)\n",
" (k): Linear(\n",
" in_features=1024, out_features=1024, bias=False\n",
" (base_layer): Linear(in_features=1024, out_features=1024, bias=False)\n",
" (ia3_l): ParameterDict( (default): Parameter containing: [torch.FloatTensor of size 1024x1])\n",
" )\n",
" (v): Linear(\n",
" in_features=1024, out_features=1024, bias=False\n",
" (base_layer): Linear(in_features=1024, out_features=1024, bias=False)\n",
" (ia3_l): ParameterDict( (default): Parameter containing: [torch.FloatTensor of size 1024x1])\n",
" )\n",
" (o): Linear(in_features=1024, out_features=1024, bias=False)\n",
@ -369,11 +374,11 @@
" (EncDecAttention): MT5Attention(\n",
" (q): Linear(in_features=1024, out_features=1024, bias=False)\n",
" (k): Linear(\n",
" in_features=1024, out_features=1024, bias=False\n",
" (base_layer): Linear(in_features=1024, out_features=1024, bias=False)\n",
" (ia3_l): ParameterDict( (default): Parameter containing: [torch.FloatTensor of size 1024x1])\n",
" )\n",
" (v): Linear(\n",
" in_features=1024, out_features=1024, bias=False\n",
" (base_layer): Linear(in_features=1024, out_features=1024, bias=False)\n",
" (ia3_l): ParameterDict( (default): Parameter containing: [torch.FloatTensor of size 1024x1])\n",
" )\n",
" (o): Linear(in_features=1024, out_features=1024, bias=False)\n",
@ -385,7 +390,7 @@
" (DenseReluDense): MT5DenseGatedActDense(\n",
" (wi_0): Linear(in_features=1024, out_features=2816, bias=False)\n",
" (wi_1): Linear(\n",
" in_features=1024, out_features=2816, bias=False\n",
" (base_layer): Linear(in_features=1024, out_features=2816, bias=False)\n",
" (ia3_l): ParameterDict( (default): Parameter containing: [torch.FloatTensor of size 2816x1])\n",
" )\n",
" (wo): Linear(in_features=2816, out_features=1024, bias=False)\n",
@ -403,11 +408,11 @@
" (SelfAttention): MT5Attention(\n",
" (q): Linear(in_features=1024, out_features=1024, bias=False)\n",
" (k): Linear(\n",
" in_features=1024, out_features=1024, bias=False\n",
" (base_layer): Linear(in_features=1024, out_features=1024, bias=False)\n",
" (ia3_l): ParameterDict( (default): Parameter containing: [torch.FloatTensor of size 1024x1])\n",
" )\n",
" (v): Linear(\n",
" in_features=1024, out_features=1024, bias=False\n",
" (base_layer): Linear(in_features=1024, out_features=1024, bias=False)\n",
" (ia3_l): ParameterDict( (default): Parameter containing: [torch.FloatTensor of size 1024x1])\n",
" )\n",
" (o): Linear(in_features=1024, out_features=1024, bias=False)\n",
@ -419,11 +424,11 @@
" (EncDecAttention): MT5Attention(\n",
" (q): Linear(in_features=1024, out_features=1024, bias=False)\n",
" (k): Linear(\n",
" in_features=1024, out_features=1024, bias=False\n",
" (base_layer): Linear(in_features=1024, out_features=1024, bias=False)\n",
" (ia3_l): ParameterDict( (default): Parameter containing: [torch.FloatTensor of size 1024x1])\n",
" )\n",
" (v): Linear(\n",
" in_features=1024, out_features=1024, bias=False\n",
" (base_layer): Linear(in_features=1024, out_features=1024, bias=False)\n",
" (ia3_l): ParameterDict( (default): Parameter containing: [torch.FloatTensor of size 1024x1])\n",
" )\n",
" (o): Linear(in_features=1024, out_features=1024, bias=False)\n",
@ -435,7 +440,7 @@
" (DenseReluDense): MT5DenseGatedActDense(\n",
" (wi_0): Linear(in_features=1024, out_features=2816, bias=False)\n",
" (wi_1): Linear(\n",
" in_features=1024, out_features=2816, bias=False\n",
" (base_layer): Linear(in_features=1024, out_features=2816, bias=False)\n",
" (ia3_l): ParameterDict( (default): Parameter containing: [torch.FloatTensor of size 2816x1])\n",
" )\n",
" (wo): Linear(in_features=2816, out_features=1024, bias=False)\n",
@ -457,7 +462,7 @@
")"
]
},
"execution_count": 16,
"execution_count": 5,
"metadata": {},
"output_type": "execute_result"
}
@ -470,7 +475,8 @@
},
{
"cell_type": "code",
"execution_count": 17,
"execution_count": 6,
"id": "155b8728",
"metadata": {
"colab": {
"base_uri": "https://localhost:8080/",
@ -519,27 +525,14 @@
"name": "stderr",
"output_type": "stream",
"text": [
"WARNING:datasets.builder:Found cached dataset financial_phrasebank (/root/.cache/huggingface/datasets/financial_phrasebank/sentences_allagree/1.0.0/550bde12e6c30e2674da973a55f57edde5181d53f5a5a34c1531c53f93b7e141)\n"
"Using the latest cached version of the dataset since financial_phrasebank couldn't be found on the Hugging Face Hub\n",
"Found the latest cached dataset configuration 'sentences_allagree' at /root/.cache/huggingface/datasets/financial_phrasebank/sentences_allagree/1.0.0/550bde12e6c30e2674da973a55f57edde5181d53f5a5a34c1531c53f93b7e141 (last modified on Thu Jul 31 03:15:41 2025).\n"
]
},
{
"data": {
"application/vnd.jupyter.widget-view+json": {
"model_id": "bbfb7533b5ca459194e171df56b79566",
"version_major": 2,
"version_minor": 0
},
"text/plain": [
" 0%| | 0/1 [00:00<?, ?it/s]"
]
},
"metadata": {},
"output_type": "display_data"
},
{
"data": {
"application/vnd.jupyter.widget-view+json": {
"model_id": "9e12d97af6124a5a8c6627708b300c1e",
"model_id": "43b03e9b6de94bf0921228482d7be1e5",
"version_major": 2,
"version_minor": 0
},
@ -553,7 +546,7 @@
{
"data": {
"application/vnd.jupyter.widget-view+json": {
"model_id": "0c561dab67914ea9b6e1aab803600551",
"model_id": "d08de1efca67472781017b806f33870c",
"version_major": 2,
"version_minor": 0
},
@ -567,12 +560,12 @@
{
"data": {
"text/plain": [
"{'sentence': 'It will be operated by Nokia , and supported by its Nokia NetAct network and service management system .',\n",
"{'sentence': 'SCOPI Chief Business Excellence Officer , Eng .',\n",
" 'label': 1,\n",
" 'text_label': 'neutral'}"
]
},
"execution_count": 17,
"execution_count": 6,
"metadata": {},
"output_type": "execute_result"
}
@ -596,7 +589,8 @@
},
{
"cell_type": "code",
"execution_count": 18,
"execution_count": 7,
"id": "723fb67d",
"metadata": {
"colab": {
"base_uri": "https://localhost:8080/",
@ -633,7 +627,63 @@
{
"data": {
"application/vnd.jupyter.widget-view+json": {
"model_id": "e1e80a68a9e7429397cafc96c3c11f80",
"model_id": "7e08a312e5454c188f52fc2ca902c463",
"version_major": 2,
"version_minor": 0
},
"text/plain": [
"tokenizer_config.json: 0%| | 0.00/430 [00:00<?, ?B/s]"
]
},
"metadata": {},
"output_type": "display_data"
},
{
"data": {
"application/vnd.jupyter.widget-view+json": {
"model_id": "25d5de12709748c9959cd011c5c641de",
"version_major": 2,
"version_minor": 0
},
"text/plain": [
"spiece.model: 0%| | 0.00/4.31M [00:00<?, ?B/s]"
]
},
"metadata": {},
"output_type": "display_data"
},
{
"data": {
"application/vnd.jupyter.widget-view+json": {
"model_id": "5b39c130813843c18e7f9187ffec37df",
"version_major": 2,
"version_minor": 0
},
"text/plain": [
"tokenizer.json: 0%| | 0.00/16.3M [00:00<?, ?B/s]"
]
},
"metadata": {},
"output_type": "display_data"
},
{
"data": {
"application/vnd.jupyter.widget-view+json": {
"model_id": "de27076e123243fd89dbad1c9e1f0596",
"version_major": 2,
"version_minor": 0
},
"text/plain": [
"special_tokens_map.json: 0%| | 0.00/74.0 [00:00<?, ?B/s]"
]
},
"metadata": {},
"output_type": "display_data"
},
{
"data": {
"application/vnd.jupyter.widget-view+json": {
"model_id": "1b55669bf13a4e2886f34c12d5f50354",
"version_major": 2,
"version_minor": 0
},
@ -647,7 +697,7 @@
{
"data": {
"application/vnd.jupyter.widget-view+json": {
"model_id": "21f582e1208a4a38ae3c0cdce87e5c14",
"model_id": "f914229f180b4188925d9e804b92475c",
"version_major": 2,
"version_minor": 0
},
@ -695,7 +745,8 @@
},
{
"cell_type": "code",
"execution_count": 19,
"execution_count": 8,
"id": "36d56ea7",
"metadata": {
"id": "f733a3c6"
},
@ -712,7 +763,8 @@
},
{
"cell_type": "code",
"execution_count": 20,
"execution_count": 9,
"id": "6b0a0536",
"metadata": {
"colab": {
"base_uri": "https://localhost:8080/"
@ -725,45 +777,45 @@
"name": "stderr",
"output_type": "stream",
"text": [
"100%|██████████| 255/255 [02:33<00:00, 1.67it/s]\n",
"100%|██████████| 29/29 [00:08<00:00, 3.48it/s]\n"
"100%|███████████████████████████████████████████████████████████████████████████████████████████████████| 255/255 [00:52<00:00, 4.86it/s]\n",
"100%|█████████████████████████████████████████████████████████████████████████████████████████████████████| 29/29 [00:02<00:00, 12.67it/s]\n"
]
},
{
"name": "stdout",
"output_type": "stream",
"text": [
"epoch=0: train_ppl=tensor(1.4939, device='cuda:0') train_epoch_loss=tensor(0.4014, device='cuda:0') eval_ppl=tensor(1.0514, device='cuda:0') eval_epoch_loss=tensor(0.0501, device='cuda:0')\n"
"epoch=0: train_ppl=tensor(1.4686, device='xpu:0') train_epoch_loss=tensor(0.3843, device='xpu:0') eval_ppl=tensor(1.0421, device='xpu:0') eval_epoch_loss=tensor(0.0412, device='xpu:0')\n"
]
},
{
"name": "stderr",
"output_type": "stream",
"text": [
"100%|██████████| 255/255 [02:32<00:00, 1.67it/s]\n",
"100%|██████████| 29/29 [00:08<00:00, 3.43it/s]\n"
"100%|███████████████████████████████████████████████████████████████████████████████████████████████████| 255/255 [00:49<00:00, 5.20it/s]\n",
"100%|█████████████████████████████████████████████████████████████████████████████████████████████████████| 29/29 [00:02<00:00, 13.62it/s]\n"
]
},
{
"name": "stdout",
"output_type": "stream",
"text": [
"epoch=1: train_ppl=tensor(1.0523, device='cuda:0') train_epoch_loss=tensor(0.0510, device='cuda:0') eval_ppl=tensor(1.0383, device='cuda:0') eval_epoch_loss=tensor(0.0376, device='cuda:0')\n"
"epoch=1: train_ppl=tensor(1.0683, device='xpu:0') train_epoch_loss=tensor(0.0661, device='xpu:0') eval_ppl=tensor(1.0264, device='xpu:0') eval_epoch_loss=tensor(0.0261, device='xpu:0')\n"
]
},
{
"name": "stderr",
"output_type": "stream",
"text": [
"100%|██████████| 255/255 [02:32<00:00, 1.68it/s]\n",
"100%|██████████| 29/29 [00:08<00:00, 3.44it/s]"
"100%|███████████████████████████████████████████████████████████████████████████████████████████████████| 255/255 [00:49<00:00, 5.20it/s]\n",
"100%|█████████████████████████████████████████████████████████████████████████████████████████████████████| 29/29 [00:02<00:00, 13.63it/s]"
]
},
{
"name": "stdout",
"output_type": "stream",
"text": [
"epoch=2: train_ppl=tensor(1.0397, device='cuda:0') train_epoch_loss=tensor(0.0389, device='cuda:0') eval_ppl=tensor(1.0392, device='cuda:0') eval_epoch_loss=tensor(0.0385, device='cuda:0')\n"
"epoch=2: train_ppl=tensor(1.0451, device='xpu:0') train_epoch_loss=tensor(0.0441, device='xpu:0') eval_ppl=tensor(1.0191, device='xpu:0') eval_epoch_loss=tensor(0.0190, device='xpu:0')\n"
]
},
{
@ -814,6 +866,7 @@
{
"cell_type": "code",
"execution_count": 21,
"id": "761b90e4",
"metadata": {
"colab": {
"base_uri": "https://localhost:8080/"
@ -849,6 +902,7 @@
{
"cell_type": "code",
"execution_count": 22,
"id": "8e0658ac",
"metadata": {
"id": "a8de6005"
},
@ -861,7 +915,8 @@
},
{
"cell_type": "code",
"execution_count": 23,
"execution_count": null,
"id": "ef7fbf9c",
"metadata": {
"colab": {
"base_uri": "https://localhost:8080/"
@ -874,18 +929,19 @@
"name": "stdout",
"output_type": "stream",
"text": [
"1.2M\tbigscience/mt0-large_IA3_SEQ_2_SEQ_LM/adapter_model.bin\n"
"1.2M\tbigscience/mt0-large_IA3_SEQ_2_SEQ_LM/adapter_model.safetensors\n"
]
}
],
"source": [
"ckpt = f\"{peft_model_id}/adapter_model.bin\"\n",
"ckpt = f\"{peft_model_id}/adapter_model.safetensors\"\n",
"!du -h $ckpt"
]
},
{
"cell_type": "code",
"execution_count": 24,
"id": "4774d931",
"metadata": {
"id": "76c2fc29"
},
@ -903,6 +959,7 @@
{
"cell_type": "code",
"execution_count": 25,
"id": "996ddf0a",
"metadata": {
"colab": {
"base_uri": "https://localhost:8080/"
@ -946,6 +1003,7 @@
{
"cell_type": "code",
"execution_count": null,
"id": "701eda1b",
"metadata": {
"id": "66c65ea4"
},
@ -955,6 +1013,7 @@
{
"cell_type": "code",
"execution_count": null,
"id": "7d7718c5",
"metadata": {
"id": "65e71f78"
},
@ -970,7 +1029,7 @@
"provenance": []
},
"kernelspec": {
"display_name": "Python 3",
"display_name": "Python 3 (ipykernel)",
"language": "python",
"name": "python3"
},
@ -984,7 +1043,7 @@
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.8.3"
"version": "3.11.13"
},
"vscode": {
"interpreter": {

Some files were not shown because too many files have changed in this diff Show More