Implements DeLoRA: "Decoupling Angles and Strength in Low-rank
Adaptation" (https://huggingface.co/papers/2503.18225).
Similar to DoRA, DeLoRA decouples the angular learning from the
adaptation strength, but it also allows to limit the norm of the change.
This way, DeLoRA promises to reduce the risk of catastrophic forgetting
and to be more robust to hyper-parameter settings such as the learning
rate.
The "LoRA Without Regret" blog
post (https://thinkingmachines.ai/blog/lora/) mentions that targeting
the MLP part of the transformer is more effective than targeting the
attention modules. This experiment tests this by targeting:
["gate_proj", "up_proj", "down_proj"]
instead of the default layers (["q_proj", "v_proj"]).
I chose a rank to match the parameter count we would get when targeting
the attention modules with rank 32, which is rank 10. Testing on my
machine, there is indeed a nice improvement in the test score:
| metric | target attention | target MLP |
|----------------------|------------------|------------|
| test accuracy | 48.2% | 51.3% |
| # trainable params | 9175040 | 9461760 |
| peak memory reserved | 20.74 GB | 23.02 GB |
There is, however, also a marked increase in memory usage, despite
matching parameter count. Since the operations are different, this may
not be a surprise, but let's wait for the final verdict once this
experiment runs on our AWS instance.
Note: I also tested higher and lower ranks when targeting the MLP. The
effect on memory usage was negligible, but it did improve the score:
| metric | rank 8 | rank 10 | rank 12 | rank 32 |
|--------------------|---------|---------|----------|----------|
| test accuracy | 50.3% | 51.3% | 52.2% | 54.8% |
| # trainable params | 7569408 | 9461760 | 11354112 | 30277632 |
In the end, I chose only to add the rank 10 experiment to match the
number of trainable parameters.
Resolves#2809
Some models like Gemma3 apply a scalar to the embedding output. It needs
to be taken into account when using trainable tokens or LoRA applied to
the embedding layer.
A new initialization method was added to prompt tuning in #2815. This PR
adds an experiment config for this method to the MetaMathQA benchmark.
Testing locally, this got a test accuracy of 36%, compared to 25% with
random initialization.
This PR adds the set_requires_grad method to PEFT models (both PeftModel
and BaseTuner). As the name suggests, this is a method to set the
requires_grad attribute of the specified PEFT adapters.
For more general context, this is mostly relevant when dealing with
multiple adapters. As is, users can already set the active adapter(s)
with set_adapter, which automatically adjust the requires_grad attribute
too, so that only the active adapters will have grads enabled. However,
there can be situations where activity status and requires grad may
differ. Right now, users would need to manually set requires_grad to
deal with that, which is error prone (e.g. forgetting modules_to_save).
This PR closes this gap in the API.
As this functionality is quite general purpose, I added a
set_requires_grad function to functional.py for easier integration.
Note: The set_requires_grad method will raise an error when called with
prompt learning methods like prompt tuning. This is because these
methods don't have a universal base class (BaseTuner and BaseTunerLayer)
that would allow to add this API. Moreover, they only support a single
adapter at a time, hence there is not much need to have this method in
the first place.
A side effect of not supporting prompt learning is that on the
PeftModel, we are free to allow set_requires_grad to accept more than
one adapter, which would normally be difficult, because prompt learning
only allows one adapter.
While memory usage correlates with the number of trainable params, having this number directly
makes it easier to see that methods are using similar numbers of trainable params and outliers
can be inspected easily.
When using add_weighted_adapter, so far, there was an implicit
assumption that all weights are positive. This PR allows negative
weights to be passed.
---------
Co-authored-by: Valentin Teutschbein <valentin.teutschbein@student.hpi.uni-potsdam.de>
Implements the paper "Exploring Sparsity for Parameter Efficient Fine
Tuning Using Wavelets" (https://arxiv.org/abs/2505.12532).
WaveFT enables fine-grained control over the number of trainable
parameters by directly learning a sparse set of coefficients in the
wavelet domain of residual matrices. Experiments show that it works well
in the text-to-image generation space.
The reset_sessions function is removed but it's also no longer necessary
to call it for the purpose we used it.
Moreover, the deprecated use_auth_token argument is fully removed now,
so everywhere we used to pass it, it is now removed, unless a user
passes it explicitly.
Also, remove the deprecated local_dir_use_symlinks argument.
Resolves#2772
Fixes several edge cases with unusual layer names or target modules.
1. As #2772 stated, if "weight" is part of a layer name, it would be
treated incorrectly when creating the PEFT state_dict.
2. Similarly, when the adapter name itself is part of a layer name.
Some of these errors would pass silently, which is especially bad (e.g.
a weight not being loaded but no error raised).
I also added some tests that were not failing before, but to cover some
yet uncovered cases or to lay out some basic functionality.
While working on this, I also noticed that it was possible to target a
BaseTunerLayer with modules_to_save and trainable_token_indices (e.g.
the lora_A and lora_B nn.Linear would be replaced with
ModulesToSaveWrapper). I don't think this is ever desired, so we now
raise an error if this is detected.
This PR adds the PEFT version to the adapter_config.json. This can be
useful in the future -- for instance when we change the state dict
format of a PEFT method, we can convert it in a backwards compatible way
based on the PEFT version being used. It can also be useful for
debugging by providing an easy way to see the PEFT version that was used
to train a PEFT adapter.
Notes:
In #2038, we made a change to PEFT configs to make it so that even if
new arguments are added to a config, it can still be loaded with older
PEFT versions (forward compatibility). Before that change, adding the
PEFT version would have been quite disruptive, as it would make all PEFT
configs incompatible with older PEFT versions. Said PR was included in
the 0.14.0 release from Dec 2024, so we can expect the vast majority of
PEFT users to use this version or a more recent one.
If the PEFT version is a dev version, the version tag is ambiguous.
Therefore, I added some code to try to determine the commit hash. This
works if users installed PEFT with git+...@<HASH>. Unit testing that the
function to determine the hash works with these types of installs is not
trivial. Therefore, I just patched the function to return a fixed hash.
I did, however, test it locally and it works:
python -m pip install
git+https://github.com/huggingface/diffusers.git@5e181eddfe7e44c1444a2511b0d8e21d177850a0
python -c "from peft.config import _get_commit_hash; print(_get_commit_hash('diffusers'))"
Also note that I tried to make the retrieval of the hash super robust by
adding a broad try ... except. If there is an error there, e.g. due to a
busted install path, we never want this to fail, but rather just accept
that the hash cannot be determined (we add @UNKNOWN in this case).
If users installed a dev version of PEFT in different way, e.g. using git
clone && pip install ., the commit hash will not be detected. I think
this is fine, I really don't want to start shelling out with git just
for this purpose.
Right now, get_model_status() and get_layer_status() only report on
BaseTunerLayers, but it would be helpful if they could also report
auxiliary modules. This PR now includes those.
To facilitate this, a few attributes and methods were added to
AuxiliaryTrainingWrapper and subclasses to make them more similar to
BaseTunerLayer (e.g. the adapter_layer_names attribute). These
attributes and methods were assumed to be present in the code that
determines the model and layer status.
Resolves#2783.
Most PEFT layers (BaseTunerLayers) expose the in_features and
out_features attributes. Therefore, other packages like diffusers may
expect this attribute to exist. However, there were a few PEFT methods
where these attributes were missing:
- LoHa
- LoKr
- LN Tuning
- Trainable Tokens
The layers of these methods now also expose the attributes.
Implementation
To avoid code duplication, I factored out the whole code block in LoRA
layers that extracts these attributes, since LoRA has the most
exhaustive list of checks. The new utility function has the exact same
functionality and can now be used by other PEFT methods.
I updated the four PEFT methods mentioned above to use this new
function, but I did not update PEFT methods that already handled it, as
there wasn't really a need (they check one or two layer types at most,
so there is little duplication).
Explain how to use multiple adapters (e.g. 2 LoRA adapters) at the same
time, as the API is not quite intuitive and there are some footguns
around trainable parameters.
This question has come up multiple times in the past (for recent
examples, check #2749 and #2756). Thus it's a good idea to properly
document this.
---------
Co-authored-by: Steven Liu <59462357+stevhliu@users.noreply.github.com>
- The warning message was missing spaces between sentences.
- Added ' around strings for clarity
- For one warning, which extended another warning, put it at the start
instead of the end, because the other warning can be quite long,
leading to users missing the addition
For more context on this warning, see #2254
See
https://github.com/huggingface/diffusers/issues/11816#issuecomment-3281290153
This PR implements two small improvements to the speed of adapter
injection. On a benchmark based on the linked issue, the first change
leads to a speedup of 21% and the second change of another 3%. It's not
that much, but as the changes don't make the code more complicated,
there is really no reason not to take them.
The optimizations don't add any functional change but are simply based
on not recomputing the same values multiple times. Therefore, unless I'm
missing something, they should strictly improve runtime.
Deduplicate a lot of redundant code from PEFT method's model.py:
merge_and_unload
unload
delete_adapter
set_adapter
enable_adapter_layers
disable_adapter_layers
_replace_module
_unload_and_optionally_merge
_mark_only_adapters_as_trainable
_check_new_adapter_config
_check_target_module_exists
_prepare_adapter_config
__getattr__
get_peft_config_as_dict (fully deleted)
Related changes:
A new module, functional.py, is introduced, which contains functions
(just reimported from elsewhere) that can be useful for libraries that
want to integrate PEFT. I would suggest that we should treat them as
public API and thus guarantee backwards compatibility.
I also deduplicated almost identical
TRANSFORMERS_MODULES_TO_XXX_TARGET_MODULES_MAPPING constants by copying
them from LoRA and only overriding a few values that differ. Moreover,
some PEFT methods didn't have their own
TRANSFORMERS_MODULES_TO_XXX_TARGET_MODULES_MAPPING but used the one from
LoRA instead. They now each have their own constant, which is a copy
from the one from LoRA.
The test_config.py tests were missing a few configs from recently added
PEFT methods. Those are now included. After adding those, it was
revealed that for C3A and trainable tokens, super().__post_init__() was
not being called. This is now done.
Description
At the moment, we strongly couple the active adapter with
requires_grad=True. Concretely, when we call model.set_adapter(name), we
automatically assume that this adapter should not only be made active,
its requires_grad should also be set to True.
For the purpose of training PEFT models, this is fair. However, when
loading PEFT models for inference, this is not desired. Generally, for
inference, we don't need requires_grad=True, but as is, it is enabled.
Generally, this is not a severe bug, since in the inference code, we
don't perform any updates, thus we don't inadvertently update a weight
because it wrongly has requires_grad=True -- this is probably why it
went unnoticed so far. However, it could lead to worse runtime
performance and memory overhead when PyTorch records grads for those
parameters (which it shouldn't if called with torch.inference_mode, but
some users may forget to use this). Therefore, this bug is still worth
fixing.
Example
Example
With `modules_to_save`
A very basic example where the current PEFT fails:
import os
from transformers import AutoModelForCausalLM
from peft import LoraConfig, PeftModel, get_peft_model
model_id = "facebook/opt-125m"
path = "/tmp/peft/2759"
if not os.path.exists(path + "/adapter_model.safetensors"):
model = AutoModelForCausalLM.from_pretrained(model_id)
config = LoraConfig(target_modules=["q_proj", "v_proj"], modules_to_save=["lm_head"], r=8)
model = get_peft_model(model, config)
model.save_pretrained(path)
del model
model = AutoModelForCausalLM.from_pretrained(model_id)
model = PeftModel.from_pretrained(model, path)
`modules_to_save` should not have grads enabled, but currently it does.
### With multiple adapters
There is also an issue when loading more than one adapter:
model = PeftModel.from_pretrained(...)
assert not any(p.requires_grad for p in model.parameters()) # works
So far, so good, the first adapter does not have `requires_grad`.
model.load_adapter(...)
assert not any(p.requires_grad for p in model.parameters()) # fails
The load_adapter call inadvertently sets requires_grad=True for the
weights of the _first_ adapter. The reason why this happens is because
when the second adapter is loaded, we call set_adapter with the first
adapter to ensure that it remains the activate adapter. However, due to
the coupling of active adapter and requires_grad, this would result in
setting requires_grad=True for the first adapter.
The PR relaxes this coupling by allowing to call set_adapter with an
additional argument, inference_mode. If set to True, the requires_grad
will not be enabled, even if the adapter is activated.
The example above would also fail for modules_to_save and trainable
tokens, not only for the LoRA/LoHa/... weights.
Still open bugs
The proposed solution is unfortunately not perfect. Right now, we do
pass inference_mode based on the PEFT config of the adapter being added,
which helps with the original issue described above. However, even this
is not absolutely correct, because inference_mode of the second adapter
does not necessarily have the same value as inference_mode of the first
adapter. To illustrate how this can go wrong, I added an xfailing test:
test_loading_model_requires_grad_set_correctly_switch_inference_mode
I believe that this use case is rarer than the ones described at the
beginning, so IMO it is okay to have this bug because we fix more common
bugs. However, LMK if you disagree.
Related to this, I noticed that many tests in
test_custom_models.TestRequiresGrad had code like this:
config0 = FooConfig(...)
peft_model = get_peft_model(MLP(), config0)
config1 = FooConfig(..., inference_mode=True) # <==
peft_model.add_adapter("adapter1", config1)
This now fails because of the reason just given. I removed
inference_mode=True here and the tests pass again.
Note that the only reason why inference_mode=True was passed here is
because AdaLoRA cannot load 2 adapters in training mode and thus
requires this. Later PEFT methods without this restriction blindly
copied the AdaLoRA test. For those PEFT methods, I removed
inference_mode=True.
However, this also means that the AdaLoRA tests now fail. I thus marked
them as xfail.
To properly fix this bug, I think we would have to refactor the code to
isolate set_adapter (i.e. determining the active adapter) and setting
requires_grad into separate code paths, as they're orthogonal. Moreover,
these attributes are being set all over the place, which makes it hard
to reason about where these attributes are being changed. This should be
streamlined.
Making these changes while not breaking any existing code is not
trivial (or maybe impossible even). Therefore, I went the easier way for
the time being with this PR. Maybe a bigger refactor could be envisioned
for a version 1.0 release of PEFT.
Related changes
While working on this, I noticed that LNTuning was completely buggy when
calling set_adapter. This is now fixed.
Moreover, since I had to touch update_layer everywhere, I ensured that
they all take kwargs for consistency.
This PR adds support for Arrow, a modular routing mechanism for LoRA experts introduced here, as well as the refinement method GenKnowSub, proposed in our ACL 2025 Main Conference paper. GenKnowSub enhances Arrow by subtracting a general-domain LoRA from task-specific ones prior to routing, leading to improved generalisation and modularity.
There was an issue that forward hooks would accumulate during
generation, since one hook per forward step was being registered and
generate would call forward multiple times. This is already undesirable,
but to make it worse, only the last hook was removed, resulting in hooks
accumulating.
LeRobot uses dataclasses to manage policy configs. If we want to
support LeRobot policy fine-tuning it'd be easiest to support
these configs in `get_model_config`.
While it is possible to fix this on LeRobot's side (add a to_dict implementation to the config classes) I think it'd be cleaner to support it on our side since the cost is relatively low and dataclasses are getting more popular anyway.
Thanks @xliu0105 for raising this issue and proposing a fix.
In #2737, we fixed some code that relied on the deprecated attribute but
some was being missed, as it only runs on the nightly CI with multiple
GPUs. This PR fixes this.
Note that the original transformers code that this solution was based on
no longer exists, as transformers now initializes the cache lazily, so
pre-allocating the keys and values to the correct device is not
necessary. But since prefix tuning inserts "virtual" keys/values, we
still have to ensure the correct device in PEFT.
I have tested the failing tests locally and they pass.
There is a failing AWQ test since torch 2.6 which is marked as xfail for
torch=2.7. However, now torch 2.8 is out and the test is still failing.
Therefore, the xfail now checks for torch>=2.7.
As AWQ is no longer being maintained, we should expect this situation to
deteriorate over time and eventually we'll have to remove it. But for
the time being, it still appears to mostly work, so I suggest we leave
it as is.
This PR migrates Activated LoRA (aLoRA) support from a standalone Github (see above) to PEFT itself.
Note there is also an active PR for vLLM inference support for Activated LoRA: vllm-project/vllm#19710 . There are also collections of aLoRA models on huggingface (in the ibm-granite org), note that these preexisting models run off of the standalone github repo and will be updated to work with this new PEFT feature if merged.
Description of changes: Activated LoRA is a modification of the LoRA architecture to "activate" the adapter weights only on tokens coming after a specified invocation_string. This fact makes it so that KV values for the string coming before the activation matches KV values for the base model. This allows KV cache for the input to be interchangeable between the base model and adapter model, and allows for major speedups in inference pipelines (e.g. agentic pipelines) that want to use both base models and adapter models. See the paper for detailed exploration of use cases and further elaboration.
Other notes:
The crux of the changes are really in layer.py. Everything else is simply managing the alora_offsets quantity which defines where the weights start to be activated. This is determined by scanning input strings for the invocation_string defined in the aLoraConfig.
I believe that aLoRA really only makes sense for CausalLMs, hence I've only implemented this for that model type.
Merging doesn't make sense for aLoRA adapters since the weights are not universally applied to all tokens.
I used the LoRA code as a starting point, but did not implement various seemingly extra features in that code.
As of now, invocation_string should probably start and end with special tokens, to avoid tokenizer issues at the boundary. Open to suggestions on how to make this more general if needed.
---------
Co-authored-by: githubnemo <githubnemo@users.noreply.github.com>