make style

Use HF Papers
2025-10-20 15:33:48 +08:00 · 2025-05-26 20:58:17 +00:00 · 2025-05-17 03:30:40 +00:00
60 changed files with 110 additions and 109 deletions
--- a/docs/source/conceptual_guides/adapter.md
+++ b/docs/source/conceptual_guides/adapter.md
@ -52,7 +52,7 @@ In principle, LoRA can be applied to any subset of weight matrices in a neural n

 ## Mixture of LoRA Experts (X-LoRA)

-[X-LoRA](https://arxiv.org/abs/2402.07148) is a mixture of experts method for LoRA which works by using dense or sparse gating to dynamically activate LoRA experts. The LoRA experts as well as the base model are frozen during training, resulting in a low parameter count as only the gating layers must be trained. In particular, the gating layers output scalings which (depending on config) are granular on the layer and token level. Additionally, during inference, X-LoRA dynamically activates LoRA adapters to recall knowledge and effectively mix them:
+[X-LoRA](https://huggingface.co/papers/2402.07148) is a mixture of experts method for LoRA which works by using dense or sparse gating to dynamically activate LoRA experts. The LoRA experts as well as the base model are frozen during training, resulting in a low parameter count as only the gating layers must be trained. In particular, the gating layers output scalings which (depending on config) are granular on the layer and token level. Additionally, during inference, X-LoRA dynamically activates LoRA adapters to recall knowledge and effectively mix them:

 The below graphic demonstrates how the scalings change for different prompts for each token. This highlights the activation of different adapters as the generation progresses and the sequence creates new context.

--- a/docs/source/conceptual_guides/ia3.md
+++ b/docs/source/conceptual_guides/ia3.md
@ -16,7 +16,7 @@ rendered properly in your Markdown viewer.

 # IA3 

-This conceptual guide gives a brief overview of [IA3](https://arxiv.org/abs/2205.05638), a parameter-efficient fine tuning technique that is 
+This conceptual guide gives a brief overview of [IA3](https://huggingface.co/papers/2205.05638), a parameter-efficient fine tuning technique that is 
 intended to improve over [LoRA](./lora).

 To make fine-tuning more efficient, IA3 (Infused Adapter by Inhibiting and Amplifying Inner Activations) 
--- a/docs/source/conceptual_guides/oft.md
+++ b/docs/source/conceptual_guides/oft.md
@ -16,7 +16,7 @@ rendered properly in your Markdown viewer.

 # Orthogonal Finetuning (OFT and BOFT) 

-This conceptual guide gives a brief overview of [OFT](https://arxiv.org/abs/2306.07280) and [BOFT](https://arxiv.org/abs/2311.06243), a parameter-efficient fine-tuning technique that utilizes orthogonal matrix to multiplicatively transform the pretrained weight matrices.
+This conceptual guide gives a brief overview of [OFT](https://huggingface.co/papers/2306.07280) and [BOFT](https://huggingface.co/papers/2311.06243), a parameter-efficient fine-tuning technique that utilizes orthogonal matrix to multiplicatively transform the pretrained weight matrices.

 To achieve efficient fine-tuning, OFT represents the weight updates with an orthogonal transformation. The orthogonal transformation is parameterized by an orthogonal matrix multiplied to the pretrained weight matrix. These new matrices can be trained to adapt to the new data while keeping the overall number of changes low. The original weight matrix remains frozen and doesn’t receive any further adjustments. To produce the final results, both the original and the adapted weights are multiplied togethor.

@ -30,7 +30,7 @@ Orthogonal Butterfly (BOFT) generalizes OFT with Butterfly factorization and fur
 BOFT has some advantages compared to LoRA: 

 * BOFT proposes a simple yet generic way to finetune pretrained models to downstream tasks, yielding a better preservation of pretraining knowledge and a better parameter efficiency.
-* Through the orthogonality, BOFT introduces a structural constraint, i.e., keeping the [hyperspherical energy](https://arxiv.org/abs/1805.09298) unchanged during finetuning. This can effectively reduce the forgetting of pretraining knowledge.
+* Through the orthogonality, BOFT introduces a structural constraint, i.e., keeping the [hyperspherical energy](https://huggingface.co/papers/1805.09298) unchanged during finetuning. This can effectively reduce the forgetting of pretraining knowledge.
 * BOFT uses the butterfly factorization to efficiently parameterize the orthogonal matrix, which yields a compact yet expressive learning space (i.e., hypothesis class).
 * The sparse matrix decomposition in BOFT brings in additional inductive biases that are beneficial to generalization.

--- a/docs/source/developer_guides/custom_models.md
+++ b/docs/source/developer_guides/custom_models.md
@ -219,7 +219,7 @@ peft_model = get_peft_model(my_mistral_model, config)

 If that doesn't help, check the existing modules in your model architecture with the `named_modules` method and try to identify the attention layers, especially the key, query, and value layers. Those will often have names such as `c_attn`, `query`, `q_proj`, etc. The key layer is not always adapted, and ideally, you should check whether including it results in better performance.

-Additionally, linear layers are common targets to be adapted (e.g. in [QLoRA paper](https://arxiv.org/abs/2305.14314), authors suggest to adapt them as well). Their names will often contain the strings `fc` or `dense`.
+Additionally, linear layers are common targets to be adapted (e.g. in [QLoRA paper](https://huggingface.co/papers/2305.14314), authors suggest to adapt them as well). Their names will often contain the strings `fc` or `dense`.

 If you want to add a new model to PEFT, please create an entry in [constants.py](https://github.com/huggingface/peft/blob/main/src/peft/utils/constants.py) and open a pull request on the [repository](https://github.com/huggingface/peft/pulls). Don't forget to update the [README](https://github.com/huggingface/peft#models-support-matrix) as well.

--- a/docs/source/developer_guides/lora.md
+++ b/docs/source/developer_guides/lora.md
@ -41,7 +41,7 @@ config = LoraConfig(init_lora_weights=False, ...)
 ```

 ### PiSSA
-[PiSSA](https://arxiv.org/abs/2404.02948) initializes the LoRA adapter using the principal singular values and singular vectors. This straightforward modification allows PiSSA to converge more rapidly than LoRA and ultimately attain superior performance. Moreover, PiSSA reduces the quantization error compared to QLoRA, leading to further enhancements.
+[PiSSA](https://huggingface.co/papers/2404.02948) initializes the LoRA adapter using the principal singular values and singular vectors. This straightforward modification allows PiSSA to converge more rapidly than LoRA and ultimately attain superior performance. Moreover, PiSSA reduces the quantization error compared to QLoRA, leading to further enhancements.

 Configure the initialization method to "pissa", which may take several minutes to execute SVD on the pre-trained model:
 ```python
@ -56,7 +56,7 @@ For detailed instruction on using PiSSA, please follow [these instructions](http

 ### CorDA

-[CorDA](https://arxiv.org/pdf/2406.05223) builds task-aware LoRA adapters from weight decomposition oriented by the context of downstream task to learn (instruction-previewed mode, IPM) or world knowledge to maintain (knowledge-preserved mode, KPM).
+[CorDA](https://huggingface.co/papers/2406.05223) builds task-aware LoRA adapters from weight decomposition oriented by the context of downstream task to learn (instruction-previewed mode, IPM) or world knowledge to maintain (knowledge-preserved mode, KPM).
 The KPM not only achieves better performance than LoRA on fine-tuning tasks, but also mitigates the catastrophic forgetting of pre-trained world knowledge.
 When preserving pre-trained knowledge is not a concern,
 the IPM is favored because it can further accelerate convergence and enhance the fine-tuning performance.
@ -86,7 +86,7 @@ peft_model = get_peft_model(model, lora_config)
 For detailed instruction on using CorDA, please follow [these instructions](https://github.com/huggingface/peft/tree/main/examples/corda_finetuning).

 ### OLoRA
-[OLoRA](https://arxiv.org/abs/2406.01775) utilizes QR decomposition to initialize the LoRA adapters. OLoRA translates the base weights of the model by a factor of their QR decompositions, i.e., it mutates the weights before performing any training on them. This approach significantly improves stability, accelerates convergence speed, and ultimately achieves superior performance.
+[OLoRA](https://huggingface.co/papers/2406.01775) utilizes QR decomposition to initialize the LoRA adapters. OLoRA translates the base weights of the model by a factor of their QR decompositions, i.e., it mutates the weights before performing any training on them. This approach significantly improves stability, accelerates convergence speed, and ultimately achieves superior performance.

 You just need to pass a single additional option to use OLoRA:
 ```python
@ -96,7 +96,7 @@ config = LoraConfig(init_lora_weights="olora", ...)
 For more advanced usage, please refer to our [documentation](https://github.com/huggingface/peft/tree/main/examples/olora_finetuning).

 ### EVA
-[EVA](https://arxiv.org/pdf/2410.07170) performs SVD on the input activations of each layer and uses the right-singular vectors to initialize LoRA weights. It is therefore a data-driven initialization scheme. Furthermore EVA adaptively allocates ranks across layers based on their "explained variance ratio" - a metric derived from the SVD analysis.
+[EVA](https://huggingface.co/papers/2410.07170) performs SVD on the input activations of each layer and uses the right-singular vectors to initialize LoRA weights. It is therefore a data-driven initialization scheme. Furthermore EVA adaptively allocates ranks across layers based on their "explained variance ratio" - a metric derived from the SVD analysis.

 You can use EVA by setting `init_lora_weights="eva"` and defining [`EvaConfig`] in [`LoraConfig`]:
 ```python
@ -129,7 +129,7 @@ For further instructions on using EVA, please refer to our [documentation](https

 #### Standard approach

-When quantizing the base model for QLoRA training, consider using the [LoftQ initialization](https://arxiv.org/abs/2310.08659), which has been shown to improve performance when training quantized models. The idea is that the LoRA weights are initialized such that the quantization error is minimized. To use LoftQ, follow [these instructions](https://github.com/huggingface/peft/tree/main/examples/loftq_finetuning).
+When quantizing the base model for QLoRA training, consider using the [LoftQ initialization](https://huggingface.co/papers/2310.08659), which has been shown to improve performance when training quantized models. The idea is that the LoRA weights are initialized such that the quantization error is minimized. To use LoftQ, follow [these instructions](https://github.com/huggingface/peft/tree/main/examples/loftq_finetuning).

 In general, for LoftQ to work best, it is recommended to target as many layers with LoRA as possible, since those not targeted cannot have LoftQ applied. This means that passing `LoraConfig(..., target_modules="all-linear")` will most likely give the best results. Also, you should use `nf4` as quant type in your quantization config when using 4bit quantization, i.e. `BitsAndBytesConfig(load_in_4bit=True, bnb_4bit_quant_type="nf4")`.

@ -176,7 +176,7 @@ config = LoraConfig(use_rslora=True, ...)

 ### Weight-Decomposed Low-Rank Adaptation (DoRA)

-This technique decomposes the updates of the weights into two parts, magnitude and direction. Direction is handled by normal LoRA, whereas the magnitude is handled by a separate learnable parameter. This can improve the performance of LoRA, especially at low ranks. For more information on DoRA, see  https://arxiv.org/abs/2402.09353.
+This technique decomposes the updates of the weights into two parts, magnitude and direction. Direction is handled by normal LoRA, whereas the magnitude is handled by a separate learnable parameter. This can improve the performance of LoRA, especially at low ranks. For more information on DoRA, see  https://huggingface.co/papers/2402.09353.

 ```py
 from peft import LoraConfig
@ -228,7 +228,7 @@ config = LoraConfig(target_modules="all-linear", ...)

 ### Memory efficient Layer Replication with LoRA

-An approach used to improve the performance of models is to expand a model by duplicating layers in the model to build a larger model from a pretrained model of a given size. For example increasing a 7B model to a 10B model as described in the [SOLAR](https://arxiv.org/abs/2312.15166) paper. PEFT LoRA supports this kind of expansion in a memory efficient manner that supports further fine-tuning using LoRA adapters attached to the layers post replication of the layers. The replicated layers do not take additional memory as they share the underlying weights so the only additional memory required is the memory for the adapter weights. To use this feature you would create a config with the `layer_replication` argument.
+An approach used to improve the performance of models is to expand a model by duplicating layers in the model to build a larger model from a pretrained model of a given size. For example increasing a 7B model to a 10B model as described in the [SOLAR](https://huggingface.co/papers/2312.15166) paper. PEFT LoRA supports this kind of expansion in a memory efficient manner that supports further fine-tuning using LoRA adapters attached to the layers post replication of the layers. The replicated layers do not take additional memory as they share the underlying weights so the only additional memory required is the memory for the adapter weights. To use this feature you would create a config with the `layer_replication` argument.

 ```py
 config = LoraConfig(layer_replication=[[0,4], [2,5]], ...)
@ -275,7 +275,7 @@ LoRA training can optionally include special purpose optimizers. Currently PEFT

 ### LoRA-FA Optimizer

-LoRA training can be more effective and efficient using LoRA-FA, as described in [LoRA-FA](https://arxiv.org/abs/2308.03303). LoRA-FA reduces activation memory consumption by fixing the matrix A and only tuning the matrix B. During training, the gradient of B is optimized to approximate the full parameter fine-tuning gradient. Moreover, the memory consumption of LoRA-FA is not sensitive to the rank (since it erases the activation of $A$), therefore it can improve performance by enlarging lora rank without increasing memory consumption.
+LoRA training can be more effective and efficient using LoRA-FA, as described in [LoRA-FA](https://huggingface.co/papers/2308.03303). LoRA-FA reduces activation memory consumption by fixing the matrix A and only tuning the matrix B. During training, the gradient of B is optimized to approximate the full parameter fine-tuning gradient. Moreover, the memory consumption of LoRA-FA is not sensitive to the rank (since it erases the activation of $A$), therefore it can improve performance by enlarging lora rank without increasing memory consumption.

 ```py
 from peft import LoraConfig, get_peft_model
@ -308,7 +308,7 @@ trainer = Trainer(

 ### LoRA+ optimized LoRA

-LoRA training can be optimized using [LoRA+](https://arxiv.org/abs/2402.12354), which uses different learning rates for the adapter matrices A and B, shown to increase finetuning speed by up to 2x and performance by 1-2%.
+LoRA training can be optimized using [LoRA+](https://huggingface.co/papers/2402.12354), which uses different learning rates for the adapter matrices A and B, shown to increase finetuning speed by up to 2x and performance by 1-2%.

 ```py
 from peft import LoraConfig, get_peft_model
--- a/docs/source/developer_guides/quantization.md
+++ b/docs/source/developer_guides/quantization.md
@ -21,7 +21,7 @@ Quantization represents data with fewer bits, making it a useful technique for r
 * optimizing which model weights are quantized with the [AWQ](https://hf.co/papers/2306.00978) algorithm
 * independently quantizing each row of a weight matrix with the [GPTQ](https://hf.co/papers/2210.17323) algorithm
 * quantizing to 8-bit and 4-bit precision with the [bitsandbytes](https://github.com/TimDettmers/bitsandbytes) library
-* quantizing to as low as 2-bit precision with the [AQLM](https://arxiv.org/abs/2401.06118) algorithm
+* quantizing to as low as 2-bit precision with the [AQLM](https://huggingface.co/papers/2401.06118) algorithm

 However, after a model is quantized it isn't typically further trained for downstream tasks because training can be unstable due to the lower precision of the weights and activations. But since PEFT methods only add *extra* trainable parameters, this allows you to train a quantized model with a PEFT adapter on top! Combining quantization with PEFT can be a good strategy for training even the largest models on a single GPU. For example, [QLoRA](https://hf.co/papers/2305.14314) is a method that quantizes a model to 4-bits and then trains it with LoRA. This method allows you to finetune a 65B parameter model on a single 48GB GPU!

@ -135,7 +135,7 @@ Once quantized, you can post-train GPTQ models with PEFT APIs.

 ## AQLM quantization

-Additive Quantization of Language Models ([AQLM](https://arxiv.org/abs/2401.06118)) is a Large Language Models compression method. It quantizes multiple weights together and takes advantage of interdependencies between them. AQLM represents groups of 8-16 weights as a sum of multiple vector codes. This allows it to compress models down to as low as 2-bit with considerably low accuracy losses.
+Additive Quantization of Language Models ([AQLM](https://huggingface.co/papers/2401.06118)) is a Large Language Models compression method. It quantizes multiple weights together and takes advantage of interdependencies between them. AQLM represents groups of 8-16 weights as a sum of multiple vector codes. This allows it to compress models down to as low as 2-bit with considerably low accuracy losses.

 Since the AQLM quantization process is computationally expensive, a use of prequantized models is recommended. A partial list of available models can be found in the official aqlm [repository](https://github.com/Vahe1994/AQLM).

--- a/docs/source/package_reference/vblora.md
+++ b/docs/source/package_reference/vblora.md
@ -18,7 +18,7 @@ rendered properly in your Markdown viewer.

 ## Overview

-[VB-LoRA](https://arxiv.org/abs/2405.15179) is a parameter-efficient fine-tuning technique that extends LoRA by learning a fine-grained parameter-sharing scheme at the sub-vector level, achieving significantly higher parameter efficiency. This makes VB-LoRA especially useful in scenarios where storage and transmission costs are critical. It works by decomposing low-rank matrices—from different layers and modules such as K, Q, V, and FFN—into sub-vectors, which are then globally shared through a vector bank.
+[VB-LoRA](https://huggingface.co/papers/2405.15179) is a parameter-efficient fine-tuning technique that extends LoRA by learning a fine-grained parameter-sharing scheme at the sub-vector level, achieving significantly higher parameter efficiency. This makes VB-LoRA especially useful in scenarios where storage and transmission costs are critical. It works by decomposing low-rank matrices—from different layers and modules such as K, Q, V, and FFN—into sub-vectors, which are then globally shared through a vector bank.

 The abstract from the paper is:

--- a/docs/source/package_reference/xlora.md
+++ b/docs/source/package_reference/xlora.md
@ -16,7 +16,7 @@ rendered properly in your Markdown viewer.

 # X-LoRA

-Mixture of LoRA Experts ([X-LoRA](https://arxiv.org/abs/2402.07148)) is a PEFT method enabling sparse or dense mixture of LoRA experts based on a high granularity (token, layer, sequence) scalings matrix. This leverages frozen LoRA adapters and a frozen base model to drastically reduces the number of parameters that need to be fine-tuned.
+Mixture of LoRA Experts ([X-LoRA](https://huggingface.co/papers/2402.07148)) is a PEFT method enabling sparse or dense mixture of LoRA experts based on a high granularity (token, layer, sequence) scalings matrix. This leverages frozen LoRA adapters and a frozen base model to drastically reduces the number of parameters that need to be fine-tuned.

 A unique aspect of X-LoRA is its versatility: it can be applied to any `transformers` base model with LoRA adapters. This means that, despite the mixture of experts strategy, no changes to the model code must be made.

--- a/examples/boft_controlnet/boft_controlnet.md
+++ b/examples/boft_controlnet/boft_controlnet.md
@ -21,7 +21,7 @@ This guide demonstrates how to use BOFT, an orthogonal fine-tuning method, to fi

 By using BOFT from 🤗 PEFT, we can significantly reduce the number of trainable parameters while still achieving impressive results in various fine-tuning tasks across different foundation models. BOFT enhances model efficiency by integrating full-rank orthogonal matrices with a butterfly structure into specific model blocks, such as attention blocks, mirroring the approach used in LoRA. During fine-tuning, only these inserted matrices are trained, leaving the original model parameters untouched. During inference, the trainable BOFT parameters can be merged into the original model, eliminating any additional computational costs.

-As a member of the **orthogonal finetuning** class, BOFT presents a systematic and principled method for fine-tuning. It possesses several unique properties and has demonstrated superior performance compared to LoRA in a variety of scenarios. For further details on BOFT, please consult the [PEFT's GitHub repo's concept guide OFT](https://https://huggingface.co/docs/peft/index), the [original BOFT paper](https://arxiv.org/abs/2311.06243) and the [original OFT paper](https://arxiv.org/abs/2306.07280).
+As a member of the **orthogonal finetuning** class, BOFT presents a systematic and principled method for fine-tuning. It possesses several unique properties and has demonstrated superior performance compared to LoRA in a variety of scenarios. For further details on BOFT, please consult the [PEFT's GitHub repo's concept guide OFT](https://https://huggingface.co/docs/peft/index), the [original BOFT paper](https://huggingface.co/papers/2311.06243) and the [original OFT paper](https://huggingface.co/papers/2306.07280).

 In this guide we provide a controllable generation (ControlNet) fine-tuning script that is available in [PEFT's GitHub repo examples](https://github.com/huggingface/peft/tree/main/examples/boft_controlnet). This implementation is adapted from [diffusers's ControlNet](https://github.com/huggingface/diffusers/tree/main/examples/controlnet) and [Hecong Wu's ControlLoRA](https://github.com/HighCWu/ControlLoRA). You can try it out and finetune on your custom images.

--- a/examples/boft_controlnet/eval.py
+++ b/examples/boft_controlnet/eval.py
@ -13,7 +13,7 @@
 # limitations under the License.

 # The implementation is based on "Parameter-Efficient Orthogonal Finetuning
-# via Butterfly Factorization" (https://arxiv.org/abs/2311.06243) in ICLR 2024.
+# via Butterfly Factorization" (https://huggingface.co/papers/2311.06243) in ICLR 2024.

 import glob
 import os
--- a/examples/boft_controlnet/test_controlnet.py
+++ b/examples/boft_controlnet/test_controlnet.py
@ -13,7 +13,7 @@
 # limitations under the License.

 # The implementation is based on "Parameter-Efficient Orthogonal Finetuning
-# via Butterfly Factorization" (https://arxiv.org/abs/2311.06243) in ICLR 2024.
+# via Butterfly Factorization" (https://huggingface.co/papers/2311.06243) in ICLR 2024.

 import os
 import sys
--- a/examples/boft_controlnet/train_controlnet.py
+++ b/examples/boft_controlnet/train_controlnet.py
@ -14,7 +14,7 @@
 # limitations under the License.

 # The implementation is based on "Parameter-Efficient Orthogonal Finetuning
-# via Butterfly Factorization" (https://arxiv.org/abs/2311.06243) in ICLR 2024.
+# via Butterfly Factorization" (https://huggingface.co/papers/2311.06243) in ICLR 2024.

 import itertools
 import logging
--- a/examples/boft_controlnet/utils/light_controlnet.py
+++ b/examples/boft_controlnet/utils/light_controlnet.py
@ -40,7 +40,7 @@ class ControlNetOutput(BaseOutput):

 class ControlNetConditioningEmbedding(nn.Module):
    """
-    Quoting from https://arxiv.org/abs/2302.05543: "Stable Diffusion uses a pre-processing method similar to VQ-GAN
+    Quoting from https://huggingface.co/papers/2302.05543: "Stable Diffusion uses a pre-processing method similar to VQ-GAN
    [11] to convert the entire dataset of 512 × 512 images into smaller 64 × 64 “latent images” for stabilized
    training. This requires ControlNets to convert image-based conditions to 64 × 64 feature space to match the
    convolution size. We use a tiny network E(·) of four convolution layers with 4 × 4 kernels and 2 × 2 strides
--- a/examples/boft_controlnet/utils/pipeline_controlnet.py
+++ b/examples/boft_controlnet/utils/pipeline_controlnet.py
@ -215,9 +215,9 @@ class LightControlNetPipeline(StableDiffusionControlNetPipeline):
                The number of denoising steps. More denoising steps usually lead to a higher quality image at the
                expense of slower inference.
            guidance_scale (`float`, *optional*, defaults to 7.5):
-                Guidance scale as defined in [Classifier-Free Diffusion Guidance](https://arxiv.org/abs/2207.12598).
+                Guidance scale as defined in [Classifier-Free Diffusion Guidance](https://huggingface.co/papers/2207.12598).
                `guidance_scale` is defined as `w` of equation 2. of [Imagen
-                Paper](https://arxiv.org/pdf/2205.11487.pdf). Guidance scale is enabled by setting `guidance_scale >
+                Paper](https://huggingface.co/papers/2205.11487). Guidance scale is enabled by setting `guidance_scale >
                1`. Higher guidance scale encourages to generate images that are closely linked to the text `prompt`,
                usually at the expense of lower image quality.
            negative_prompt (`str` or `List[str]`, *optional*):
@ -227,7 +227,7 @@ class LightControlNetPipeline(StableDiffusionControlNetPipeline):
            num_images_per_prompt (`int`, *optional*, defaults to 1):
                The number of images to generate per prompt.
            eta (`float`, *optional*, defaults to 0.0):
-                Corresponds to parameter eta (η) in the DDIM paper: https://arxiv.org/abs/2010.02502. Only applies to
+                Corresponds to parameter eta (η) in the DDIM paper: https://huggingface.co/papers/2010.02502. Only applies to
                [`schedulers.DDIMScheduler`], will be ignored for others.
            generator (`torch.Generator` or `List[torch.Generator]`, *optional*):
                One or a list of [torch generator(s)](https://pytorch.org/docs/stable/generated/torch.Generator.html)
@ -298,7 +298,7 @@ class LightControlNetPipeline(StableDiffusionControlNetPipeline):

        device = self._execution_device
        # here `guidance_scale` is defined analog to the guidance weight `w` of equation (2)
-        # of the Imagen paper: https://arxiv.org/pdf/2205.11487.pdf . `guidance_scale = 1`
+        # of the Imagen paper: https://huggingface.co/papers/2205.11487 . `guidance_scale = 1`
        # corresponds to doing no classifier free guidance.
        do_classifier_free_guidance = guidance_scale > 1.0

--- a/examples/boft_dreambooth/boft_dreambooth.md
+++ b/examples/boft_dreambooth/boft_dreambooth.md
@ -20,7 +20,7 @@ This guide demonstrates how to use BOFT, an orthogonal fine-tuning method, to fi

 By using BOFT from 🤗 PEFT, we can significantly reduce the number of trainable parameters while still achieving impressive results in various fine-tuning tasks across different foundation models. BOFT enhances model efficiency by integrating full-rank orthogonal matrices with a butterfly structure into specific model blocks, such as attention blocks, mirroring the approach used in LoRA. During fine-tuning, only these inserted matrices are trained, leaving the original model parameters untouched. During inference, the trainable BOFT parameters can be merged into the original model, eliminating any additional computational costs.

-As a member of the **orthogonal finetuning** class, BOFT presents a systematic and principled method for fine-tuning. It possesses several unique properties and has demonstrated superior performance compared to LoRA in a variety of scenarios. For further details on BOFT, please consult the [PEFT's GitHub repo's concept guide OFT](https://https://huggingface.co/docs/peft/index), the [original BOFT paper](https://arxiv.org/abs/2311.06243) and the [original OFT paper](https://arxiv.org/abs/2306.07280).
+As a member of the **orthogonal finetuning** class, BOFT presents a systematic and principled method for fine-tuning. It possesses several unique properties and has demonstrated superior performance compared to LoRA in a variety of scenarios. For further details on BOFT, please consult the [PEFT's GitHub repo's concept guide OFT](https://https://huggingface.co/docs/peft/index), the [original BOFT paper](https://huggingface.co/papers/2311.06243) and the [original OFT paper](https://huggingface.co/papers/2306.07280).

 In this guide we provide a Dreambooth fine-tuning script that is available in [PEFT's GitHub repo examples](https://github.com/huggingface/peft/tree/main/examples/boft_dreambooth). This implementation is adapted from [peft's lora_dreambooth](https://github.com/huggingface/peft/tree/main/examples/lora_dreambooth). You can try it out and finetune on your custom images.

--- a/examples/boft_dreambooth/train_dreambooth.py
+++ b/examples/boft_dreambooth/train_dreambooth.py
@ -14,7 +14,7 @@
 # limitations under the License.

 # The implementation is based on "Parameter-Efficient Orthogonal Finetuning
-# via Butterfly Factorization" (https://arxiv.org/abs/2311.06243) in ICLR 2024.
+# via Butterfly Factorization" (https://huggingface.co/papers/2311.06243) in ICLR 2024.

 import hashlib
 import itertools
--- a/examples/bone_finetuning/README.md
+++ b/examples/bone_finetuning/README.md
@ -1,5 +1,5 @@
 # DiSHA: Dimension-Sharding Adaptation with Fast Convergence and Fast Computation
-## Introduction ([Paper](https://arxiv.org/pdf/2409.15371), [code](https://github.com/JL-er/DiSHA))
+## Introduction ([Paper](https://huggingface.co/papers/2409.15371), [code](https://github.com/JL-er/DiSHA))
 Low-Rank Adaptation (LoRA) leverages the low intrinsic rank of weight updates in Large Language Models (LLMs), establishing a Parameter-Efficient Fine-Tuning (PEFT) paradigm. However, LoRA suffers from slow convergence. We introduce Dimension-Sharding Adaptation (DiSHA), which expands the PEFT design space to unlock lower intrinsic ranks and faster convergence by default. Within DiSHA's design space, we propose Block Affine Adaptation (Bone), a computationally efficient structure that delivers both high performance and efficiency. While certain DiSHA configurations may result in colinear updates to weight shards, we address this with Block Affine Transformation Adaptation (BAT), a nonlinear variant of DiSHA. BAT introduces nonlinearity by combining trainable matrices with original weight shards in a nonlinear manner, inducing nonlinearity in matrix updates without introducing additional parameters. Empirical results show that Bone, under the DiSHA framework, consistently outperforms LoRA variants in both NLG and NLU tasks, with significantly improved computational efficiency. Further analysis demonstrates that BAT enhances model capabilities by leveraging its nonlinear design.


@ -92,5 +92,5 @@ python bone_finetuning.py \
      eprint={2409.15371},
      archivePrefix={arXiv},
      primaryClass={cs.CL},
-      url={https://arxiv.org/abs/2409.15371}, 
+      url={https://huggingface.co/papers/2409.15371}, 
 }
--- a/examples/cpt_finetuning/README.md
+++ b/examples/cpt_finetuning/README.md
@ -1,6 +1,6 @@

 # Context-aware Prompt Tuning: Advancing In-Context Learning with Adversarial Methods
-## Introduction ([Paper](https://arxiv.org/abs/2410.17222), [Code](https://github.com/tsachiblau/Context-aware-Prompt-Tuning-Advancing-In-Context-Learning-with-Adversarial-Methods), [Notebook](cpt_train_and_inference.ipynb), [Colab](https://colab.research.google.com/drive/1UhQDVhZ9bDlSk1551SuJV8tIUmlIayta?usp=sharing))
+## Introduction ([Paper](https://huggingface.co/papers/2410.17222), [Code](https://github.com/tsachiblau/Context-aware-Prompt-Tuning-Advancing-In-Context-Learning-with-Adversarial-Methods), [Notebook](cpt_train_and_inference.ipynb), [Colab](https://colab.research.google.com/drive/1UhQDVhZ9bDlSk1551SuJV8tIUmlIayta?usp=sharing))

 > Large Language Models (LLMs) can perform few-shot learning using either optimization-based approaches or In-Context Learning (ICL). Optimization-based methods often suffer from overfitting, as they require updating a large number of parameters with limited data. In contrast, ICL avoids overfitting but typically underperforms compared to optimization-based methods and is highly sensitive to the selection, order, and format of demonstration examples. To overcome these challenges, we introduce Context-aware Prompt Tuning (CPT), a method inspired by ICL, Prompt Tuning (PT), and adversarial attacks. CPT builds on the ICL strategy of concatenating examples before the input, extending it by incorporating PT-like learning to refine the context embedding through iterative optimization, extracting deeper insights from the training examples. Our approach carefully modifies specific context tokens, considering the unique structure of the examples within the context. In addition to updating the context with PT-like optimization, CPT draws inspiration from adversarial attacks, adjusting the input based on the labels present in the context while preserving the inherent value of the user-provided data. To ensure robustness and stability during optimization, we employ a projected gradient descent algorithm, constraining token embeddings to remain close to their original values and safeguarding the quality of the context. Our method has demonstrated superior accuracy across multiple classification tasks using various LLM models, outperforming existing baselines and effectively addressing the overfitting challenge in few-shot learning.

--- a/examples/dora_finetuning/README.md
+++ b/examples/dora_finetuning/README.md
@ -4,7 +4,7 @@


 ## Introduction
-[DoRA](https://arxiv.org/abs/2402.09353) is a novel approach that leverages low rank adaptation through weight decomposition analysis to investigate the inherent differences between full fine-tuning and LoRA. DoRA initially decomposes the pretrained weight into its magnitude and directional components and finetunes both of them. Because the directional component is large in terms of parameter numbers, we further decompose it with LoRA for efficient finetuning. This results in enhancing both the learning capacity and training stability of LoRA while avoiding any additional inference overhead.
+[DoRA](https://huggingface.co/papers/2402.09353) is a novel approach that leverages low rank adaptation through weight decomposition analysis to investigate the inherent differences between full fine-tuning and LoRA. DoRA initially decomposes the pretrained weight into its magnitude and directional components and finetunes both of them. Because the directional component is large in terms of parameter numbers, we further decompose it with LoRA for efficient finetuning. This results in enhancing both the learning capacity and training stability of LoRA while avoiding any additional inference overhead.

 ## Quick start
 ```python
--- a/examples/eva_finetuning/README.md
+++ b/examples/eva_finetuning/README.md
@ -1,5 +1,5 @@
 # EVA: Explained Variance Adaptation
-## Introduction ([Paper](https://arxiv.org/abs/2410.07170), [code](https://github.com/ml-jku/EVA))
+## Introduction ([Paper](https://huggingface.co/papers/2410.07170), [code](https://github.com/ml-jku/EVA))
 Explained Variance Adaptation (EVA) is a novel initialization method for LoRA style adapters which initializes adapter weights in a data driven manner and adaptively allocates ranks according to the variance they explain. EVA improves average performance on a multitude of tasks across various domains, such as Language generation and understanding, Image classification, and Decision Making.

 The abstract from the paper is:
--- a/examples/hra_dreambooth/README.md
+++ b/examples/hra_dreambooth/README.md
@ -22,7 +22,7 @@ HRA provides a new perspective connecting LoRA to OFT and achieves encouraging p
 HRA adapts a pre-trained model by multiplying each frozen weight matrix with a chain of r learnable Householder reflections (HRs).
 HRA can be interpreted as either an OFT adapter or an adaptive LoRA. 
 Consequently, it harnesses the advantages of both strategies, reducing parameters and computation costs while penalizing the loss of pre-training knowledge.
-For further details on HRA, please consult the [original HRA paper](https://arxiv.org/abs/2405.17484).
+For further details on HRA, please consult the [original HRA paper](https://huggingface.co/papers/2405.17484).

 In this guide we provide a Dreambooth fine-tuning script that is available in [PEFT's GitHub repo examples](https://github.com/huggingface/peft/tree/main/examples/hra_dreambooth). This implementation is adapted from [peft's boft_dreambooth](https://github.com/huggingface/peft/tree/main/examples/boft_dreambooth). 

--- a/examples/hra_dreambooth/train_dreambooth.py
+++ b/examples/hra_dreambooth/train_dreambooth.py
@ -14,7 +14,7 @@
 # limitations under the License.

 # The implementation is based on "Bridging The Gap between Low-rank and Orthogonal
-# Adaptation via Householder Reflection Adaptation" (https://arxiv.org/abs/2405.17484).
+# Adaptation via Householder Reflection Adaptation" (https://huggingface.co/papers/2405.17484).

 import hashlib
 import itertools
--- a/examples/image_classification/README.md
+++ b/examples/image_classification/README.md
@ -4,9 +4,9 @@

 [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/huggingface/peft/blob/main/examples/image_classification/image_classification_peft_lora.ipynb) 

-We provide a notebook (`image_classification_peft_lora.ipynb`) where we learn how to use [LoRA](https://arxiv.org/abs/2106.09685) from 🤗 PEFT to fine-tune an image classification model by ONLY using **0.7%** of the original trainable parameters of the model. 
+We provide a notebook (`image_classification_peft_lora.ipynb`) where we learn how to use [LoRA](https://huggingface.co/papers/2106.09685) from 🤗 PEFT to fine-tune an image classification model by ONLY using **0.7%** of the original trainable parameters of the model. 

-LoRA adds low-rank "update matrices" to certain blocks in the underlying model (in this case the attention blocks) and ONLY trains those matrices during fine-tuning. During inference, these update matrices are _merged_ with the original model parameters. For more details, check out the [original LoRA paper](https://arxiv.org/abs/2106.09685). 
+LoRA adds low-rank "update matrices" to certain blocks in the underlying model (in this case the attention blocks) and ONLY trains those matrices during fine-tuning. During inference, these update matrices are _merged_ with the original model parameters. For more details, check out the [original LoRA paper](https://huggingface.co/papers/2106.09685). 

 ## PoolFormer model from timm

--- a/examples/image_classification/image_classification_peft_lora.ipynb
+++ b/examples/image_classification/image_classification_peft_lora.ipynb
@ -8,9 +8,9 @@
   "source": [
    "## Introduction\n",
    "\n",
-    "In this notebook, we will learn how to use [LoRA](https://arxiv.org/abs/2106.09685) from 🤗 PEFT to fine-tune an image classification model by ONLY using **0.77%** of the original trainable parameters of the model. \n",
+    "In this notebook, we will learn how to use [LoRA](https://huggingface.co/papers/2106.09685) from 🤗 PEFT to fine-tune an image classification model by ONLY using **0.77%** of the original trainable parameters of the model. \n",
    "\n",
-    "LoRA adds low-rank \"update matrices\" to certain blocks in the underlying model (in this case the attention blocks) and ONLY trains those matrices during fine-tuning. During inference, these update matrices are _merged_ with the original model parameters. For more details, check out the [original LoRA paper](https://arxiv.org/abs/2106.09685). \n",
+    "LoRA adds low-rank \"update matrices\" to certain blocks in the underlying model (in this case the attention blocks) and ONLY trains those matrices during fine-tuning. During inference, these update matrices are _merged_ with the original model parameters. For more details, check out the [original LoRA paper](https://huggingface.co/papers/2106.09685). \n",
    "\n",
    "Let's get started by installing the dependencies. \n",
    "\n",
--- a/examples/loftq_finetuning/LoftQ_weight_replacement.ipynb
+++ b/examples/loftq_finetuning/LoftQ_weight_replacement.ipynb
@ -13,7 +13,7 @@
   "id": "d041ecb4-6957-467e-8f3e-d4a12c674e9f",
   "metadata": {},
   "source": [
-    "This notebook shows how to apply [LoftQ](https://arxiv.org/abs/2310.08659) initialization on our QLoRA model.\n",
+    "This notebook shows how to apply [LoftQ](https://huggingface.co/papers/2310.08659) initialization on our QLoRA model.\n",
    "\n",
    "In short, the idea behind LoftQ is the following. When we use QLoRA, i.e. we quantize the base model with bitsandbytes to save memory, and then train LoRA weights on top of this base model, we expect a certain performance gap. This is partly due to the fact that quantization is onyl an approximation of the \"real\" weights and thus introduces a quantization error. By default, LoRA weights are initialized such that they are a no-op at the start of the training. However, we can instead initialize them so that they minimize the quantization error. This is the idea behind LoftQ.\n",
    "\n",
--- a/examples/loftq_finetuning/README.md
+++ b/examples/loftq_finetuning/README.md
@ -46,7 +46,7 @@ peft_model = PeftModel.from_pretrained(

 ### Apply LoftQ and save
 We provide [quantize_save_load.py](quantize_save_load.py) as an example to apply LoftQ with 
-different bits(`--bits`), ranks(`--rank`), and alternating steps (`--iter`, a hyper-parameter in LoftQ, see Algorithm 1 in [LoftQ paper](https://arxiv.org/abs/2310.08659)). Currently, this example supports
+different bits(`--bits`), ranks(`--rank`), and alternating steps (`--iter`, a hyper-parameter in LoftQ, see Algorithm 1 in [LoftQ paper](https://huggingface.co/papers/2310.08659)). Currently, this example supports
 `llama-2`, `falcon`, `mistral`, `bart`, `t5`, `deberta`, `bert`, `roberta`.

 Below is an example of obtaining 4bit LLAMA-2-7b with 16-rank LoRA adapters by 5 alternating steps.
--- a/examples/lorafa_finetune/README.md
+++ b/examples/lorafa_finetune/README.md
@ -2,7 +2,7 @@

 ## Introduction

-[LoRA-FA](https://arxiv.org/abs/2308.03303) is a noval Parameter-efficient Fine-tuning method, which freezes the projection down layer (matrix A) during LoRA training process and thus lead to less GPU memory consumption by eliminating the need for storing the activations of input tensors (X). Furthermore, LoRA-FA narrows the gap between the update amount of pre-trained weights when using the low-rank fine-tuning method and the full fine-tuning method. In conclusion, LoRA-FA reduces the memory consumption and leads to superior performance compared to vanilla LoRA.
+[LoRA-FA](https://huggingface.co/papers/2308.03303) is a noval Parameter-efficient Fine-tuning method, which freezes the projection down layer (matrix A) during LoRA training process and thus lead to less GPU memory consumption by eliminating the need for storing the activations of input tensors (X). Furthermore, LoRA-FA narrows the gap between the update amount of pre-trained weights when using the low-rank fine-tuning method and the full fine-tuning method. In conclusion, LoRA-FA reduces the memory consumption and leads to superior performance compared to vanilla LoRA.

 ## Quick start

@ -110,6 +110,6 @@ Despite its advantages, LoRA-FA is inherently limited by its low-rank approximat
      eprint={2308.03303},
      archivePrefix={arXiv},
      primaryClass={cs.CL},
-      url={https://arxiv.org/abs/2308.03303}, 
+      url={https://huggingface.co/papers/2308.03303}, 
 }
 ```
--- a/examples/olora_finetuning/README.md
+++ b/examples/olora_finetuning/README.md
@ -1,7 +1,7 @@
 # OLoRA: Orthonormal Low Rank Adaptation of Large Language Models

 ## Introduction
-[OLoRA](https://arxiv.org/abs/2406.01775) is a novel approach that leverages orthonormal low rank adaptation through QR decomposition. Unlike the default LoRA implementation, OLoRA decomposes original weights into their $\mathbf{Q}$ and $\mathbf{R}$ parts, and then uses the first `rank` rows of $\mathbf{R}$ and the first `rank` columns of $\mathbf{Q}$ to initialize $\mathbf{A}$ and $\mathbf{B}$, respectively. This results in significantly faster convergence, more stable training, and superior performance.
+[OLoRA](https://huggingface.co/papers/2406.01775) is a novel approach that leverages orthonormal low rank adaptation through QR decomposition. Unlike the default LoRA implementation, OLoRA decomposes original weights into their $\mathbf{Q}$ and $\mathbf{R}$ parts, and then uses the first `rank` rows of $\mathbf{R}$ and the first `rank` columns of $\mathbf{Q}$ to initialize $\mathbf{A}$ and $\mathbf{B}$, respectively. This results in significantly faster convergence, more stable training, and superior performance.

 ## Quick start
 ```python
--- a/examples/pissa_finetuning/README.md
+++ b/examples/pissa_finetuning/README.md
@ -1,5 +1,5 @@
 # PiSSA: Principal Singular values and Singular vectors Adaptation
-## Introduction ([Paper](https://arxiv.org/abs/2404.02948), [code](https://github.com/GraphPKU/PiSSA))
+## Introduction ([Paper](https://huggingface.co/papers/2404.02948), [code](https://github.com/GraphPKU/PiSSA))
 PiSSA represents a matrix $W\in\mathbb{R}^{m\times n}$ within the model by the product of two trainable matrices $A \in \mathbb{R}^{m\times r}$ and $B \in \mathbb{R}^{r\times n}$, where $r \ll \min(m, n)$, plus a residual matrix $W^{res}\in\mathbb{R}^{m\times n}$ for error correction. Singular value decomposition (SVD) is employed to factorize $W$, and the principal singular values and vectors of $W$ are utilized to initialize $A$ and $B$. The residual singular values and vectors initialize the residual matrix $W^{res}$, which keeps frozen during fine-tuning. This straightforward modification allows PiSSA to converge more rapidly than LoRA and ultimately attain superior performance. Moreover, PiSSA reduces the quantization error compared to QLoRA, leading to further enhancements.

 ## Quick Start
--- a/examples/semantic_segmentation/README.md
+++ b/examples/semantic_segmentation/README.md
@ -2,6 +2,6 @@

 [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/huggingface/peft/blob/main/examples/semantic_segmentation/semantic_segmentation_peft_lora.ipynb) 

-We provide a notebook (`semantic_segmentation_peft_lora.ipynb`) where we learn how to use [LoRA](https://arxiv.org/abs/2106.09685) from 🤗 PEFT to fine-tune an semantic segmentation by ONLY using **14%%** of the original trainable parameters of the model. 
+We provide a notebook (`semantic_segmentation_peft_lora.ipynb`) where we learn how to use [LoRA](https://huggingface.co/papers/2106.09685) from 🤗 PEFT to fine-tune an semantic segmentation by ONLY using **14%%** of the original trainable parameters of the model. 

-LoRA adds low-rank "update matrices" to certain blocks in the underlying model (in this case the attention blocks) and ONLY trains those matrices during fine-tuning. During inference, these update matrices are _merged_ with the original model parameters. For more details, check out the [original LoRA paper](https://arxiv.org/abs/2106.09685). 
+LoRA adds low-rank "update matrices" to certain blocks in the underlying model (in this case the attention blocks) and ONLY trains those matrices during fine-tuning. During inference, these update matrices are _merged_ with the original model parameters. For more details, check out the [original LoRA paper](https://huggingface.co/papers/2106.09685). 
--- a/examples/semantic_segmentation/semantic_segmentation_peft_lora.ipynb
+++ b/examples/semantic_segmentation/semantic_segmentation_peft_lora.ipynb
@ -8,9 +8,9 @@
   "source": [
    "## Introduction\n",
    "\n",
-    "In this notebook, we will learn how to use [LoRA](https://arxiv.org/abs/2106.09685) from 🤗 PEFT to fine-tune a SegFormer model variant for semantic segmentation by ONLY using **14%** of the original trainable parameters of the model. \n",
+    "In this notebook, we will learn how to use [LoRA](https://huggingface.co/papers/2106.09685) from 🤗 PEFT to fine-tune a SegFormer model variant for semantic segmentation by ONLY using **14%** of the original trainable parameters of the model. \n",
    "\n",
-    "LoRA adds low-rank \"update matrices\" to certain blocks in the underlying model (in this case the attention blocks) and ONLY trains those matrices during fine-tuning. During inference, these update matrices are _merged_ with the original model parameters. For more details, check out the [original LoRA paper](https://arxiv.org/abs/2106.09685). \n",
+    "LoRA adds low-rank \"update matrices\" to certain blocks in the underlying model (in this case the attention blocks) and ONLY trains those matrices during fine-tuning. During inference, these update matrices are _merged_ with the original model parameters. For more details, check out the [original LoRA paper](https://huggingface.co/papers/2106.09685). \n",
    "\n",
    "Let's get started by installing the dependencies. "
   ]
--- a/examples/token_classification/peft_lora_token_cls.ipynb
+++ b/examples/token_classification/peft_lora_token_cls.ipynb
@ -20,9 +20,9 @@
    "\n",
    "In this notebook, we are going to fine-tune the LayoutLM model by Microsoft Research on the [FUNSD](https://guillaumejaume.github.io/FUNSD/) dataset, which is a collection of annotated form documents. The goal of our model is to learn the annotations of a number of labels (\"question\", \"answer\", \"header\" and \"other\") on those forms, such that it can be used to annotate unseen forms in the future.\n",
    "\n",
-    "* Original LayoutLM paper: https://arxiv.org/abs/1912.13318\n",
+    "* Original LayoutLM paper: https://huggingface.co/papers/1912.13318\n",
    "\n",
-    "* Original FUNSD paper: https://arxiv.org/abs/1905.13538\n"
+    "* Original FUNSD paper: https://huggingface.co/papers/1905.13538\n"
   ]
  },
  {
--- a/src/peft/helpers.py
+++ b/src/peft/helpers.py
@ -162,9 +162,9 @@ def rescale_adapter_scale(model, multiplier):
    transformers and diffusers models that have directly loaded LoRA adapters.

    For LoRA, applying this context manager with multiplier in [0, 1] is strictly equivalent to applying
-    [wise-ft](https://arxiv.org/abs/2109.01903) (see [#1940](https://github.com/huggingface/peft/issues/1940) for
-    details). It can improve the performances of the model if there is a distribution shiftbetween the training data
-    used for fine-tuning, and the test data used during inference.
+    [wise-ft](https://huggingface.co/papers/2109.01903) (see [#1940](https://github.com/huggingface/peft/issues/1940)
+    for details). It can improve the performances of the model if there is a distribution shiftbetween the training
+    data used for fine-tuning, and the test data used during inference.

    Warning: It has been reported that when using Apple's MPS backend for PyTorch, it is necessary to add a short sleep
        time after exiting the context before the scales are fully restored.
--- a/src/peft/optimizers/lorafa.py
+++ b/src/peft/optimizers/lorafa.py
@ -49,7 +49,7 @@ class LoraFAOptimizer(Optimizer):
        closure (Callable, optional): A closure that reevaluates the model and returns the loss.

    Reference:
-        - LoRA-FA: https://arxiv.org/abs/2308.03303
+        - LoRA-FA: https://huggingface.co/papers/2308.03303
    """

    def __init__(
--- a/src/peft/optimizers/loraplus.py
+++ b/src/peft/optimizers/loraplus.py
@ -35,7 +35,7 @@ def create_loraplus_optimizer(
    """
    Creates a LoraPlus optimizer.

-    Efficient Low Rank Adaptation of Large Models: https://arxiv.org/abs/2402.12354
+    Efficient Low Rank Adaptation of Large Models: https://huggingface.co/papers/2402.12354

    Reference: https://github.com/nikhil-ghosh-berkeley/loraplus/

--- a/src/peft/tuners/adaption_prompt/model.py
+++ b/src/peft/tuners/adaption_prompt/model.py
@ -24,7 +24,7 @@ from .utils import is_adaption_prompt_trainable

 class AdaptionPromptModel(nn.Module):
    """
-    Implements adaption prompts as described in https://arxiv.org/pdf/2303.16199.pdf.
+    Implements adaption prompts as described in https://huggingface.co/papers/2303.16199.

    The top L attention modules are replaced with AdaptedAttention modules that wrap the original ones, but insert
    trainable prompts with gates (for zero init).
--- a/src/peft/tuners/boft/config.py
+++ b/src/peft/tuners/boft/config.py
@ -13,7 +13,7 @@
 # limitations under the License.

 # The implementation is based on "Parameter-Efficient Orthogonal Finetuning
-# via Butterfly Factorization" (https://arxiv.org/abs/2311.06243) in ICLR 2024.
+# via Butterfly Factorization" (https://huggingface.co/papers/2311.06243) in ICLR 2024.

 from __future__ import annotations

--- a/src/peft/tuners/boft/layer.py
+++ b/src/peft/tuners/boft/layer.py
@ -13,7 +13,7 @@
 # limitations under the License.

 # The implementation is based on "Parameter-Efficient Orthogonal Finetuning
-# via Butterfly Factorization" (https://arxiv.org/abs/2311.06243) in ICLR 2024.
+# via Butterfly Factorization" (https://huggingface.co/papers/2311.06243) in ICLR 2024.

 from __future__ import annotations

--- a/src/peft/tuners/boft/model.py
+++ b/src/peft/tuners/boft/model.py
@ -13,7 +13,7 @@
 # limitations under the License.

 # The implementation is based on "Parameter-Efficient Orthogonal Finetuning
-# via Butterfly Factorization" (https://arxiv.org/abs/2311.06243) in ICLR 2024.
+# via Butterfly Factorization" (https://huggingface.co/papers/2311.06243) in ICLR 2024.

 import warnings
 from dataclasses import asdict
@ -42,8 +42,8 @@ from .layer import BOFTLayer, Conv2d, Linear

 class BOFTModel(BaseTuner):
    """
-    Creates BOFT and OFT model from a pretrained transformers model. Paper: https://arxiv.org/abs/2311.06243
-    https://arxiv.org/abs/2306.07280
+    Creates BOFT and OFT model from a pretrained transformers model. Paper: https://huggingface.co/papers/2311.06243
+    https://huggingface.co/papers/2306.07280

    Args:
        model ([`transformers.PreTrainedModel`]): The model to be adapted.
--- a/src/peft/tuners/bone/model.py
+++ b/src/peft/tuners/bone/model.py
@ -35,7 +35,7 @@ from .layer import BoneLayer, BoneLinear
 class BoneModel(BaseTuner):
    """
    Creates Householder reflection adaptation (Bone) model from a pretrained model. The method is described in
-    https://arxiv.org/abs/2409.15371
+    https://huggingface.co/papers/2409.15371

    Args:
        model (`torch.nn.Module`): The model to which the adapter tuner layers will be attached.
--- a/src/peft/tuners/cpt/config.py
+++ b/src/peft/tuners/cpt/config.py
@ -31,7 +31,7 @@ class CPTConfig(PromptLearningConfig):
    - Loss weighting
    - Projection settings

-    For more details, see the paper: https://arxiv.org/abs/2410.17222
+    For more details, see the paper: https://huggingface.co/papers/2410.17222
    """

    # Token-related configurations
--- a/src/peft/tuners/fourierft/model.py
+++ b/src/peft/tuners/fourierft/model.py
@ -39,7 +39,7 @@ class FourierFTModel(BaseTuner):
    """
    Creates FourierFT model from a pretrained transformers model.

-    The method is described in detail in https://arxiv.org/abs/2405.03003.
+    The method is described in detail in https://huggingface.co/papers/2405.03003.

    Args:
        model ([`torch.nn.Module`]): The model to be adapted.
--- a/src/peft/tuners/hra/model.py
+++ b/src/peft/tuners/hra/model.py
@ -35,7 +35,7 @@ from .layer import HRAConv2d, HRALayer, HRALinear
 class HRAModel(BaseTuner):
    """
    Creates Householder reflection adaptation (HRA) model from a pretrained model. The method is described in
-    https://arxiv.org/abs/2405.17484
+    https://huggingface.co/papers/2405.17484

    Args:
        model (`torch.nn.Module`): The model to which the adapter tuner layers will be attached.
--- a/src/peft/tuners/ia3/model.py
+++ b/src/peft/tuners/ia3/model.py
@ -39,7 +39,7 @@ from .layer import Conv2d, Conv3d, IA3Layer, Linear
 class IA3Model(BaseTuner):
    """
    Creates a Infused Adapter by Inhibiting and Amplifying Inner Activations ((IA)^3) model from a pretrained
-    transformers model. The method is described in detail in https://arxiv.org/abs/2205.05638
+    transformers model. The method is described in detail in https://huggingface.co/papers/2205.05638

    Args:
        model ([`~transformers.PreTrainedModel`]): The model to be adapted.
--- a/src/peft/tuners/ln_tuning/model.py
+++ b/src/peft/tuners/ln_tuning/model.py
@ -31,7 +31,7 @@ class LNTuningModel(BaseTuner):
    """
    Creates LayerNorm tuning from a pretrained transformer model.

-    The method is described in detail in https://arxiv.org/abs/2312.11420.
+    The method is described in detail in https://huggingface.co/papers/2312.11420.

    Args:
        model ([`torch.nn.Module`]): The model to be adapted.
--- a/src/peft/tuners/loha/model.py
+++ b/src/peft/tuners/loha/model.py
@ -26,7 +26,7 @@ from .layer import Conv2d, Linear, LoHaLayer
 class LoHaModel(LycorisTuner):
    """
    Creates Low-Rank Hadamard Product model from a pretrained model. The method is partially described in
-    https://arxiv.org/abs/2108.06098 Current implementation heavily borrows from
+    https://huggingface.co/papers/2108.06098 Current implementation heavily borrows from
    https://github.com/KohakuBlueleaf/LyCORIS/blob/eb460098187f752a5d66406d3affade6f0a07ece/lycoris/modules/loha.py

    Args:
--- a/src/peft/tuners/lokr/model.py
+++ b/src/peft/tuners/lokr/model.py
@ -26,8 +26,8 @@ from .layer import Conv2d, Linear, LoKrLayer
 class LoKrModel(LycorisTuner):
    """
    Creates Low-Rank Kronecker Product model from a pretrained model. The original method is partially described in
-    https://arxiv.org/abs/2108.06098 and in https://arxiv.org/abs/2309.14859 Current implementation heavily borrows
-    from
+    https://huggingface.co/papers/2108.06098 and in https://huggingface.co/papers/2309.14859 Current implementation
+    heavily borrows from
    https://github.com/KohakuBlueleaf/LyCORIS/blob/eb460098187f752a5d66406d3affade6f0a07ece/lycoris/modules/lokr.py

    Args:
--- a/src/peft/tuners/lora/config.py
+++ b/src/peft/tuners/lora/config.py
@ -73,7 +73,7 @@ class LoftQConfig:
 class EvaConfig:
    """
    This is the sub-configuration class to store the configuration for a data-driven initialization via EVA. EVA was
-    introduced in <a href='https://arxiv.org/abs/2410.07170'>Explained Variance Adaptation</a>.
+    introduced in <a href='https://huggingface.co/papers/2410.07170'>Explained Variance Adaptation</a>.

    Args:
        rho (`float`):
@ -228,9 +228,9 @@ class LoraConfig(PeftConfig):
            will be updated during training. Be aware that this means that, even when disabling the adapters, the model
            will not produce the same output as the base model would have without adaptation.
        use_rslora (`bool`):
-            When set to True, uses <a href='https://doi.org/10.48550/arXiv.2312.03732'>Rank-Stabilized LoRA</a> which
-            sets the adapter scaling factor to `lora_alpha/math.sqrt(r)`, since it was proven to work better.
-            Otherwise, it will use the original default value of `lora_alpha/r`.
+            When set to True, uses [Rank-Stabilized LoRA](https://huggingface.co/papers/2312.03732) which sets the
+            adapter scaling factor to `lora_alpha/math.sqrt(r)`, since it was proven to work better. Otherwise, it will
+            use the original default value of `lora_alpha/r`.
        modules_to_save (`List[str]`):
            List of modules apart from adapter layers to be set as trainable and saved in the final checkpoint.
        init_lora_weights (`bool` | `Literal["gaussian", "eva", "olora", "pissa", "pissa_niter_[number of iters]", "corda", "loftq"]`):
@ -240,19 +240,20 @@ class LoraConfig(PeftConfig):
            False leads to random initialization of LoRA A and B, meaning that LoRA is not a no-op before training;
            this setting is intended for debugging purposes. Passing 'gaussian' results in Gaussian initialization
            scaled by the LoRA rank for linear and layers. Pass `'loftq'` to use LoftQ initialization. Passing `'eva'`
-            results in a data-driven initialization of <a href='https://arxiv.org/abs/2410.07170' >Explained Variance
-            Adaptation</a>. EVA initializes LoRA based on the SVD of layer input activations and achieves SOTA
+            results in a data-driven initialization of <a href='https://huggingface.co/papers/2410.07170' >Explained
+            Variance Adaptation</a>. EVA initializes LoRA based on the SVD of layer input activations and achieves SOTA
            performance due to its ability to adapt to the finetuning data. Pass `'olora'` to use OLoRA initialization.
-            Passing `'pissa'` results in the initialization of <a href='https://arxiv.org/abs/2404.02948' >Principal
-            Singular values and Singular vectors Adaptation (PiSSA)</a>, which converges more rapidly than LoRA and
-            ultimately achieves superior performance. Moreover, PiSSA reduces the quantization error compared to QLoRA,
-            leading to further enhancements. Passing `'pissa_niter_[number of iters]'` initiates Fast-SVD-based PiSSA
-            initialization, where `[number of iters]` indicates the number of subspace iterations to perform FSVD, and
-            must be a nonnegative integer. When `[number of iters]` is set to 16, it can complete the initialization of
-            a 7B model within seconds, and the training effect is approximately equivalent to using SVD. Passing
-            `'corda'` results in the initialization of <a href='https://arxiv.org/abs/2406.05223' >Context-Oriented
-            Decomposition Adaptation</a>, which converges even more rapidly than PiSSA in Instruction-Previewed Mode,
-            and preserves world knowledge better than LoRA in Knowledge-Preserved Mode.
+            Passing `'pissa'` results in the initialization of <a href='https://huggingface.co/papers/2404.02948'
+            >Principal Singular values and Singular vectors Adaptation (PiSSA)</a>, which converges more rapidly than
+            LoRA and ultimately achieves superior performance. Moreover, PiSSA reduces the quantization error compared
+            to QLoRA, leading to further enhancements. Passing `'pissa_niter_[number of iters]'` initiates
+            Fast-SVD-based PiSSA initialization, where `[number of iters]` indicates the number of subspace iterations
+            to perform FSVD, and must be a nonnegative integer. When `[number of iters]` is set to 16, it can complete
+            the initialization of a 7B model within seconds, and the training effect is approximately equivalent to
+            using SVD. Passing `'corda'` results in the initialization of <a
+            href='https://huggingface.co/papers/2406.05223' >Context-Oriented Decomposition Adaptation</a>, which
+            converges even more rapidly than PiSSA in Instruction-Previewed Mode, and preserves world knowledge better
+            than LoRA in Knowledge-Preserved Mode.
        layers_to_transform (`Union[List[int], int]`):
            The layer indices to transform. If a list of ints is passed, it will apply the adapter to the layer indices
            that are specified in this list. If a single integer is passed, it will apply the transformations on the
@ -296,7 +297,7 @@ class LoraConfig(PeftConfig):
            handled by a separate learnable parameter. This can improve the performance of LoRA especially at low
            ranks. Right now, DoRA only supports linear and Conv2D layers. DoRA introduces a bigger overhead than pure
            LoRA, so it is recommended to merge weights for inference. For more information, see
-            https://arxiv.org/abs/2402.09353.
+            https://huggingface.co/papers/2402.09353.
        layer_replication (`List[Tuple[int, int]]`):
            Build a new stack of layers by stacking the original model layers according to the ranges specified. This
            allows expanding (or shrinking) the model without duplicating the base model weights. The new layers will
@ -340,7 +341,7 @@ class LoraConfig(PeftConfig):
        default=False,
        metadata={
            "help": (
-                "When set to True, uses <a href='https://doi.org/10.48550/arXiv.2312.03732'>Rank-Stabilized LoRA</a>"
+                "When set to True, uses [Rank-Stabilized LoRA](https://huggingface.co/papers/2312.03732)"
                " which sets the adapter scaling factor to `lora_alpha/math.sqrt(r)`, since it"
                " was proven to work better. Otherwise, it will use the original default"
                " value of `lora_alpha/r`."
@ -485,7 +486,7 @@ class LoraConfig(PeftConfig):
        default=False,
        metadata={
            "help": (
-                "Enable <a href='https://arxiv.org/abs/2402.09353'>'Weight-Decomposed Low-Rank Adaptation' (DoRA)</a>. This technique decomposes the updates of the "
+                "Enable <a href='https://huggingface.co/papers/2402.09353'>'Weight-Decomposed Low-Rank Adaptation' (DoRA)</a>. This technique decomposes the updates of the "
                "weights into two parts, magnitude and direction. Direction is handled by normal LoRA, whereas the "
                "magnitude is handled by a separate learnable parameter. This can improve the performance of LoRA, "
                "especially at low ranks. Right now, DoRA only supports linear and Conv2D layers. DoRA introduces a bigger"
--- a/src/peft/tuners/lora/corda.py
+++ b/src/peft/tuners/lora/corda.py
@ -13,7 +13,7 @@
 # limitations under the License.

 # Reference code: https://github.com/iboing/CorDA/blob/main/cordalib/decomposition.py
-# Reference paper: https://arxiv.org/abs/2406.05223
+# Reference paper: https://huggingface.co/papers/2406.05223

 import os
 from collections.abc import Iterable
--- a/src/peft/tuners/lora/dora.py
+++ b/src/peft/tuners/lora/dora.py
@ -76,7 +76,7 @@ class DoraLinearLayer(nn.Module):
        weight = dequantize_module_weight(base_layer)
        weight = weight.to(x.dtype)
        weight_norm = self.get_weight_norm(weight, lora_weight.detach(), scaling)
-        # see section 4.3 of DoRA (https://arxiv.org/abs/2402.09353)
+        # see section 4.3 of DoRA (https://huggingface.co/papers/2402.09353)
        # "[...] we suggest treating ||V +∆V ||_c in
        # Eq. (5) as a constant, thereby detaching it from the gradient
        # graph. This means that while ||V + ∆V ||_c dynamically
@ -114,7 +114,7 @@ class DoraEmbeddingLayer(DoraLinearLayer):
        magnitude = self.weight
        weight = base_layer.weight
        weight_norm = self.get_weight_norm(weight, lora_weight.detach(), scaling)
-        # see section 4.3 of DoRA (https://arxiv.org/abs/2402.09353)
+        # see section 4.3 of DoRA (https://huggingface.co/papers/2402.09353)
        # "[...] we suggest treating ||V +∆V ||_c in
        # Eq. (5) as a constant, thereby detaching it from the gradient
        # graph. This means that while ||V + ∆V ||_c dynamically
@ -149,7 +149,7 @@ class _DoraConvNdLayer(DoraLinearLayer):
        lora_weight = lora_weight.reshape(weight.shape)
        magnitude = self.weight
        weight_norm = self.get_weight_norm(weight, lora_weight.detach(), scaling)
-        # see section 4.3 of DoRA (https://arxiv.org/abs/2402.09353)
+        # see section 4.3 of DoRA (https://huggingface.co/papers/2402.09353)
        # "[...] we suggest treating ||V +∆V ||_c in
        # Eq. (5) as a constant, thereby detaching it from the gradient
        # graph. This means that while ||V + ∆V ||_c dynamically
--- a/src/peft/tuners/lora/model.py
+++ b/src/peft/tuners/lora/model.py
@ -68,7 +68,7 @@ class LoraModel(BaseTuner):
    """
    Creates Low Rank Adapter (LoRA) model from a pretrained transformers model.

-    The method is described in detail in https://arxiv.org/abs/2106.09685.
+    The method is described in detail in https://huggingface.co/papers/2106.09685.

    Args:
        model ([`torch.nn.Module`]): The model to be adapted.
--- a/src/peft/tuners/multitask_prompt_tuning/model.py
+++ b/src/peft/tuners/multitask_prompt_tuning/model.py
@ -21,7 +21,7 @@ from peft.utils.save_and_load import torch_load
 from .config import MultitaskPromptTuningConfig, MultitaskPromptTuningInit


-# This code is adapted for the paper: https://arxiv.org/abs/2303.02861 and
+# This code is adapted for the paper: https://huggingface.co/papers/2303.02861 and
 # constitutes the work done at MIT-IBM Watson Research Lab.


--- a/src/peft/tuners/oft/model.py
+++ b/src/peft/tuners/oft/model.py
@ -40,7 +40,7 @@ from .layer import Conv2d, Linear, OFTLayer
 class OFTModel(BaseTuner):
    """
    Creates Orthogonal Finetuning model from a pretrained model. The method is described in
-    https://arxiv.org/abs/2306.07280
+    https://huggingface.co/papers/2306.07280

    Args:
        model (`torch.nn.Module`): The model to which the adapter tuner layers will be attached.
--- a/src/peft/tuners/poly/config.py
+++ b/src/peft/tuners/poly/config.py
@ -25,8 +25,8 @@ from peft.utils import PeftType
 class PolyConfig(PeftConfig):
    """
    This is the configuration class to store the configuration of a [`PolyModel`].
-        - [Polytropon (Poly)](https://arxiv.org/abs/2202.13914)
-        - [Multi-Head Routing (MHR)](https://arxiv.org/abs/2211.03831)
+        - [Polytropon (Poly)](https://huggingface.co/papers/2202.13914)
+        - [Multi-Head Routing (MHR)](https://huggingface.co/papers/2211.03831)

    Args:
        r (`int`): Attention dimension of each Lora in Poly.
--- a/src/peft/tuners/randlora/config.py
+++ b/src/peft/tuners/randlora/config.py
@ -25,7 +25,7 @@ class RandLoraConfig(PeftConfig):
    """
    This is the configuration class to store the configuration of a [`RandLoraModel`].

-    Paper: https://arxiv.org/pdf/2502.00987.
+    Paper: https://huggingface.co/papers/2502.00987.

    Args:
        r (`int`, *optional*, defaults to `32`):
@ -44,9 +44,9 @@ class RandLoraConfig(PeftConfig):
            Whether to use sparse random bases as described in the RandLora paper. The bases are ternary sparse bases
            (only containing -1, 0 and 1) where the attribution probability is 1/6 for -1 and 1 and 2/3 for 0. These
            sparse matrices aim to be used for matmul free computation in the future, see
-            https://arxiv.org/pdf/2406.02528v1 The current implementation is a proof of concept however where the
-            sparseness is not used to improve speed or memory usage. Using sparse matrices typically does not reduce
-            performance and can even help reduce overfitting. Defaults to `False`.
+            https://huggingface.co/papers/2406.02528v1 The current implementation is a proof of concept however where
+            the sparseness is not used to improve speed or memory usage. Using sparse matrices typically does not
+            reduce performance and can even help reduce overfitting. Defaults to `False`.
        very_sparse (`bool`):
            Whether to use highly sparse random bases as described in the RandLora paper. The very sparse bases are
            ternary sparse bases (only containing -1, 0 and 1) given a matrix with smallest dimension d, the
--- a/src/peft/tuners/vblora/config.py
+++ b/src/peft/tuners/vblora/config.py
@ -26,7 +26,7 @@ class VBLoRAConfig(PeftConfig):
    """
    This is the configuration class to store the configuration of a [`VBLoRAConfig`].

-    Paper: https://arxiv.org/abs/2405.15179
+    Paper: https://huggingface.co/papers/2405.15179

    Args:
        r (`int`):
--- a/src/peft/tuners/vblora/model.py
+++ b/src/peft/tuners/vblora/model.py
@ -34,7 +34,7 @@ class VBLoRAModel(BaseTuner):
    """
    Creates VBLoRA model from a pretrained transformers model.

-    The method is described in detail in https://arxiv.org/abs/2405.15179.
+    The method is described in detail in https://huggingface.co/papers/2405.15179.

    Args:
        model ([`~transformers.PreTrainedModel`]): The model to be adapted.
--- a/src/peft/tuners/vera/config.py
+++ b/src/peft/tuners/vera/config.py
@ -26,7 +26,7 @@ class VeraConfig(PeftConfig):
    """
    This is the configuration class to store the configuration of a [`VeraModel`].

-    Paper: https://arxiv.org/abs/2310.11454.
+    Paper: https://huggingface.co/papers/2310.11454.

    Args:
        r (`int`, *optional*, defaults to `256`):
--- a/src/peft/tuners/xlora/model.py
+++ b/src/peft/tuners/xlora/model.py
@ -158,7 +158,7 @@ class XLoraModel(BaseTuner):
    Creates an X-LoRA (Mixture of LoRA experts), model from a pretrained transformers model. Currently, this X-LoRA
    implementation only works with models with a transformer architecture.

-    The method is described in detail in https://arxiv.org/abs/2402.07148.
+    The method is described in detail in https://huggingface.co/papers/2402.07148.

    Args:
        model ([`torch.nn.Module`]): The model to be adapted.
--- a/src/peft/utils/loftq_utils.py
+++ b/src/peft/utils/loftq_utils.py
@ -13,7 +13,7 @@
 # limitations under the License.

 # Reference code: https://github.com/yxli2123/LoftQ/blob/main/utils.py
-# Reference paper: https://arxiv.org/abs/2310.08659
+# Reference paper: https://huggingface.co/papers/2310.08659

 from __future__ import annotations
Author	SHA1	Message	Date
Quentin Gallouédec	fa16f85cef	make style	2025-05-26 20:58:17 +00:00
Quentin Gallouédec	b1563c5f65	Use HF Papers	2025-05-17 03:30:40 +00:00