mirror of
https://github.com/huggingface/peft.git
synced 2025-10-20 15:33:48 +08:00
DOC Update of Bone/Bat/DiSHA docs (#2312)
This commit is contained in:
@ -122,12 +122,12 @@ HRA constructs a chain of `r` trainable Householder reflections (HRs). Because t
|
||||
The higher `r`, the more trainable parameters, resulting in a larger model capacity and better performance. Besides, due to the chain structure, the orthogonality of HR planes impacts the capacity and regularity of HRA. To achieve a trade-off between the model capacity and regularity, an orthogonality regularizer of the HR planes is added to the loss function. The weight \\(\lambda\\) can control the strength of the regularizer.
|
||||
|
||||
## Bone
|
||||
[Bone](https://huggingface.co/papers/2409.15371) A novel PEFT technique distinct from LoRA, called Block-Affine Adaptation (Bone). By dividing the original weights into multiple subspaces that share a single matrix for weight updates, Bone simplifies the process by requiring the trainable matrix to be initialized to zero, eliminating the need for complex initialization as in some LoRA variants. Compared to LoRA, Bone significantly reduces memory usage and achieves faster computation.
|
||||
[DiSHA](https://huggingface.co/papers/2409.15371) A novel PEFT technique distinct from LoRA, called Dimension-Sharding Adaptation (DiSHA). By dividing the original weights into multiple subspaces that share a single matrix for weight updates, DiSHA simplifies the process by requiring the trainable matrix to be initialized to zero, eliminating the need for complex initialization as in some LoRA variants. Bone and Bat are derivative structures of DiSHA. Bone significantly improves computational efficiency while saving memory, whereas Bat addresses the limitation of Bone's linear update by employing a non-linear update to break through the upper bound.
|
||||
|
||||
<small><a href="https://huggingface.co/papers/2409.15371">Bone: Block-Affine Adaptation of Large Language Models</a></small>
|
||||
<small><a href="https://huggingface.co/papers/2409.15371">DiSHA: Dimension-Sharding Adaptation with Fast Convergence and Fast Computation</a></small>
|
||||
|
||||
Intuitively, the shape of a single trainable matrix in Bone is consistent with `lora_B`, so the `r` parameter in Bone is less than the `r` in LoRA by (`in_feature * r`).
|
||||
|
||||
Bat is a new structure composed of Bone and "Weight Guide." You can use it by setting init_weights == "bat".Bat reduces the number of trainable parameters by using a block grouping method, and the block computation of weight W effectively promotes information exchange in the original weights, enhancing data fitting capability during fine-tuning. The experiment mentions controlling the size of trainable parameters through b (block size), similar to r (rank) in LoRA. For consistency within PEFT, we also name b as r. Note: Bat's r (b) is special and requires that weight W satisfies the conditions `in_features % r == 0` and `out_features % r == 0`. Additionally, when `in_features == out_features` and Bone-r equals LoRA-r, Bone's number of trainable parameters is only half that of LoRA.
|
||||
Note: Bat's r (b) is special and requires that weight W satisfies the conditions `in_features % r == 0` and `out_features % r == 0`. Additionally, when `in_features == out_features` and Bone-r equals LoRA-r, Bone's number of trainable parameters is only half that of LoRA.
|
||||
|
||||
Bat currently has some issues: it is slower than LoRA and requires checkpointing to address excessive memory usage due to intermediate values, which further reduces training speed. We plan to address this in the future. Contributions are welcome.
|
||||
Although the nonlinear updates of Bat bring some performance improvements, they also increase computational overhead. Its main purpose is to provide researchers with a direction for improvement. Therefore, we recommend fine-tuning the comprehensive Bone model instead.
|
@ -16,11 +16,13 @@ rendered properly in your Markdown viewer.
|
||||
|
||||
# Bone
|
||||
|
||||
Block-Affine Adaptation ([Bone](https://huggingface.co/papers/2409.15371)) This work, inspired by GQA and MQA, leverages the sparsity of LLM weights to design the Block-Affine Adaptation (Bone) structure. In Bone, the original weights are divided into multiple sub-spaces, all of which share a single low-rank matrix initialized to zero for updates. Extensive exper-iments demonstrate that Bone consistently outperforms LoRA and its variants across various tasks, while also offering superior computational efficiency.
|
||||
DiSHA: Dimension-Sharding Adaptation ([DiSHA](https://huggingface.co/papers/2409.15371)) We introduce Dimension-Sharding Adaptation (DiSHA), which expands the PEFT design space to unlock lower intrinsic ranks and faster convergence by default. Building on DiSHA, we propose an efficient algorithm called Block-Affine Adaptation (Bone) structure and a non-linear update method called Block Affine Transformation Adaptation (BAT).
|
||||
|
||||
|
||||
The abstract from the paper is:
|
||||
|
||||
Low-Rank Adaptation (LoRA) has achieved remarkable training results by freezing the original weights and training only low-rank matrices, establishing itself as the predominant fine-tuning method for LLMs. In pursuit of performance closer to full-parameter training, a series of LoRA variants have emerged, such as LoRA+, PISSA, Olora, and LoRA-GA. This paper introduces a novel PEFT technique distinct from LoRA, called Block-Affine Adaptation (Bone). By dividing the original weights into multiple subspaces that share a single matrix for weight updates, Bone simplifies the process by requiring the trainable matrix to be initialized to zero, eliminating the need for complex initialization as in some LoRA variants. Compared to LoRA, Bone significantly reduces memory usage and achieves faster computation. Evaluation of both NLU and NLG tasks demonstrates that Bone substantially outperforms LoRA and its variants. Inspired by Pissa, we further proposed the 'Weight Guide' theory to better utilize the information from the original weights. By integrating 'Weight Guide' with Bone, we developed a new structure called Block-Affine Transformation (Bat), and ablation experiments confirmed the effectiveness of 'Weight Guide'.
|
||||
Low-Rank Adaptation (LoRA) leverages the low intrinsic rank of weight updates in Large Language Models (LLMs), establishing a Parameter-Efficient Fine-Tuning (PEFT) paradigm. However, LoRA suffers from slow convergence. We introduce Dimension-Sharding Adaptation (DiSHA), which expands the PEFT design space to unlock lower intrinsic ranks and faster convergence by default. Within DiSHA's design space, we propose Block Affine Adaptation (Bone), a computationally efficient structure that delivers both high performance and efficiency. While certain DiSHA configurations may result in colinear updates to weight shards, we address this with Block Affine Transformation Adaptation (BAT), a nonlinear variant of DiSHA. BAT introduces nonlinearity by combining trainable matrices with original weight shards in a nonlinear manner, inducing nonlinearity in matrix updates without introducing additional parameters. Empirical results show that Bone, under the DiSHA framework, consistently outperforms LoRA variants in both NLG and NLU tasks, with significantly improved computational efficiency. Further analysis demonstrates that BAT enhances model capabilities by leveraging its nonlinear design.
|
||||
|
||||
|
||||
## BoneConfig
|
||||
|
||||
|
@ -1,6 +1,7 @@
|
||||
# BONE: BLOCK AFFINE TRANSFORMATION AS PARAMETER EFFICIENT FINE-TUNING METHODS FOR LARGE LANGUAGE MODELS
|
||||
## Introduction ([Paper](https://arxiv.org/pdf/2409.15371), [code](https://github.com/JL-er/Bone))
|
||||
Low-Rank Adaptation (LoRA) has achieved remarkable training results by freezing the original weights and training only low-rank matrices, establishing itself as the predominant fine-tuning method for LLMs. In pursuit of performance closer to full-parameter training, a series of LoRA variants have emerged, such as LoRA+, PISSA, Olora, and LoRA-GA. This paper introduces a novel PEFT technique distinct from LoRA, called Block-Affine Adaptation (Bone). By dividing the original weights into multiple subspaces that share a single matrix for weight updates, Bone simplifies the process by requiring the trainable matrix to be initialized to zero, eliminating the need for complex initialization as in some LoRA variants. Compared to LoRA, Bone significantly reduces memory usage and achieves faster computation. Evaluation of both NLU and NLG tasks demonstrates that Bone substantially outperforms LoRA and its variants. Inspired by Pissa, we further proposed the 'Weight Guide' theory to better utilize the information from the original weights. By integrating 'Weight Guide' with Bone, we developed a new structure called Block-Affine Transformation (Bat), and ablation experiments confirmed the effectiveness of 'Weight Guide'.
|
||||
# DiSHA: Dimension-Sharding Adaptation with Fast Convergence and Fast Computation
|
||||
## Introduction ([Paper](https://arxiv.org/pdf/2409.15371), [code](https://github.com/JL-er/DiSHA))
|
||||
Low-Rank Adaptation (LoRA) leverages the low intrinsic rank of weight updates in Large Language Models (LLMs), establishing a Parameter-Efficient Fine-Tuning (PEFT) paradigm. However, LoRA suffers from slow convergence. We introduce Dimension-Sharding Adaptation (DiSHA), which expands the PEFT design space to unlock lower intrinsic ranks and faster convergence by default. Within DiSHA's design space, we propose Block Affine Adaptation (Bone), a computationally efficient structure that delivers both high performance and efficiency. While certain DiSHA configurations may result in colinear updates to weight shards, we address this with Block Affine Transformation Adaptation (BAT), a nonlinear variant of DiSHA. BAT introduces nonlinearity by combining trainable matrices with original weight shards in a nonlinear manner, inducing nonlinearity in matrix updates without introducing additional parameters. Empirical results show that Bone, under the DiSHA framework, consistently outperforms LoRA variants in both NLG and NLU tasks, with significantly improved computational efficiency. Further analysis demonstrates that BAT enhances model capabilities by leveraging its nonlinear design.
|
||||
|
||||
|
||||
## Quick Start
|
||||
```python
|
||||
@ -84,10 +85,10 @@ python bone_finetuning.py \
|
||||
|
||||
# Citation
|
||||
```bib
|
||||
@misc{kang2024boneblockaffineadaptationlarge,
|
||||
title={Bone: Block-Affine Adaptation of Large Language Models},
|
||||
@misc{kang2025dishadimensionshardingadaptationlarge,
|
||||
title={DiSHA: Dimension-Sharding Adaptation of Large Language Models with Fast Convergence and Fast Computation},
|
||||
author={Jiale Kang},
|
||||
year={2024},
|
||||
year={2025},
|
||||
eprint={2409.15371},
|
||||
archivePrefix={arXiv},
|
||||
primaryClass={cs.CL},
|
||||
|
Reference in New Issue
Block a user