LoRA: Low-Rank Adaptation of Large Language Models

From llamawiki.ai
LoRA: Low-Rank Adaptation of Large Language Models
Published17 June 2021 (3 years ago) (2021-06-17)
AuthorsEdward J. Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Lu Wang, Weizhu Chen
InstitutionMicrosoft
arXiv Abstract[[1]]
arXiv PDF[[2]]

The paper LORA: Low-Rank Adaptation of Large Language Models proposes a method called Low-Rank Adaptation (LoRA) for efficiently adapting large pre-trained language models to downstream tasks.

Introduction[edit | edit source]

Pre-training language models on large amounts of text data and then fine-tuning them on downstream tasks has become a popular paradigm in NLP. However, as the size of pre-trained models increases (e.g. GPT-3 with 175 billion parameters), full fine-tuning becomes prohibitively expensive.

LoRA allows efficient adaptation by keeping the weights of the pre-trained model frozen, and injecting small trainable low-rank matrices into each layer. This greatly reduces the number of trainable parameters compared to full fine-tuning.

Method[edit | edit source]

For a pretrained weight matrix in , LoRA represents the update as a low-rank decomposition , where is , is , and is the rank.

  • is frozen during training, only and are trainable.
  • Can recover full fine-tuning flexibility by setting to rank of .
  • No additional inference latency, since can be precomputed.
  • Reduces memory usage during training up to 3x due to not storing optimizer states of frozen weights.

LoRA is applied to the query, key, and value matrices in the Transformer self-attention module.

Experiments[edit | edit source]

LoRA was evaluated on RoBERTa, DeBERTa, GPT-2, and GPT-3 on tasks including GLUE, summarization, and text generation.

  • Matches or exceeds the performance of full fine-tuning despite having orders of magnitude fewer parameters.
  • Outperforms adapter layers and prefix tuning baselines.
  • Reduces GPT-3 checkpoint size from 350GB to 35MB, a 10,000x reduction.

Analysis[edit | edit source]

Further analysis provides insights into the low-rank adaptation:

  • Adapting query and value matrices is better than just query or just value.
  • Very low rank (e.g. 1-4) is sufficient for good performance.
  • Learned low-rank matrices amplify directions not emphasized during pre-training.
  • Magnitude of amplification can be very large (e.g. 20x).