LoRA

From llamawiki.ai

Low Rank Adaptors (also known as LoRA) are a method to efficiently adapt large pretrained neural network models to downstream tasks while keeping the original model weights frozen. The key idea behind LoRA is to separately store updates to the pretrained weights as much smaller matrices, which are therefore much quicker to train and less prone to failure due to over training.

The LoRA technique was introduced in the paper LoRA: Low-Rank Adaptation of Large Language Models.

Technical Description[edit | edit source]

The key idea behind LoRA is to represent the update to the pretrained weights as low rank matrices that are injected into the original model. For a given layer with original weight matrix in (i.e. a set of real numbers typically represented as 16-bit floating point numbers), the low rank update is represented as:

where in and in with rank r much less than the smaller of d and k (i.e. ). In this way a large matrix is represented as the product of two much smaller matrices, greatly reducing the amount of data involved.

During training, the original weight matrix is frozen and only and are trained, which dramatically reduces the number of trainable parameters. The forward pass is computed as:

where is the input to the layer. The scalar controls the magnitude of the low rank update.

This allows adapting a huge pretrained model using only 0.01% additional trainable parameters stored in B and A. Since W remains fixed, LoRA does not add any inference latency or memory overhead.

Benefits and Performance[edit | edit source]

The key benefits of LoRA are:

  • Massive reduction in trainable parameters during finetuning
  • No increase in inference time or model size
  • Ability to efficiently switch between tasks by swapping adapter weights
  • Less GPU memory needed for training compared to full finetuning

LoRA has been shown to match the performance of full finetuning on models like GPT-2, while using several orders of magnitude fewer parameters. The QLORA paper combines LoRA with model quantization for additional efficiency gains. Overall, LoRA enables efficient deployment of gigantic pretrained models.

See Also[edit | edit source]