LoRA: Low-Rank Adaptation of Large Language Models

Metadata

Reading status: read complete
Year: 2021
Compute regime: Efficient and edge inference (efficient_edge_inference)
PDF: 2021-lora_2021.pdf
Extracted text: 2021-lora_2021.txt
PDF URL: https://arxiv.org/pdf/2106.09685.pdf
OpenAlex:
Citation count source/date:
Citation count:
Reading card created: 2026-06-15

Compute Setup

The paper explicitly uses NVIDIA Tesla V100 GPUs for the experiments and measures adapter inference latency on an NVIDIA Quadro RTX8000. The GPT-3 175B throughput footnote reports tokens per second per V100 under the same number of model-parallel weight shards: full fine-tuning reaches 32.5 tokens/s/V100, while LoRA reaches 43.1 tokens/s/V100. This gives device-level throughput for a model where memory and model parallelism dominate adaptation.

The largest setup is GPT-3 175B. Full fine-tuning with Adam requires about 1.2 TB of training VRAM, while LoRA reduces that to 350 GB. Checkpoint storage changes from 350 GB to 35 MB when adapting only query and value projections with rank r = 4. For latency, Table 1 measures GPT-2 medium on RTX8000; at batch size 1 and sequence length 128, Fine-Tune/LoRA is 19.8 ms, while adapter baselines rise to 23.9 ms and 25.8 ms.

Bottleneck

The bottleneck is adaptation. Full fine-tuning updates all weights for every downstream task, repeating the optimizer-state and checkpoint footprint of the full model. For GPT-3 175B, this means treating 175B parameters as trainable for each task. Adam states make the hardware barrier similar to pretraining's memory footprint, and storing many task-specialized 350 GB checkpoints creates a severe I/O and storage bottleneck.

Existing parameter-efficient methods can introduce inference latency. Adapter layers add modules between existing layers, and those modules are processed sequentially with the base path. This is most visible in online inference where batch size is small; at larger batches or sequence lengths, hardware parallelism hides more of the overhead.

Method Adaptation

LoRA adapts the weight-update structure rather than inserting a new inference block. For a pretrained weight matrix W0, it represents adaptation as W0 + BA, where B is d x r, A is r x k, and r is much smaller than the original matrix dimensions. During training, W0 is frozen and only A and B receive gradient updates. The paper mainly adapts attention weights, especially Wq and Wv.

This is compute-aware in three ways. Freezing most weights removes optimizer state and gradient computation for the base model. Low-rank matrices make task switching an I/O operation over megabytes rather than hundreds of gigabytes: 100 adapted GPT-3 models need about 354 GB with LoRA plus the 350 GB base, instead of roughly 35 TB for full checkpoints. At deployment, BA can be merged into W0, preserving the same dense matmul path as a fully fine-tuned model.

The batching trade-off follows from this merge. If different requests in one batch need different LoRA modules absorbed into W, each task wants a different effective weight. The paper states that batching different tasks is not straightforward in the merged, no-latency form.

Evidence

The abstract summarizes the largest-scale result as reducing trainable parameters by 10,000x and GPU memory by 3x compared with GPT-3 175B fine-tuned with Adam. The detailed numbers are training VRAM falling from 1.2 TB to 350 GB, and the r = 4 Wq/Wv checkpoint falling from 350 GB to 35 MB. The V100 throughput note reports 32.5 tokens/s/V100 for full fine-tuning versus 43.1 tokens/s/V100 for LoRA.

The RTX8000 latency table shows why merged weights matter. At batch size 32, sequence length 512, Fine-Tune/LoRA is 1449.4 ms, while AdapterL and AdapterH are only a few percent slower. At batch size 1, sequence length 128, Fine-Tune/LoRA is 19.8 ms, AdapterL is 23.9 ms, and AdapterH is 25.8 ms, corresponding to +20.7% and +30.3%. Larger batches and sequence lengths mitigate latency by better using hardware parallelism.

Quality evidence supports the trade-off. RoBERTa-base LoRA uses 0.3M trainable parameters and reports 87.2 average GLUE, above the listed 86.4 full fine-tuning baseline. RoBERTa-large LoRA uses 0.8M trainable parameters and reports 89.0, close to the 88.9 baseline.

Historical Effect

LoRA made parameter-efficient fine-tuning a deployment primitive. Its important compute contribution is not only low rank as a mathematical prior; it is that adaptation can be merged into the existing dense weight matrix. That keeps the serving path compatible with optimized Transformer inference kernels and avoids the extra sequential modules used by adapters.

It also changed model-specialization storage economics: one large base model can remain resident while many task updates are small enough to swap or archive cheaply. LoRA's compute structure connects training memory, checkpoint I/O, and latency in one adaptation method.

Limits

The paper is clear about a deployment limitation: if LoRA weights are merged into the base matrix to eliminate latency, batching inputs for different tasks in one forward pass is not straightforward. Serving many tasks therefore requires module management, weight swapping, or accepting unmerged dynamic modules in latency-tolerant scenarios. The method also still requires the base model at deployment; the GPT-3 example keeps the 350 GB pretrained model and adds small task deltas.

The empirical study focuses mainly on Transformer attention projections. The authors leave adaptation of MLP layers, LayerNorm layers, and biases to future work. Low rank is also an empirical fit rather than a theorem for every task: the paper studies rank deficiency and finds very low ranks sufficient in its settings, but the optimal rank varies by model, task, and which weights are adapted.