What is LoRA?
Large language models are powerful — but out of the box, they do what they were trained to do. To make one follow instructions, adopt a persona, or know your company's domain, you need to fine-tune it: continue training it on your data.
The problem? A state-of-the-art model like Llama 3 70B has 70 billion parameters. Full fine-tuning means updating every single one of them on every training step. That's computationally catastrophic — you need dozens of high-end GPUs for days or weeks.
LoRA (Low-Rank Adaptation) is a technique that dramatically reduces the cost of fine-tuning by making a simple observation: you don't need to change all 70 billion weights to adapt a model to a new task.
Instead of retraining the entire weight matrix, LoRA injects small trainable rank-decomposition matrices alongside the original frozen weights. When training finishes, you can merge them back into a single model — or keep them as swappable adapters.
The result: a 70B model can be fine-tuned on a single 48GB GPU, with training taking hours instead of weeks.
Full Fine-Tuning vs LoRA
Let's make this concrete. Imagine a single linear layer in a neural network with weight matrix W₀ of shape (d × k) — say, 8192 × 8192. That's over 67 million parameters just for one layer.
Full fine-tuning touches every weight. LoRA only trains the bypass matrices — a tiny fraction of total parameters.
Full Fine-Tuning
- All parameters trainable
- Massive GPU memory required
- weeks of training time
- Produces a full new model
- Risk of catastrophic forgetting
LoRA
- Only adapter matrices trainable
- Fits on consumer GPUs (24–48GB)
- Hours of training time
- Produces small swappable adapters
- Preserves pretrained knowledge
The Intuition
Here's the key insight behind LoRA, explained without any math:
When you fine-tune a model, you're really looking for a way to "nudge" it into new behavior. You don't need to rebuild it from scratch. You just need to add a small correction signal — a nudge in the right direction.
W₀, LoRA learns a small correction delta: ΔW = B · A. The original weights stay frozen. Only A and B are trained.
Think of it like editing a photo in a non-destructive editor. You don't paint over the original pixels — you add an adjustment layer. The original stays intact, and you can always revert. LoRA works the same way: it's an adjustment layer on top of the pretrained model.
Why does this work? Researchers discovered that the important updates to pretrained language models tend to lie in a low-dimensional subspace. You don't need a high-rank matrix to capture what changed — a low-rank approximation does the job just fine.
As the LoRA paper (Hu et al., 2021) put it: "We hypothesize that the update matrices ΔW have a low intrinsic rank during adaptation." In plain English: you can compress what the model learned into far fewer dimensions than the original weight matrix, and it still works.
LoRA adds a small trainable bypass. After merging, you get a model that behaves differently without changing its structure.
Rank Decomposition — Why Low-Rank Works
Let's look at what "low-rank" actually means and why it's so powerful.
Imagine you have a weight matrix W₀ of shape (d × k) — say, (8192 × 8192). That's 67 million parameters. A full fine-tune updates all 67M per step.
LoRA instead approximates the weight update ΔW as the product of two small matrices: A and B.
ΔW ≈ B · A where A is (d × r) and B is (r × k), and r << min(d, k).
If r = 16, then A has 8192 × 16 = 131,072 params and B has 16 × 8192 = 131,072 params. Total: ~262,000 parameters — 256× smaller than the full 67M.
A (d×r) and B (r×k) multiply to produce an effective d×k update — but you only train ~262K params instead of 67M.
Why does low-rank work? When you fine-tune, the model is adapting to a new distribution that's very similar to the original one. The direction of change (the update) doesn't need to explore the full high-dimensional space — it can be captured in a much smaller subspace. Like being able to describe the direction of a plane's movement with just a few coordinates rather than knowing the position of every atom.
The Math — W = W₀ + B · A
The entire LoRA formula fits on a napkin:
W = W₀ + B · A · (α / r)Where
W₀ is frozen, A and B are trainable, r is rank, and α is a scaling factor.
During the forward pass through a LoRA-adapted layer, the computation becomes:
# Regular forward pass:
h = W₀ · x
# LoRA forward pass:
h = W₀ · x + (B · A) · x · (α / r)
A Concrete Numeric Example
Let's make this real. Say we have a weight matrix W₀ of shape (4 × 4) for simplicity, and we use rank r = 2.
# W₀ is frozen — let's just show what it contributes:
W₀ = [[2.0, 0.5, 1.2, 0.8],
[0.3, 1.8, 0.4, 2.1],
[1.1, 0.6, 1.9, 0.7],
[0.4, 1.3, 0.8, 1.6]]
# Input:
x = [1.0, 0.5, 0.8, 0.2]
# Forward: h = W₀ · x
h_base = [3.61, 2.37, 3.56, 2.36]
Now we add LoRA. We randomly initialize A from a Gaussian and set B to zero (so at the start, LoRA contributes nothing). With α = 8, r = 2:
# Trainable LoRA matrices:
A = [[0.5, 0.2], # shape (4 × 2)
[0.1, 0.8],
[0.3, 0.4],
[0.6, 0.1]]
B = [[0.3, 0.0, 0.5, 0.2], # shape (2 × 4)
[0.1, 0.4, 0.2, 0.7]]
# Compute BA:
BA = B @ A # → shape (4 × 4)
# Scale factor:
scaling = α / r = 8 / 2 = 4.0
# Forward with LoRA:
h = W₀·x + (BA·x) * scaling
= [3.61, 2.37, 3.56, 2.36] + [1.42, 1.08, 1.24, 1.56] * 4
= [3.61, 2.37, 3.56, 2.36] + [5.68, 4.32, 4.96, 6.24]
= [9.29, 6.69, 8.52, 8.60]
Those small random A and B matrices, trained via gradient descent, produce a meaningful shift in the output — adapting the model's behavior without touching the original 67M+ parameters.
What Gets Trained?
Not every layer in a transformer contributes equally to model behavior. Research found that targeting specific weight matrices gives the best bang for your buck.
LoRA is most commonly applied to Q, K, V projection matrices (Wq, Wk, Wv), the output projection (Wo), and FFN layers. The embeddings and layer norms remain frozen.
Common LoRA Target Modules
| Module | What it does | LoRA Applied? |
|---|---|---|
W_q (Query projection) |
Transforms input to query vectors for attention | Yes — most common |
W_k (Key projection) |
Transforms input to key vectors for attention | Yes — optional |
W_v (Value projection) |
Transforms input to value vectors for attention | Yes — most common |
W_o (Output projection) |
Projects attention output back to residual stream | Yes — common |
W_up, W_gate (FFN) |
Feed-forward network layers | Yes — common |
| Embeddings, LayerNorm | Token embeddings and normalization | Frozen |
Hyperparameters
Three hyperparameters control LoRA's behavior and efficiency:
1. Rank (r) — How Much Capacity?
The rank r controls the dimensionality of the low-rank subspace. Higher r = more trainable parameters = more expressive adapter, but also more compute and risk of overfitting.
Typical values: r = 4, 8, 16
- 4–8: Lightweight adapters. Good for simple style or persona changes. Very low VRAM usage.
- 16: Standard choice. Good balance of quality and cost. Works for most instruction-tuning tasks.
- 32–64: High-capacity. Used when the target domain is very different from pretraining.
- 128+: Rarely needed. Approaches full fine-tuning in behavior.
Parameter count formula
# Per target module:
params = 2 × r × d (A + B)
# For Q, K, V, O in an attention head:
# r=16, d_model=4096, n_heads=32
# Per head: 4 × 2 × 16 × 4096 = 524K
# Total: 32 × 524K = 16.7M params
2. Alpha (α) — The Scaling Knob
α scales the LoRA contribution relative to the frozen weights. The effective learning rate for LoRA updates is α / r. So if you double r, you typically double α to keep the same effective scale.
# Common ratios:
α = r # Standard (1:1 mapping)
α = 2 × r # Popular for higher ranks
α = r / 2 # More conservative updates
# Scaling in forward pass:
output = W₀ · x + (B · A · x) × (α / r)
3. Target Modules — What to Modify?
You specify which weight matrices get LoRA adapters. In the Hugging Face PEFT library:
from peft import get_peft_model, LoraConfig
config = LoraConfig(
r=16,
lora_alpha=32,
target_modules=["q_proj", "k_proj", "v_proj", "o_proj"],
lora_dropout=0.05,
bias="none",
task_type="CAUSAL_LM"
)
model = get_peft_model(base_model, config)
# Only q_proj, k_proj, v_proj, o_proj get LoRA adapters
Dropout Regularization
lora_dropout (typically 0.0–0.1) randomly zeros out LoRA activations during training as regularization. For most tasks, 0.05 is a good default.
QLoRA — Quantized LoRA
QLoRA (Dettmers et al., 2023) pushes LoRA even further by combining it with quantization. The key innovation: the base model is stored in a highly compressed 4-bit format, and only the LoRA adapters are in full 16-bit. This dramatically reduces VRAM requirements.
QLoRA stores the base model in 4-bit NormalFloat (NF4) while keeping LoRA adapters in 16-bit. Same quality as full 16-bit LoRA at a fraction of the memory.
How QLoRA Works
QLoRA uses two key techniques:
- 4-bit NormalFloat (NF4) — A quantization scheme optimized for normally-distributed weights, achieving near-lossless compression at 4 bits per parameter.
- Double Quantization — Quantizes the quantization constants themselves, squeezing out additional memory savings.
- Paged Optimizers — Offloads optimizer states to CPU RAM when GPU memory runs low, managing peak memory spikes.
Training Process — How Gradients Update Only A and B
During backpropagation, gradients flow through the network as usual. But here's the crucial difference: only the LoRA adapter parameters (A and B) receive gradient updates. Everything else — the frozen base model weights — gets zero gradients.
During backprop, W₀ gets zero gradients (frozen). Only B and A receive non-zero gradients and update. Optimizer state is only stored for A and B.
The training loop is straightforward:
# LoRA Training Loop (simplified)
for batch in training_data:
outputs = model(batch) # forward pass with LoRA
loss = compute_loss(outputs, labels)
grads = backward(loss) # backprop — W₀ gets 0 gradients!
for name, param in model.named_parameters():
if "lora_" in name:
optimizer.step(param) # only LoRA params here
# W₀: skipped — still frozen
Why This Saves Memory
- No optimizer states for W₀ — Adam/SGD stores momentum buffers (2× the parameter count). For a 70B model, this alone saves ~560GB of RAM.
- No gradient storage for W₀ — gradients alone would be 140GB. LoRA eliminates storing gradients for the frozen base model.
- Only A and B in optimizer — at r=16, that's ~16M parameters per layer × 3 optimizer states ≈ 192MB per layer vs. gigabytes for full training.
Merging Adapters — Combining Back into a Single Model
Once training is done, you have two options. Keep the adapter as a separate file and dynamically load it at inference — or merge it back into the base model weights for use as a standalone model.
Option 1: Keep as Swappable Adapter
Adapters are small (~100MB for r=16). You can have multiple adapters for different tasks and load/swap them at runtime:
# Multiple adapters — load whichever you need
model.load_adapter("./finetuned-math-adapter", "math-expert")
model.load_adapter("./finetuned-code-adapter", "code-assistant")
# Switch between them:
model.set_adapter("math-expert")
answer = model.generate("solve for x: 2x + 5 = 13")
model.set_adapter("code-assistant")
answer = model.generate("write a quicksort in python")
Option 2: Merge and Export
Merge the trained A and B matrices back into W₀ to get a standalone model with no adapter overhead:
W_merged = W₀ + (B · A · α / r)Once merged, you can discard A and B and use the model like any fine-tuned model.
# Merge LoRA adapter into base weights
model.merge_and_unload() # PEFT: merges and replaces LoRA layers
# Now save the merged model like a normal model
model.save_pretrained("./merged-llama3-70b-instruct")
Merging is simply adding the trained BA matrix to the frozen base weights, then scaling by α/r. The result is a single model file, identical in structure to the original.
Practical Uses
LoRA has opened fine-tuning to researchers, small companies, and hobbyists. Here are the most common applications:
1. Instruction Tuning
Take a pretrained base model and train it to follow instructions, answer questions helpfully, and refuse harmful requests. This is how you get models like Vicuna or Alpaca — they were created by fine-tuning LLaMA with LoRA on human-annotated or GPT-generated instruction-response pairs.
# Example: instruction-tuning dataset
# {"instruction": "Explain photosynthesis", "input": "", "output": "..."}
training_data = load_instruction_dataset("./alpaca_data.json")
# LoRA config for instruction tuning
config = LoraConfig(r=16, lora_alpha=32, target_modules=["q_proj","v_proj"])
model = get_peft_model(base_model, config)
trainer = Trainer(model=model, train_dataset=training_data, ...)
trainer.train()
2. Domain Adaptation
Teach a general model your specialized domain — law, medicine, finance, code. You feed it documents from your domain and LoRA fine-tunes it to understand your domain's language and patterns.
3. Character / Roleplay Adapters
LoRA adapters are perfect for teaching a model to roleplay as a specific character, with a particular personality and speech style. Multiple characters can be stored as separate small adapter files.
4. Style Transfer
Fine-tune a model to write in a specific style — technical, literary, concise, conversational. A single LoRA adapter can shift tone without changing the underlying knowledge.
5. Multi-Task Learning with Adapters
Train separate LoRA adapters for different tasks and combine them. Since adapters are small, you can keep many of them in memory and combine outputs — a form of efficient multi-task learning.
Limitations of LoRA
LoRA is powerful, but it's not magic. Understanding its limitations helps you know when to use it — and when a different approach is needed.
When LoRA Struggles
- Learning entirely new knowledge — LoRA adapts the model's existing knowledge, it doesn't inject large amounts of new facts. For fact-learning, RAG or continued pretraining may be better.
- Very high-rank tasks — If the target distribution is extremely different from pretraining, r=16 or r=32 may not have enough capacity.
- Long-context tasks — LoRA adapters may not adapt position interpolation or context-length extension.
- Some architectures — LoRA was designed for linear layers with residual connections. Graph neural networks, diffusion model UNets, and some newer architectures need different approaches.
- Catastrophic forgetting — Even with a frozen base, aggressive fine-tuning on one task can reduce performance on others (mitigated by lower rank, dropout, and data mixing).
LoRA Strengths (For Reference)
- Efficient instruction following
- Character/persona adaptation
- Domain-specific style transfer
- Multi-adapter ensemble (cheap)
- Fast iteration cycles
- Works on consumer GPUs
What to Use Instead
| Goal | Better Alternative |
|---|---|
| Inject massive new knowledge | Continued pretraining (CTP) or RAG |
| Very high quality, no compromises | Full fine-tuning (if you have the budget) |
| Architecture without linear layers | Adapter variants (e.g., LoRA-Qformer, IA³) |
| Maximum instruction-following quality | DPO / RLHF on top of LoRA |
Quick Reference
# The LoRA update:
W = W₀ + B · A · (α / r)
# Key hyperparameters:
r # rank (4–64, higher = more capacity)
α # scaling factor (often 2×r)
target_modules # ["q_proj","v_proj","k_proj","o_proj"]
lora_dropout # 0.0 – 0.1
# Memory saving:
# full fine-tune: ~16 bytes per param (model + grads + optim)
# LoRA: ~2 bytes per trainable param + 2 bytes per frozen param
# QLoRA: ~0.5 bytes per frozen param (4-bit) + 2 bytes per LoRA param
# Common ranks by use case:
r=4-8 # style, persona, light adaptation
r=16 # instruction tuning, domain knowledge
r=32-64 # heavy domain shift, complex tasks