How LoRA Training Works

What is LoRA?

Large language models are powerful — but out of the box, they do what they were trained to do. To make one follow instructions, adopt a persona, or know your company's domain, you need to fine-tune it: continue training it on your data.

The problem? A state-of-the-art model like Llama 3 70B has 70 billion parameters. Full fine-tuning means updating every single one of them on every training step. That's computationally catastrophic — you need dozens of high-end GPUs for days or weeks.

The Problem

Full fine-tuning a 70B parameter model requires updating all 70B weights per step. On a 80GB GPU, you'd need ~14 A100s just to fit the model and gradients in memory. Cost: tens of thousands of dollars per training run.

LoRA (Low-Rank Adaptation) is a technique that dramatically reduces the cost of fine-tuning by making a simple observation: you don't need to change all 70 billion weights to adapt a model to a new task.

Instead of retraining the entire weight matrix, LoRA injects small trainable rank-decomposition matrices alongside the original frozen weights. When training finishes, you can merge them back into a single model — or keep them as swappable adapters.

The result: a 70B model can be fine-tuned on a single 48GB GPU, with training taking hours instead of weeks.

Full Fine-Tuning vs LoRA

Let's make this concrete. Imagine a single linear layer in a neural network with weight matrix W₀ of shape (d × k) — say, 8192 × 8192. That's over 67 million parameters just for one layer.

Full Fine-Tuning vs LoRA — Architecture Comparison

Full fine-tuning touches every weight. LoRA only trains the bypass matrices — a tiny fraction of total parameters.

Full Fine-Tuning

All parameters trainable
Massive GPU memory required
weeks of training time
Produces a full new model
Risk of catastrophic forgetting

      LoRA
      Only adapter matrices trainable
Fits on consumer GPUs (24–48GB)
Hours of training time
Produces small swappable adapters
Preserves pretrained knowledge

    

The Intuition

Here's the key insight behind LoRA, explained without any math:

When you fine-tune a model, you're really looking for a way to "nudge" it into new behavior. You don't need to rebuild it from scratch. You just need to add a small correction signal — a nudge in the right direction.

Core Idea

Instead of overwriting the original weights W₀, LoRA learns a small correction delta: ΔW = B · A. The original weights stay frozen. Only A and B are trained.

Think of it like editing a photo in a non-destructive editor. You don't paint over the original pixels — you add an adjustment layer. The original stays intact, and you can always revert. LoRA works the same way: it's an adjustment layer on top of the pretrained model.

Why does this work? Researchers discovered that the important updates to pretrained language models tend to lie in a low-dimensional subspace. You don't need a high-rank matrix to capture what changed — a low-rank approximation does the job just fine.

As the LoRA paper (Hu et al., 2021) put it: "We hypothesize that the update matrices ΔW have a low intrinsic rank during adaptation." In plain English: you can compress what the model learned into far fewer dimensions than the original weight matrix, and it still works.

The Nudge Analogy

LoRA adds a small trainable bypass. After merging, you get a model that behaves differently without changing its structure.

Rank Decomposition — Why Low-Rank Works

Let's look at what "low-rank" actually means and why it's so powerful.

Imagine you have a weight matrix W₀ of shape (d × k) — say, (8192 × 8192). That's 67 million parameters. A full fine-tune updates all 67M per step.

LoRA instead approximates the weight update ΔW as the product of two small matrices: A and B.

Rank Decomposition

ΔW ≈ B · A where A is (d × r) and B is (r × k), and r << min(d, k).

If r = 16, then A has 8192 × 16 = 131,072 params and B has 16 × 8192 = 131,072 params. Total: ~262,000 parameters — 256× smaller than the full 67M.

Rank Decomposition: W = W₀ + B · A

A (d×r) and B (r×k) multiply to produce an effective d×k update — but you only train ~262K params instead of 67M.

Why does low-rank work? When you fine-tune, the model is adapting to a new distribution that's very similar to the original one. The direction of change (the update) doesn't need to explore the full high-dimensional space — it can be captured in a much smaller subspace. Like being able to describe the direction of a plane's movement with just a few coordinates rather than knowing the position of every atom.

The Math — W = W₀ + B · A

The entire LoRA formula fits on a napkin:

The Formula

W = W₀ + B · A · (α / r)
Where W₀ is frozen, A and B are trainable, r is rank, and α is a scaling factor.

During the forward pass through a LoRA-adapted layer, the computation becomes:

# Regular forward pass:
h = W₀ · x

# LoRA forward pass:
h = W₀ · x + (B · A) · x · (α / r)

A Concrete Numeric Example

Let's make this real. Say we have a weight matrix W₀ of shape (4 × 4) for simplicity, and we use rank r = 2.

# W₀ is frozen — let's just show what it contributes:
W₀ = [[2.0, 0.5, 1.2, 0.8],
      [0.3, 1.8, 0.4, 2.1],
      [1.1, 0.6, 1.9, 0.7],
      [0.4, 1.3, 0.8, 1.6]]
# Input:
x  = [1.0, 0.5, 0.8, 0.2]

# Forward: h = W₀ · x
h_base = [3.61, 2.37, 3.56, 2.36]

Now we add LoRA. We randomly initialize A from a Gaussian and set B to zero (so at the start, LoRA contributes nothing). With α = 8, r = 2:

# Trainable LoRA matrices:
A = [[0.5, 0.2],    # shape (4 × 2)
     [0.1, 0.8],
     [0.3, 0.4],
     [0.6, 0.1]]

B = [[0.3, 0.0, 0.5, 0.2],  # shape (2 × 4)
     [0.1, 0.4, 0.2, 0.7]]

# Compute BA:
BA = B @ A  # → shape (4 × 4)

# Scale factor:
scaling = α / r = 8 / 2 = 4.0

# Forward with LoRA:
h = W₀·x + (BA·x) * scaling
  = [3.61, 2.37, 3.56, 2.36] + [1.42, 1.08, 1.24, 1.56] * 4
  = [3.61, 2.37, 3.56, 2.36] + [5.68, 4.32, 4.96, 6.24]
  = [9.29, 6.69, 8.52, 8.60]

Those small random A and B matrices, trained via gradient descent, produce a meaningful shift in the output — adapting the model's behavior without touching the original 67M+ parameters.

What Gets Trained?

Not every layer in a transformer contributes equally to model behavior. Research found that targeting specific weight matrices gives the best bang for your buck.

Transformer Layer — Which Parts Get LoRA?

LoRA is most commonly applied to Q, K, V projection matrices (Wq, Wk, Wv), the output projection (Wo), and FFN layers. The embeddings and layer norms remain frozen.

Common LoRA Target Modules

Module	What it does	LoRA Applied?
`W_q` (Query projection)	Transforms input to query vectors for attention	Yes — most common
`W_k` (Key projection)	Transforms input to key vectors for attention	Yes — optional
`W_v` (Value projection)	Transforms input to value vectors for attention	Yes — most common
`W_o` (Output projection)	Projects attention output back to residual stream	Yes — common
`W_up`, `W_gate` (FFN)	Feed-forward network layers	Yes — common
Embeddings, LayerNorm	Token embeddings and normalization	Frozen

Hyperparameters

Three hyperparameters control LoRA's behavior and efficiency:

1. Rank (r) — How Much Capacity?

The rank r controls the dimensionality of the low-rank subspace. Higher r = more trainable parameters = more expressive adapter, but also more compute and risk of overfitting.

Typical values: r = 4, 8, 16

4–8: Lightweight adapters. Good for simple style or persona changes. Very low VRAM usage.
16: Standard choice. Good balance of quality and cost. Works for most instruction-tuning tasks.
32–64: High-capacity. Used when the target domain is very different from pretraining.
128+: Rarely needed. Approaches full fine-tuning in behavior.

Parameter count formula

# Per target module:
params = 2 × r × d  (A + B)

# For Q, K, V, O in an attention head:
# r=16, d_model=4096, n_heads=32
# Per head: 4 × 2 × 16 × 4096 = 524K
# Total: 32 × 524K = 16.7M params

2. Alpha (α) — The Scaling Knob

α scales the LoRA contribution relative to the frozen weights. The effective learning rate for LoRA updates is α / r. So if you double r, you typically double α to keep the same effective scale.

# Common ratios:
α = r       # Standard (1:1 mapping)
α = 2 × r   # Popular for higher ranks
α = r / 2   # More conservative updates

# Scaling in forward pass:
output = W₀ · x + (B · A · x) × (α / r)

3. Target Modules — What to Modify?

You specify which weight matrices get LoRA adapters. In the Hugging Face PEFT library:

from peft import get_peft_model, LoraConfig

config = LoraConfig(
    r=16,
    lora_alpha=32,
    target_modules=["q_proj", "k_proj", "v_proj", "o_proj"],
    lora_dropout=0.05,
    bias="none",
    task_type="CAUSAL_LM"
)
model = get_peft_model(base_model, config)
# Only q_proj, k_proj, v_proj, o_proj get LoRA adapters

Dropout Regularization

lora_dropout (typically 0.0–0.1) randomly zeros out LoRA activations during training as regularization. For most tasks, 0.05 is a good default.

QLoRA — Quantized LoRA

QLoRA (Dettmers et al., 2023) pushes LoRA even further by combining it with quantization. The key innovation: the base model is stored in a highly compressed 4-bit format, and only the LoRA adapters are in full 16-bit. This dramatically reduces VRAM requirements.

Standard LoRA vs QLoRA — Memory Comparison

QLoRA stores the base model in 4-bit NormalFloat (NF4) while keeping LoRA adapters in 16-bit. Same quality as full 16-bit LoRA at a fraction of the memory.

How QLoRA Works

QLoRA uses two key techniques:

4-bit NormalFloat (NF4) — A quantization scheme optimized for normally-distributed weights, achieving near-lossless compression at 4 bits per parameter.
Double Quantization — Quantizes the quantization constants themselves, squeezing out additional memory savings.
Paged Optimizers — Offloads optimizer states to CPU RAM when GPU memory runs low, managing peak memory spikes.

Key Result

QLoRA fine-tuned a 65B parameter model to match the quality of a full 16-bit fine-tune while using only a single 48GB GPU. This is what made fine-tuning large models genuinely accessible.

Training Process — How Gradients Update Only A and B

During backpropagation, gradients flow through the network as usual. But here's the crucial difference: only the LoRA adapter parameters (A and B) receive gradient updates. Everything else — the frozen base model weights — gets zero gradients.

Backward Pass — Gradient Flow

During backprop, W₀ gets zero gradients (frozen). Only B and A receive non-zero gradients and update. Optimizer state is only stored for A and B.

The training loop is straightforward:

# LoRA Training Loop (simplified)
for batch in training_data:
    outputs = model(batch)              # forward pass with LoRA
    loss = compute_loss(outputs, labels)
    grads = backward(loss)             # backprop — W₀ gets 0 gradients!

    for name, param in model.named_parameters():
        if "lora_" in name:
            optimizer.step(param)       # only LoRA params here
        # W₀: skipped — still frozen

Why This Saves Memory

No optimizer states for W₀ — Adam/SGD stores momentum buffers (2× the parameter count). For a 70B model, this alone saves ~560GB of RAM.
No gradient storage for W₀ — gradients alone would be 140GB. LoRA eliminates storing gradients for the frozen base model.
Only A and B in optimizer — at r=16, that's ~16M parameters per layer × 3 optimizer states ≈ 192MB per layer vs. gigabytes for full training.

Merging Adapters — Combining Back into a Single Model

Once training is done, you have two options. Keep the adapter as a separate file and dynamically load it at inference — or merge it back into the base model weights for use as a standalone model.

Option 1: Keep as Swappable Adapter

Adapters are small (~100MB for r=16). You can have multiple adapters for different tasks and load/swap them at runtime:

# Multiple adapters — load whichever you need
model.load_adapter("./finetuned-math-adapter", "math-expert")
model.load_adapter("./finetuned-code-adapter", "code-assistant")

# Switch between them:
model.set_adapter("math-expert")
answer = model.generate("solve for x: 2x + 5 = 13")

model.set_adapter("code-assistant")
answer = model.generate("write a quicksort in python")

Option 2: Merge and Export

Merge the trained A and B matrices back into W₀ to get a standalone model with no adapter overhead:

Merge Formula

The merge is just a weight addition: W_merged = W₀ + (B · A · α / r)
Once merged, you can discard A and B and use the model like any fine-tuned model.

# Merge LoRA adapter into base weights
model.merge_and_unload()   # PEFT: merges and replaces LoRA layers

# Now save the merged model like a normal model
model.save_pretrained("./merged-llama3-70b-instruct")

Merge Process: W₀ + ΔW → W_merged

Merging is simply adding the trained BA matrix to the frozen base weights, then scaling by α/r. The result is a single model file, identical in structure to the original.

Practical Uses

LoRA has opened fine-tuning to researchers, small companies, and hobbyists. Here are the most common applications:

1. Instruction Tuning

Take a pretrained base model and train it to follow instructions, answer questions helpfully, and refuse harmful requests. This is how you get models like Vicuna or Alpaca — they were created by fine-tuning LLaMA with LoRA on human-annotated or GPT-generated instruction-response pairs.

# Example: instruction-tuning dataset
# {"instruction": "Explain photosynthesis", "input": "", "output": "..."}
training_data = load_instruction_dataset("./alpaca_data.json")
# LoRA config for instruction tuning
config = LoraConfig(r=16, lora_alpha=32, target_modules=["q_proj","v_proj"])
model = get_peft_model(base_model, config)
trainer = Trainer(model=model, train_dataset=training_data, ...)
trainer.train()

2. Domain Adaptation

Teach a general model your specialized domain — law, medicine, finance, code. You feed it documents from your domain and LoRA fine-tunes it to understand your domain's language and patterns.

Real Example

A legal firm fine-tuned a 7B model on their contracts and court filings using LoRA (r=8, single A100). The resulting model understood their specific terminology, phrasing, and precedent references — and fit in a single GPU.

3. Character / Roleplay Adapters

LoRA adapters are perfect for teaching a model to roleplay as a specific character, with a particular personality and speech style. Multiple characters can be stored as separate small adapter files.

4. Style Transfer

Fine-tune a model to write in a specific style — technical, literary, concise, conversational. A single LoRA adapter can shift tone without changing the underlying knowledge.

5. Multi-Task Learning with Adapters

Train separate LoRA adapters for different tasks and combine them. Since adapters are small, you can keep many of them in memory and combine outputs — a form of efficient multi-task learning.

Limitations of LoRA

LoRA is powerful, but it's not magic. Understanding its limitations helps you know when to use it — and when a different approach is needed.

When LoRA Struggles

Learning entirely new knowledge — LoRA adapts the model's existing knowledge, it doesn't inject large amounts of new facts. For fact-learning, RAG or continued pretraining may be better.
Very high-rank tasks — If the target distribution is extremely different from pretraining, r=16 or r=32 may not have enough capacity.
Long-context tasks — LoRA adapters may not adapt position interpolation or context-length extension.
Some architectures — LoRA was designed for linear layers with residual connections. Graph neural networks, diffusion model UNets, and some newer architectures need different approaches.
Catastrophic forgetting — Even with a frozen base, aggressive fine-tuning on one task can reduce performance on others (mitigated by lower rank, dropout, and data mixing).

      LoRA Strengths (For Reference)
      Efficient instruction following
Character/persona adaptation
Domain-specific style transfer
Multi-adapter ensemble (cheap)
Fast iteration cycles
Works on consumer GPUs

    

What to Use Instead

Goal	Better Alternative
Inject massive new knowledge	Continued pretraining (CTP) or RAG
Very high quality, no compromises	Full fine-tuning (if you have the budget)
Architecture without linear layers	Adapter variants (e.g., LoRA-Qformer, IA³)
Maximum instruction-following quality	DPO / RLHF on top of LoRA

Important Note

LoRA adapters are typically merged for inference, but even after merging, the quality ceiling is often slightly below a full fine-tune. For production systems where every百分点的质量 matters, full fine-tuning on a large cluster may still be the right answer.

Quick Reference

# The LoRA update:
W = W₀ + B · A · (α / r)

# Key hyperparameters:
r         # rank (4–64, higher = more capacity)
α         # scaling factor (often 2×r)
target_modules  # ["q_proj","v_proj","k_proj","o_proj"]
lora_dropout   # 0.0 – 0.1

# Memory saving:
# full fine-tune: ~16 bytes per param (model + grads + optim)
# LoRA: ~2 bytes per trainable param + 2 bytes per frozen param
# QLoRA: ~0.5 bytes per frozen param (4-bit) + 2 bytes per LoRA param

# Common ranks by use case:
r=4-8    # style, persona, light adaptation
r=16     # instruction tuning, domain knowledge
r=32-64  # heavy domain shift, complex tasks