How Large Language Models Work

What is an LLM?

A Large Language Model (LLM) is a neural network trained to predict the next piece of text given everything that came before it. That's the whole game. It reads a sequence of tokens (words, word-parts, or characters), and for each position, it estimates the probability distribution over what the next token could be.

Think of it like autocomplete on steroids — not a database of facts, not a reasoning engine, but a very sophisticated pattern-recognition machine that has internalized statistical relationships from billions of documents.

Key insight: LLMs don't "understand" language the way humans do. They build internal representations that capture patterns so well that the outputs appear intelligent. The model literally learns to guess what a well-formed answer looks like.

Modern LLMs like GPT-4, Claude, and Gemini are built on the Transformer architecture, introduced in the landmark 2017 paper "Attention Is All You Need" by Google researchers. Before Transformers, models struggled with long-range dependencies in text. Transformers solved that — and changed everything.

Input

Raw text → converted to numbers via tokenization and embeddings

Process

Neural network layers compute probability distributions over vocabulary

Output

Next token sampled from the distribution, then fed back as input

Tokenization — Breaking Text Into Bite-Sized Pieces

Before a model can process text, it must be converted into something numeric. The first step is tokenization — splitting text into discrete chunks called tokens.

Tokens are not the same as words. A token might be a full word ("cat"), a common subword pair ("cat_"), a rare word broken into pieces ("cat" → "ca" + "t"), or even a single character. The tokenizer is trained to produce a vocabulary that balances efficiency (not too many tokens per sentence) against precision (meaningful units).

Modern LLMs use Byte Pair Encoding (BPE) or similar algorithms (WordPiece, SentencePiece). The core idea: start with individual characters, then iteratively merge the most frequent adjacent pair into a new token, until the vocabulary reaches a target size.

Tokenization example — same sentence, different tokenizers

The Ġquick Ġbrown Ġfox Ġjumps Ġover Ġlazy Ġdog

The "Ġ" marks tokens that had a leading space. "quick", "brown", "fox" are common words so they each get one token. Rare words split into subword pieces.

Why it matters: The number of tokens directly affects cost and speed. Most LLM APIs charge per token. A rough English rule: ~4 characters ≈ 1 token, or ~¾ of a word. This is why short, concise prompts cost less to process.

How BPE tokenization works

Embeddings — Turning Tokens Into Numbers

Tokens are just integers — IDs pointing into the tokenizer's vocabulary. To be useful, each token needs to be represented as a list of numbers that capture its meaning and context. That's what embeddings do.

An embedding is a learned vector (a list of floating-point numbers) of fixed dimensionality. For GPT-3, that's 12,288 numbers per token. For BERT, 768. These numbers aren't assigned arbitrarily — the model learns them during training so that tokens with similar meanings end up with similar vectors.

Simplified embedding concept
Token:   "king"   →  embedding = [ 0.23, -1.45,  0.87,  ...,  0.12]   (12,288 dims)
Token:   "queen"  →  embedding = [ 0.31, -1.38,  0.92,  ...,  0.19]
Token:   "apple"  →  embedding = [ 1.22,  0.45, -0.33,  ..., -0.08]

Similarity("king", "queen")  >>  Similarity("king", "apple")

The magic is that arithmetic over embeddings works. The famous example: embedding("king") - embedding("man") + embedding("woman") ≈ embedding("queen"). This is called linearity in embedding space — and it's a consequence of the model having learned rich, structured representations.

Token embeddings projected into 2D space

Positional Embeddings

Raw embeddings don't know where a token is in the sequence. A "cat" in position 1 and a "cat" in position 50 would look identical. To fix this, positional embeddings are added to the token embeddings — learned or fixed vectors that encode the token's position in the sequence.

Modern models (like GPT models) often use Rotary Position Embeddings (RoPE) or ALiBi instead of learned positional embeddings. These allow models to generalize to longer sequences than they were trained on — critical for inference on inputs longer than the training context window.

The Transformer Architecture — Attention Is All You Need

The Transformer is the engine of every modern LLM. It consists of layers of multi-head self-attention and feed-forward networks, wrapped in residual connections and layer normalization. Let's unpack each piece.

Self-Attention — What It Actually Does

Self-attention lets every token in a sequence "look at" every other token and decide how much each one matters for understanding the current position. It computes a weighted average of all token representations, where the weights are determined by learned similarity scores.

For each token, the model computes three learned vectors: a Query (Q) ("what am I looking for?"), a Key (K) ("what do I contain?"), and a Value (V) ("what information do I provide?"). The attention score between token i and token j is computed as the dot product of Q_i with K_j, scaled by √d, then run through softmax.

Attention formula
Attention(Q, K, V) = softmax( Q · K^T / √d_k ) · V

where:
  Q · K^T  = pairwise dot products (similarity scores)
  √d_k     = scaling factor (d_k = dimensionality of each head)
  softmax  = converts scores to probabilities (all sum to 1)
  · V      = weighted sum of value vectors

Simplified attention visualization — "The cat sat on the mat"

Multi-head attention runs this process in parallel across many "heads" — each head learns different Q/K/V projections, capturing different types of relationships (syntactic, semantic, positional). The outputs of all heads are concatenated and projected back. A model like GPT-3 has 96 attention heads per layer.

Feed-Forward Networks (FFN)

Between attention layers, each token passes through a feed-forward network — essentially a two-layer neural network with a non-linearity (usually GELU or ReLU) in between. This is where much of the model's "knowledge" is stored. The FFN can be thought of as a key-value memory: the first layer selects relevant "keys" based on the current token's representation, and the second layer retrieves corresponding "values."

Layer Structure

A complete Transformer layer looks like this:

One Transformer layer
Input
  │
  ├─▶  Multi-Head Self-Attention
  │       └─▶  Add & Norm (residual connection + layer norm)
  │
  ├─▶  Feed-Forward Network
  │       └─▶  Add & Norm
  │
  ▼
Output

Residual connections (Add) pass the input directly to the output of each sub-layer, allowing gradients to flow unchanged during backpropagation. Layer normalization (Norm) stabilizes training by normalizing activations per-token across features.

GPT-3 has 96 of these layers stacked. The output of layer N becomes the input to layer N+1. Each layer progressively refines the representations, moving from surface-level patterns (syntax) to deeper meaning (semantics, reasoning).

How Inference Works — Next Token Prediction

When you send a prompt to an LLM, here's precisely what happens during inference:

Tokenize — Your input text is converted to a sequence of token IDs using the model's tokenizer.
Embed — Each token ID is looked up in the embedding matrix, producing a vector. Positional embeddings are added.
Process through layers — The sequence passes through all 96 Transformer layers. Each layer refines the representations using self-attention (tokens "talk" to each other) and FFNs.
Logits — The final layer outputs a vector of size vocab_size (typically 50,000–100,000). Each number corresponds to the "unnormalized probability" of that token being next.
Sampling — The logits are converted to probabilities via softmax. The model then samples one token from this probability distribution. (Or applies top-p/top-k filtering, temperature scaling, etc.)
Append & repeat — The sampled token is appended to the input, and the whole process repeats until the model produces a stop token or hits the maximum context length.

Autoregressive generation loop

Temperature & Sampling

The temperature parameter controls how "creative" the sampling is. At temperature 0, the model always picks the highest-probability token (greedy decoding, deterministic). At higher temperatures, the distribution is "flatter" — the model is more likely to pick less probable tokens, increasing diversity and surprise.

Parameter	What it does	Good for
`temperature = 0`	Always pick top token (deterministic)	Factual Q&A, code completion
`temperature = 0.7`	Moderate randomness, still focused	Conversational responses
`temperature = 1.0+`	High randomness, creative/surprising	Creative writing, brainstorming
`top_p`	Nucleus sampling: smallest set of tokens covering P probability mass	Replaces temperature for many use cases

Training — How the Model Learns From Data

Training an LLM is essentially a massive curve-fitting exercise. The goal: adjust the billions of numbers (weights) in the neural network so that, for every input sequence, the model predicts the next token as accurately as possible.

Step 1: Pretraining on raw text

The model is fed a causal language modeling objective: given a sequence of tokens, predict the next token. The model sees the correct answer, computes how wrong it was (via cross-entropy loss), and updates its weights slightly to reduce that error. This is repeated trillions of times across billions of documents from the internet.

Loss function — cross-entropy
Loss = -Σ  log P(next_token | context)

For each position in the training text:
  • Model sees tokens [1, 2, 3, ...]
  • Must predict token [2]  →  compute log probability of "is"
  • Must predict token [3]  →  compute log probability of "an"
  • Must predict token [4]  →  compute log probability of "AI"
  • Sum all log probs, take negative → gradient → update weights

"Prediction error goes down → model learns patterns"

GPT-3 was trained on ~300 billion tokens from a scraped web corpus. The training process used thousands of GPU/TPU chips running for weeks. The compute cost was estimated at $4-5 million.

Step 2: Supervised Fine-Tuning (SFT)

After pretraining, the model is a powerful "next token predictor" but not yet a helpful assistant. SFT corrects this: human annotators write high-quality example conversations (prompt → ideal response), and the model is fine-tuned on these demonstrations. This shapes the model's behavior to produce helpful, well-formatted responses.

Step 3: RLHF — Learning From Human Feedback

Reinforcement Learning from Human Feedback (RLHF) is what makes modern assistants feel aligned with user intent. The process:

Reward model training — Humans rank multiple model responses to the same prompt. A separate neural network (the reward model) learns to predict which response a human would prefer.
RL optimization — The policy model (the LLM) is fine-tuned using the reward model as a signal via the PPO (Proximal Policy Optimization) algorithm. The LLM generates responses, the reward model scores them, and the LLM updates its weights to generate higher-scoring responses.

RLHF is notoriously tricky to get right. It can lead to reward hacking (model exploits patterns the reward model correlates with quality rather than genuine helpfulness), mode collapse, or sycophancy. It's an active area of research.

Step 4: Constitutional AI / DPO

Newer approaches like Direct Preference Optimization (DPO) and Constitutional AI simplify the RLHF pipeline by training the model directly on preference data without a separate reward model — making alignment more stable and less computationally expensive.

Parameters, Weights & Model Sizes Explained

Every neural network layer has a set of parameters — the weights and biases that define how inputs are transformed into outputs. In a Transformer, these live in the attention projections (Q, K, V, output), the FFN layers, the embedding matrices, and the layer norms.

Model	Parameters	Layers	Heads	Context	Training Compute
GPT-2	1.5B	48	25	2K	~few $K
GPT-3	175B	96	96	2K	~$4-5M
Llama 2	70B	80	80	4K	~$3M
GPT-4	~1.8T (est.)	120 (est.)	96 (est.)	128K	>$100M (est.)
Claude 3.5	~2T (est.)	??	??	200K	Unknown

What Do Parameters Actually Represent?

A parameter is literally a floating-point number stored in a matrix. The model "knows" something not as a stored fact, but as a pattern in these numbers — a distributed, compressed representation of the statistical regularities in its training data.

A useful analogy: The human brain has ~86 billion neurons with ~100 trillion synapses. GPT-3's 175 billion parameters are comparable in scale to the synaptic connections in a human brain — though the architectures and learning mechanisms are fundamentally different.

Why Bigger Models Are Smarter (Usually)

Larger models benefit from two things: more capacity (they can store more patterns) and, critically, in-context learning (they can pick up new patterns from the prompt itself without weight changes). The "emergent abilities" phenomenon — where capabilities seem to appear suddenly at certain model scales — is partly a statistical artifact and partly a real consequence of scale.

Prompt Engineering Basics

Prompt engineering is the practice of crafting inputs to get better outputs from an LLM. It exploits the fact that the model's behavior is highly sensitive to how you frame your request. Good prompts give the model clear signals about what to do, how to do it, and what format to respond in.

Be Specific

Vague prompts produce vague answers. "Explain quantum computing" gets you a surface overview. "Explain quantum entanglement to a 10-year-old using a sock metaphor" gets you something genuinely useful.

Few-Shot Examples

Give 2-3 examples of the input→output pattern you want. The model infers the pattern and applies it to your actual query. Often more powerful than detailed instructions.

Chain of Thought

Ask the model to "think step by step" before answering. This gives the model compute budget to work through reasoning before committing to an answer. Dramatically improves accuracy on complex tasks.

System Prompts

System-level instructions set persistent behavior: "You are a helpful coding assistant." "Always format responses as JSON." These act as a behavioral primer before any user message.

Key Techniques

Zero-shot (no examples)
Translate to French: "Hello, how are you?"

Few-shot (with examples)
Translate English to French:
  "Hello" → "Bonjour"
  "Good morning" → "Bonjour"
  "Thank you" → "Merci"
  "Hello, how are you?" →

Chain-of-thought
Problem: If a train travels 120km in 2 hours, what is its average speed?

Let me work through this step by step:
  Step 1: Identify what we're solving for — speed = distance / time
  Step 2: Distance = 120km, Time = 2 hours
  Step 3: 120 / 2 = 60 km/h

Answer: 60 km/h

Why chain-of-thought works: Asking for steps forces the model to allocate more forward-pass computation to the reasoning process. It effectively gives the model more "thinking room" before producing the final answer — a cheap trick that mimics System 2 thinking.

Limitations & Gotchas

LLMs are genuinely impressive — but they're also deeply flawed in ways that matter. Understanding these limitations is essential for using them responsibly.

Hallucinations

Models confidently generate false information. They don't know what they don't know. A model can produce a completely fabricated citation with the same fluency as a real one. This is the #1 practical risk in production deployments.

Training Data Cutoff

Models don't know events after their training cutoff. Ask them about today's news and they'll either say they don't know, or worse — hallucinate plausible-sounding recent events.

No True Memory

LLMs have a fixed context window (e.g., 128K tokens). Beyond that, information is lost. Each conversation starts fresh. Persistence requires RAG (Retrieval-Augmented Generation) or external memory systems.

Brittle Reasoning

LLMs can do impressive multi-step reasoning — but they're also sensitive to how problems are framed. Change a single word in a math problem and get a completely wrong answer. They lack robust, generalizable reasoning.

Toxicity & Bias

Models absorb harmful patterns from training data. Alignment techniques (RLHF, SFT) reduce but don't eliminate bias, stereotypes, and toxic outputs. Sensitive applications need guardrails and evaluation.

No Tool Use by Default

Without explicit tooling (web search, code execution, calculators), LLMs can't verify facts, run code, or interact with external systems. They are, by default, isolated text predictors.

Context Length — A Hard Limit

Every LLM has a maximum context window. Once you exceed it, the model simply can't see the older tokens. Some models handle long contexts poorly — information at the beginning of a very long conversation can be "forgotten" (the lost-in-the-middle problem). Newer models (Claude 3.5, Gemini 1.5) use extended context windows and architectural improvements to mitigate this.

The irony: LLMs are trained to sound authoritative — confident, fluent, well-structured. This makes their errors harder to detect than the errors of a system that clearly signals uncertainty. Always verify claims that matter. LLMs are excellent at pattern-matching your intent; they are not reliable oracles.

Despite these limitations, LLMs represent a genuine leap in machine intelligence. They're remarkably versatile — writing, reasoning, explaining, translating, debugging — all from a single architecture trained on next-token prediction. The next breakthrough may come from better training methods, better architectures, or hybrid systems that combine LLMs with structured knowledge and real-world tools. The field is moving fast.