The transformer architecture didn’t just improve language models-it rewrote the rules of how machines understand text. Before transformers, models like RNNs and LSTMs processed words one at a time, like reading a book page by page. That made them slow, hard to train, and terrible at remembering what happened early in a long sentence. Then, in 2017, a paper called "Attention Is All You Need" dropped. It introduced a model that could look at all the words in a sentence at once, figure out which ones mattered most, and do it all in parallel. That’s the transformer. And today, every major large language model-from GPT to Llama to Claude-runs on this design. If you want to know how AI understands "The cat sat on the mat" and not just guess the next word, you need to understand the transformer.
How Transformers Turn Words Into Numbers
Before a transformer can do anything, it needs to turn text into numbers. That’s the job of the tokenizer. It splits sentences into chunks called tokens. For example, "unhappiness" might become "un" + "happ" + "iness" instead of one token, because the model learns that "un" and "happ" appear often in other words. This lets the model handle words it’s never seen before by combining parts it knows. The vocabulary size for models like GPT-2 is around 50,257 tokens, meaning it has a dictionary of that many possible chunks of text.
Each token gets converted into a vector-a list of numbers. In GPT-2, each token becomes a 768-dimensional vector. Think of it like assigning each word a unique address in a 768-room building. Words with similar meanings, like "king" and "queen," end up near each other in this space. Words like "apple" and "quantum" are far apart. This is called an embedding layer. But there’s a problem: if the model only uses embeddings, it can’t tell the difference between "The dog chased the cat" and "The cat chased the dog." The same words, same embeddings, just reordered. That’s where positional encoding comes in.
Positional encoding adds information about where each token sits in the sequence. It’s not just a number like "first word," it’s a unique pattern of sine and cosine waves added to each vector. This lets the model know not just what words are there, but their order. Now, when the model sees "cat" in position 1 versus position 4, it knows the context changes.
The Heart of the Transformer: Self-Attention
The real magic happens in the self-attention mechanism. For every word in a sentence, the transformer asks: "Which other words should I pay attention to when understanding this one?" It doesn’t just look at neighbors. It looks at every word.
Here’s how it works. Each token’s embedding gets passed through three separate linear layers to produce three vectors: query, key, and value. The query asks: "What am I looking for?" The key says: "Here’s what I represent." The value is the actual content. The model calculates attention scores by taking the dot product of the query and all the keys. That gives a score for how much each word relates to the current one. Higher score = more attention.
But there’s a catch. If you just used these scores, the numbers could get huge and mess up training. So they’re scaled down by the square root of the vector dimension (in GPT-2, that’s √768 ≈ 27.7). Then, they’re passed through a softmax function, turning them into probabilities that add up to 1. This means the model picks a weighted mix of all the words’ values to form its understanding of the current word.
Now, here’s the twist: multi-head attention. Instead of one set of query, key, and value vectors, the model splits the 768-dimensional embedding into 12 smaller chunks-each with 64 dimensions. Each chunk runs its own attention calculation independently. One head might learn to notice subject-verb relationships. Another might track pronoun references. A third might catch emotional tone. These 12 heads run in parallel, and their outputs are concatenated back into one 768-dimensional vector. This lets the model capture many types of relationships at once.
How the Transformer Layer Repeats
One attention mechanism isn’t enough. So the transformer stacks multiple transformer blocks on top of each other. Each block has two main parts: the multi-head attention layer and a feed-forward neural network (also called an MLP).
After attention, the output goes into the MLP. This isn’t fancy-it’s just two linear layers with a nonlinear activation (GELU in GPT-2) in between. The first layer expands the 768-dimensional vector to 3,072 dimensions. This gives the model more "thinking space" to find complex patterns. Then, it collapses it back down to 768. Think of it like zooming in on a painting to see fine details, then zooming out to fit it back into the frame.
But here’s the critical part: residual connections. Before each of these layers (attention and MLP), the original input is added to the output. So if the attention layer outputs a change of +0.5, the final result is the input + 0.5. This keeps information from earlier layers alive. Without this, gradients would vanish in deep networks, and training would fail. In GPT-2, this happens twice per block: once before attention and once before the MLP.
Then comes layer normalization. This isn’t just a fancy term-it’s what makes training stable. The original transformer used post-LN: normalize after the layer. But that caused training instability. The modern approach, pre-LN, normalizes before the layer. This lets the model train faster, without needing a learning rate warm-up. Today, every major LLM uses pre-LN.
Encoder vs Decoder: What’s the Difference?
Not all transformers are built the same. There are two main types: encoder-only, decoder-only, and encoder-decoder.
Encoder-only models like BERT are great for understanding text. They take a sentence, run it through layers of attention, and output a rich representation. They’re used for tasks like sentiment analysis or question answering, where you need to understand context.
Decoder-only models like GPT-2 are built for generation. They don’t need an encoder because they predict the next token step by step. Each time they generate a word, they look at all previous words. To prevent cheating, they use causal masking. That means when predicting the third word, the model can’t see the fourth, fifth, or any future words. The attention matrix is zeroed out above the diagonal. This forces the model to learn causality: words depend only on what came before.
Encoder-decoder models like T5 or BART are for tasks like translation or summarization. The encoder reads the input text. The decoder then uses that encoded info, plus the words it’s generated so far, to produce output. It has an extra attention layer that lets it "look back" at the encoder’s output while generating. This is why it can translate "Je suis fatigué" into "I am tired"-it knows what the input meant.
The Full Pipeline: From Input to Output
Here’s what happens step by step when a model like GPT-2 generates text:
- Tokenization: "Hello, how are you?" becomes ["Hello", ",", "how", "are", "you", "?"]
- Embedding: Each token turns into a 768D vector.
- Positional encoding: Each vector gets its position added.
- Transformer layers: The 12 blocks process the sequence. Each block updates representations using attention and MLP.
- Unembedding: The final 768D vector goes through a linear layer to map back to the 50,257-token vocabulary.
- Softmax: The output scores become probabilities. The model might say: "cat" (12%), "dog" (8%), "mat" (15%), "the" (30%), etc.
- Sampling: The model picks the next token-not always the highest probability. Sometimes it picks randomly, based on the probabilities, to avoid repetitive output.
- Repeat: The new token is added to the input. The whole process repeats until it hits an end token.
This autoregressive loop is why LLMs can write paragraphs. Each word is generated one at a time, based on everything that came before.
Why Transformers Outperform Old Models
Before transformers, RNNs processed text sequentially. That meant they couldn’t be parallelized. Training a sentence with 100 words took 100 steps. Transformers process all 100 at once. That’s why they train 10x faster.
Also, RNNs suffered from the "vanishing gradient" problem. If a word was 50 steps back, the model forgot its influence. Transformers don’t have this. Self-attention connects every word directly. "The cat" and "the mat" are linked no matter how far apart they are.
And then there’s scale. Transformers don’t just work better-they scale better. Add more layers? More parameters? More data? They handle it. GPT-3 has 175 billion parameters. GPT-4 likely has more. No RNN could handle that. Transformers made massive models feasible.
Training Costs and What Happens After
Training a transformer isn’t cheap. GPT-3 reportedly cost over $100 million in compute. It took weeks on thousands of GPUs. Why? Because every weight-every attention head, every embedding, every linear layer-is adjusted during training. The model sees trillions of tokens. It learns that after "The cat sat on the," the next word is usually "mat." Not because it remembers the phrase, but because the weights adjusted to reflect that pattern across millions of examples.
Once training is done, those weights freeze. That’s the model. Inference-generating text-uses those fixed weights. No more learning. Just computation. A single inference request might use 100ms of GPU time. But the cost of training? That’s what made LLMs so expensive to build.
And yet, despite the cost, the architecture is simple enough to replicate. That’s why we now have open models like Llama, Mistral, and Phi. The transformer didn’t just enable big companies to build AI-it let anyone with enough compute try.
What Comes Next?
Transformers aren’t perfect. They’re slow at long documents. They don’t reason like humans. They hallucinate. But they’re the best we have. Researchers are working on alternatives-like state-space models and hybrid architectures-but none have matched the transformer’s combination of performance, scalability, and flexibility.
Today, transformers power chatbots, code assistants, translation tools, and search engines. They’re in your phone, your laptop, your smart speaker. And they all trace back to that 2017 paper. The transformer didn’t just change AI. It became the foundation for how machines understand language-and how we’ll build the next generation of AI.