The architecture that changed NLP forever — self-attention, positional encoding, and the pre-trained model revolution.
LSTMs had three key bottlenecks that made them hard to scale to the level needed for modern NLP.
| Problem | Detail | Transformer Fix |
|---|---|---|
| Sequential (Slow) | Items are fed one at a time — needs h_{t-1} before computing h_t, so can't parallelize | Self-attention sees all tokens at once — fully parallelizable |
| Large Memory | Must store all hidden states and gradients through time | Attention over fixed-length key/value stores |
| No Transfer Learning | Must retrain from scratch on every new labeled dataset — costly & slow | Pre-train once on massive corpus; fine-tune cheaply on any task |
Original Transformer architecture introduced; designed for machine translation.
First major pre-trained Transformer; Generative Pre-Training using decoder stack.
Bidirectional Encoder Representations from Transformers; excels at understanding tasks.
60% faster, 40% lighter than BERT while maintaining comparable accuracy.
175 billion parameters; autoregressive model for generating human-like text.
Translate text from one natural language to another (e.g., English → French).
Identify the emotion or opinion in a sentence (positive/negative/neutral).
Condense a long document into a shorter, meaningful version.
"This course will teach you all about ___ models." → Transformers
Detect entities in text: people, places, dates, organizations.
Assign grammatical labels: noun, verb, adjective, etc. to each word.
Uses both components for generative tasks that require input (e.g., translation, summarization). Models: BART, T5.
Computes relationships between all words in the sequence simultaneously using Q, K, V matrices.
Adds the input directly to the attention output: output = LayerNorm(x + Attention(x)). Prevents vanishing gradients.
Normalizes the distribution of intermediate layers → smoother gradients, faster training, better generalization.
Two linear layers with ReLU in between. Transforms attention vectors for the next block. Also has a skip connection + LayerNorm after it.
Transformers have no built-in sense of word order (unlike RNNs/CNNs). Positional encoding injects order information into the embeddings so the model knows where each word appears.
Instead of a single self-attention computation, multi-head attention runs several attention operations in parallel, each with its own Q, K, V matrices. Outputs are concatenated and projected.
Consider: "The cat drank the milk because it was hungry." vs "The cat drank the milk because it was sweet."
In the first sentence "it" = cat; in the second "it" = milk. Self-attention lets each word "look at" every other word to resolve this.
What each word offers.
"Here's what I contain."
What the current word is looking for.
"What am I interested in?"
The content to be retrieved.
"Here's the actual information."
Mask ~15% of tokens → train model to predict them. Example:
"The man went to the [MASK]." → "store"
Given two sentences A & B, predict: IsNext or NotNext. Trains the model to understand sentence relationships (important for Q&A, NLI).
Language modeling: given "The cat sat on the", predict "mat". Repeat for all positions. No labels needed — fully self-supervised.
| Feature | RNN/LSTM | BERT (Encoder) | GPT (Decoder) | T5/BART (Seq2Seq) |
|---|---|---|---|---|
| Attention | Bahdanau (optional) | Bidirectional self-attn | Masked self-attn | Both |
| Direction | Left→Right | Bidirectional ✓ | Left→Right only | Enc: bi, Dec: L→R |
| Parallelization | Sequential ✗ | Full parallel ✓ | Parallel (train) ✓ | Parallel (train) ✓ |
| Transfer Learning | Poor ✗ | Excellent ✓ | Excellent ✓ | Excellent ✓ |
| Best For | Short sequences | NLU (classification) | NLG (generation) | Seq-to-seq tasks |
| Examples | LSTM, GRU | BERT, RoBERTa, ALBERT | GPT-1/2/3, Llama | T5, BART, mT5 |
Q1. What is the main reason transformers can be trained faster than LSTMs?
Q2. In the self-attention formula Attention(Q, K, V) = softmax(Q·Kᵀ / √d_k) · V, what does dividing by √d_k accomplish?
Q3. BERT is described as "bidirectional." What does this mean?
Q4. Which pre-training objective does BERT use to learn sentence-level relationships?
Q5. What is positional encoding in transformers and why is it needed?
Q6. GPT is an "autoregressive" model. What does that mean during generation?