🚀

Why Transformers?

History & motivation for moving beyond LSTMs

Problems with LSTMs that Transformers Solve

What

LSTMs had three key bottlenecks that made them hard to scale to the level needed for modern NLP.

ProblemDetailTransformer Fix
Sequential (Slow)Items are fed one at a time — needs h_{t-1} before computing h_t, so can't parallelizeSelf-attention sees all tokens at once — fully parallelizable
Large MemoryMust store all hidden states and gradients through timeAttention over fixed-length key/value stores
No Transfer LearningMust retrain from scratch on every new labeled dataset — costly & slowPre-train once on massive corpus; fine-tune cheaply on any task

Timeline: From Transformers to Modern LLMs

2017
"Attention Is All You Need" — Vaswani et al.

Original Transformer architecture introduced; designed for machine translation.

June 2018
GPT-1 (OpenAI)

First major pre-trained Transformer; Generative Pre-Training using decoder stack.

October 2018
BERT (Google)

Bidirectional Encoder Representations from Transformers; excels at understanding tasks.

2019
DistilBERT

60% faster, 40% lighter than BERT while maintaining comparable accuracy.

2020
GPT-3 (OpenAI)

175 billion parameters; autoregressive model for generating human-like text.

Why Pre-Train Instead of Training from Scratch?

🎯 The Core Idea Train once on massive unlabeled text using self-supervised learning, then fine-tune cheaply on specific tasks.
  • Time: Training from scratch on large corpora takes extremely long — weeks to months on hundreds of GPUs.
  • Cost & Carbon: Pre-training is very expensive and creates a large environmental footprint.
  • Data efficiency: Fine-tuning a pre-trained model requires far less labeled data than training from scratch.
  • Domain knowledge: A pre-trained model already has statistical knowledge of language that transfers to the fine-tuning task.
💡 Self-Supervised Learning Labels are automatically created from the raw text (e.g., masking words, predicting next word) — no human labeling needed!

NLP Tasks Transformers Excel At

🌐 Machine Translation (MT)

Translate text from one natural language to another (e.g., English → French).

💬 Sentiment Analysis

Identify the emotion or opinion in a sentence (positive/negative/neutral).

📝 Text Summarization

Condense a long document into a shorter, meaningful version.

🔲 Fill-in-the-Blank (MLM)

"This course will teach you all about ___ models." → Transformers

🏷️ Named Entity Recognition (NER)

Detect entities in text: people, places, dates, organizations.

🔤 Part-of-Speech (POS) Tagging

Assign grammatical labels: noun, verb, adjective, etc. to each word.

🏗️

Transformer Architecture

Encoder, decoder, and the three model families

Encoder vs. Decoder Overview

Input Tokens
Token Embedding
+
Positional Encoding
Encoder Block(s)
Context Vectors
Context Vector (Bottleneck) The last encoder produces a fixed-length context vector that is passed to the decoder. The decoder attends to this to generate output.

🔵 Encoder

  • Reads the entire input sequence at once (bidirectional)
  • Produces a rich representation for each token
  • Best for understanding tasks (NLU)
  • Models: BERT, RoBERTa, ALBERT

🟠 Decoder

  • Generates tokens one at a time (left-to-right only)
  • Attends only to previously generated tokens (masked)
  • Best for generation tasks (NLG)
  • Models: GPT, GPT-2, GPT-3

🔵+🟠 Encoder–Decoder (Seq2Seq)

Seq2Seq

Uses both components for generative tasks that require input (e.g., translation, summarization). Models: BART, T5.

Inside One Encoder Block (4 Modules)

Input
(Embedding + Pos)
Self-Attention
Layer
Add & Norm
(Skip Connection)
Feed-Forward
Network
Add & Norm
(Skip Connection)
Output
(to next block)

① Self-Attention

Computes relationships between all words in the sequence simultaneously using Q, K, V matrices.

② Skip (Residual) Connection

Adds the input directly to the attention output: output = LayerNorm(x + Attention(x)). Prevents vanishing gradients.

③ Layer Normalization

Normalizes the distribution of intermediate layers → smoother gradients, faster training, better generalization.

④ Feed-Forward Network (FFN)

Two linear layers with ReLU in between. Transforms attention vectors for the next block. Also has a skip connection + LayerNorm after it.

Positional Encoding

Why

Transformers have no built-in sense of word order (unlike RNNs/CNNs). Positional encoding injects order information into the embeddings so the model knows where each word appears.

// Positional Encoding formula PE(pos, 2i) = sin(pos / 10000^(2i/d_model)) ← even indices PE(pos, 2i+1) = cos(pos / 10000^(2i/d_model)) ← odd indices // pos = position in sequence, i = dimension index // d_model = embedding dimension // Final input = Token Embedding + Positional Encoding
💡 Key Point Sine for even indices, cosine for odd indices. The resulting vectors are added to (not concatenated with) the word embeddings.

Multi-Head Attention

What

Instead of a single self-attention computation, multi-head attention runs several attention operations in parallel, each with its own Q, K, V matrices. Outputs are concatenated and projected.

Why Use Multiple Heads? Each head can attend to different kinds of relationships simultaneously — e.g., one head captures syntax, another captures coreference, another captures semantic similarity.
// Multi-Head Attention head_i = Attention(Q·Wᵢᵠ, K·Wᵢᵏ, V·Wᵢᵛ) MultiHead = Concat(head_1, ..., head_h) · Wᴼ // Each head has its own learned projection matrices Wᵠ, Wᵏ, Wᵛ // Processed in parallel → efficient on GPUs
🔍

Self-Attention: The Core Mechanism

Query, Key, Value — understanding how attention scores are computed

Intuition: "It was hungry" — which word does "it" refer to?

The Problem

Consider: "The cat drank the milk because it was hungry." vs "The cat drank the milk because it was sweet."
In the first sentence "it" = cat; in the second "it" = milk. Self-attention lets each word "look at" every other word to resolve this.

💡 Self-Attention Purpose For each word in the input, compute a weighted combination of all other words' representations, where the weights represent how relevant each word is to understanding the current word.

Query, Key, Value (Q, K, V) — Step by Step

🗝️ KEY (K)

What each word offers.
"Here's what I contain."

❓ QUERY (Q)

What the current word is looking for.
"What am I interested in?"

💰 VALUE (V)

The content to be retrieved.
"Here's the actual information."

// Step-by-step self-attention calculation // 1. Project each word embedding into Q, K, V vectors q_i = x_i · W_Q ← query for word i k_j = x_j · W_K ← key for word j v_j = x_j · W_V ← value for word j // 2. Compute raw attention score (dot product) e_ij = q_i · k_jᵀ / √d_k ← scale to avoid large values // 3. Softmax to get attention weights (sum to 1) α_ij = softmax(e_ij) // 4. Weighted sum of values = output for word i z_i = Σⱼ α_ij · v_j // Matrix form (all words at once): Attention(Q, K, V) = softmax( Q·Kᵀ / √d_k ) · V
⚠️ Why divide by √d_k? With large embedding dimensions, dot products grow very large → softmax becomes extremely peaked → vanishing gradients. Dividing by √d_k stabilizes training.

🔬 Interactive: Self-Attention Weight Visualizer

Click a query word to see how much attention it pays to each key word in the sentence.
🤖

Model Families: BERT, GPT, T5

Encoder-only, decoder-only, and encoder–decoder architectures

BERT — Encoder-Only (Bidirectional)

NLU Tasks
  • Trained on BooksCorpus (~800M words) + English Wikipedia (~2.5B words)
  • Two pre-training objectives: MLM + NSP
  • Bidirectional: attends to words both before AND after
  • Great for: classification, Q&A, NER, sentiment
  • Related: RoBERTa, ALBERT, DistilBERT (60% faster, 40% lighter)

Masked Language Modeling (MLM)

Mask ~15% of tokens → train model to predict them. Example:
"The man went to the [MASK]." → "store"

Next Sentence Prediction (NSP)

Given two sentences A & B, predict: IsNext or NotNext. Trains the model to understand sentence relationships (important for Q&A, NLI).

# HuggingFace BERT example — fill-mask pipeline from transformers import pipeline pipe = pipeline("fill-mask", model="bert-base-uncased") result = pipe("The man went to the [MASK]. He bought a [MASK] of milk.") # MASK_1 → "store", MASK_2 → "gallon"

GPT — Decoder-Only (Unidirectional, Autoregressive)

NLG Tasks
  • Developed by OpenAI — Generative Pre-Training
  • Autoregressive: predicts the next word given all previous words
  • Unidirectional: only looks left (past tokens)
  • GPT-3: 175 billion parameters

Pre-training Objective

Language modeling: given "The cat sat on the", predict "mat". Repeat for all positions. No labels needed — fully self-supervised.

# HuggingFace GPT-2 example — text generation from transformers import pipeline pipe = pipeline("text-generation", model="gpt2") results = pipe("Help! I'm a language model", max_length=50, num_return_sequences=5) # Generates 5 different continuations of the input text

T5 / BART — Encoder–Decoder (Seq2Seq)

Generative + Understanding
  • Use both encoder stack (to understand input) and decoder stack (to generate output)
  • Best for tasks needing input → output: translation, summarization, question answering with generation
  • Used the original "Attention Is All You Need" architecture (Vaswani et al., 2017)

Complete Architecture Comparison

Feature RNN/LSTM BERT (Encoder) GPT (Decoder) T5/BART (Seq2Seq)
Attention Bahdanau (optional) Bidirectional self-attn Masked self-attn Both
Direction Left→Right Bidirectional ✓ Left→Right only Enc: bi, Dec: L→R
Parallelization Sequential ✗ Full parallel ✓ Parallel (train) ✓ Parallel (train) ✓
Transfer Learning Poor ✗ Excellent ✓ Excellent ✓ Excellent ✓
Best For Short sequences NLU (classification) NLG (generation) Seq-to-seq tasks
Examples LSTM, GRU BERT, RoBERTa, ALBERT GPT-1/2/3, Llama T5, BART, mT5

🎯 Quiz 8 Practice — Transformers, BERT & GPT

Q1. What is the main reason transformers can be trained faster than LSTMs?

Q2. In the self-attention formula Attention(Q, K, V) = softmax(Q·Kᵀ / √d_k) · V, what does dividing by √d_k accomplish?

Q3. BERT is described as "bidirectional." What does this mean?

Q4. Which pre-training objective does BERT use to learn sentence-level relationships?

Q5. What is positional encoding in transformers and why is it needed?

Q6. GPT is an "autoregressive" model. What does that mean during generation?