Week 9 – Transformers, BERT & GPT

🚀

Why Transformers?

History & motivation for moving beyond LSTMs

Problems with LSTMs that Transformers Solve

▼

What

LSTMs had three key bottlenecks that made them hard to scale to the level needed for modern NLP.

Problem	Detail	Transformer Fix
Sequential (Slow)	Items are fed one at a time — needs h_{t-1} before computing h_t, so can't parallelize	Self-attention sees all tokens at once — fully parallelizable
Large Memory	Must store all hidden states and gradients through time	Attention over fixed-length key/value stores
No Transfer Learning	Must retrain from scratch on every new labeled dataset — costly & slow	Pre-train once on massive corpus; fine-tune cheaply on any task

Timeline: From Transformers to Modern LLMs

▼

2017

"Attention Is All You Need" — Vaswani et al.

Original Transformer architecture introduced; designed for machine translation.

June 2018

GPT-1 (OpenAI)

First major pre-trained Transformer; Generative Pre-Training using decoder stack.

October 2018

BERT (Google)

Bidirectional Encoder Representations from Transformers; excels at understanding tasks.

2019

DistilBERT

60% faster, 40% lighter than BERT while maintaining comparable accuracy.

2020

GPT-3 (OpenAI)

175 billion parameters; autoregressive model for generating human-like text.

Why Pre-Train Instead of Training from Scratch?

▼

🎯 The Core Idea Train once on massive unlabeled text using self-supervised learning, then fine-tune cheaply on specific tasks.

Time: Training from scratch on large corpora takes extremely long — weeks to months on hundreds of GPUs.
Cost & Carbon: Pre-training is very expensive and creates a large environmental footprint.
Data efficiency: Fine-tuning a pre-trained model requires far less labeled data than training from scratch.
Domain knowledge: A pre-trained model already has statistical knowledge of language that transfers to the fine-tuning task.

💡 Self-Supervised Learning Labels are automatically created from the raw text (e.g., masking words, predicting next word) — no human labeling needed!

NLP Tasks Transformers Excel At

▼

🌐 Machine Translation (MT)

Translate text from one natural language to another (e.g., English → French).

💬 Sentiment Analysis

Identify the emotion or opinion in a sentence (positive/negative/neutral).

📝 Text Summarization

Condense a long document into a shorter, meaningful version.

🔲 Fill-in-the-Blank (MLM)

"This course will teach you all about ___ models." → Transformers

🏷️ Named Entity Recognition (NER)

Detect entities in text: people, places, dates, organizations.

🔤 Part-of-Speech (POS) Tagging

Assign grammatical labels: noun, verb, adjective, etc. to each word.

🏗️

Transformer Architecture

Encoder, decoder, and the three model families

Encoder vs. Decoder Overview

▼

Input Tokens

→

Token Embedding

Positional Encoding

→

Encoder Block(s)

→

Context Vectors

Context Vector (Bottleneck) The last encoder produces a fixed-length context vector that is passed to the decoder. The decoder attends to this to generate output.

🔵 Encoder

Reads the entire input sequence at once (bidirectional)
Produces a rich representation for each token
Best for understanding tasks (NLU)
Models: BERT, RoBERTa, ALBERT

🟠 Decoder

Generates tokens one at a time (left-to-right only)
Attends only to previously generated tokens (masked)
Best for generation tasks (NLG)
Models: GPT, GPT-2, GPT-3

🔵+🟠 Encoder–Decoder (Seq2Seq)

Seq2Seq

Uses both components for generative tasks that require input (e.g., translation, summarization). Models: BART, T5.

Inside One Encoder Block (4 Modules)

▼

Input
(Embedding + Pos)

→

Self-Attention
Layer

→

Add & Norm
(Skip Connection)

→

Feed-Forward
Network

→

Add & Norm
(Skip Connection)

→

Output
(to next block)

① Self-Attention

Computes relationships between all words in the sequence simultaneously using Q, K, V matrices.

② Skip (Residual) Connection

Adds the input directly to the attention output: output = LayerNorm(x + Attention(x)). Prevents vanishing gradients.

③ Layer Normalization

Normalizes the distribution of intermediate layers → smoother gradients, faster training, better generalization.

④ Feed-Forward Network (FFN)

Two linear layers with ReLU in between. Transforms attention vectors for the next block. Also has a skip connection + LayerNorm after it.

Positional Encoding

▼

Why

Transformers have no built-in sense of word order (unlike RNNs/CNNs). Positional encoding injects order information into the embeddings so the model knows where each word appears.

// Positional Encoding formula PE(pos, 2i) = sin(pos / 10000^(2i/d_model)) ← even indices PE(pos, 2i+1) = cos(pos / 10000^(2i/d_model)) ← odd indices // pos = position in sequence, i = dimension index // d_model = embedding dimension // Final input = Token Embedding + Positional Encoding

💡 Key Point Sine for even indices, cosine for odd indices. The resulting vectors are added to (not concatenated with) the word embeddings.

Multi-Head Attention

▼

What

Instead of a single self-attention computation, multi-head attention runs several attention operations in parallel, each with its own Q, K, V matrices. Outputs are concatenated and projected.

Why Use Multiple Heads? Each head can attend to different kinds of relationships simultaneously — e.g., one head captures syntax, another captures coreference, another captures semantic similarity.

// Multi-Head Attention head_i = Attention(Q·Wᵢᵠ, K·Wᵢᵏ, V·Wᵢᵛ) MultiHead = Concat(head_1, ..., head_h) · Wᴼ // Each head has its own learned projection matrices Wᵠ, Wᵏ, Wᵛ // Processed in parallel → efficient on GPUs

🔍

Self-Attention: The Core Mechanism

Query, Key, Value — understanding how attention scores are computed

Intuition: "It was hungry" — which word does "it" refer to?

▼

The Problem

Consider: "The cat drank the milk because it was hungry." vs "The cat drank the milk because it was sweet."
In the first sentence "it" = cat; in the second "it" = milk. Self-attention lets each word "look at" every other word to resolve this.

💡 Self-Attention Purpose For each word in the input, compute a weighted combination of all other words' representations, where the weights represent how relevant each word is to understanding the current word.

Query, Key, Value (Q, K, V) — Step by Step

▼

🗝️ KEY (K)

What each word offers.
"Here's what I contain."

❓ QUERY (Q)

What the current word is looking for.
"What am I interested in?"

💰 VALUE (V)

The content to be retrieved.
"Here's the actual information."

// Step-by-step self-attention calculation // 1. Project each word embedding into Q, K, V vectors q_i = x_i · W_Q ← query for word i k_j = x_j · W_K ← key for word j v_j = x_j · W_V ← value for word j // 2. Compute raw attention score (dot product) e_ij = q_i · k_jᵀ / √d_k ← scale to avoid large values // 3. Softmax to get attention weights (sum to 1) α_ij = softmax(e_ij) // 4. Weighted sum of values = output for word i z_i = Σⱼ α_ij · v_j // Matrix form (all words at once): Attention(Q, K, V) = softmax( Q·Kᵀ / √d_k ) · V

⚠️ Why divide by √d_k? With large embedding dimensions, dot products grow very large → softmax becomes extremely peaked → vanishing gradients. Dividing by √d_k stabilizes training.

🔬 Interactive: Self-Attention Weight Visualizer

Click a query word to see how much attention it pays to each key word in the sentence.

🤖

Model Families: BERT, GPT, T5

Encoder-only, decoder-only, and encoder–decoder architectures

BERT — Encoder-Only (Bidirectional)

NLU Tasks

Trained on BooksCorpus (~800M words) + English Wikipedia (~2.5B words)
Two pre-training objectives: MLM + NSP
Bidirectional: attends to words both before AND after
Great for: classification, Q&A, NER, sentiment
Related: RoBERTa, ALBERT, DistilBERT (60% faster, 40% lighter)

Masked Language Modeling (MLM)

Mask ~15% of tokens → train model to predict them. Example:
"The man went to the [MASK]." → "store"

Next Sentence Prediction (NSP)

Given two sentences A & B, predict: IsNext or NotNext. Trains the model to understand sentence relationships (important for Q&A, NLI).

# HuggingFace BERT example — fill-mask pipeline from transformers import pipeline pipe = pipeline("fill-mask", model="bert-base-uncased") result = pipe("The man went to the [MASK]. He bought a [MASK] of milk.") # MASK_1 → "store", MASK_2 → "gallon"

GPT — Decoder-Only (Unidirectional, Autoregressive)

NLG Tasks

Developed by OpenAI — Generative Pre-Training
Autoregressive: predicts the next word given all previous words
Unidirectional: only looks left (past tokens)
GPT-3: 175 billion parameters

Pre-training Objective

Language modeling: given "The cat sat on the", predict "mat". Repeat for all positions. No labels needed — fully self-supervised.

# HuggingFace GPT-2 example — text generation from transformers import pipeline pipe = pipeline("text-generation", model="gpt2") results = pipe("Help! I'm a language model", max_length=50, num_return_sequences=5) # Generates 5 different continuations of the input text

T5 / BART — Encoder–Decoder (Seq2Seq)

Generative + Understanding

Use both encoder stack (to understand input) and decoder stack (to generate output)
Best for tasks needing input → output: translation, summarization, question answering with generation
Used the original "Attention Is All You Need" architecture (Vaswani et al., 2017)

Complete Architecture Comparison

▼

Feature	RNN/LSTM	BERT (Encoder)	GPT (Decoder)	T5/BART (Seq2Seq)
Attention	Bahdanau (optional)	Bidirectional self-attn	Masked self-attn	Both
Direction	Left→Right	Bidirectional ✓	Left→Right only	Enc: bi, Dec: L→R
Parallelization	Sequential ✗	Full parallel ✓	Parallel (train) ✓	Parallel (train) ✓
Transfer Learning	Poor ✗	Excellent ✓	Excellent ✓	Excellent ✓
Best For	Short sequences	NLU (classification)	NLG (generation)	Seq-to-seq tasks
Examples	LSTM, GRU	BERT, RoBERTa, ALBERT	GPT-1/2/3, Llama	T5, BART, mT5

🎯 Quiz 8 Practice — Transformers, BERT & GPT

Q1. What is the main reason transformers can be trained faster than LSTMs?

Q2. In the self-attention formula Attention(Q, K, V) = softmax(Q·Kᵀ / √d_k) · V, what does dividing by √d_k accomplish?

Q3. BERT is described as "bidirectional." What does this mean?

Q4. Which pre-training objective does BERT use to learn sentence-level relationships?

Q5. What is positional encoding in transformers and why is it needed?

Q6. GPT is an "autoregressive" model. What does that mean during generation?

← Week 7: LSTM & Attention 🏠 All Weeks

Method	Trainable Params	GPU Memory (7B)	Quality vs. Full FT	Inference Overhead
Full Fine-tuning	100% (~7B)	~80GB (fp16)	Baseline	None
LoRA (r=16)	~0.1% (~7M)	~18GB (fp16)	95–99%	None (merge at deploy)
QLoRA (r=16, 4-bit)	~0.1% (~7M)	~6GB (int4)	93–97%	Small (dequantize)
Prefix Tuning	0.1–1%	~16GB	85–90%	Yes (longer KV)
Adapter Layers	0.5–3%	~20GB	88–94%	Yes (sequential)

Transformers, BERT & GPT

Why Transformers?

Problems with LSTMs that Transformers Solve

Timeline: From Transformers to Modern LLMs

Why Pre-Train Instead of Training from Scratch?

NLP Tasks Transformers Excel At

🌐 Machine Translation (MT)

💬 Sentiment Analysis

📝 Text Summarization

🔲 Fill-in-the-Blank (MLM)

🏷️ Named Entity Recognition (NER)

🔤 Part-of-Speech (POS) Tagging

Transformer Architecture

Encoder vs. Decoder Overview

🔵 Encoder

🟠 Decoder

🔵+🟠 Encoder–Decoder (Seq2Seq)

Inside One Encoder Block (4 Modules)

① Self-Attention

② Skip (Residual) Connection

③ Layer Normalization

④ Feed-Forward Network (FFN)

Positional Encoding

Multi-Head Attention

Self-Attention: The Core Mechanism

Intuition: "It was hungry" — which word does "it" refer to?

Query, Key, Value (Q, K, V) — Step by Step

🗝️ KEY (K)

❓ QUERY (Q)

💰 VALUE (V)

🔬 Interactive: Self-Attention Weight Visualizer

Model Families: BERT, GPT, T5

BERT — Encoder-Only (Bidirectional)

Masked Language Modeling (MLM)

Next Sentence Prediction (NSP)

GPT — Decoder-Only (Unidirectional, Autoregressive)

Pre-training Objective

T5 / BART — Encoder–Decoder (Seq2Seq)

Complete Architecture Comparison

🎯 Quiz 8 Practice — Transformers, BERT & GPT

PEFT & LoRA: Fine-Tuning LLMs Without Breaking the Bank

🧠 Why Parameter-Efficient Fine-Tuning? Critical

The PEFT Family

Why LoRA Dominates

📐 LoRA Mathematics PhD

Initialization

Hyperparameter Intuition

🔬 QLoRA: 4-Bit Fine-Tuning 2023 Breakthrough

The Three Innovations

QLoRA in Practice

📊 PEFT Methods Comparison