Week 7 – LSTM, GRU & Attention

🧠

Why LSTM? The RNN Problem

From short-term memory to long-term memory

😤 RNN's Fatal Flaw: Short-Term Memory

▼

What's the problem

RNNs pass information forward through a hidden state h_t. Over long sequences, the gradient shrinks exponentially during backpropagation — so the model effectively forgets words from the beginning of a sentence.

🔴 Classic Example (from lecture): "I was born in Japan and lived there for 10 years and then moved to the US. My native language is ___."

The word "Japan" is far from the blank. An RNN's gradient for that early word shrinks to near-zero by the time backprop reaches it — so the model ignores it and fails to predict "Japanese."

🔴 RNN Behavior

Hidden state h_t carries memory
h_t = tanh(x_t·Θ¹ + h_{t-1}·Θ³)
Gradient multiplied at EVERY step
tanh derivative ≤ 1 → product → 0
Result: only remembers last ~5–10 tokens

🟢 LSTM Solution

Adds a Cell State C_t (long-term memory)
Gates selectively keep or discard info
Gradient path through C_t avoids repeated multiplication
Result: can remember 100s of tokens

💡 Key Analogy: Think of the RNN hidden state like RAM (fast but small). The LSTM cell state is like a hard drive — slower to update, but can hold far more information over time.

🔒

Part 1: The Four LSTM Gates

How LSTM controls what to remember and forget

Forget Gate

f_t

What old info to erase from cell state?

Input Gate

i_t

What new info to write to cell state?

Cell Update

C_t

The long-term memory updated by forget + input

Output Gate

o_t → h_t

What to expose from cell state as hidden state?

🟠 Forget Gate — "What do I erase?"

▼

What

The forget gate looks at the previous hidden state h_{t-1} and current input x_t, and outputs a number between 0 and 1 for each element of the cell state. 0 = erase completely. 1 = keep completely.

f_t = σ(W_f · [h_{t-1}, x_t] + b_f) # [h_{t-1}, x_t] = concatenation of previous hidden state and current input # W_f, b_f = learned weight matrix and bias for forget gate # σ = sigmoid → output in (0, 1) for every cell state element # f_t ≈ 0 → forget this info from cell state # f_t ≈ 1 → keep this info in cell state # Example: When we see "." (end of sentence), the forget gate learns # to clear the subject so the next sentence starts fresh.

⚠️ Lecture Quote: "Forget gate will help us determine what information we are going to remove or throw away from the cell state."

🟢 Input Gate + Candidate — "What do I write?"

▼

What

Two computations work together: the input gate decides which positions to update (0–1 importance scores), while the candidate vector creates the actual new values to potentially write (scaled -1 to 1).

# Step 1: Input gate — which positions are important to update? i_t = σ(W_i · [h_{t-1}, x_t] + b_i) # 0 = not important, 1 = important to update # Step 2: Candidate cell state — what new values could we write? C̃_t = tanh(W_c · [h_{t-1}, x_t] + b_c) # tanh → values in (-1, +1) to regulate the network # Step 3: Combine — only write where input gate says it's important # (used in the cell state update below) i_t ⊙ C̃_t ← point-wise multiplication

💡 Analogy: i_t is the highlighter — it marks which information is worth remembering. C̃_t is the raw text — the content itself. Together they decide what actually gets stored.

🩷 Cell State Update — "Updating long-term memory"

▼

How

The new cell state combines: (1) the old cell state filtered by the forget gate, and (2) the new candidate values filtered by the input gate. This is the core of LSTM's long-term memory.

C_t = f_t ⊙ C_{t-1} + i_t ⊙ C̃_t # f_t ⊙ C_{t-1} → how much of the OLD memory to keep # i_t ⊙ C̃_t → how much NEW information to add # ⊙ → element-wise (Hadamard) multiplication # Why does this solve vanishing gradients? # The gradient through C_t flows via ADDITION (+), not multiplication! # Addition doesn't shrink gradients → long-range learning is possible.

🔑 The Key Insight — Why LSTM Works: In a plain RNN, gradients flow through repeated tanh multiplications (each ≤ 1) — they vanish. In LSTM, gradients can flow through the cell state update path where the dominant operation is addition. Addition preserves gradient magnitude, enabling learning from hundreds of time steps back.

🔵 Output Gate — "What do I expose as hidden state?"

▼

What

The output gate decides which parts of the cell state to actually output as the hidden state h_t. Not everything in long-term memory is relevant at every step.

o_t = σ(W_o · [h_{t-1}, x_t] + b_o) # Decides which parts of cell state to expose h_t = o_t ⊙ tanh(C_t) # tanh(C_t) squashes cell state to (-1, +1) # o_t masks which dimensions to include in h_t # h_t is used for predictions and passed to the next time step

✅ Summary — LSTM's two outputs per step: C_t = updated long-term memory (carried forward mostly intact)
h_t = short-term working memory / current output (filtered view of C_t)

🎮 LSTM Gate Explorer

Simulate how each gate responds to different input values. Adjust the sliders to see how forget/input/output gate activations change.

Input signal x_t: 0.5

Prev hidden h_{t-1}: 0.3

Prev cell C_{t-1}: 0.6

f_t

i_t

C̃_t

o_t

C_t

h_t

Adjust the sliders to see LSTM gate values...

⚡

Part 2: GRU — Gated Recurrent Unit

A simpler, faster alternative to LSTM

⚡ GRU: LSTM's Streamlined Sibling

▼

What

GRU is an updated version of LSTM that uses only two gates (Update + Reset) instead of four, and merges the cell state and hidden state into one. Fewer parameters → faster training with comparable performance.

# GRU has 2 gates (vs LSTM's 4 operations): Update Gate (z_t): similar to LSTM's forget + input gates combined z_t = σ(W_z · [h_{t-1}, x_t]) # Decides how much of past to keep vs. new input to use # z_t ≈ 1 → keep past (like forget gate saying "remember") # z_t ≈ 0 → use new info (like input gate saying "update") Reset Gate (r_t): decides how much past to forget when computing new candidate r_t = σ(W_r · [h_{t-1}, x_t]) # Controls how much past hidden state influences the new candidate New hidden state: h̃_t = tanh(W · [r_t ⊙ h_{t-1}, x_t]) h_t = (1 - z_t) ⊙ h_{t-1} + z_t ⊙ h̃_t

✅ GRU Advantages

Fewer tensor operations → faster to train
Fewer parameters → less risk of overfitting on small data
Single hidden state (simpler)
Often matches LSTM performance in practice

🔵 When to Choose LSTM vs GRU

LSTM: very long sequences, large datasets, complex tasks
GRU: faster prototyping, smaller datasets, real-time systems
In practice: try both and validate

Feature	LSTM	GRU
Gates	4 (forget, input, candidate, output)	2 (update, reset)
Memory vectors	2 (C_t and h_t)	1 (h_t only)
Parameters	More	Fewer (~25% less)
Training speed	Slower	Faster
Long sequences	Excellent	Good
Interpretability	Cell state separable	Simpler

🎯

Part 3: Encoder-Decoder + Attention

Teaching the model to focus on what matters

🔁 Encoder-Decoder Architecture

▼

What

Used for sequence-to-sequence tasks like machine translation. The encoder reads the input and produces a context vector C. The decoder reads C and generates the output one token at a time.

Why

Many NLP tasks require variable-length input AND variable-length output (translation, summarization). A single RNN can't handle this — we need a two-stage architecture.

Example: "How are you?" → "Jak się masz?" (Polish)

ENCODER

h₁

How

→

ENCODER

h₂

are

→

ENCODER

h₃

you

→C→

DECODER

s₁

Jak

→

DECODER

s₂

się

→

DECODER

s₃

masz

🔴 The Big Problem: The entire input "How are you?" must be compressed into a SINGLE context vector C (the final encoder hidden state). For long sentences, this bottleneck loses information — the decoder can't recover what was compressed away.

🎯 Attention Mechanism — "Look at everything, focus on what matters"

▼

What

Instead of using only the final encoder state, attention gives the decoder access to ALL encoder hidden states at every decoding step. It learns a weighted combination — paying more attention to relevant words.

Why it works

When translating word 3 of the output, the decoder doesn't need word 1 of the input — it needs the most relevant encoder states. Attention learns these relevance scores automatically.

# Step 1: Compute raw alignment score for each encoder state h_j # using a small feedforward neural network: e_{t,j} = FFN(h_j, s_{t-1}) = tanh( [h_j ; s_{t-1}] · W_a + b_a ) · v_a # h_j = encoder hidden state at position j # s_{t-1} = previous decoder hidden state # FFN = single hidden layer + linear output (tanh activation) # Step 2: Normalize raw scores into attention weights via softmax α_{t,j} = softmax( e_{t,j} ) over all j # α values sum to 1 across all encoder positions # High α_{t,j} → decoder should pay a lot of attention to h_j now # Step 3: Compute context vector as weighted sum of encoder states C_t = Σ_j α_{t,j} · h_j # Each decoder step gets its OWN context vector C_t # (vs. simple encoder-decoder: single C for all decoder steps) # Step 4: Use C_t to compute decoder output normally s_t = LSTM(s_{t-1}, [y_{t-1}; C_t])

💡 This type is called Additive (Bahdanau) Attention. It was the first attention mechanism for NLP (2015). The "additive" refers to how h_j and s_{t-1} are added/concatenated before being fed into the FFN. Later, transformers use scaled dot-product attention (faster) and self-attention — covered in Week 8!

🎮 Attention Weight Visualizer

See how attention distributes focus across encoder words when generating each output token. Select a decoder output word to see which input words it "attends to" most.

Translating: "The cat sat on mat" → "Le chat s'assit sur le tapis"

Generating output word:

Attention α over encoder words (darker = more attention):

Click a decoder word above to see its attention weights.

🗺️

Complete Picture: RNN → LSTM → GRU → Attention

How each step solves the previous one's limitations

📊 Full Architecture Comparison

▼

Property	RNN	LSTM	GRU	LSTM + Attention
Long-range memory	❌ Vanishes	✅ Cell state	✅ Update gate	✅ All encoder states
Mechanism	h_t only	C_t + h_t + 4 gates	h_t + 2 gates	LSTM + weighted context C_t
Parameters	Few	Many	Moderate	Many + FFN for α
Handles very long seqs?	No	Yes	Yes	Best
Typical NLP tasks	Short text class.	NER, POS, LM	Same as LSTM	Translation, summarization
Context vector	h_t (single)	h_t (single)	h_t (single)	C_t per step (rich)
Superseded by	LSTM/GRU	Transformers (partial)	Transformers (partial)	Self-Attention / Transformers

🔑 The Progression of Ideas: RNN → add gates to control memory (LSTM) → simplify the gates (GRU) → add a mechanism to look back at all inputs dynamically (Attention) → replace recurrence entirely with attention (Transformers, Week 8).

🧪 Quiz 6 Practice — LSTM & Attention

Q1. What is the primary reason LSTM outperforms RNN on long sequences?

Q2. In an LSTM, the forget gate outputs values close to 0. What does this mean?

Q3. Which of the following correctly describes the LSTM cell state update equation?

Q4. How does GRU differ from LSTM?

Q5. In an attention-based encoder-decoder, what does the context vector C_t represent?

Q6. True or False: In a simple encoder-decoder (no attention), the decoder has access to all encoder hidden states at each decoding step.

← Week 6: CNN & RNN 🏠 All Weeks Week 8: Transformers →

🔄 RoPE: Rotary Position Embeddings Used in Every Modern LLM

Absolute position embeddings (learned or sinusoidal) struggle with sequences longer than training length. RoPE (Su et al., 2021) encodes position via rotation in the complex plane — allowing relative position to naturally emerge in the attention dot product.

Query rotation at position m: q_m = q · e^{i·m·θ} Key rotation at position n: k_n = k · e^{i·n·θ} Attention dot product: q_m · k_n = q·k · e^{i·(m-n)·θ} ↑ depends only on (m-n)!

The inner product naturally encodes the relative distance m−n, not absolute positions. Long-range decay is built in: large |m−n| → smaller dot products (more rotation mismatch).

Why RoPE Won

No extra parameters (unlike learned absolute PE)
Extrapolates to longer sequences better than sinusoidal
Can be extended with NTK-aware scaling or YaRN
LLaMA-3 supports 128K context via RoPE + fine-tuning
Used by: LLaMA, Mistral, Gemma, Qwen, Falcon, GPT-NeoX

👁️ Grouped Query Attention (GQA) — KV-Cache Economics

📖 The Problem

During autoregressive generation, the model must store Key and Value matrices for every token generated so far — this is the KV-cache. For multi-head attention with H=32 heads and sequence length N, the KV-cache grows as O(N·H·d_head). At 100K tokens, this can be gigabytes per layer, per request. GQA reduces this dramatically.

MHA

H queries
H keys
H values
Full KV cache

→

GQA

H queries
G key groups
G value groups
G << H

→

MQA

H queries
1 key
1 value
Minimal cache

LLaMA-2 and Mistral use GQA with G=8 groups, H=32 heads — reducing KV cache by 4× vs. MHA with minimal quality loss. LLaMA-3 70B uses GQA with 8 KV heads for 8 query head groups. This is why LLaMA-3 can handle much longer contexts than LLaMA-1.

⚡ Flash Attention — Making O(N²) Practical Systems

Standard attention computes the full N×N attention matrix in GPU HBM (high-bandwidth memory), then applies softmax, then multiplies by V. For N=32K, this requires ~4GB just for the attention matrix. Flash Attention (Dao et al., 2022) avoids materializing it.

The Tiling Insight

# Standard: materialize full N×N matrix S = Q·K^T / √d # [N,N] in HBM 😰 P = softmax(S) # [N,N] in HBM O = P·V # [N,d] in HBM # Flash: tile Q,K,V into SRAM blocks for Q_block in tiles(Q): for K_block in tiles(K): compute partial S in SRAM # fast! accumulate softmax online # no HBM! update O incrementally

Results

Memory: O(N) instead of O(N²)
Speed: 2–4× wall-clock faster on A100
Exact output — not an approximation
Flash Attention 2 (2023): 2× over FA1 via better work partitioning
Flash Attention 3 (2024): asynchronous TMA on H100 GPUs
Integrated in PyTorch 2.0 as default via SDPA

🔀 Mixture of Experts (MoE) — Scaling Parameters Without Scaling Cost Architecture

Standard transformers activate all parameters for every token. MoE replaces each FFN layer with E expert FFN sub-networks, but only activates top-k experts per token — keeping FLOPs constant while increasing parameter count.

# MoE Layer forward pass gates = softmax(Router(x)) # [E] scores top_k_idx = topk(gates, k=2) # pick 2 experts output = Σᵢ gates[i] · Expert_i(x) for i in top_k_idx # Load balancing loss prevents expert collapse L_aux = α · Σᵢ (fᵢ · pᵢ) where fᵢ = fraction of tokens → expert i

Mixtral 8×7B (2024) Example

8 expert FFNs per layer, top-2 selected per token
47B total parameters, 13B active per token
Matches Llama-2 70B quality at 5× lower FLOP cost
Challenge: all experts must fit in GPU memory
Future: expert offloading to CPU/NVMe for larger models

🗜️ Quantization & Speculative Decoding

Quantization (INT8, INT4, GPTQ, AWQ)

Reduce weight precision to decrease memory footprint and improve throughput. Key methods:

Method	Bits	Quality Loss	Used By
INT8	8-bit	Minimal	bitsandbytes
GPTQ	4-bit	~1% on benchmarks	AutoGPTQ
AWQ	4-bit	Better than GPTQ	LLaMA.cpp
NF4	4-bit	Minimal (QLoRA)	bitsandbytes

Speculative Decoding

Autoregressive generation is sequential — one token at a time. Speculative decoding uses a small draft model to generate K tokens quickly, then a verifier (large model) accepts or rejects them in one parallel forward pass:

Draft: 1.3B model, generates K=4 tokens fast
Verify: 70B model checks all 4 in one pass
Accept if p_large ≥ p_draft, else resample
2–3× throughput improvement in practice

LSTM, GRU & Attention

Why LSTM? The RNN Problem

😤 RNN's Fatal Flaw: Short-Term Memory

🔴 RNN Behavior

🟢 LSTM Solution

Part 1: The Four LSTM Gates

🟠 Forget Gate — "What do I erase?"

🟢 Input Gate + Candidate — "What do I write?"

🩷 Cell State Update — "Updating long-term memory"

🔵 Output Gate — "What do I expose as hidden state?"

🎮 LSTM Gate Explorer

Part 2: GRU — Gated Recurrent Unit

⚡ GRU: LSTM's Streamlined Sibling

✅ GRU Advantages

🔵 When to Choose LSTM vs GRU

Part 3: Encoder-Decoder + Attention

🔁 Encoder-Decoder Architecture

🎯 Attention Mechanism — "Look at everything, focus on what matters"

🎮 Attention Weight Visualizer

Complete Picture: RNN → LSTM → GRU → Attention

📊 Full Architecture Comparison

🧪 Quiz 6 Practice — LSTM & Attention

Efficient Transformer Architectures: What Powers Every Frontier LLM in 2026

🔄 RoPE: Rotary Position Embeddings Used in Every Modern LLM

Why RoPE Won

👁️ Grouped Query Attention (GQA) — KV-Cache Economics

MHA

GQA

MQA

⚡ Flash Attention — Making O(N²) Practical Systems

The Tiling Insight

Results

🔀 Mixture of Experts (MoE) — Scaling Parameters Without Scaling Cost Architecture

Mixtral 8×7B (2024) Example

🗜️ Quantization & Speculative Decoding

Quantization (INT8, INT4, GPTQ, AWQ)

Speculative Decoding