๐Ÿง 

Why LSTM? The RNN Problem

From short-term memory to long-term memory

๐Ÿ˜ค RNN's Fatal Flaw: Short-Term Memory

โ–ผ
What's the problem

RNNs pass information forward through a hidden state h_t. Over long sequences, the gradient shrinks exponentially during backpropagation โ€” so the model effectively forgets words from the beginning of a sentence.

๐Ÿ”ด Classic Example (from lecture): "I was born in Japan and lived there for 10 years and then moved to the US. My native language is ___."

The word "Japan" is far from the blank. An RNN's gradient for that early word shrinks to near-zero by the time backprop reaches it โ€” so the model ignores it and fails to predict "Japanese."

๐Ÿ”ด RNN Behavior

  • Hidden state h_t carries memory
  • h_t = tanh(x_tยทฮ˜ยน + h_{t-1}ยทฮ˜ยณ)
  • Gradient multiplied at EVERY step
  • tanh derivative โ‰ค 1 โ†’ product โ†’ 0
  • Result: only remembers last ~5โ€“10 tokens

๐ŸŸข LSTM Solution

  • Adds a Cell State C_t (long-term memory)
  • Gates selectively keep or discard info
  • Gradient path through C_t avoids repeated multiplication
  • Result: can remember 100s of tokens
๐Ÿ’ก Key Analogy: Think of the RNN hidden state like RAM (fast but small). The LSTM cell state is like a hard drive โ€” slower to update, but can hold far more information over time.
๐Ÿ”’

Part 1: The Four LSTM Gates

How LSTM controls what to remember and forget
Forget Gate
f_t
What old info to erase from cell state?
Input Gate
i_t
What new info to write to cell state?
Cell Update
C_t
The long-term memory updated by forget + input
Output Gate
o_t โ†’ h_t
What to expose from cell state as hidden state?

๐ŸŸ  Forget Gate โ€” "What do I erase?"

โ–ผ
What

The forget gate looks at the previous hidden state h_{t-1} and current input x_t, and outputs a number between 0 and 1 for each element of the cell state. 0 = erase completely. 1 = keep completely.

f_t = ฯƒ(W_f ยท [h_{t-1}, x_t] + b_f) # [h_{t-1}, x_t] = concatenation of previous hidden state and current input # W_f, b_f = learned weight matrix and bias for forget gate # ฯƒ = sigmoid โ†’ output in (0, 1) for every cell state element # f_t โ‰ˆ 0 โ†’ forget this info from cell state # f_t โ‰ˆ 1 โ†’ keep this info in cell state # Example: When we see "." (end of sentence), the forget gate learns # to clear the subject so the next sentence starts fresh.
โš ๏ธ Lecture Quote: "Forget gate will help us determine what information we are going to remove or throw away from the cell state."

๐ŸŸข Input Gate + Candidate โ€” "What do I write?"

โ–ผ
What

Two computations work together: the input gate decides which positions to update (0โ€“1 importance scores), while the candidate vector creates the actual new values to potentially write (scaled -1 to 1).

# Step 1: Input gate โ€” which positions are important to update? i_t = ฯƒ(W_i ยท [h_{t-1}, x_t] + b_i) # 0 = not important, 1 = important to update # Step 2: Candidate cell state โ€” what new values could we write? Cฬƒ_t = tanh(W_c ยท [h_{t-1}, x_t] + b_c) # tanh โ†’ values in (-1, +1) to regulate the network # Step 3: Combine โ€” only write where input gate says it's important # (used in the cell state update below) i_t โŠ™ Cฬƒ_t โ† point-wise multiplication
๐Ÿ’ก Analogy: i_t is the highlighter โ€” it marks which information is worth remembering. Cฬƒ_t is the raw text โ€” the content itself. Together they decide what actually gets stored.

๐Ÿฉท Cell State Update โ€” "Updating long-term memory"

โ–ผ
How

The new cell state combines: (1) the old cell state filtered by the forget gate, and (2) the new candidate values filtered by the input gate. This is the core of LSTM's long-term memory.

C_t = f_t โŠ™ C_{t-1} + i_t โŠ™ Cฬƒ_t # f_t โŠ™ C_{t-1} โ†’ how much of the OLD memory to keep # i_t โŠ™ Cฬƒ_t โ†’ how much NEW information to add # โŠ™ โ†’ element-wise (Hadamard) multiplication # Why does this solve vanishing gradients? # The gradient through C_t flows via ADDITION (+), not multiplication! # Addition doesn't shrink gradients โ†’ long-range learning is possible.
๐Ÿ”‘ The Key Insight โ€” Why LSTM Works: In a plain RNN, gradients flow through repeated tanh multiplications (each โ‰ค 1) โ€” they vanish. In LSTM, gradients can flow through the cell state update path where the dominant operation is addition. Addition preserves gradient magnitude, enabling learning from hundreds of time steps back.

๐Ÿ”ต Output Gate โ€” "What do I expose as hidden state?"

โ–ผ
What

The output gate decides which parts of the cell state to actually output as the hidden state h_t. Not everything in long-term memory is relevant at every step.

o_t = ฯƒ(W_o ยท [h_{t-1}, x_t] + b_o) # Decides which parts of cell state to expose h_t = o_t โŠ™ tanh(C_t) # tanh(C_t) squashes cell state to (-1, +1) # o_t masks which dimensions to include in h_t # h_t is used for predictions and passed to the next time step
โœ… Summary โ€” LSTM's two outputs per step: C_t = updated long-term memory (carried forward mostly intact)
h_t = short-term working memory / current output (filtered view of C_t)

๐ŸŽฎ LSTM Gate Explorer

Simulate how each gate responds to different input values. Adjust the sliders to see how forget/input/output gate activations change.
f_t
i_t
Cฬƒ_t
o_t
C_t
h_t
Adjust the sliders to see LSTM gate values...
โšก

Part 2: GRU โ€” Gated Recurrent Unit

A simpler, faster alternative to LSTM

โšก GRU: LSTM's Streamlined Sibling

โ–ผ
What

GRU is an updated version of LSTM that uses only two gates (Update + Reset) instead of four, and merges the cell state and hidden state into one. Fewer parameters โ†’ faster training with comparable performance.

# GRU has 2 gates (vs LSTM's 4 operations): Update Gate (z_t): similar to LSTM's forget + input gates combined z_t = ฯƒ(W_z ยท [h_{t-1}, x_t]) # Decides how much of past to keep vs. new input to use # z_t โ‰ˆ 1 โ†’ keep past (like forget gate saying "remember") # z_t โ‰ˆ 0 โ†’ use new info (like input gate saying "update") Reset Gate (r_t): decides how much past to forget when computing new candidate r_t = ฯƒ(W_r ยท [h_{t-1}, x_t]) # Controls how much past hidden state influences the new candidate New hidden state: hฬƒ_t = tanh(W ยท [r_t โŠ™ h_{t-1}, x_t]) h_t = (1 - z_t) โŠ™ h_{t-1} + z_t โŠ™ hฬƒ_t

โœ… GRU Advantages

  • Fewer tensor operations โ†’ faster to train
  • Fewer parameters โ†’ less risk of overfitting on small data
  • Single hidden state (simpler)
  • Often matches LSTM performance in practice

๐Ÿ”ต When to Choose LSTM vs GRU

  • LSTM: very long sequences, large datasets, complex tasks
  • GRU: faster prototyping, smaller datasets, real-time systems
  • In practice: try both and validate
FeatureLSTMGRU
Gates4 (forget, input, candidate, output)2 (update, reset)
Memory vectors2 (C_t and h_t)1 (h_t only)
ParametersMoreFewer (~25% less)
Training speedSlowerFaster
Long sequencesExcellentGood
InterpretabilityCell state separableSimpler
๐ŸŽฏ

Part 3: Encoder-Decoder + Attention

Teaching the model to focus on what matters

๐Ÿ” Encoder-Decoder Architecture

โ–ผ
What

Used for sequence-to-sequence tasks like machine translation. The encoder reads the input and produces a context vector C. The decoder reads C and generates the output one token at a time.

Why

Many NLP tasks require variable-length input AND variable-length output (translation, summarization). A single RNN can't handle this โ€” we need a two-stage architecture.

Example: "How are you?" โ†’ "Jak siฤ™ masz?" (Polish)

ENCODER
hโ‚
How
โ†’
ENCODER
hโ‚‚
are
โ†’
ENCODER
hโ‚ƒ
you
โ†’Cโ†’
DECODER
sโ‚
Jak
โ†’
DECODER
sโ‚‚
siฤ™
โ†’
DECODER
sโ‚ƒ
masz
๐Ÿ”ด The Big Problem: The entire input "How are you?" must be compressed into a SINGLE context vector C (the final encoder hidden state). For long sentences, this bottleneck loses information โ€” the decoder can't recover what was compressed away.

๐ŸŽฏ Attention Mechanism โ€” "Look at everything, focus on what matters"

โ–ผ
What

Instead of using only the final encoder state, attention gives the decoder access to ALL encoder hidden states at every decoding step. It learns a weighted combination โ€” paying more attention to relevant words.

Why it works

When translating word 3 of the output, the decoder doesn't need word 1 of the input โ€” it needs the most relevant encoder states. Attention learns these relevance scores automatically.

# Step 1: Compute raw alignment score for each encoder state h_j # using a small feedforward neural network: e_{t,j} = FFN(h_j, s_{t-1}) = tanh( [h_j ; s_{t-1}] ยท W_a + b_a ) ยท v_a # h_j = encoder hidden state at position j # s_{t-1} = previous decoder hidden state # FFN = single hidden layer + linear output (tanh activation) # Step 2: Normalize raw scores into attention weights via softmax ฮฑ_{t,j} = softmax( e_{t,j} ) over all j # ฮฑ values sum to 1 across all encoder positions # High ฮฑ_{t,j} โ†’ decoder should pay a lot of attention to h_j now # Step 3: Compute context vector as weighted sum of encoder states C_t = ฮฃ_j ฮฑ_{t,j} ยท h_j # Each decoder step gets its OWN context vector C_t # (vs. simple encoder-decoder: single C for all decoder steps) # Step 4: Use C_t to compute decoder output normally s_t = LSTM(s_{t-1}, [y_{t-1}; C_t])
๐Ÿ’ก This type is called Additive (Bahdanau) Attention. It was the first attention mechanism for NLP (2015). The "additive" refers to how h_j and s_{t-1} are added/concatenated before being fed into the FFN. Later, transformers use scaled dot-product attention (faster) and self-attention โ€” covered in Week 8!

๐ŸŽฎ Attention Weight Visualizer

See how attention distributes focus across encoder words when generating each output token. Select a decoder output word to see which input words it "attends to" most.

Translating: "The cat sat on mat" โ†’ "Le chat s'assit sur le tapis"

Generating output word:

Attention ฮฑ over encoder words (darker = more attention):

Click a decoder word above to see its attention weights.
๐Ÿ—บ๏ธ

Complete Picture: RNN โ†’ LSTM โ†’ GRU โ†’ Attention

How each step solves the previous one's limitations

๐Ÿ“Š Full Architecture Comparison

โ–ผ
PropertyRNNLSTMGRULSTM + Attention
Long-range memory โŒ Vanishes โœ… Cell state โœ… Update gate โœ… All encoder states
Mechanism h_t only C_t + h_t + 4 gates h_t + 2 gates LSTM + weighted context C_t
Parameters Few Many Moderate Many + FFN for ฮฑ
Handles very long seqs? No Yes Yes Best
Typical NLP tasks Short text class. NER, POS, LM Same as LSTM Translation, summarization
Context vector h_t (single) h_t (single) h_t (single) C_t per step (rich)
Superseded by LSTM/GRU Transformers (partial) Transformers (partial) Self-Attention / Transformers
๐Ÿ”‘ The Progression of Ideas: RNN โ†’ add gates to control memory (LSTM) โ†’ simplify the gates (GRU) โ†’ add a mechanism to look back at all inputs dynamically (Attention) โ†’ replace recurrence entirely with attention (Transformers, Week 8).

๐Ÿงช Quiz 6 Practice โ€” LSTM & Attention

Q1. What is the primary reason LSTM outperforms RNN on long sequences?

Q2. In an LSTM, the forget gate outputs values close to 0. What does this mean?

Q3. Which of the following correctly describes the LSTM cell state update equation?

Q4. How does GRU differ from LSTM?

Q5. In an attention-based encoder-decoder, what does the context vector C_t represent?

Q6. True or False: In a simple encoder-decoder (no attention), the decoder has access to all encoder hidden states at each decoding step.