Solving the vanishing gradient problem with gated memory, then teaching the model to focus on what matters most.
RNNs pass information forward through a hidden state h_t. Over long sequences, the gradient shrinks exponentially during backpropagation โ so the model effectively forgets words from the beginning of a sentence.
The forget gate looks at the previous hidden state h_{t-1} and current input x_t, and outputs a number between 0 and 1 for each element of the cell state. 0 = erase completely. 1 = keep completely.
Two computations work together: the input gate decides which positions to update (0โ1 importance scores), while the candidate vector creates the actual new values to potentially write (scaled -1 to 1).
The new cell state combines: (1) the old cell state filtered by the forget gate, and (2) the new candidate values filtered by the input gate. This is the core of LSTM's long-term memory.
tanh multiplications (each โค 1) โ they vanish. In LSTM, gradients can flow through the cell state update path where the dominant operation is addition. Addition preserves gradient magnitude, enabling learning from hundreds of time steps back.
The output gate decides which parts of the cell state to actually output as the hidden state h_t. Not everything in long-term memory is relevant at every step.
C_t = updated long-term memory (carried forward mostly intact)h_t = short-term working memory / current output (filtered view of C_t)
GRU is an updated version of LSTM that uses only two gates (Update + Reset) instead of four, and merges the cell state and hidden state into one. Fewer parameters โ faster training with comparable performance.
| Feature | LSTM | GRU |
|---|---|---|
| Gates | 4 (forget, input, candidate, output) | 2 (update, reset) |
| Memory vectors | 2 (C_t and h_t) | 1 (h_t only) |
| Parameters | More | Fewer (~25% less) |
| Training speed | Slower | Faster |
| Long sequences | Excellent | Good |
| Interpretability | Cell state separable | Simpler |
Used for sequence-to-sequence tasks like machine translation. The encoder reads the input and produces a context vector C. The decoder reads C and generates the output one token at a time.
Many NLP tasks require variable-length input AND variable-length output (translation, summarization). A single RNN can't handle this โ we need a two-stage architecture.
Example: "How are you?" โ "Jak siฤ masz?" (Polish)
Instead of using only the final encoder state, attention gives the decoder access to ALL encoder hidden states at every decoding step. It learns a weighted combination โ paying more attention to relevant words.
When translating word 3 of the output, the decoder doesn't need word 1 of the input โ it needs the most relevant encoder states. Attention learns these relevance scores automatically.
Translating: "The cat sat on mat" โ "Le chat s'assit sur le tapis"
Attention ฮฑ over encoder words (darker = more attention):
| Property | RNN | LSTM | GRU | LSTM + Attention |
|---|---|---|---|---|
| Long-range memory | โ Vanishes | โ Cell state | โ Update gate | โ All encoder states |
| Mechanism | h_t only | C_t + h_t + 4 gates | h_t + 2 gates | LSTM + weighted context C_t |
| Parameters | Few | Many | Moderate | Many + FFN for ฮฑ |
| Handles very long seqs? | No | Yes | Yes | Best |
| Typical NLP tasks | Short text class. | NER, POS, LM | Same as LSTM | Translation, summarization |
| Context vector | h_t (single) | h_t (single) | h_t (single) | C_t per step (rich) |
| Superseded by | LSTM/GRU | Transformers (partial) | Transformers (partial) | Self-Attention / Transformers |
Q1. What is the primary reason LSTM outperforms RNN on long sequences?
Q2. In an LSTM, the forget gate outputs values close to 0. What does this mean?
Q3. Which of the following correctly describes the LSTM cell state update equation?
Q4. How does GRU differ from LSTM?
Q5. In an attention-based encoder-decoder, what does the context vector C_t represent?
Q6. True or False: In a simple encoder-decoder (no attention), the decoder has access to all encoder hidden states at each decoding step.
The Transformer you learned in lecture is not the Transformer running GPT-4, LLaMA-3, or Mixtral. Modern LLMs layer a set of architectural innovations โ RoPE, GQA, Flash Attention, MoE, quantization โ on top of the vanilla design. This section covers the math and intuition behind each.
Absolute position embeddings (learned or sinusoidal) struggle with sequences longer than training length. RoPE (Su et al., 2021) encodes position via rotation in the complex plane โ allowing relative position to naturally emerge in the attention dot product.
The inner product naturally encodes the relative distance mโn, not absolute positions. Long-range decay is built in: large |mโn| โ smaller dot products (more rotation mismatch).
During autoregressive generation, the model must store Key and Value matrices for every token generated so far โ this is the KV-cache. For multi-head attention with H=32 heads and sequence length N, the KV-cache grows as O(NยทHยทd_head). At 100K tokens, this can be gigabytes per layer, per request. GQA reduces this dramatically.
H queries
H keys
H values
Full KV cache
H queries
G key groups
G value groups
G << H
H queries
1 key
1 value
Minimal cache
LLaMA-2 and Mistral use GQA with G=8 groups, H=32 heads โ reducing KV cache by 4ร vs. MHA with minimal quality loss. LLaMA-3 70B uses GQA with 8 KV heads for 8 query head groups. This is why LLaMA-3 can handle much longer contexts than LLaMA-1.
Standard attention computes the full NรN attention matrix in GPU HBM (high-bandwidth memory), then applies softmax, then multiplies by V. For N=32K, this requires ~4GB just for the attention matrix. Flash Attention (Dao et al., 2022) avoids materializing it.
Standard transformers activate all parameters for every token. MoE replaces each FFN layer with E expert FFN sub-networks, but only activates top-k experts per token โ keeping FLOPs constant while increasing parameter count.
Reduce weight precision to decrease memory footprint and improve throughput. Key methods:
| Method | Bits | Quality Loss | Used By |
|---|---|---|---|
| INT8 | 8-bit | Minimal | bitsandbytes |
| GPTQ | 4-bit | ~1% on benchmarks | AutoGPTQ |
| AWQ | 4-bit | Better than GPTQ | LLaMA.cpp |
| NF4 | 4-bit | Minimal (QLoRA) | bitsandbytes |
Autoregressive generation is sequential โ one token at a time. Speculative decoding uses a small draft model to generate K tokens quickly, then a verifier (large model) accepts or rejects them in one parallel forward pass: