Solving the vanishing gradient problem with gated memory, then teaching the model to focus on what matters most.
RNNs pass information forward through a hidden state h_t. Over long sequences, the gradient shrinks exponentially during backpropagation โ so the model effectively forgets words from the beginning of a sentence.
The forget gate looks at the previous hidden state h_{t-1} and current input x_t, and outputs a number between 0 and 1 for each element of the cell state. 0 = erase completely. 1 = keep completely.
Two computations work together: the input gate decides which positions to update (0โ1 importance scores), while the candidate vector creates the actual new values to potentially write (scaled -1 to 1).
The new cell state combines: (1) the old cell state filtered by the forget gate, and (2) the new candidate values filtered by the input gate. This is the core of LSTM's long-term memory.
tanh multiplications (each โค 1) โ they vanish. In LSTM, gradients can flow through the cell state update path where the dominant operation is addition. Addition preserves gradient magnitude, enabling learning from hundreds of time steps back.
The output gate decides which parts of the cell state to actually output as the hidden state h_t. Not everything in long-term memory is relevant at every step.
C_t = updated long-term memory (carried forward mostly intact)h_t = short-term working memory / current output (filtered view of C_t)
GRU is an updated version of LSTM that uses only two gates (Update + Reset) instead of four, and merges the cell state and hidden state into one. Fewer parameters โ faster training with comparable performance.
| Feature | LSTM | GRU |
|---|---|---|
| Gates | 4 (forget, input, candidate, output) | 2 (update, reset) |
| Memory vectors | 2 (C_t and h_t) | 1 (h_t only) |
| Parameters | More | Fewer (~25% less) |
| Training speed | Slower | Faster |
| Long sequences | Excellent | Good |
| Interpretability | Cell state separable | Simpler |
Used for sequence-to-sequence tasks like machine translation. The encoder reads the input and produces a context vector C. The decoder reads C and generates the output one token at a time.
Many NLP tasks require variable-length input AND variable-length output (translation, summarization). A single RNN can't handle this โ we need a two-stage architecture.
Example: "How are you?" โ "Jak siฤ masz?" (Polish)
Instead of using only the final encoder state, attention gives the decoder access to ALL encoder hidden states at every decoding step. It learns a weighted combination โ paying more attention to relevant words.
When translating word 3 of the output, the decoder doesn't need word 1 of the input โ it needs the most relevant encoder states. Attention learns these relevance scores automatically.
Translating: "The cat sat on mat" โ "Le chat s'assit sur le tapis"
Attention ฮฑ over encoder words (darker = more attention):
| Property | RNN | LSTM | GRU | LSTM + Attention |
|---|---|---|---|---|
| Long-range memory | โ Vanishes | โ Cell state | โ Update gate | โ All encoder states |
| Mechanism | h_t only | C_t + h_t + 4 gates | h_t + 2 gates | LSTM + weighted context C_t |
| Parameters | Few | Many | Moderate | Many + FFN for ฮฑ |
| Handles very long seqs? | No | Yes | Yes | Best |
| Typical NLP tasks | Short text class. | NER, POS, LM | Same as LSTM | Translation, summarization |
| Context vector | h_t (single) | h_t (single) | h_t (single) | C_t per step (rich) |
| Superseded by | LSTM/GRU | Transformers (partial) | Transformers (partial) | Self-Attention / Transformers |
Q1. What is the primary reason LSTM outperforms RNN on long sequences?
Q2. In an LSTM, the forget gate outputs values close to 0. What does this mean?
Q3. Which of the following correctly describes the LSTM cell state update equation?
Q4. How does GRU differ from LSTM?
Q5. In an attention-based encoder-decoder, what does the context vector C_t represent?
Q6. True or False: In a simple encoder-decoder (no attention), the decoder has access to all encoder hidden states at each decoding step.