CNNs extract local patterns from text. RNNs process sequences with memory. Together, they unlock understanding of structure and order in language.
| Network Type | What it captures | Position-aware? | Memory? |
|---|---|---|---|
| FC Neural Network | Global patterns (all features at once) | No | No |
| CNN | Local patterns (n-gram-like features) | Yes (local regions) | No |
| RNN | Sequential patterns (order-aware) | Yes (sequence order) | Yes (hidden state) |
A CNN uses small learnable filters (also called kernels) that slide across the input, performing a dot product at each position. Each filter detects a specific local pattern (e.g., "I love" or a specific visual edge).
Instead of connecting every input feature to every hidden neuron (too many parameters!), each hidden neuron connects only to a small local region. This drastically reduces parameters and forces the network to look for local structure.
CNN for Text — Pipeline
After convolution, each filter produces a feature map (a vector of scores). Max pooling selects the maximum value from this map — capturing "was this pattern detected anywhere in the document?"
We don't care WHERE in the document a positive pattern appears — only WHETHER it appears. Max pooling achieves this position-invariance and further reduces dimensionality, preventing overfitting.
The same filter weights are used at every position. This means the network learns to detect a pattern (e.g., "not good") anywhere in the text, using far fewer parameters than a fully-connected layer.
An RNN processes the input one word at a time, maintaining a hidden state hₜ that serves as memory — it carries information from all previous words into the current step.
For NER (Named Entity Recognition): "Mahdi" vs "Mahdi teaches" → the word after Mahdi helps confirm it's a person. CNNs see only a fixed window; RNNs see the entire history. Word order matters for understanding meaning.
RNN Unrolled — "Mahdi and Wafa teach NLP"
Each hₜ carries memory of all words before it. Θ¹, Θ², Θ³ are shared at every step — same weights for all words.
| Architecture | Input → Output | NLP Example |
|---|---|---|
| Many-to-Many | Sequence → Sequence (same length) | NER: label each word in a sentence |
| Many-to-One | Sequence → Single output | Sentiment analysis: entire doc → positive/negative |
| One-to-Many | Single input → Sequence | Image captioning: one image → generated description |
| Many-to-Many (encoder-decoder) | Sequence → Sequence (different length) | Machine translation: English → French |
During backpropagation through time (BPTT), gradients are multiplied through many timesteps. This causes gradients to either shrink to near-zero (vanishing) or grow uncontrollably (exploding).
Problem: Gradient grows exponentially → weights update wildly, training diverges.
Solution: Gradient clipping — if ‖gradient‖ > threshold, scale it down.
Problem: Gradient shrinks to near zero → early timesteps get no learning signal → can't learn long-range dependencies.
Solutions: ReLU activation, LSTM/GRU (Week 7), better weight initialization.
| Property | CNN for Text | RNN for Text |
|---|---|---|
| Captures | Local n-gram patterns (fixed window) | Sequential dependencies (full history) |
| Position-aware | Partially (local window only) | Yes (full sequence order) |
| Training speed | Fast (parallelizable) | Slower (sequential processing) |
| Long-range dependencies | Poor (limited window size) | Better (but vanishing gradient) |
| Best for | Text classification, sentiment analysis | NER, machine translation, language modeling |
| Zero-padding | Required (all docs same length) | Used (or bucketing for efficiency) |
You've now built all the foundations. Here's how Week 6 connects to the rest of the course:
Fixes RNN's vanishing gradient with gates. Builds directly on Week 6 RNN.
LSTM + Attention = model can focus on the most relevant parts of the input.
Replaces RNN entirely with self-attention. More parallelizable, better at long-range.
Transformers trained on massive data. Context-dependent embeddings (finally!).
Q1. What is the primary advantage of using CNN over a fully-connected neural network for text classification?
Q2. In a CNN for text, if the document matrix has shape (6 × 12) and we apply a filter of shape (3 × 12) with stride 1, what is the feature map size?
Q3. What does the hidden state hₜ in an RNN represent?
Q4. What is the vanishing gradient problem in RNNs and what is a common solution?