🗺️

The Big Picture

Why do we need CNN and RNN after regular Neural Networks?
The Limitation of Fully-Connected Networks Regular (FC) Neural Networks treat every feature independently and in fixed positions. For text and images, position and order matter. "Dog bites man" ≠ "Man bites dog". CNNs capture local patterns; RNNs capture sequential dependencies.
Network TypeWhat it capturesPosition-aware?Memory?
FC Neural NetworkGlobal patterns (all features at once)NoNo
CNNLocal patterns (n-gram-like features)Yes (local regions)No
RNNSequential patterns (order-aware)Yes (sequence order)Yes (hidden state)
🔍

Part 1: Convolutional Neural Networks (CNN)

Local pattern detectors for text

🔬 Filters and Convolution — The Core Idea

What

A CNN uses small learnable filters (also called kernels) that slide across the input, performing a dot product at each position. Each filter detects a specific local pattern (e.g., "I love" or a specific visual edge).

Why

Instead of connecting every input feature to every hidden neuron (too many parameters!), each hidden neuron connects only to a small local region. This drastically reduces parameters and forces the network to look for local structure.

How — Text CNN Step by Step

# Step 1: Convert each word to a vector (one-hot or Word2Vec) Document → Matrix of shape (n_words × embedding_dim) "great movie love it" → [4 × 12] matrix (e.g., 12-dim one-hot) # Step 2: Zero-pad all documents to same length If max_len = 6, pad shorter docs: [4×12] → [6×12] (add zero rows) # Step 3: Apply filters (convolution) Filter size: (k × embedding_dim) where k = n-gram window size Filter slides down the rows with stride 1 Dot product at each position → scalar output n_words - k + 1 scalars → feature map of size (n-k+1) # Step 4: Max Pooling Take MAX value from each feature map → one scalar per filter (captures strongest activation) # Step 5: Flatten → FC layer → prediction n_filters scalars → concatenate → classify

🎮 CNN Convolution Visualizer

See how a filter slides over a document matrix and computes the feature map.
Adjust sliders to see computation...

CNN for Text — Pipeline

Input Matrix
n × d
Conv Layer
filters: k × d
Feature Maps
(n-k+1) per filter
Max Pooling
1 value per filter
FC + Softmax
classification

🏊 Max Pooling — Selecting the Strongest Signal

What

After convolution, each filter produces a feature map (a vector of scores). Max pooling selects the maximum value from this map — capturing "was this pattern detected anywhere in the document?"

Why

We don't care WHERE in the document a positive pattern appears — only WHETHER it appears. Max pooling achieves this position-invariance and further reduces dimensionality, preventing overfitting.

# Example: filter produces feature map [0.3, -0.1, 0.8, 0.2, 0.5] Max pool → 0.8 ← "yes, this n-gram pattern appeared!" # For images (2×2 max pool over 4×4 → 2×2): [[3, 1, 2, 4], [[3, 4], [2, 4, 1, 0], → [4, 5]] [1, 0, 3, 5], [2, 4, 1, 2]] # Each 2×2 block → its maximum value

CNN Parameter Sharing

The same filter weights are used at every position. This means the network learns to detect a pattern (e.g., "not good") anywhere in the text, using far fewer parameters than a fully-connected layer.

3 Ways CNN Reduces Parameters

  • Local connectivity: each neuron sees only k words
  • Weight sharing: same filter reused at every position
  • Max pooling: subsamples the feature maps
🔁

Part 2: Recurrent Neural Networks (RNN)

Sequential memory for language

💾 The Hidden State — RNN's Memory

What

An RNN processes the input one word at a time, maintaining a hidden state hₜ that serves as memory — it carries information from all previous words into the current step.

Why

For NER (Named Entity Recognition): "Mahdi" vs "Mahdi teaches" → the word after Mahdi helps confirm it's a person. CNNs see only a fixed window; RNNs see the entire history. Word order matters for understanding meaning.

# At each timestep t, processing word xₜ: hₜ = tanh( xₜ × Θ¹ + hₜ₋₁ × Θ³ + b ) ← combines current word AND previous memory ŷₜ = softmax( hₜ × Θ² ) ← predict output at this step (if needed) # Parameter sharing (KEY property of RNNs): Θ¹, Θ², Θ³ are SHARED across ALL timesteps Total params = d×m + m×y + m×m (regardless of sequence length!) # Sizes: xₜ: d-dimensional word vector hₜ: m-dimensional hidden state (hyperparameter) Θ¹: d × m (input → hidden) Θ³: m × m (prev hidden → hidden) Θ²: m × y (hidden → output)

RNN Unrolled — "Mahdi and Wafa teach NLP"

x₁: "Mahdi"
h₁
ŷ₁=PERSON
x₂: "and"
h₂
ŷ₂=OTHER
x₃: "Wafa"
h₃
ŷ₃=PERSON
x₄: "teach"
h₄
ŷ₄=OTHER
x₅: "NLP"
h₅
ŷ₅=OTHER

Each hₜ carries memory of all words before it. Θ¹, Θ², Θ³ are shared at every step — same weights for all words.

🏗️ RNN Architectures — Many-to-Many, Many-to-One, One-to-Many

ArchitectureInput → OutputNLP Example
Many-to-ManySequence → Sequence (same length)NER: label each word in a sentence
Many-to-OneSequence → Single outputSentiment analysis: entire doc → positive/negative
One-to-ManySingle input → SequenceImage captioning: one image → generated description
Many-to-Many (encoder-decoder)Sequence → Sequence (different length)Machine translation: English → French

⚠️ Vanishing & Exploding Gradients

What

During backpropagation through time (BPTT), gradients are multiplied through many timesteps. This causes gradients to either shrink to near-zero (vanishing) or grow uncontrollably (exploding).

💥 Exploding Gradients

Problem: Gradient grows exponentially → weights update wildly, training diverges.

Solution: Gradient clipping — if ‖gradient‖ > threshold, scale it down.

🌫️ Vanishing Gradients

Problem: Gradient shrinks to near zero → early timesteps get no learning signal → can't learn long-range dependencies.

Solutions: ReLU activation, LSTM/GRU (Week 7), better weight initialization.

⚠️ Why Simple RNNs Struggle with Long Sequences If the relevant context is 50 words ago, gradients get multiplied 50 times — and they typically vanish. This is why LSTM (Week 7) was invented: it uses gating mechanisms to selectively remember and forget.

🎮 Vanishing Gradient Visualizer

See how gradient magnitude changes as it propagates back through timesteps.
Adjust sliders...
⚖️

CNN vs RNN for NLP — When to Use Which

PropertyCNN for TextRNN for Text
CapturesLocal n-gram patterns (fixed window)Sequential dependencies (full history)
Position-awarePartially (local window only)Yes (full sequence order)
Training speedFast (parallelizable)Slower (sequential processing)
Long-range dependenciesPoor (limited window size)Better (but vanishing gradient)
Best forText classification, sentiment analysisNER, machine translation, language modeling
Zero-paddingRequired (all docs same length)Used (or bucketing for efficiency)

🔭 What's Coming Next (Week 7+)

You've now built all the foundations. Here's how Week 6 connects to the rest of the course:

Week 7

LSTM & GRU

Fixes RNN's vanishing gradient with gates. Builds directly on Week 6 RNN.

Week 7

Attention Mechanism

LSTM + Attention = model can focus on the most relevant parts of the input.

Week 9

Transformers

Replaces RNN entirely with self-attention. More parallelizable, better at long-range.

Week 9

BERT & GPT

Transformers trained on massive data. Context-dependent embeddings (finally!).

🧪 Quiz Prep — Week 6 / Quiz 6 Preview

Q1. What is the primary advantage of using CNN over a fully-connected neural network for text classification?

Q2. In a CNN for text, if the document matrix has shape (6 × 12) and we apply a filter of shape (3 × 12) with stride 1, what is the feature map size?

Q3. What does the hidden state hₜ in an RNN represent?

Q4. What is the vanishing gradient problem in RNNs and what is a common solution?