Week 6 – CNN & RNN

🗺️

The Big Picture

Why do we need CNN and RNN after regular Neural Networks?

The Limitation of Fully-Connected Networks Regular (FC) Neural Networks treat every feature independently and in fixed positions. For text and images, position and order matter. "Dog bites man" ≠ "Man bites dog". CNNs capture local patterns; RNNs capture sequential dependencies.

Network Type	What it captures	Position-aware?	Memory?
FC Neural Network	Global patterns (all features at once)	No	No
CNN	Local patterns (n-gram-like features)	Yes (local regions)	No
RNN	Sequential patterns (order-aware)	Yes (sequence order)	Yes (hidden state)

🔍

Part 1: Convolutional Neural Networks (CNN)

Local pattern detectors for text

🔬 Filters and Convolution — The Core Idea

▼

What

A CNN uses small learnable filters (also called kernels) that slide across the input, performing a dot product at each position. Each filter detects a specific local pattern (e.g., "I love" or a specific visual edge).

Why

Instead of connecting every input feature to every hidden neuron (too many parameters!), each hidden neuron connects only to a small local region. This drastically reduces parameters and forces the network to look for local structure.

How — Text CNN Step by Step

# Step 1: Convert each word to a vector (one-hot or Word2Vec) Document → Matrix of shape (n_words × embedding_dim) "great movie love it" → [4 × 12] matrix (e.g., 12-dim one-hot) # Step 2: Zero-pad all documents to same length If max_len = 6, pad shorter docs: [4×12] → [6×12] (add zero rows) # Step 3: Apply filters (convolution) Filter size: (k × embedding_dim) where k = n-gram window size Filter slides down the rows with stride 1 Dot product at each position → scalar output n_words - k + 1 scalars → feature map of size (n-k+1) # Step 4: Max Pooling Take MAX value from each feature map → one scalar per filter (captures strongest activation) # Step 5: Flatten → FC layer → prediction n_filters scalars → concatenate → classify

🎮 CNN Convolution Visualizer

See how a filter slides over a document matrix and computes the feature map.

Document length (n): 6

Filter size (k): 3

Embedding dim (d): 8

Number of filters: 4

Adjust sliders to see computation...

CNN for Text — Pipeline

Input Matrix
n × d

→

Conv Layer
filters: k × d

→

Feature Maps
(n-k+1) per filter

→

Max Pooling
1 value per filter

→

FC + Softmax
classification

🏊 Max Pooling — Selecting the Strongest Signal

▼

What

After convolution, each filter produces a feature map (a vector of scores). Max pooling selects the maximum value from this map — capturing "was this pattern detected anywhere in the document?"

Why

We don't care WHERE in the document a positive pattern appears — only WHETHER it appears. Max pooling achieves this position-invariance and further reduces dimensionality, preventing overfitting.

# Example: filter produces feature map [0.3, -0.1, 0.8, 0.2, 0.5] Max pool → 0.8 ← "yes, this n-gram pattern appeared!" # For images (2×2 max pool over 4×4 → 2×2): [[3, 1, 2, 4], [[3, 4], [2, 4, 1, 0], → [4, 5]] [1, 0, 3, 5], [2, 4, 1, 2]] # Each 2×2 block → its maximum value

CNN Parameter Sharing

The same filter weights are used at every position. This means the network learns to detect a pattern (e.g., "not good") anywhere in the text, using far fewer parameters than a fully-connected layer.

3 Ways CNN Reduces Parameters

Local connectivity: each neuron sees only k words
Weight sharing: same filter reused at every position
Max pooling: subsamples the feature maps

🔁

Part 2: Recurrent Neural Networks (RNN)

Sequential memory for language

💾 The Hidden State — RNN's Memory

▼

What

An RNN processes the input one word at a time, maintaining a hidden state hₜ that serves as memory — it carries information from all previous words into the current step.

Why

For NER (Named Entity Recognition): "Mahdi" vs "Mahdi teaches" → the word after Mahdi helps confirm it's a person. CNNs see only a fixed window; RNNs see the entire history. Word order matters for understanding meaning.

# At each timestep t, processing word xₜ: hₜ = tanh( xₜ × Θ¹ + hₜ₋₁ × Θ³ + b ) ← combines current word AND previous memory ŷₜ = softmax( hₜ × Θ² ) ← predict output at this step (if needed) # Parameter sharing (KEY property of RNNs): Θ¹, Θ², Θ³ are SHARED across ALL timesteps Total params = d×m + m×y + m×m (regardless of sequence length!) # Sizes: xₜ: d-dimensional word vector hₜ: m-dimensional hidden state (hyperparameter) Θ¹: d × m (input → hidden) Θ³: m × m (prev hidden → hidden) Θ²: m × y (hidden → output)

RNN Unrolled — "Mahdi and Wafa teach NLP"

x₁: "Mahdi"

h₁

ŷ₁=PERSON

→

x₂: "and"

h₂

ŷ₂=OTHER

→

x₃: "Wafa"

h₃

ŷ₃=PERSON

→

x₄: "teach"

h₄

ŷ₄=OTHER

→

x₅: "NLP"

h₅

ŷ₅=OTHER

Each hₜ carries memory of all words before it. Θ¹, Θ², Θ³ are shared at every step — same weights for all words.

🏗️ RNN Architectures — Many-to-Many, Many-to-One, One-to-Many

▼

Architecture	Input → Output	NLP Example
Many-to-Many	Sequence → Sequence (same length)	NER: label each word in a sentence
Many-to-One	Sequence → Single output	Sentiment analysis: entire doc → positive/negative
One-to-Many	Single input → Sequence	Image captioning: one image → generated description
Many-to-Many (encoder-decoder)	Sequence → Sequence (different length)	Machine translation: English → French

⚠️ Vanishing & Exploding Gradients

▼

What

During backpropagation through time (BPTT), gradients are multiplied through many timesteps. This causes gradients to either shrink to near-zero (vanishing) or grow uncontrollably (exploding).

💥 Exploding Gradients

Problem: Gradient grows exponentially → weights update wildly, training diverges.

Solution: Gradient clipping — if ‖gradient‖ > threshold, scale it down.

🌫️ Vanishing Gradients

Problem: Gradient shrinks to near zero → early timesteps get no learning signal → can't learn long-range dependencies.

Solutions: ReLU activation, LSTM/GRU (Week 7), better weight initialization.

⚠️ Why Simple RNNs Struggle with Long Sequences If the relevant context is 50 words ago, gradients get multiplied 50 times — and they typically vanish. This is why LSTM (Week 7) was invented: it uses gating mechanisms to selectively remember and forget.

🎮 Vanishing Gradient Visualizer

See how gradient magnitude changes as it propagates back through timesteps.

Gradient per step (g): 0.7

Sequence length: 10

Adjust sliders...

⚖️

CNN vs RNN for NLP — When to Use Which

Property	CNN for Text	RNN for Text
Captures	Local n-gram patterns (fixed window)	Sequential dependencies (full history)
Position-aware	Partially (local window only)	Yes (full sequence order)
Training speed	Fast (parallelizable)	Slower (sequential processing)
Long-range dependencies	Poor (limited window size)	Better (but vanishing gradient)
Best for	Text classification, sentiment analysis	NER, machine translation, language modeling
Zero-padding	Required (all docs same length)	Used (or bucketing for efficiency)

🔭 What's Coming Next (Week 7+)

You've now built all the foundations. Here's how Week 6 connects to the rest of the course:

Week 7

LSTM & GRU

Fixes RNN's vanishing gradient with gates. Builds directly on Week 6 RNN.

Week 7

Attention Mechanism

LSTM + Attention = model can focus on the most relevant parts of the input.

Week 9

Transformers

Replaces RNN entirely with self-attention. More parallelizable, better at long-range.

Week 9

BERT & GPT

Transformers trained on massive data. Context-dependent embeddings (finally!).

🧪 Quiz Prep — Week 6 / Quiz 6 Preview

Q1. What is the primary advantage of using CNN over a fully-connected neural network for text classification?

Q2. In a CNN for text, if the document matrix has shape (6 × 12) and we apply a filter of shape (3 × 12) with stride 1, what is the feature map size?

Q3. What does the hidden state hₜ in an RNN represent?

Q4. What is the vanishing gradient problem in RNNs and what is a common solution?

← Week 5: Neural Networks & Word2Vec Back to All Weeks →

Convolutional & Recurrent Neural Networks

The Big Picture

Part 1: Convolutional Neural Networks (CNN)

🔬 Filters and Convolution — The Core Idea

🎮 CNN Convolution Visualizer

🏊 Max Pooling — Selecting the Strongest Signal

CNN Parameter Sharing

3 Ways CNN Reduces Parameters

Part 2: Recurrent Neural Networks (RNN)

💾 The Hidden State — RNN's Memory

🏗️ RNN Architectures — Many-to-Many, Many-to-One, One-to-Many

⚠️ Vanishing & Exploding Gradients

💥 Exploding Gradients

🌫️ Vanishing Gradients

🎮 Vanishing Gradient Visualizer

CNN vs RNN for NLP — When to Use Which

🔭 What's Coming Next (Week 7+)

LSTM & GRU

Attention Mechanism

Transformers

BERT & GPT

🧪 Quiz Prep — Week 6 / Quiz 6 Preview