🗺️

The Big Picture

How it all connects
The Learning Thread: From Perceptron to Neural Networks Week 3 gave us a Perceptron (one linear unit + sign activation). Week 5 stacks many such units in layers with non-linear activations → a Neural Network. This enables learning non-linear, complex patterns that single linear classifiers cannot.
ModelArchitectureCaptures?
Perceptron1 unit, sign activationLinear patterns only
Logistic Regression1 unit, sigmoid activationLinear, soft boundary
Neural NetworkMany units, multiple layers, non-linear activationsNon-linear, complex patterns
🧠

Part 1: Neural Networks (Fully Connected)

Stacking learning blocks to capture complexity

⚡ Activation Functions — The Non-Linearity Secret

What

An activation function transforms the linear combination of inputs (s = θᵀX) into the output of a neuron. Without non-linear activations, a stack of layers would just be equivalent to one linear layer — useless for complex patterns.

Why non-linearity is essential

Linear(Linear(X)) = Linear(X) — composing linear transformations gives another linear transformation. Non-linear activations break this, allowing the network to learn arbitrary complex functions.

Linear

(-∞, +∞)
→ Regression

Sign / Step

{-1, 0, +1}
→ Hard classification (Perceptron)

Sigmoid σ

(0, 1)
→ Soft classification (LR)

ReLU

[0, +∞)
→ Deep learning ⭐

Tanh

(-1, 1)
→ RNN hidden states
Click an activation function above to explore it.

⏩ Forward Pass — Computing the Network's Output

How a Forward Pass Works (Layer by Layer)

Input flows forward through each layer: linear combination → activation → output to next layer.

# For each layer l, neuron j: u_j^l = Σᵢ θᵢⱼ^l × O_i^(l-1) ← linear combination O_j^l = activation(u_j^l) ← apply activation function # In matrix form for one layer: U^l = (O^(l-1)) × Θ^l ← all neurons at once O^l = activation(U^l) # For NLP (e.g. sentiment analysis): Input: document matrix X (N × d, from BoW/TF-IDF) Layer 1: h = ReLU(X × W₁ + b₁) shape: N × hidden_size Layer 2: y = sigmoid(h × W₂ + b₂) shape: N × 1 (binary classification)

🎮 Neural Network Forward Pass Calculator

Trace a 1-hidden-layer network with 2 inputs and 2 hidden neurons, step by step.
Adjust inputs...

⏪ Backpropagation — How the Network Learns

What

Backpropagation computes the gradient of the loss with respect to every weight in the network, using the chain rule. Weights are then updated with gradient descent.

Why

We need to know "how much did each weight contribute to the error?" so we know which direction to adjust it. For deep networks with millions of weights, backprop is the only feasible way to compute all these gradients efficiently.

# Chain rule — the key insight: ∂L/∂θᵢ = ∂L/∂output × ∂output/∂hidden × ∂hidden/∂θᵢ # Algorithm: 1. Forward pass: compute all outputs (save intermediate values) 2. Compute loss L (e.g., cross-entropy for classification) 3. Backward pass: propagate gradient from output → input 4. Update: θ ← θ - η × ∂L/∂θ # Training modes: Stochastic GD (SGD): 1 sample per iteration → noisy but memory-efficient Mini-Batch GD: batch_size samples → balance of speed + stability ⭐ Batch GD: all samples per iteration → stable but slow/memory-heavy Epoch = one complete pass through all training data Iteration = one weight update step
💡 SGD vs Batch GD: Quick Rule For 1000 documents with batch_size=50 → 20 iterations per epoch. With SGD (batch=1) → 1000 iterations per epoch. Mini-batch is almost always used in practice.
🔤

Part 2: Word2Vec (CBoW & Skip-Gram)

Self-supervised word embeddings via prediction tasks

💡 The Word2Vec Big Idea — Self-Supervised Learning

What

Word2Vec trains a shallow neural network to perform a prediction task on raw text. The labels are generated from the text itself — no human annotation needed. The embedding matrix (weights) is the real output we care about.

Why

GloVe/SVD need the full co-occurrence matrix (d×d, huge). Word2Vec trains on local windows using gradient descent — more scalable to very large corpora (billions of words). It also produces dense, meaningful vectors with linear semantic structure.

🔑 Why Word2Vec over One-Hot? One-hot: size = vocab (100k+), no similarity. Word2Vec: size = embedding_dim (100-300), cat≈kitten. Fixed size regardless of corpus.

📦 CBoW (Continuous Bag of Words)

What

CBoW predicts the center word from its surrounding context words. Given the context window, what word is in the middle?

Task: "apple [___] are fruit" → predict "and" (center word) Context (inputs): ["apple", "are"] (window size = 1) # Architecture: Input: average of context word one-hot vectors × W (d × E) Hidden: h = mean(Σ context_word_vectors × W) [size E] Output: softmax(h × W') → probability over all vocab words Loss: cross-entropy(predicted, actual_center_word) # After training: W (input weight matrix) = word embeddings ← we extract this!
CBoW strengths

Faster to train (averages context). Better for frequent words. Better at capturing syntactic relationships (grammar patterns).

⏭️ Skip-Gram

What

Skip-Gram predicts the surrounding context words from a center word. Given the middle word, what are its neighbors?

Task: given "orange", predict ["apple", "are"] in "apple orange are fruit" # Architecture: Input: center word one-hot × W → hidden vector h (size E) Output: for EACH context position, softmax(h × W') → one probability distribution per context word Loss: sum of cross-entropy losses for each context word # Skip-gram trains on MORE examples than CBoW: window=2, 100 words → 400 training pairs (each word with 4 neighbors)
Skip-Gram strengths

Works better with small datasets and rare words. Better at capturing semantic relationships (meaning). The preferred method for learning high-quality embeddings.

⚖️ CBoW vs. Skip-Gram — Side by Side

PropertyCBoWSkip-Gram
TaskContext → Center wordCenter word → Context
Training speedFaster (averages context)Slower (multiple outputs)
Rare wordsPoorBetter
CapturesSyntactic (grammar) patternsSemantic (meaning) relationships
Dataset sizeBetter for large datasetsWorks well even with small datasets
Context dependencyPosition-independent (averages all context)Position-independent (predicts each)
⚠️ Both Word2Vec and GloVe are Context-Independent "bank" always gets the same vector whether you mean river bank or financial bank. This is a key limitation fixed by later models (BERT, Week 9).

🎮 Word2Vec Window Explorer

See what training examples Word2Vec generates from a sentence.
Click "Generate Examples" to see training pairs...

🧪 Quiz Prep — Week 5 / Quiz 5 Questions

Q1. Which activation function is most commonly used in deep learning due to its optimization-friendly properties?

Q2. In CBoW, the model predicts:

Q3. 1000 training documents, batch size = 50. How many iterations per epoch?

Q4. Both Word2Vec and GloVe produce context-dependent embeddings (different vectors for the same word in different contexts) — True or False?