🗺️

The Big Picture

How it all connects
The Learning Thread: From Perceptron to Neural Networks Week 3 gave us a Perceptron (one linear unit + sign activation). Week 5 stacks many such units in layers with non-linear activations → a Neural Network. This enables learning non-linear, complex patterns that single linear classifiers cannot.
ModelArchitectureCaptures?
Perceptron1 unit, sign activationLinear patterns only
Logistic Regression1 unit, sigmoid activationLinear, soft boundary
Neural NetworkMany units, multiple layers, non-linear activationsNon-linear, complex patterns
🧠

Part 1: Neural Networks (Fully Connected)

Stacking learning blocks to capture complexity

⚡ Activation Functions — The Non-Linearity Secret

What

An activation function transforms the linear combination of inputs (s = θᵀX) into the output of a neuron. Without non-linear activations, a stack of layers would just be equivalent to one linear layer — useless for complex patterns.

Why non-linearity is essential

Linear(Linear(X)) = Linear(X) — composing linear transformations gives another linear transformation. Non-linear activations break this, allowing the network to learn arbitrary complex functions.

Linear

(-∞, +∞)
→ Regression

Sign / Step

{-1, 0, +1}
→ Hard classification (Perceptron)

Sigmoid σ

(0, 1)
→ Soft classification (LR)

ReLU

[0, +∞)
→ Deep learning ⭐

Tanh

(-1, 1)
→ RNN hidden states
Click an activation function above to explore it.

⏩ Forward Pass — Computing the Network's Output

How a Forward Pass Works (Layer by Layer)

Input flows forward through each layer: linear combination → activation → output to next layer.

# For each layer l, neuron j: u_j^l = Σᵢ θᵢⱼ^l × O_i^(l-1) ← linear combination O_j^l = activation(u_j^l) ← apply activation function # In matrix form for one layer: U^l = (O^(l-1)) × Θ^l ← all neurons at once O^l = activation(U^l) # For NLP (e.g. sentiment analysis): Input: document matrix X (N × d, from BoW/TF-IDF) Layer 1: h = ReLU(X × W₁ + b₁) shape: N × hidden_size Layer 2: y = sigmoid(h × W₂ + b₂) shape: N × 1 (binary classification)

🎮 Neural Network Forward Pass Calculator

Trace a 1-hidden-layer network with 2 inputs and 2 hidden neurons, step by step.
Adjust inputs...

⏪ Backpropagation — How the Network Learns

What

Backpropagation computes the gradient of the loss with respect to every weight in the network, using the chain rule. Weights are then updated with gradient descent.

Why

We need to know "how much did each weight contribute to the error?" so we know which direction to adjust it. For deep networks with millions of weights, backprop is the only feasible way to compute all these gradients efficiently.

# Chain rule — the key insight: ∂L/∂θᵢ = ∂L/∂output × ∂output/∂hidden × ∂hidden/∂θᵢ # Algorithm: 1. Forward pass: compute all outputs (save intermediate values) 2. Compute loss L (e.g., cross-entropy for classification) 3. Backward pass: propagate gradient from output → input 4. Update: θ ← θ - η × ∂L/∂θ # Training modes: Stochastic GD (SGD): 1 sample per iteration → noisy but memory-efficient Mini-Batch GD: batch_size samples → balance of speed + stability ⭐ Batch GD: all samples per iteration → stable but slow/memory-heavy Epoch = one complete pass through all training data Iteration = one weight update step
💡 SGD vs Batch GD: Quick Rule For 1000 documents with batch_size=50 → 20 iterations per epoch. With SGD (batch=1) → 1000 iterations per epoch. Mini-batch is almost always used in practice.
🔤

Part 2: Word2Vec (CBoW & Skip-Gram)

Self-supervised word embeddings via prediction tasks

💡 The Word2Vec Big Idea — Self-Supervised Learning

What

Word2Vec trains a shallow neural network to perform a prediction task on raw text. The labels are generated from the text itself — no human annotation needed. The embedding matrix (weights) is the real output we care about.

Why

GloVe/SVD need the full co-occurrence matrix (d×d, huge). Word2Vec trains on local windows using gradient descent — more scalable to very large corpora (billions of words). It also produces dense, meaningful vectors with linear semantic structure.

🔑 Why Word2Vec over One-Hot? One-hot: size = vocab (100k+), no similarity. Word2Vec: size = embedding_dim (100-300), cat≈kitten. Fixed size regardless of corpus.

📦 CBoW (Continuous Bag of Words)

What

CBoW predicts the center word from its surrounding context words. Given the context window, what word is in the middle?

Task: "apple [___] are fruit" → predict "and" (center word) Context (inputs): ["apple", "are"] (window size = 1) # Architecture: Input: average of context word one-hot vectors × W (d × E) Hidden: h = mean(Σ context_word_vectors × W) [size E] Output: softmax(h × W') → probability over all vocab words Loss: cross-entropy(predicted, actual_center_word) # After training: W (input weight matrix) = word embeddings ← we extract this!
CBoW strengths

Faster to train (averages context). Better for frequent words. Better at capturing syntactic relationships (grammar patterns).

⏭️ Skip-Gram

What

Skip-Gram predicts the surrounding context words from a center word. Given the middle word, what are its neighbors?

Task: given "orange", predict ["apple", "are"] in "apple orange are fruit" # Architecture: Input: center word one-hot × W → hidden vector h (size E) Output: for EACH context position, softmax(h × W') → one probability distribution per context word Loss: sum of cross-entropy losses for each context word # Skip-gram trains on MORE examples than CBoW: window=2, 100 words → 400 training pairs (each word with 4 neighbors)
Skip-Gram strengths

Works better with small datasets and rare words. Better at capturing semantic relationships (meaning). The preferred method for learning high-quality embeddings.

⚖️ CBoW vs. Skip-Gram — Side by Side

PropertyCBoWSkip-Gram
TaskContext → Center wordCenter word → Context
Training speedFaster (averages context)Slower (multiple outputs)
Rare wordsPoorBetter
CapturesSyntactic (grammar) patternsSemantic (meaning) relationships
Dataset sizeBetter for large datasetsWorks well even with small datasets
Context dependencyPosition-independent (averages all context)Position-independent (predicts each)
⚠️ Both Word2Vec and GloVe are Context-Independent "bank" always gets the same vector whether you mean river bank or financial bank. This is a key limitation fixed by later models (BERT, Week 9).

🎮 Word2Vec Window Explorer

See what training examples Word2Vec generates from a sentence.
Click "Generate Examples" to see training pairs...

🧪 Quiz Prep — Week 5 / Quiz 5 Questions

Q1. Which activation function is most commonly used in deep learning due to its optimization-friendly properties?

Q2. In CBoW, the model predicts:

Q3. 1000 training documents, batch size = 50. How many iterations per epoch?

Q4. Both Word2Vec and GloVe produce context-dependent embeddings (different vectors for the same word in different contexts) — True or False?

🔬 Beyond the Slides · Graduate Depth

Scaling Laws, Negative Sampling & The Science of Scale

Why do bigger models perform better? How do you know how much data to train on? What happens when you scale compute vs. parameters? These aren't intuitions — they're empirical laws with mathematical derivations. Every NLP researcher in 2026 must know Kaplan and Chinchilla.

📈 Kaplan Scaling Laws (2020) PhD Foundational

OpenAI (Kaplan et al., 2020) showed that LM validation loss follows smooth power laws in model size N, dataset size D, and compute C, with diminishing returns for each dimension:

L(N) ≈ (N_c / N)^α_N where α_N ≈ 0.076 L(D) ≈ (D_c / D)^α_D where α_D ≈ 0.095 L(C) ≈ (C_c / C)^α_C where α_C ≈ 0.057

N_c, D_c, C_c are critical threshold constants. Loss decreases as a power law — doubling model size from 1B→2B gives a smaller gain than going from 100M→200M.

Key Findings from Kaplan 2020

  • Model size matters more than dataset size (Kaplan's conclusion)
  • Larger models are more sample-efficient
  • Architecture (width vs. depth) matters less than N for fixed C
  • Motivated GPT-3 (175B) with relatively small training data (~300B tokens)

🦦 Chinchilla Scaling Laws (2022) — The Correction

📖 The Paradigm Shift

DeepMind (Hoffmann et al., 2022) re-ran Kaplan's experiments more carefully — using proper learning rate schedules and a wider range of model/data combinations. Conclusion: Kaplan was wrong about the ratio. Both model size and dataset size should scale equally. The optimal model for a given compute budget is much smaller and trained on much more data than Kaplan suggested.

Optimal: N* ∝ C^0.50 D* ∝ C^0.50 Chinchilla Rule of Thumb: D* ≈ 20 × N* Example: 70B params → 1.4T tokens 7B params → 140B tokens

Real-World Impact

  • GPT-3 (175B, 300B tokens) was under-trained
  • Chinchilla (70B, 1.4T tokens) matched GPT-3 quality
  • LLaMA-2 (7B, 2T tokens) beats GPT-3 on many benchmarks
  • Modern small models (Phi-3, Gemma) exploit this heavily
  • Lesson: data quantity ≥ model size for fixed compute
ModelParamsTraining TokensTokens/ParamChinchilla-Optimal?
GPT-3 (2020)175B300B1.7❌ Under-trained
Chinchilla (2022)70B1.4T20✅ Optimal
LLaMA-2 (2023)7B2T286✅+ Over-trained (inference-efficient)
Phi-3-Mini (2024)3.8B3.3T868✅++ Heavily over-trained for deployment

Note: "over-training" (beyond Chinchilla optimal) is desirable for small models deployed for inference — you spend compute once at training time, but the smaller model is cheaper to run for millions of queries.

⚡ Test-Time Compute Scaling (2024) Frontier Research

A new scaling axis emerged in 2024: spending more compute at inference time can substitute for a larger model. OpenAI o1/o3, Deepseek-R1 exploit this.

Mechanisms

  • Chain-of-Thought (CoT): more reasoning tokens → better answers
  • Self-consistency: sample N completions, majority vote
  • Process Reward Models (PRM): score intermediate steps
  • Tree search (MCTS): explore reasoning branches

Emerging Law

Snell et al. (2024) showed: for math problems, a smaller model with 16× more inference compute matches a 14× larger model at standard inference. This opens a new efficiency frontier — train smaller, reason longer.

🎯 Word2Vec Negative Sampling — The Math PhD

The slides show Word2Vec. Here's why negative sampling works — and the subtle trick in the noise distribution.

Why Negative Sampling?

Naive softmax over |V|~100K words per token is expensive. Negative sampling replaces the full softmax with a binary classification objective: does this (word, context) pair appear in real data?

J(θ) = log σ(v_c · v_w) + Σₖ E[log σ(-v_n · v_w)] where k negative samples drawn from P_noise(w) ∝ f(w)^(3/4)

The 3/4 Power Trick

Uniform sampling over-represents common words ("the", "of"). Raw frequency over-represents them even more. The 3/4 power law is a middle ground — empirically found to work best:

  • Rare words sampled more than their frequency suggests
  • Common words sampled less
  • Empirically beats both uniform and frequency-proportional