Week 5 – Neural Networks & Word2Vec

🗺️

The Big Picture

How it all connects

The Learning Thread: From Perceptron to Neural Networks Week 3 gave us a Perceptron (one linear unit + sign activation). Week 5 stacks many such units in layers with non-linear activations → a Neural Network. This enables learning non-linear, complex patterns that single linear classifiers cannot.

Model	Architecture	Captures?
Perceptron	1 unit, sign activation	Linear patterns only
Logistic Regression	1 unit, sigmoid activation	Linear, soft boundary
Neural Network	Many units, multiple layers, non-linear activations	Non-linear, complex patterns

🧠

Part 1: Neural Networks (Fully Connected)

Stacking learning blocks to capture complexity

⚡ Activation Functions — The Non-Linearity Secret

▼

What

An activation function transforms the linear combination of inputs (s = θᵀX) into the output of a neuron. Without non-linear activations, a stack of layers would just be equivalent to one linear layer — useless for complex patterns.

Why non-linearity is essential

Linear(Linear(X)) = Linear(X) — composing linear transformations gives another linear transformation. Non-linear activations break this, allowing the network to learn arbitrary complex functions.

Linear

(-∞, +∞)

→ Regression

Sign / Step

{-1, 0, +1}

→ Hard classification (Perceptron)

Sigmoid σ

(0, 1)

→ Soft classification (LR)

ReLU

[0, +∞)

→ Deep learning ⭐

Tanh

(-1, 1)

→ RNN hidden states

Click an activation function above to explore it.

⏩ Forward Pass — Computing the Network's Output

▼

How a Forward Pass Works (Layer by Layer)

Input flows forward through each layer: linear combination → activation → output to next layer.

# For each layer l, neuron j: u_j^l = Σᵢ θᵢⱼ^l × O_i^(l-1) ← linear combination O_j^l = activation(u_j^l) ← apply activation function # In matrix form for one layer: U^l = (O^(l-1)) × Θ^l ← all neurons at once O^l = activation(U^l) # For NLP (e.g. sentiment analysis): Input: document matrix X (N × d, from BoW/TF-IDF) Layer 1: h = ReLU(X × W₁ + b₁) shape: N × hidden_size Layer 2: y = sigmoid(h × W₂ + b₂) shape: N × 1 (binary classification)

🎮 Neural Network Forward Pass Calculator

Trace a 1-hidden-layer network with 2 inputs and 2 hidden neurons, step by step.

Input x₁: 1.0

Input x₂: 0.5

Adjust inputs...

⏪ Backpropagation — How the Network Learns

▼

What

Backpropagation computes the gradient of the loss with respect to every weight in the network, using the chain rule. Weights are then updated with gradient descent.

Why

We need to know "how much did each weight contribute to the error?" so we know which direction to adjust it. For deep networks with millions of weights, backprop is the only feasible way to compute all these gradients efficiently.

# Chain rule — the key insight: ∂L/∂θᵢ = ∂L/∂output × ∂output/∂hidden × ∂hidden/∂θᵢ # Algorithm: 1. Forward pass: compute all outputs (save intermediate values) 2. Compute loss L (e.g., cross-entropy for classification) 3. Backward pass: propagate gradient from output → input 4. Update: θ ← θ - η × ∂L/∂θ # Training modes: Stochastic GD (SGD): 1 sample per iteration → noisy but memory-efficient Mini-Batch GD: batch_size samples → balance of speed + stability ⭐ Batch GD: all samples per iteration → stable but slow/memory-heavy Epoch = one complete pass through all training data Iteration = one weight update step

💡 SGD vs Batch GD: Quick Rule For 1000 documents with batch_size=50 → 20 iterations per epoch. With SGD (batch=1) → 1000 iterations per epoch. Mini-batch is almost always used in practice.

🔤

Part 2: Word2Vec (CBoW & Skip-Gram)

Self-supervised word embeddings via prediction tasks

💡 The Word2Vec Big Idea — Self-Supervised Learning

▼

What

Word2Vec trains a shallow neural network to perform a prediction task on raw text. The labels are generated from the text itself — no human annotation needed. The embedding matrix (weights) is the real output we care about.

Why

GloVe/SVD need the full co-occurrence matrix (d×d, huge). Word2Vec trains on local windows using gradient descent — more scalable to very large corpora (billions of words). It also produces dense, meaningful vectors with linear semantic structure.

🔑 Why Word2Vec over One-Hot? One-hot: size = vocab (100k+), no similarity. Word2Vec: size = embedding_dim (100-300), cat≈kitten. Fixed size regardless of corpus.

📦 CBoW (Continuous Bag of Words)

▼

What

CBoW predicts the center word from its surrounding context words. Given the context window, what word is in the middle?

Task: "apple [___] are fruit" → predict "and" (center word) Context (inputs): ["apple", "are"] (window size = 1) # Architecture: Input: average of context word one-hot vectors × W (d × E) Hidden: h = mean(Σ context_word_vectors × W) [size E] Output: softmax(h × W') → probability over all vocab words Loss: cross-entropy(predicted, actual_center_word) # After training: W (input weight matrix) = word embeddings ← we extract this!

CBoW strengths

Faster to train (averages context). Better for frequent words. Better at capturing syntactic relationships (grammar patterns).

⏭️ Skip-Gram

▼

What

Skip-Gram predicts the surrounding context words from a center word. Given the middle word, what are its neighbors?

Task: given "orange", predict ["apple", "are"] in "apple orange are fruit" # Architecture: Input: center word one-hot × W → hidden vector h (size E) Output: for EACH context position, softmax(h × W') → one probability distribution per context word Loss: sum of cross-entropy losses for each context word # Skip-gram trains on MORE examples than CBoW: window=2, 100 words → 400 training pairs (each word with 4 neighbors)

Skip-Gram strengths

Works better with small datasets and rare words. Better at capturing semantic relationships (meaning). The preferred method for learning high-quality embeddings.

⚖️ CBoW vs. Skip-Gram — Side by Side

▼

Property	CBoW	Skip-Gram
Task	Context → Center word	Center word → Context
Training speed	Faster (averages context)	Slower (multiple outputs)
Rare words	Poor	Better
Captures	Syntactic (grammar) patterns	Semantic (meaning) relationships
Dataset size	Better for large datasets	Works well even with small datasets
Context dependency	Position-independent (averages all context)	Position-independent (predicts each)

⚠️ Both Word2Vec and GloVe are Context-Independent "bank" always gets the same vector whether you mean river bank or financial bank. This is a key limitation fixed by later models (BERT, Week 9).

🎮 Word2Vec Window Explorer

See what training examples Word2Vec generates from a sentence.

Window size: 2

Click "Generate Examples" to see training pairs...

🧪 Quiz Prep — Week 5 / Quiz 5 Questions

Q1. Which activation function is most commonly used in deep learning due to its optimization-friendly properties?

Q2. In CBoW, the model predicts:

Q3. 1000 training documents, batch size = 50. How many iterations per epoch?

Q4. Both Word2Vec and GloVe produce context-dependent embeddings (different vectors for the same word in different contexts) — True or False?

← Week 4: SVD & GloVe Week 6: CNN & RNN →

📈 Kaplan Scaling Laws (2020) PhD Foundational

OpenAI (Kaplan et al., 2020) showed that LM validation loss follows smooth power laws in model size N, dataset size D, and compute C, with diminishing returns for each dimension:

L(N) ≈ (N_c / N)^α_N where α_N ≈ 0.076 L(D) ≈ (D_c / D)^α_D where α_D ≈ 0.095 L(C) ≈ (C_c / C)^α_C where α_C ≈ 0.057

N_c, D_c, C_c are critical threshold constants. Loss decreases as a power law — doubling model size from 1B→2B gives a smaller gain than going from 100M→200M.

Key Findings from Kaplan 2020

Model size matters more than dataset size (Kaplan's conclusion)
Larger models are more sample-efficient
Architecture (width vs. depth) matters less than N for fixed C
Motivated GPT-3 (175B) with relatively small training data (~300B tokens)

🦦 Chinchilla Scaling Laws (2022) — The Correction

📖 The Paradigm Shift

DeepMind (Hoffmann et al., 2022) re-ran Kaplan's experiments more carefully — using proper learning rate schedules and a wider range of model/data combinations. Conclusion: Kaplan was wrong about the ratio. Both model size and dataset size should scale equally. The optimal model for a given compute budget is much smaller and trained on much more data than Kaplan suggested.

Optimal: N* ∝ C^0.50 D* ∝ C^0.50 Chinchilla Rule of Thumb: D* ≈ 20 × N* Example: 70B params → 1.4T tokens 7B params → 140B tokens

Real-World Impact

GPT-3 (175B, 300B tokens) was under-trained
Chinchilla (70B, 1.4T tokens) matched GPT-3 quality
LLaMA-2 (7B, 2T tokens) beats GPT-3 on many benchmarks
Modern small models (Phi-3, Gemma) exploit this heavily
Lesson: data quantity ≥ model size for fixed compute

Model	Params	Training Tokens	Tokens/Param	Chinchilla-Optimal?
GPT-3 (2020)	175B	300B	1.7	❌ Under-trained
Chinchilla (2022)	70B	1.4T	20	✅ Optimal
LLaMA-2 (2023)	7B	2T	286	✅+ Over-trained (inference-efficient)
Phi-3-Mini (2024)	3.8B	3.3T	868	✅++ Heavily over-trained for deployment

Note: "over-training" (beyond Chinchilla optimal) is desirable for small models deployed for inference — you spend compute once at training time, but the smaller model is cheaper to run for millions of queries.

⚡ Test-Time Compute Scaling (2024) Frontier Research

A new scaling axis emerged in 2024: spending more compute at inference time can substitute for a larger model. OpenAI o1/o3, Deepseek-R1 exploit this.

Mechanisms

Chain-of-Thought (CoT): more reasoning tokens → better answers
Self-consistency: sample N completions, majority vote
Process Reward Models (PRM): score intermediate steps
Tree search (MCTS): explore reasoning branches

Emerging Law

Snell et al. (2024) showed: for math problems, a smaller model with 16× more inference compute matches a 14× larger model at standard inference. This opens a new efficiency frontier — train smaller, reason longer.

🎯 Word2Vec Negative Sampling — The Math PhD

The slides show Word2Vec. Here's why negative sampling works — and the subtle trick in the noise distribution.

Why Negative Sampling?

Naive softmax over |V|~100K words per token is expensive. Negative sampling replaces the full softmax with a binary classification objective: does this (word, context) pair appear in real data?

J(θ) = log σ(v_c · v_w) + Σₖ E[log σ(-v_n · v_w)] where k negative samples drawn from P_noise(w) ∝ f(w)^(3/4)

The 3/4 Power Trick

Uniform sampling over-represents common words ("the", "of"). Raw frequency over-represents them even more. The 3/4 power law is a middle ground — empirically found to work best:

Rare words sampled more than their frequency suggests
Common words sampled less
Empirically beats both uniform and frequency-proportional

Neural Networks & Word2Vec

The Big Picture

Part 1: Neural Networks (Fully Connected)

⚡ Activation Functions — The Non-Linearity Secret

Linear

Sign / Step

Sigmoid σ

ReLU

Tanh

⏩ Forward Pass — Computing the Network's Output

🎮 Neural Network Forward Pass Calculator

⏪ Backpropagation — How the Network Learns

Part 2: Word2Vec (CBoW & Skip-Gram)

💡 The Word2Vec Big Idea — Self-Supervised Learning

📦 CBoW (Continuous Bag of Words)

⏭️ Skip-Gram

⚖️ CBoW vs. Skip-Gram — Side by Side

🎮 Word2Vec Window Explorer

🧪 Quiz Prep — Week 5 / Quiz 5 Questions

Scaling Laws, Negative Sampling & The Science of Scale

📈 Kaplan Scaling Laws (2020) PhD Foundational

Key Findings from Kaplan 2020

🦦 Chinchilla Scaling Laws (2022) — The Correction

Real-World Impact

⚡ Test-Time Compute Scaling (2024) Frontier Research

Mechanisms

Emerging Law

🎯 Word2Vec Negative Sampling — The Math PhD

Why Negative Sampling?

The 3/4 Power Trick