The foundation of deep learning. Connect many Logistic Regression "blocks" together, train with backpropagation, and get remarkably powerful models.
| Model | Architecture | Captures? |
|---|---|---|
| Perceptron | 1 unit, sign activation | Linear patterns only |
| Logistic Regression | 1 unit, sigmoid activation | Linear, soft boundary |
| Neural Network | Many units, multiple layers, non-linear activations | Non-linear, complex patterns |
An activation function transforms the linear combination of inputs (s = θᵀX) into the output of a neuron. Without non-linear activations, a stack of layers would just be equivalent to one linear layer — useless for complex patterns.
Linear(Linear(X)) = Linear(X) — composing linear transformations gives another linear transformation. Non-linear activations break this, allowing the network to learn arbitrary complex functions.
Input flows forward through each layer: linear combination → activation → output to next layer.
Backpropagation computes the gradient of the loss with respect to every weight in the network, using the chain rule. Weights are then updated with gradient descent.
We need to know "how much did each weight contribute to the error?" so we know which direction to adjust it. For deep networks with millions of weights, backprop is the only feasible way to compute all these gradients efficiently.
Word2Vec trains a shallow neural network to perform a prediction task on raw text. The labels are generated from the text itself — no human annotation needed. The embedding matrix (weights) is the real output we care about.
GloVe/SVD need the full co-occurrence matrix (d×d, huge). Word2Vec trains on local windows using gradient descent — more scalable to very large corpora (billions of words). It also produces dense, meaningful vectors with linear semantic structure.
CBoW predicts the center word from its surrounding context words. Given the context window, what word is in the middle?
Faster to train (averages context). Better for frequent words. Better at capturing syntactic relationships (grammar patterns).
Skip-Gram predicts the surrounding context words from a center word. Given the middle word, what are its neighbors?
Works better with small datasets and rare words. Better at capturing semantic relationships (meaning). The preferred method for learning high-quality embeddings.
| Property | CBoW | Skip-Gram |
|---|---|---|
| Task | Context → Center word | Center word → Context |
| Training speed | Faster (averages context) | Slower (multiple outputs) |
| Rare words | Poor | Better |
| Captures | Syntactic (grammar) patterns | Semantic (meaning) relationships |
| Dataset size | Better for large datasets | Works well even with small datasets |
| Context dependency | Position-independent (averages all context) | Position-independent (predicts each) |
Q1. Which activation function is most commonly used in deep learning due to its optimization-friendly properties?
Q2. In CBoW, the model predicts:
Q3. 1000 training documents, batch size = 50. How many iterations per epoch?
Q4. Both Word2Vec and GloVe produce context-dependent embeddings (different vectors for the same word in different contexts) — True or False?
Why do bigger models perform better? How do you know how much data to train on? What happens when you scale compute vs. parameters? These aren't intuitions — they're empirical laws with mathematical derivations. Every NLP researcher in 2026 must know Kaplan and Chinchilla.
OpenAI (Kaplan et al., 2020) showed that LM validation loss follows smooth power laws in model size N, dataset size D, and compute C, with diminishing returns for each dimension:
N_c, D_c, C_c are critical threshold constants. Loss decreases as a power law — doubling model size from 1B→2B gives a smaller gain than going from 100M→200M.
DeepMind (Hoffmann et al., 2022) re-ran Kaplan's experiments more carefully — using proper learning rate schedules and a wider range of model/data combinations. Conclusion: Kaplan was wrong about the ratio. Both model size and dataset size should scale equally. The optimal model for a given compute budget is much smaller and trained on much more data than Kaplan suggested.
| Model | Params | Training Tokens | Tokens/Param | Chinchilla-Optimal? |
|---|---|---|---|---|
| GPT-3 (2020) | 175B | 300B | 1.7 | ❌ Under-trained |
| Chinchilla (2022) | 70B | 1.4T | 20 | ✅ Optimal |
| LLaMA-2 (2023) | 7B | 2T | 286 | ✅+ Over-trained (inference-efficient) |
| Phi-3-Mini (2024) | 3.8B | 3.3T | 868 | ✅++ Heavily over-trained for deployment |
Note: "over-training" (beyond Chinchilla optimal) is desirable for small models deployed for inference — you spend compute once at training time, but the smaller model is cheaper to run for millions of queries.
A new scaling axis emerged in 2024: spending more compute at inference time can substitute for a larger model. OpenAI o1/o3, Deepseek-R1 exploit this.
Snell et al. (2024) showed: for math problems, a smaller model with 16× more inference compute matches a 14× larger model at standard inference. This opens a new efficiency frontier — train smaller, reason longer.
The slides show Word2Vec. Here's why negative sampling works — and the subtle trick in the noise distribution.
Naive softmax over |V|~100K words per token is expensive. Negative sampling replaces the full softmax with a binary classification objective: does this (word, context) pair appear in real data?
Uniform sampling over-represents common words ("the", "of"). Raw frequency over-represents them even more. The 3/4 power law is a middle ground — empirically found to work best: