The foundation of deep learning. Connect many Logistic Regression "blocks" together, train with backpropagation, and get remarkably powerful models.
| Model | Architecture | Captures? |
|---|---|---|
| Perceptron | 1 unit, sign activation | Linear patterns only |
| Logistic Regression | 1 unit, sigmoid activation | Linear, soft boundary |
| Neural Network | Many units, multiple layers, non-linear activations | Non-linear, complex patterns |
An activation function transforms the linear combination of inputs (s = θᵀX) into the output of a neuron. Without non-linear activations, a stack of layers would just be equivalent to one linear layer — useless for complex patterns.
Linear(Linear(X)) = Linear(X) — composing linear transformations gives another linear transformation. Non-linear activations break this, allowing the network to learn arbitrary complex functions.
Input flows forward through each layer: linear combination → activation → output to next layer.
Backpropagation computes the gradient of the loss with respect to every weight in the network, using the chain rule. Weights are then updated with gradient descent.
We need to know "how much did each weight contribute to the error?" so we know which direction to adjust it. For deep networks with millions of weights, backprop is the only feasible way to compute all these gradients efficiently.
Word2Vec trains a shallow neural network to perform a prediction task on raw text. The labels are generated from the text itself — no human annotation needed. The embedding matrix (weights) is the real output we care about.
GloVe/SVD need the full co-occurrence matrix (d×d, huge). Word2Vec trains on local windows using gradient descent — more scalable to very large corpora (billions of words). It also produces dense, meaningful vectors with linear semantic structure.
CBoW predicts the center word from its surrounding context words. Given the context window, what word is in the middle?
Faster to train (averages context). Better for frequent words. Better at capturing syntactic relationships (grammar patterns).
Skip-Gram predicts the surrounding context words from a center word. Given the middle word, what are its neighbors?
Works better with small datasets and rare words. Better at capturing semantic relationships (meaning). The preferred method for learning high-quality embeddings.
| Property | CBoW | Skip-Gram |
|---|---|---|
| Task | Context → Center word | Center word → Context |
| Training speed | Faster (averages context) | Slower (multiple outputs) |
| Rare words | Poor | Better |
| Captures | Syntactic (grammar) patterns | Semantic (meaning) relationships |
| Dataset size | Better for large datasets | Works well even with small datasets |
| Context dependency | Position-independent (averages all context) | Position-independent (predicts each) |
Q1. Which activation function is most commonly used in deep learning due to its optimization-friendly properties?
Q2. In CBoW, the model predicts:
Q3. 1000 training documents, batch size = 50. How many iterations per epoch?
Q4. Both Word2Vec and GloVe produce context-dependent embeddings (different vectors for the same word in different contexts) — True or False?