🗺️

The Big Picture — The Problem with Sparse Vectors

The Core Problem (from Weeks 1-2) One-hot, BoW, and TF-IDF all produce vectors of size = vocabulary size (could be 100,000+). They're sparse (mostly zeros) and have no notion of meaning. "cat" and "kitten" are as different as "cat" and "airplane". This week we fix that.
RepresentationVector SizeCaptures Meaning?Dense?
One-Hot / BoW / TF-IDFd (vocab size, huge)No — all words equidistantNo — mostly zeros
SVD on Co-occurrencek (50-500, chosen)Partially — context-basedYes — all values filled
GloVek (50-300, chosen)Yes — semantic + linear structureYes
🔢

Part 1: Co-occurrence Matrix

Capturing meaning through context

🧠 The Distributional Hypothesis — Why Context = Meaning

What

"You shall know a word by the company it keeps." — Words that appear in similar contexts have similar meanings.

Why

You've never seen the word "glarg" — but if you saw "I ate the glarg", you'd guess it's edible. Context gives meaning. Co-occurrence matrices capture this: words that often appear near each other get high co-occurrence counts.

How — Building the Matrix

For each word in the corpus, count how many times every other word appears within a sliding window of size k around it. This creates a d×d matrix where d = vocabulary size.

Corpus: "it was the best of times, it was the worst of times" Window size: 2 # How many times does "was" appear near "it"? "it was the best..." → "was" is within 2 of "it" ✓ "it was the worst..." → "was" is within 2 of "it" ✓ Co-occurrence(it, was) = 2 # Co-occurrence matrix shape: d × d (d = vocab size) # Symmetric: co-occurrence(i,j) = co-occurrence(j,i)

Example Co-occurrence Matrix (window=2, corpus above)

itwasthebestworstoftimes
it0221000
was2021100
the2201100
best1110010
worst0110010
of0001102
times0000020

Purple = high co-occurrence. Notice "best" and "worst" have similar rows → they appear in similar contexts!

⚠️ Quiz 4 Tested This! "The size of the co-occurrence matrix is proportional to the number of words in the corpus" → FALSE. It's d×d where d = vocabulary size (unique words), not total corpus size.
⚠️ Quiz 4: Another Trap! "For a corpus with n documents and d distinct words, the co-occurrence matrix has size n×n" → FALSE. It's d×d.
📐

Part 2: Singular Value Decomposition (SVD)

Compressing the co-occurrence matrix into dense embeddings

🗜️ SVD: From d×d Sparse → k-dimensional Dense

What

SVD decomposes any matrix X into three matrices: U, Σ, Vᵀ, such that X = U × Σ × Vᵀ. The key insight: we only keep the TOP K singular values (the most important dimensions), giving us compact, dense word vectors.

Why

The co-occurrence matrix is d×d and very sparse. It's too large and noisy. SVD finds the directions of greatest variance (the directions that best explain the data) — like PCA. By keeping only k dimensions, we get a compressed representation that captures the essential structure.

X
d × d
co-occurrence
=
U
d × d
left singular vectors
×
Σ
d × d
singular values (diagonal)
×
Vᵀ
d × d
right singular vectors

↓ Keep only top K singular values (Truncated SVD)

X ≈
d × d approx
=
U_k
d × k
WORD EMBEDDINGS ✓
×
Σ_k
k × k
×
Vᵀ_k
k × d
# The top-k rows of U give us k-dimensional word embeddings # Each row = one word's embedding vector k = number of dimensions to keep (hyperparameter, typically 50-500) # 1 ≤ k ≤ d (can't keep more than vocab size dimensions) Word embedding for word w = row w of U_k → a vector of size k (dense, all values filled) → similar words will have similar vectors # We lose some information (truncation), but gain: ✓ Denoising (noise averaged out) ✓ Generalization (similar words cluster together) ✓ Efficient computation (k << d)

🎮 SVD Dimensionality Explorer

Adjust k to see the trade-off between compression and information retention.
Adjust sliders...
⚠️ Quiz 4: Range of k "What is the valid range of k for truncated SVD on a d×d co-occurrence matrix?" → [1, d]. You can keep anywhere from 1 to all d dimensions. You cannot have k > d.
🧲

Part 3: GloVe (Global Vectors for Word Representation)

Better word vectors through co-occurrence probabilities

🌐 GloVe: Beyond Raw Counts

What

GloVe (Global Vectors) trains word vectors so that the dot product of two word vectors equals the log of their co-occurrence probability. It combines the strengths of both global matrix factorization and local window methods.

Why GloVe over SVD?

SVD just factorizes counts (noisy). GloVe uses co-occurrence probabilities and trains with a weighted objective that down-weights very rare and very frequent co-occurrences. This gives cleaner, more meaningful vectors.

How — The Objective

# GloVe's key insight: wᵢᵀwⱼ + bᵢ + bⱼ ≈ log(X_{ij}) Where: wᵢ, wⱼ = word vectors for words i and j bᵢ, bⱼ = bias terms X_{ij} = co-occurrence count of words i and j # Weighted least-squares cost function: J = Σᵢⱼ f(X_{ij}) × (wᵢᵀwⱼ + bᵢ + bⱼ - log X_{ij})² f(X) = (X/x_max)^α if X < x_max f(X) = 1 if X ≥ x_max # α = 3/4 worked best in original paper # f(X) down-weights rare co-occurrences (noisy) and caps very frequent ones
⚠️ Quiz 4: Key GloVe Facts "GloVe probabilities have nothing to do with the co-occurrence matrix" → FALSE. GloVe is entirely based on co-occurrence statistics.

"GloVe word vectors capture linear structure and semantic meaning" → TRUE.

🔮 Word Analogy: The Magic of Dense Embeddings

What GloVe Captures

GloVe vectors encode linear relationships between words. Vector arithmetic captures meaning:

king
man
+
woman
queen ✓
England
English
+
Italian
Italy ✓
summer
→ nearest neighbors:
winter, spring, autumn ✓
💡 Why this works The semantic concept "gender" lives in the subspace captured by king−man = queen−woman. The model has learned not just word associations but underlying semantic dimensions. This is what sparse vectors (OHE, BoW) can never do.

🧪 Quiz Prep — Week 4 Questions

Q1. The size of the co-occurrence matrix is directly proportional to the number of words in the corpus — True or False?

Q2. For a corpus with n documents and d distinct words, what is the shape of the co-occurrence matrix?

Q3. After performing truncated SVD on a d×d co-occurrence matrix keeping top k values, what is the valid range of k?

Q4. GloVe uses co-occurrence probabilities — not raw counts — as its primary source of information. True or False?