Week 4 – SVD, Co-occurrence & GloVe

🗺️

The Big Picture — The Problem with Sparse Vectors

The Core Problem (from Weeks 1-2) One-hot, BoW, and TF-IDF all produce vectors of size = vocabulary size (could be 100,000+). They're sparse (mostly zeros) and have no notion of meaning. "cat" and "kitten" are as different as "cat" and "airplane". This week we fix that.

Representation	Vector Size	Captures Meaning?	Dense?
One-Hot / BoW / TF-IDF	d (vocab size, huge)	No — all words equidistant	No — mostly zeros
SVD on Co-occurrence	k (50-500, chosen)	Partially — context-based	Yes — all values filled
GloVe	k (50-300, chosen)	Yes — semantic + linear structure	Yes

🔢

Part 1: Co-occurrence Matrix

Capturing meaning through context

🧠 The Distributional Hypothesis — Why Context = Meaning

▼

What

"You shall know a word by the company it keeps." — Words that appear in similar contexts have similar meanings.

Why

You've never seen the word "glarg" — but if you saw "I ate the glarg", you'd guess it's edible. Context gives meaning. Co-occurrence matrices capture this: words that often appear near each other get high co-occurrence counts.

How — Building the Matrix

For each word in the corpus, count how many times every other word appears within a sliding window of size k around it. This creates a d×d matrix where d = vocabulary size.

Corpus: "it was the best of times, it was the worst of times" Window size: 2 # How many times does "was" appear near "it"? "it was the best..." → "was" is within 2 of "it" ✓ "it was the worst..." → "was" is within 2 of "it" ✓ Co-occurrence(it, was) = 2 # Co-occurrence matrix shape: d × d (d = vocab size) # Symmetric: co-occurrence(i,j) = co-occurrence(j,i)

Example Co-occurrence Matrix (window=2, corpus above)

	it	was	the	best	worst	of	times
it	0	2	2	1	0	0	0
was	2	0	2	1	1	0	0
the	2	2	0	1	1	0	0
best	1	1	1	0	0	1	0
worst	0	1	1	0	0	1	0
of	0	0	0	1	1	0	2
times	0	0	0	0	0	2	0

Purple = high co-occurrence. Notice "best" and "worst" have similar rows → they appear in similar contexts!

⚠️ Quiz 4 Tested This! "The size of the co-occurrence matrix is proportional to the number of words in the corpus" → FALSE. It's d×d where d = vocabulary size (unique words), not total corpus size.

⚠️ Quiz 4: Another Trap! "For a corpus with n documents and d distinct words, the co-occurrence matrix has size n×n" → FALSE. It's d×d.

📐

Part 2: Singular Value Decomposition (SVD)

Compressing the co-occurrence matrix into dense embeddings

🗜️ SVD: From d×d Sparse → k-dimensional Dense

▼

What

SVD decomposes any matrix X into three matrices: U, Σ, Vᵀ, such that X = U × Σ × Vᵀ. The key insight: we only keep the TOP K singular values (the most important dimensions), giving us compact, dense word vectors.

Why

The co-occurrence matrix is d×d and very sparse. It's too large and noisy. SVD finds the directions of greatest variance (the directions that best explain the data) — like PCA. By keeping only k dimensions, we get a compressed representation that captures the essential structure.

d × d
co-occurrence

d × d
left singular vectors

d × d
singular values (diagonal)

Vᵀ

d × d
right singular vectors

↓ Keep only top K singular values (Truncated SVD)

X ≈

d × d approx

U_k

d × k
WORD EMBEDDINGS ✓

Σ_k

k × k

Vᵀ_k

k × d

# The top-k rows of U give us k-dimensional word embeddings # Each row = one word's embedding vector k = number of dimensions to keep (hyperparameter, typically 50-500) # 1 ≤ k ≤ d (can't keep more than vocab size dimensions) Word embedding for word w = row w of U_k → a vector of size k (dense, all values filled) → similar words will have similar vectors # We lose some information (truncation), but gain: ✓ Denoising (noise averaged out) ✓ Generalization (similar words cluster together) ✓ Efficient computation (k << d)

🎮 SVD Dimensionality Explorer

Adjust k to see the trade-off between compression and information retention.

Vocab size d: 1000 Embedding size k: 100

Adjust sliders...

⚠️ Quiz 4: Range of k "What is the valid range of k for truncated SVD on a d×d co-occurrence matrix?" → [1, d]. You can keep anywhere from 1 to all d dimensions. You cannot have k > d.

🧲

Part 3: GloVe (Global Vectors for Word Representation)

Better word vectors through co-occurrence probabilities

🌐 GloVe: Beyond Raw Counts

▼

What

GloVe (Global Vectors) trains word vectors so that the dot product of two word vectors equals the log of their co-occurrence probability. It combines the strengths of both global matrix factorization and local window methods.

Why GloVe over SVD?

SVD just factorizes counts (noisy). GloVe uses co-occurrence probabilities and trains with a weighted objective that down-weights very rare and very frequent co-occurrences. This gives cleaner, more meaningful vectors.

How — The Objective

# GloVe's key insight: wᵢᵀwⱼ + bᵢ + bⱼ ≈ log(X_{ij}) Where: wᵢ, wⱼ = word vectors for words i and j bᵢ, bⱼ = bias terms X_{ij} = co-occurrence count of words i and j # Weighted least-squares cost function: J = Σᵢⱼ f(X_{ij}) × (wᵢᵀwⱼ + bᵢ + bⱼ - log X_{ij})² f(X) = (X/x_max)^α if X < x_max f(X) = 1 if X ≥ x_max # α = 3/4 worked best in original paper # f(X) down-weights rare co-occurrences (noisy) and caps very frequent ones

⚠️ Quiz 4: Key GloVe Facts "GloVe probabilities have nothing to do with the co-occurrence matrix" → FALSE. GloVe is entirely based on co-occurrence statistics.

"GloVe word vectors capture linear structure and semantic meaning" → TRUE.

🔮 Word Analogy: The Magic of Dense Embeddings

▼

What GloVe Captures

GloVe vectors encode linear relationships between words. Vector arithmetic captures meaning:

king

−

man

woman

≈

queen ✓

England

−

English

Italian

≈

Italy ✓

summer

→ nearest neighbors:

winter, spring, autumn ✓

💡 Why this works The semantic concept "gender" lives in the subspace captured by king−man = queen−woman. The model has learned not just word associations but underlying semantic dimensions. This is what sparse vectors (OHE, BoW) can never do.

🧪 Quiz Prep — Week 4 Questions

Q1. The size of the co-occurrence matrix is directly proportional to the number of words in the corpus — True or False?

Q2. For a corpus with n documents and d distinct words, what is the shape of the co-occurrence matrix?

Q3. After performing truncated SVD on a d×d co-occurrence matrix keeping top k values, what is the valid range of k?

Q4. GloVe uses co-occurrence probabilities — not raw counts — as its primary source of information. True or False?

← Week 3: Linear Classifiers Week 5: Neural Networks →

SVD, Co-occurrence Matrices & GloVe

The Big Picture — The Problem with Sparse Vectors

Part 1: Co-occurrence Matrix

🧠 The Distributional Hypothesis — Why Context = Meaning

Part 2: Singular Value Decomposition (SVD)

🗜️ SVD: From d×d Sparse → k-dimensional Dense

🎮 SVD Dimensionality Explorer

Part 3: GloVe (Global Vectors for Word Representation)

🌐 GloVe: Beyond Raw Counts

🔮 Word Analogy: The Magic of Dense Embeddings

🧪 Quiz Prep — Week 4 Questions