The leap from sparse, meaningless word vectors to dense embeddings that actually capture word meaning. This is where NLP gets powerful.
| Representation | Vector Size | Captures Meaning? | Dense? |
|---|---|---|---|
| One-Hot / BoW / TF-IDF | d (vocab size, huge) | No — all words equidistant | No — mostly zeros |
| SVD on Co-occurrence | k (50-500, chosen) | Partially — context-based | Yes — all values filled |
| GloVe | k (50-300, chosen) | Yes — semantic + linear structure | Yes |
"You shall know a word by the company it keeps." — Words that appear in similar contexts have similar meanings.
You've never seen the word "glarg" — but if you saw "I ate the glarg", you'd guess it's edible. Context gives meaning. Co-occurrence matrices capture this: words that often appear near each other get high co-occurrence counts.
For each word in the corpus, count how many times every other word appears within a sliding window of size k around it. This creates a d×d matrix where d = vocabulary size.
Example Co-occurrence Matrix (window=2, corpus above)
| it | was | the | best | worst | of | times | |
|---|---|---|---|---|---|---|---|
| it | 0 | 2 | 2 | 1 | 0 | 0 | 0 |
| was | 2 | 0 | 2 | 1 | 1 | 0 | 0 |
| the | 2 | 2 | 0 | 1 | 1 | 0 | 0 |
| best | 1 | 1 | 1 | 0 | 0 | 1 | 0 |
| worst | 0 | 1 | 1 | 0 | 0 | 1 | 0 |
| of | 0 | 0 | 0 | 1 | 1 | 0 | 2 |
| times | 0 | 0 | 0 | 0 | 0 | 2 | 0 |
Purple = high co-occurrence. Notice "best" and "worst" have similar rows → they appear in similar contexts!
SVD decomposes any matrix X into three matrices: U, Σ, Vᵀ, such that X = U × Σ × Vᵀ. The key insight: we only keep the TOP K singular values (the most important dimensions), giving us compact, dense word vectors.
The co-occurrence matrix is d×d and very sparse. It's too large and noisy. SVD finds the directions of greatest variance (the directions that best explain the data) — like PCA. By keeping only k dimensions, we get a compressed representation that captures the essential structure.
↓ Keep only top K singular values (Truncated SVD)
GloVe (Global Vectors) trains word vectors so that the dot product of two word vectors equals the log of their co-occurrence probability. It combines the strengths of both global matrix factorization and local window methods.
SVD just factorizes counts (noisy). GloVe uses co-occurrence probabilities and trains with a weighted objective that down-weights very rare and very frequent co-occurrences. This gives cleaner, more meaningful vectors.
GloVe vectors encode linear relationships between words. Vector arithmetic captures meaning:
Q1. The size of the co-occurrence matrix is directly proportional to the number of words in the corpus — True or False?
Q2. For a corpus with n documents and d distinct words, what is the shape of the co-occurrence matrix?
Q3. After performing truncated SVD on a d×d co-occurrence matrix keeping top k values, what is the valid range of k?
Q4. GloVe uses co-occurrence probabilities — not raw counts — as its primary source of information. True or False?