Week 12 – Topic Modeling: LSI & LDA

🗂️

What is Topic Modeling?

Unsupervised learning · No labels needed

Core Idea & Motivation

▼

What

Topic modeling is an unsupervised learning technique (no labels needed) to extract topics from documents and find documents that potentially share a common context.

Why

It enables semantic querying — we can retrieve documents that are related to a topic even if they don't contain the exact search keyword. For example, searching for "system" can retrieve a document that only contains "data" and "retrieval" because they share the same underlying database concept.

💡 Key Example We may retrieve documents that don't have the term "system", but they contain "data" and "retrieval" — because they share the same database concept. Topic modeling bridges this semantic gap.

📊 Outputs

Document-Concept/Topic matrix — which documents belong to which topic
Term/Word-Concept/Topic matrix — which words define each topic

🔑 Key Property

Topics are latent (hidden). We don't know their labels in advance — we label them manually after seeing which words cluster together.

LSI — Latent Semantic Indexing

M11T1L1

Uses SVD (Singular Value Decomposition) to decompose the document-term matrix into concept matrices. Linear algebra approach — fast and efficient.

LDA — Latent Dirichlet Allocation

M11T1L2

Uses Dirichlet distributions and probabilistic sampling. Requires specifying number of topics K in advance. Gold standard for topic modeling.

🔢

Latent Semantic Indexing (LSI)

SVD-based topic modeling · M11T1L1

Step 1: Build the Document-Term Matrix

▼

How

Same as unigram Bag of Words — each row is a document, each column is a term. Cell value = 1 if term appears in document, 0 otherwise (or can use TF counts).

	data	system	retrieval	lung	ear
doc1	1	1	1	0	0
doc2	1	1	1	0	0
doc3	0	0	0	1	1
doc4	0	0	0	1	1

doc1&doc2 = CS/database concept; doc3&doc4 = medical concept

Step 2: Apply SVD to Get Concept Matrices

▼

We use SVD to decompose the document-term matrix A into three matrices representing concepts:

Doc-Term Matrix
(n × d)

Doc-Concept Matrix
(n × k)

Concept Strengths
(k × k diagonal)

Vᵀ

Term-Concept Matrix
(k × d)

📐 Dimensions n = number of documents · d = number of terms · k = number of concepts (max k = d). The maximum value for k equals d (the number of unique terms).

Matrix	Name	Meaning
U (n × k)	Document-Concept Matrix	How strongly each document relates to each concept
Σ (k × k)	Singular Values (diagonal)	Strength/confidence of each concept — sorted descending. Higher value = concept appears more in corpus.
Vᵀ (k × d)	Term-Concept Matrix	How strongly each term relates to each concept

✅ Σ Interpretation The diagonal values represent concept confidence. If the CS concept has a higher value than the medical concept, it means the corpus has more CS-related documents — which is reflected in the higher singular value.

Step 3: Query Using Concepts

▼

How to Query by Term

Convert the term into a one-hot vector of size d (1 at the term's position, 0 elsewhere). Then multiply by the term-concept matrix Vᵀ to get the concept representation of that term. Compare with document concept vectors from U.

query_vector → one-hot (size d, value=1 at term position) concept_repr = query_vector × Vᵀ → maps query to concept space → find documents in U with similar concept coordinates → returns semantically related docs even without exact keyword match

🔍 Power of LSI Document ("information", "retrieval") is retrieved by query ("data") even though it does NOT contain "data" — because both map to the same CS concept via SVD!

LSI Drawbacks

▼

Word order neglected — built on Bag of Words; word order (important for semantic meaning) is completely lost
Information loss — converting documents to a document-term matrix then performing SVD with k < d loses some information
No probability — LSI gives geometric/algebraic similarity, not probabilistic topic distributions
Fixed concepts — k must be chosen in advance; hard to interpret latent concepts

🔬 Interactive: LSI Concept Query

Click a term to see which concept it maps to, and which documents are retrieved.

Term "data" → maps to CS/Database concept (Σ₁ = 2.16) Concept interpretation: {data: 0.8, system: 0.5, retrieval: 0.6} Retrieved documents: doc1, doc2 ✓ Works even if docs don't contain exact query term!

🎲

Latent Dirichlet Allocation (LDA)

Probabilistic topic modeling · M11T1L2

Core Idea & Setup

▼

What

LDA is an unsupervised probabilistic model. You must specify the number of topics K in advance. LDA finds groups of words that occur together across documents, discovering the hidden topic structure.

⚠️ Key Requirement Unlike LSI, LDA requires you to specify K (number of topics) before training. The right K depends on your dataset and problem domain.

Core Assumption

Documents with similar topics share common words. Topics can be discovered by finding groups of words occurring together across all documents.

📄 Document-Topic Matrix (θ)

For each document, the probability of each topic
e.g., Doc1: {Sports: 0.1, Food: 0.8, Economy: 0.1}
Rows = documents, Columns = topics

📝 Topic-Word Matrix (φ)

For each topic, the probability of each word
e.g., Sports topic: {football: 0.4, stadium: 0.4, price: 0.01}
Rows = topics, Columns = words

The LDA Probability Equation

▼

LDA models the probability of generating a document as a product of four terms:

p(w, z, θ, φ, α, β) = ∏_j=1..M p(θ_j | α) ① Dirichlet: doc-topics distribution (α controls shape) × ∏_i=1..K p(φ_i | β) ② Dirichlet: topic-words distribution (β controls shape) × ∏_{t=1..N p(z_{j,t} | θ_j) ③ Multinomial: randomly generate topic for each word
× p(w_{j,t} | φ, z_{j,t}) ④ Multinomial: randomly assign word to chosen topic}

Variables M = number of documents · K = number of topics · N = number of words in document j · α = Dirichlet param for doc-topics · β = Dirichlet param for topic-words · θ = doc-topic dist · φ = topic-word dist · z = topic assignment · w = actual word

The 4 Probability Terms — Each One's Role

▼

① Topics Probability

p(θ_j | α)

Dirichlet distribution for document-topics. Imagine a triangle with K topics at the corners. Tells us the probability distribution over topics for each document. M = # documents.

② Words Probability

p(φ_i | β)

Dirichlet distribution for topic-words. Imagine a triangle with words at corners. Tells us the probability distribution over words for each topic. K = # topics.

③ Randomly Generate Topics

p(z_{j,t} | θ_j)

Multinomial: use the doc-topic Dirichlet to randomly assign a topic to each word in the document. e.g., if Doc1 is 80% Topic 2, each word has an 80% chance of being assigned Topic 2.

④ Randomly Assign Words

p(w_{j,t} | φ, z_{j,t})

Multinomial: given the assigned topic from step ③, use the topic-word Dirichlet to randomly assign a word. e.g., if word is assigned Topic 1 (sports), "football" and "stadium" get highest probability.

✅ Mnemonic ① "What topics does this doc like?" → ② "What words does each topic like?" → ③ "Pick a topic for this word" → ④ "Given that topic, pick a word"

Dirichlet Distribution & α/β Parameters

▼

What

The Dirichlet distribution generates probability distributions. For K=3 topics, it produces a point inside a triangle (simplex) where each corner represents 100% probability for that topic. For higher K, it becomes a higher-dimensional simplex.

α — Doc-Topic Dirichlet

High α: documents tend to cover many topics (uniform distribution)
Low α: documents focus on 1–2 specific topics (sparse distribution)

β — Topic-Word Dirichlet

High β: topics have many words with similar probability
Low β: topics are defined by a few key words

LDA Training Algorithm (Gibbs Sampling)

▼

Initialize: Randomly assign each word in the corpus to one of K topics
Iterate: For each word in each document, reassign it to a topic based on (a) how often the topic appears in the document and (b) how often the word appears in the topic
Optimize α, β: Change Dirichlet parameters and generate M documents. Check which settings re-generate documents closest to the original ones.
Gibbs Sampling: Used to maximize the probability that generated documents match the originals. This clusters similar words and similar documents for a specific topic.
Convergence: Stop when topic assignments stabilize

🎯 Goal Maximize the probability of generated documents being similar to the original ones → this effectively clusters similar words and similar documents for each topic.

🎲 Interactive: LDA Topic Distribution

Click a document to see its topic distribution and the top words in each topic.

⚖️

LSI vs LDA: Head-to-Head

When to use which

Aspect	LSI (Latent Semantic Indexing)	LDA (Latent Dirichlet Allocation)
Approach	Linear algebra (SVD)	Probabilistic (Bayesian)
Output type	Geometric similarity scores	Probability distributions
Concept/topic representation	Latent dimensions via SVD	Dirichlet distributions over words
Requires K upfront?	No (k ≤ d, choose freely)	Yes — must specify K topics
Interpretability	Harder — no probability meaning	Better — probabilistic topics
Speed	Fast (SVD is efficient)	Slower (iterative sampling)
Query mechanism	Map query to concept space via Vᵀ	Infer topic distribution of new doc
Handles polysemy?	Partially — maps to concept space	Better — words distributed across topics
Word order	Ignored (BoW-based)	Also ignored (BoW-based)
Common use	Information retrieval, search	Document classification, topic discovery

✅ When to use each Use LSI when you need fast semantic search/retrieval and want to find similar documents quickly. Use LDA when you want interpretable probabilistic topics and want to understand what each topic "means" in terms of word distributions.

🎯 Quiz 10 Practice — Topic Modeling

Q1. True or False: Topic modeling with LSI and LDA both require labeled training data (supervised learning).

Both LSI and LDA are unsupervised — no labels are needed. They discover hidden topic structure from raw documents.

Q2. In LSI, which matrix from the SVD decomposition represents the document-concept relationship?

U = document-concept matrix. Σ = concept strengths (diagonal). Vᵀ = term-concept matrix. To query a term, you multiply its one-hot vector by Vᵀ.

Q3. Which of the following statements is FALSE about LDA?

LDA does NOT use word order — it is based on Bag of Words (word frequencies). Topic discovery is based on word co-occurrence, not sequence.

Q4. In LSI, the singular values in the Σ matrix are sorted in descending order. What does a higher singular value for a concept indicate?

Higher singular values indicate stronger, more prominent concepts. For example, if there are more CS documents than medical documents in the corpus, the CS concept will have a higher singular value in Σ.

Q5. In LDA, what is the role of the term p(z_{j,t} | θ_j) in the probability equation?

p(z_{j,t} | θ_j) is the multinomial distribution that randomly assigns a topic to word t in document j, based on the document's topic distribution θ_j. For example, if θ_j says 80% Topic 2, each word has an 80% chance of being assigned Topic 2.

← Week 10: Sequence Labeling 🏠 All Weeks Week 13 →