🗂️

What is Topic Modeling?

Unsupervised learning · No labels needed

Core Idea & Motivation

What

Topic modeling is an unsupervised learning technique (no labels needed) to extract topics from documents and find documents that potentially share a common context.

Why

It enables semantic querying — we can retrieve documents that are related to a topic even if they don't contain the exact search keyword. For example, searching for "system" can retrieve a document that only contains "data" and "retrieval" because they share the same underlying database concept.

💡 Key Example We may retrieve documents that don't have the term "system", but they contain "data" and "retrieval" — because they share the same database concept. Topic modeling bridges this semantic gap.

📊 Outputs

  • Document-Concept/Topic matrix — which documents belong to which topic
  • Term/Word-Concept/Topic matrix — which words define each topic

🔑 Key Property

Topics are latent (hidden). We don't know their labels in advance — we label them manually after seeing which words cluster together.

LSI — Latent Semantic Indexing

M11T1L1

Uses SVD (Singular Value Decomposition) to decompose the document-term matrix into concept matrices. Linear algebra approach — fast and efficient.

LDA — Latent Dirichlet Allocation

M11T1L2

Uses Dirichlet distributions and probabilistic sampling. Requires specifying number of topics K in advance. Gold standard for topic modeling.

🔢

Latent Semantic Indexing (LSI)

SVD-based topic modeling · M11T1L1

Step 1: Build the Document-Term Matrix

How

Same as unigram Bag of Words — each row is a document, each column is a term. Cell value = 1 if term appears in document, 0 otherwise (or can use TF counts).

datasystemretrievallungear
doc111100
doc211100
doc300011
doc400011

doc1&doc2 = CS/database concept; doc3&doc4 = medical concept

Step 2: Apply SVD to Get Concept Matrices

We use SVD to decompose the document-term matrix A into three matrices representing concepts:

A
Doc-Term Matrix
(n × d)
=
U
Doc-Concept Matrix
(n × k)
×
Σ
Concept Strengths
(k × k diagonal)
×
Vᵀ
Term-Concept Matrix
(k × d)
📐 Dimensions n = number of documents · d = number of terms · k = number of concepts (max k = d). The maximum value for k equals d (the number of unique terms).
MatrixNameMeaning
U (n × k)Document-Concept MatrixHow strongly each document relates to each concept
Σ (k × k)Singular Values (diagonal)Strength/confidence of each concept — sorted descending. Higher value = concept appears more in corpus.
Vᵀ (k × d)Term-Concept MatrixHow strongly each term relates to each concept
✅ Σ Interpretation The diagonal values represent concept confidence. If the CS concept has a higher value than the medical concept, it means the corpus has more CS-related documents — which is reflected in the higher singular value.

Step 3: Query Using Concepts

How to Query by Term

Convert the term into a one-hot vector of size d (1 at the term's position, 0 elsewhere). Then multiply by the term-concept matrix Vᵀ to get the concept representation of that term. Compare with document concept vectors from U.

query_vector → one-hot (size d, value=1 at term position) concept_repr = query_vector × Vᵀ → maps query to concept space → find documents in U with similar concept coordinates → returns semantically related docs even without exact keyword match
🔍 Power of LSI Document ("information", "retrieval") is retrieved by query ("data") even though it does NOT contain "data" — because both map to the same CS concept via SVD!

LSI Drawbacks

  • Word order neglected — built on Bag of Words; word order (important for semantic meaning) is completely lost
  • Information loss — converting documents to a document-term matrix then performing SVD with k < d loses some information
  • No probability — LSI gives geometric/algebraic similarity, not probabilistic topic distributions
  • Fixed concepts — k must be chosen in advance; hard to interpret latent concepts

🔬 Interactive: LSI Concept Query

Click a term to see which concept it maps to, and which documents are retrieved.
Term "data" → maps to CS/Database concept (Σ₁ = 2.16) Concept interpretation: {data: 0.8, system: 0.5, retrieval: 0.6} Retrieved documents: doc1, doc2 ✓ Works even if docs don't contain exact query term!
🎲

Latent Dirichlet Allocation (LDA)

Probabilistic topic modeling · M11T1L2

Core Idea & Setup

What

LDA is an unsupervised probabilistic model. You must specify the number of topics K in advance. LDA finds groups of words that occur together across documents, discovering the hidden topic structure.

⚠️ Key Requirement Unlike LSI, LDA requires you to specify K (number of topics) before training. The right K depends on your dataset and problem domain.
Core Assumption

Documents with similar topics share common words. Topics can be discovered by finding groups of words occurring together across all documents.

📄 Document-Topic Matrix (θ)

  • For each document, the probability of each topic
  • e.g., Doc1: {Sports: 0.1, Food: 0.8, Economy: 0.1}
  • Rows = documents, Columns = topics

📝 Topic-Word Matrix (φ)

  • For each topic, the probability of each word
  • e.g., Sports topic: {football: 0.4, stadium: 0.4, price: 0.01}
  • Rows = topics, Columns = words

The LDA Probability Equation

LDA models the probability of generating a document as a product of four terms:

p(w, z, θ, φ, α, β) = ∏j=1..M p(θ_j | α) ① Dirichlet: doc-topics distribution (α controls shape) × ∏i=1..K p(φ_i | β) ② Dirichlet: topic-words distribution (β controls shape) × ∏t=1..N p(z_{j,t} | θ_j) ③ Multinomial: randomly generate topic for each word × p(w_{j,t} | φ, z_{j,t}) ④ Multinomial: randomly assign word to chosen topic
Variables M = number of documents · K = number of topics · N = number of words in document j · α = Dirichlet param for doc-topics · β = Dirichlet param for topic-words · θ = doc-topic dist · φ = topic-word dist · z = topic assignment · w = actual word

The 4 Probability Terms — Each One's Role

① Topics Probability
p(θ_j | α)

Dirichlet distribution for document-topics. Imagine a triangle with K topics at the corners. Tells us the probability distribution over topics for each document. M = # documents.

② Words Probability
p(φ_i | β)

Dirichlet distribution for topic-words. Imagine a triangle with words at corners. Tells us the probability distribution over words for each topic. K = # topics.

③ Randomly Generate Topics
p(z_{j,t} | θ_j)

Multinomial: use the doc-topic Dirichlet to randomly assign a topic to each word in the document. e.g., if Doc1 is 80% Topic 2, each word has an 80% chance of being assigned Topic 2.

④ Randomly Assign Words
p(w_{j,t} | φ, z_{j,t})

Multinomial: given the assigned topic from step ③, use the topic-word Dirichlet to randomly assign a word. e.g., if word is assigned Topic 1 (sports), "football" and "stadium" get highest probability.

✅ Mnemonic ① "What topics does this doc like?" → ② "What words does each topic like?" → ③ "Pick a topic for this word" → ④ "Given that topic, pick a word"

Dirichlet Distribution & α/β Parameters

What

The Dirichlet distribution generates probability distributions. For K=3 topics, it produces a point inside a triangle (simplex) where each corner represents 100% probability for that topic. For higher K, it becomes a higher-dimensional simplex.

α — Doc-Topic Dirichlet

  • High α: documents tend to cover many topics (uniform distribution)
  • Low α: documents focus on 1–2 specific topics (sparse distribution)

β — Topic-Word Dirichlet

  • High β: topics have many words with similar probability
  • Low β: topics are defined by a few key words

LDA Training Algorithm (Gibbs Sampling)

  1. Initialize: Randomly assign each word in the corpus to one of K topics
  2. Iterate: For each word in each document, reassign it to a topic based on (a) how often the topic appears in the document and (b) how often the word appears in the topic
  3. Optimize α, β: Change Dirichlet parameters and generate M documents. Check which settings re-generate documents closest to the original ones.
  4. Gibbs Sampling: Used to maximize the probability that generated documents match the originals. This clusters similar words and similar documents for a specific topic.
  5. Convergence: Stop when topic assignments stabilize
🎯 Goal Maximize the probability of generated documents being similar to the original ones → this effectively clusters similar words and similar documents for each topic.

🎲 Interactive: LDA Topic Distribution

Click a document to see its topic distribution and the top words in each topic.
⚖️

LSI vs LDA: Head-to-Head

When to use which
AspectLSI (Latent Semantic Indexing)LDA (Latent Dirichlet Allocation)
ApproachLinear algebra (SVD)Probabilistic (Bayesian)
Output typeGeometric similarity scoresProbability distributions
Concept/topic representationLatent dimensions via SVDDirichlet distributions over words
Requires K upfront?No (k ≤ d, choose freely)Yes — must specify K topics
InterpretabilityHarder — no probability meaningBetter — probabilistic topics
SpeedFast (SVD is efficient)Slower (iterative sampling)
Query mechanismMap query to concept space via VᵀInfer topic distribution of new doc
Handles polysemy?Partially — maps to concept spaceBetter — words distributed across topics
Word orderIgnored (BoW-based)Also ignored (BoW-based)
Common useInformation retrieval, searchDocument classification, topic discovery
✅ When to use each Use LSI when you need fast semantic search/retrieval and want to find similar documents quickly. Use LDA when you want interpretable probabilistic topics and want to understand what each topic "means" in terms of word distributions.

🎯 Quiz 10 Practice — Topic Modeling

Q1. True or False: Topic modeling with LSI and LDA both require labeled training data (supervised learning).

Both LSI and LDA are unsupervised — no labels are needed. They discover hidden topic structure from raw documents.

Q2. In LSI, which matrix from the SVD decomposition represents the document-concept relationship?

U = document-concept matrix. Σ = concept strengths (diagonal). Vᵀ = term-concept matrix. To query a term, you multiply its one-hot vector by Vᵀ.

Q3. Which of the following statements is FALSE about LDA?

LDA does NOT use word order — it is based on Bag of Words (word frequencies). Topic discovery is based on word co-occurrence, not sequence.

Q4. In LSI, the singular values in the Σ matrix are sorted in descending order. What does a higher singular value for a concept indicate?

Higher singular values indicate stronger, more prominent concepts. For example, if there are more CS documents than medical documents in the corpus, the CS concept will have a higher singular value in Σ.

Q5. In LDA, what is the role of the term p(z_{j,t} | θ_j) in the probability equation?

p(z_{j,t} | θ_j) is the multinomial distribution that randomly assigns a topic to word t in document j, based on the document's topic distribution θ_j. For example, if θ_j says 80% Topic 2, each word has an 80% chance of being assigned Topic 2.