Unsupervised techniques to discover hidden topics in documents — from SVD-based Latent Semantic Indexing to probabilistic Latent Dirichlet Allocation.
Topic modeling is an unsupervised learning technique (no labels needed) to extract topics from documents and find documents that potentially share a common context.
It enables semantic querying — we can retrieve documents that are related to a topic even if they don't contain the exact search keyword. For example, searching for "system" can retrieve a document that only contains "data" and "retrieval" because they share the same underlying database concept.
Topics are latent (hidden). We don't know their labels in advance — we label them manually after seeing which words cluster together.
Uses SVD (Singular Value Decomposition) to decompose the document-term matrix into concept matrices. Linear algebra approach — fast and efficient.
Uses Dirichlet distributions and probabilistic sampling. Requires specifying number of topics K in advance. Gold standard for topic modeling.
Same as unigram Bag of Words — each row is a document, each column is a term. Cell value = 1 if term appears in document, 0 otherwise (or can use TF counts).
| data | system | retrieval | lung | ear | |
|---|---|---|---|---|---|
| doc1 | 1 | 1 | 1 | 0 | 0 |
| doc2 | 1 | 1 | 1 | 0 | 0 |
| doc3 | 0 | 0 | 0 | 1 | 1 |
| doc4 | 0 | 0 | 0 | 1 | 1 |
doc1&doc2 = CS/database concept; doc3&doc4 = medical concept
We use SVD to decompose the document-term matrix A into three matrices representing concepts:
| Matrix | Name | Meaning |
|---|---|---|
| U (n × k) | Document-Concept Matrix | How strongly each document relates to each concept |
| Σ (k × k) | Singular Values (diagonal) | Strength/confidence of each concept — sorted descending. Higher value = concept appears more in corpus. |
| Vᵀ (k × d) | Term-Concept Matrix | How strongly each term relates to each concept |
Convert the term into a one-hot vector of size d (1 at the term's position, 0 elsewhere). Then multiply by the term-concept matrix Vᵀ to get the concept representation of that term. Compare with document concept vectors from U.
LDA is an unsupervised probabilistic model. You must specify the number of topics K in advance. LDA finds groups of words that occur together across documents, discovering the hidden topic structure.
Documents with similar topics share common words. Topics can be discovered by finding groups of words occurring together across all documents.
LDA models the probability of generating a document as a product of four terms:
Dirichlet distribution for document-topics. Imagine a triangle with K topics at the corners. Tells us the probability distribution over topics for each document. M = # documents.
Dirichlet distribution for topic-words. Imagine a triangle with words at corners. Tells us the probability distribution over words for each topic. K = # topics.
Multinomial: use the doc-topic Dirichlet to randomly assign a topic to each word in the document. e.g., if Doc1 is 80% Topic 2, each word has an 80% chance of being assigned Topic 2.
Multinomial: given the assigned topic from step ③, use the topic-word Dirichlet to randomly assign a word. e.g., if word is assigned Topic 1 (sports), "football" and "stadium" get highest probability.
The Dirichlet distribution generates probability distributions. For K=3 topics, it produces a point inside a triangle (simplex) where each corner represents 100% probability for that topic. For higher K, it becomes a higher-dimensional simplex.
| Aspect | LSI (Latent Semantic Indexing) | LDA (Latent Dirichlet Allocation) |
|---|---|---|
| Approach | Linear algebra (SVD) | Probabilistic (Bayesian) |
| Output type | Geometric similarity scores | Probability distributions |
| Concept/topic representation | Latent dimensions via SVD | Dirichlet distributions over words |
| Requires K upfront? | No (k ≤ d, choose freely) | Yes — must specify K topics |
| Interpretability | Harder — no probability meaning | Better — probabilistic topics |
| Speed | Fast (SVD is efficient) | Slower (iterative sampling) |
| Query mechanism | Map query to concept space via Vᵀ | Infer topic distribution of new doc |
| Handles polysemy? | Partially — maps to concept space | Better — words distributed across topics |
| Word order | Ignored (BoW-based) | Also ignored (BoW-based) |
| Common use | Information retrieval, search | Document classification, topic discovery |
Q1. True or False: Topic modeling with LSI and LDA both require labeled training data (supervised learning).
Q2. In LSI, which matrix from the SVD decomposition represents the document-concept relationship?
Q3. Which of the following statements is FALSE about LDA?
Q4. In LSI, the singular values in the Σ matrix are sorted in descending order. What does a higher singular value for a concept indicate?
Q5. In LDA, what is the role of the term p(z_{j,t} | θ_j) in the probability equation?