🗺️

The Big Picture

Week 2 connects Week 1 representations to predictions
The Journey So Far Week 1: Raw text → Numeric matrix (X). Now in Week 2: Given X and labels (y), can we predict the correct label for a new document? And how do we know if we're doing a good job?
The Two Questions This Week Answers:
  1. Naive Bayes: "Given this document (X), what is the probability it belongs to class Y?" → This is prediction.
  2. Evaluation Metrics: "How accurate is our classifier? Is it biased? When does it fail?" → This is measurement.
🧮

Part 1: Naive Bayes

A probabilistic text classifier built on Bayes' Rule

📐 Bayes' Rule — The Foundation

What

Bayes' Rule tells us how to compute the probability of a hypothesis (like "this email is spam") given some observed evidence (the words in the email).

P(Y | X) = P(X | Y) × P(Y) / P(X) Posterior Likelihood × Prior / Normalization Where: P(Y | X) = Posterior → P(spam | these words) — what we WANT P(X | Y) = Likelihood → P(these words | spam) — how likely these words appear in spam P(Y) = Prior → P(spam) — how common is spam in our dataset? P(X) = Evidence → same for all classes, so we IGNORE IT
Prior
P(Y)
×
Likelihood
P(X|Y)
÷
Evidence
P(X)
=
Posterior
P(Y|X) ✓
💡 Why we ignore P(X) during prediction We compare P(spam|X) vs P(not-spam|X). Since P(X) is the same for both, it cancels out! We only need to compare Likelihood × Prior for each class.

🤔 Why "Naive"? The Independence Assumption

What

The "naive" part means we assume all features (words) are conditionally independent given the class. In reality, word co-occurrences aren't independent ("machine" often appears with "learning"), but this assumption makes the math tractable.

Why it matters

Without this assumption, computing P(X|Y) = P(word1, word2, ..., wordN | Y) requires calculating a joint probability over thousands of words — computationally impossible. With independence, it becomes a simple product.

# Without independence (intractable): P(X | Y) = P(w₁, w₂, ..., wₙ | Y) ← joint probability, huge # With naive independence assumption (tractable!): P(X | Y) = P(w₁|Y) × P(w₂|Y) × ... × P(wₙ|Y) = ∏ P(wᵢ | Y) ← just a product of individual word probs
⚠️ Quiz 2 Tested This! "The position of words in the document is important for Naive Bayes" → FALSE. NB uses BoW — order doesn't matter.

🔢 How to Calculate — Multinomial NB for NLP

How — Step by Step

# Step 1: Calculate Prior P(Y=c) for each class c P(Y=spam) = count(spam docs) / total_docs P(Y=not-spam) = count(not-spam docs) / total_docs # Step 2: Calculate Likelihood P(word | Y=c) for each word # Using Multinomial distribution: P(word_w | Y=c) = count(word_w in class c docs) ───────────────────────────── count(ALL words in class c docs) # Step 3: Classify new document — pick highest posterior ŷ = argmax_c [ P(Y=c) × ∏ P(wᵢ | Y=c) ] # In log space (avoids underflow with many tiny probabilities): ŷ = argmax_c [ log P(Y=c) + Σ log P(wᵢ | Y=c) ]

🎮 Naive Bayes Spam Classifier Demo

Enter a sentence and see how the NB classifier scores it as spam vs. not-spam using a small trained vocabulary.
Click "Classify" to run Naive Bayes...

✅ Advantages of Naive Bayes

  • Very simple, fast to train (just counting!)
  • No optimization loop needed — just compute counts
  • Works well in practice despite "naive" assumption
  • Good baseline for text classification

❌ Disadvantages

  • Word position and order ignored
  • Independence assumption is unrealistic
  • Zero-probability problem (use Laplace smoothing)
  • Can't capture complex patterns
🔑 Generative vs. Discriminative Naive Bayes is generative — it models both P(X|Y) and P(Y) to reconstruct how data was generated. Logistic Regression, SVM, Neural Nets are discriminative — they compute P(Y|X) directly without modeling P(X|Y).
📏

Part 2: Classification Evaluation

How do we know if our model is actually good?

🔲 The Confusion Matrix

What

A table that compares what the model predicted against what the labels actually are. Each cell shows how many times the model got it right or wrong — and how.

Why

Accuracy alone can be misleading. If 99% of emails are not-spam, a model that always predicts "not-spam" gets 99% accuracy but is completely useless! The confusion matrix exposes this.

Binary Confusion Matrix
Predicted: Positive Predicted: Negative
Actual: Positive TP ✓
True Positive
FN ✗
False Negative
Actual: Negative FP ✗
False Positive
TN ✓
True Negative
Accuracy = (TP + TN) / (TP + TN + FP + FN) → What fraction of ALL predictions were correct? Precision = TP / (TP + FP) → Of things I called "positive", how many were actually positive? → Spam detector: "When I say spam, am I right?" Recall = TP / (TP + FN) (= Sensitivity = True Positive Rate) → Of all actual positives, how many did I catch? → Cancer detector: "Did I catch all the cancer cases?" F1 Score = 2 × (Precision × Recall) / (Precision + Recall) → Harmonic mean — balances both precision and recall

🎮 Interactive Confusion Matrix Calculator

Adjust TP, FP, FN, TN to see how metrics change. Watch what happens when data is imbalanced!
Adjust sliders above...
⚠️ Quiz 2 Tested This! "Confusion matrix can only be used for binary classification" → FALSE. It works for multi-class too (e.g., sports/news/politics).

📈 ROC Curve & AUC

What

The ROC (Receiver Operating Characteristic) curve plots True Positive Rate (Recall) vs. False Positive Rate as you vary the classification threshold. AUC = Area Under the Curve.

Why

Classifiers like Logistic Regression output a probability (e.g., 0.73 = 73% spam). We need a threshold (e.g., 0.5) to convert to a label. The ROC curve shows which threshold gives the best trade-off.

How to read it

False Positive Rate → True Positive Rate → AUC ≈ 0.9 Random = 0.5 Perfect = (0,1)
  • AUC = 1.0: Perfect classifier
  • AUC = 0.5: Random guessing (the diagonal line)
  • AUC = 0.9: Excellent model
  • AUC < 0.5: Worse than random (reversed predictions!)
False Positive Rate = FP / (FP + TN)
True Positive Rate = TP / (TP + FN) = Recall
⚠️ Warning: AUC alone isn't everything! Two models can have the same AUC but very different behavior. One might have TPR=0 for low thresholds — terrible for medical diagnosis. Always inspect the full curve, not just the summary number.

🧪 Quiz Prep — Week 2 Questions

Q1. In text classification with Naive Bayes, what is P(class | document) called?

Q2. Which statement does NOT hold true for Naive Bayes?

Q3. Accuracy may not be informative when evaluating highly imbalanced data — True or False?

Q4. Which of these statements about the confusion matrix is CORRECT?