Week 2 – Naive Bayes & Classification Evaluation

🗺️

The Big Picture

Week 2 connects Week 1 representations to predictions

The Journey So Far Week 1: Raw text → Numeric matrix (X). Now in Week 2: Given X and labels (y), can we predict the correct label for a new document? And how do we know if we're doing a good job?

The Two Questions This Week Answers:

Naive Bayes: "Given this document (X), what is the probability it belongs to class Y?" → This is prediction.
Evaluation Metrics: "How accurate is our classifier? Is it biased? When does it fail?" → This is measurement.

🧮

Part 1: Naive Bayes

A probabilistic text classifier built on Bayes' Rule

📐 Bayes' Rule — The Foundation

▼

What

Bayes' Rule tells us how to compute the probability of a hypothesis (like "this email is spam") given some observed evidence (the words in the email).

P(Y | X) = P(X | Y) × P(Y) / P(X) Posterior Likelihood × Prior / Normalization Where: P(Y | X) = Posterior → P(spam | these words) — what we WANT P(X | Y) = Likelihood → P(these words | spam) — how likely these words appear in spam P(Y) = Prior → P(spam) — how common is spam in our dataset? P(X) = Evidence → same for all classes, so we IGNORE IT

Prior

P(Y)

Likelihood

P(X|Y)

Evidence

P(X)

Posterior

P(Y|X) ✓

💡 Why we ignore P(X) during prediction We compare P(spam|X) vs P(not-spam|X). Since P(X) is the same for both, it cancels out! We only need to compare Likelihood × Prior for each class.

🤔 Why "Naive"? The Independence Assumption

▼

What

The "naive" part means we assume all features (words) are conditionally independent given the class. In reality, word co-occurrences aren't independent ("machine" often appears with "learning"), but this assumption makes the math tractable.

Why it matters

Without this assumption, computing P(X|Y) = P(word1, word2, ..., wordN | Y) requires calculating a joint probability over thousands of words — computationally impossible. With independence, it becomes a simple product.

⚠️ Quiz 2 Tested This! "The position of words in the document is important for Naive Bayes" → FALSE. NB uses BoW — order doesn't matter.

🔢 How to Calculate — Multinomial NB for NLP

▼

How — Step by Step

# Step 1: Calculate Prior P(Y=c) for each class c P(Y=spam) = count(spam docs) / total_docs P(Y=not-spam) = count(not-spam docs) / total_docs # Step 2: Calculate Likelihood P(word | Y=c) for each word # Using Multinomial distribution: P(word_w | Y=c) = count(word_w in class c docs) ───────────────────────────── count(ALL words in class c docs) # Step 3: Classify new document — pick highest posterior ŷ = argmax_c [ P(Y=c) × ∏ P(wᵢ | Y=c) ] # In log space (avoids underflow with many tiny probabilities): ŷ = argmax_c [ log P(Y=c) + Σ log P(wᵢ | Y=c) ]

🎮 Naive Bayes Spam Classifier Demo

Enter a sentence and see how the NB classifier scores it as spam vs. not-spam using a small trained vocabulary.

Click "Classify" to run Naive Bayes...

✅ Advantages of Naive Bayes

Very simple, fast to train (just counting!)
No optimization loop needed — just compute counts
Works well in practice despite "naive" assumption
Good baseline for text classification

❌ Disadvantages

Word position and order ignored
Independence assumption is unrealistic
Zero-probability problem (use Laplace smoothing)
Can't capture complex patterns

🔑 Generative vs. Discriminative Naive Bayes is generative — it models both P(X|Y) and P(Y) to reconstruct how data was generated. Logistic Regression, SVM, Neural Nets are discriminative — they compute P(Y|X) directly without modeling P(X|Y).

📏

Part 2: Classification Evaluation

How do we know if our model is actually good?

🔲 The Confusion Matrix

▼

What

A table that compares what the model predicted against what the labels actually are. Each cell shows how many times the model got it right or wrong — and how.

Why

Accuracy alone can be misleading. If 99% of emails are not-spam, a model that always predicts "not-spam" gets 99% accuracy but is completely useless! The confusion matrix exposes this.

Binary Confusion Matrix

	Predicted: Positive	Predicted: Negative
Actual: Positive	TP ✓ True Positive	FN ✗ False Negative
Actual: Negative	FP ✗ False Positive	TN ✓ True Negative

Accuracy = (TP + TN) / (TP + TN + FP + FN) → What fraction of ALL predictions were correct? Precision = TP / (TP + FP) → Of things I called "positive", how many were actually positive? → Spam detector: "When I say spam, am I right?" Recall = TP / (TP + FN) (= Sensitivity = True Positive Rate) → Of all actual positives, how many did I catch? → Cancer detector: "Did I catch all the cancer cases?" F1 Score = 2 × (Precision × Recall) / (Precision + Recall) → Harmonic mean — balances both precision and recall

🎮 Interactive Confusion Matrix Calculator

Adjust TP, FP, FN, TN to see how metrics change. Watch what happens when data is imbalanced!

✅ True Positives (TP): 45

❌ False Positives (FP): 10

❌ False Negatives (FN): 5

✅ True Negatives (TN): 40

Adjust sliders above...

⚠️ Quiz 2 Tested This! "Confusion matrix can only be used for binary classification" → FALSE. It works for multi-class too (e.g., sports/news/politics).

📈 ROC Curve & AUC

▼

What

The ROC (Receiver Operating Characteristic) curve plots True Positive Rate (Recall) vs. False Positive Rate as you vary the classification threshold. AUC = Area Under the Curve.

Why

Classifiers like Logistic Regression output a probability (e.g., 0.73 = 73% spam). We need a threshold (e.g., 0.5) to convert to a label. The ROC curve shows which threshold gives the best trade-off.

How to read it

AUC = 1.0: Perfect classifier
AUC = 0.5: Random guessing (the diagonal line)
AUC = 0.9: Excellent model
AUC < 0.5: Worse than random (reversed predictions!)

False Positive Rate = FP / (FP + TN)
True Positive Rate = TP / (TP + FN) = Recall

⚠️ Warning: AUC alone isn't everything! Two models can have the same AUC but very different behavior. One might have TPR=0 for low thresholds — terrible for medical diagnosis. Always inspect the full curve, not just the summary number.

🧪 Quiz Prep — Week 2 Questions

Q1. In text classification with Naive Bayes, what is P(class | document) called?

Q2. Which statement does NOT hold true for Naive Bayes?

Q3. Accuracy may not be informative when evaluating highly imbalanced data — True or False?

Q4. Which of these statements about the confusion matrix is CORRECT?

← Week 1: Preprocessing Week 3: Logistic Regression →