🗺️

The Big Picture

Three linear classifiers, same input — different approaches
Key Shift This Week: Generative → Discriminative Naive Bayes (Week 2) models HOW data is generated (generative). This week's models skip that — they directly learn a decision boundary to separate classes (discriminative). Simpler assumption, often better performance.
AlgorithmCore IdeaOutputType
PerceptronFind ANY line that separates classesHard label (±1)Hard classification
Logistic RegressionFind a line + convert to probabilityProbability [0,1]Soft classification
SVMFind the line with MAXIMUM marginHard label (±1)Hard classification (max-margin)
📉

Part 1: Logistic Regression

The backbone of neural networks

🔗 The Sigmoid Function — Converting Scores to Probabilities

What

The sigmoid (logistic) function squishes any real number into the range [0, 1], making it interpretable as a probability.

Why

Linear combination of features (θᵀX) can be any number from -∞ to +∞. For classification, we need a probability between 0 and 1. The sigmoid provides exactly this transformation while being mathematically derived from the log-odds ratio.

σ(s) = eˢ / (1 + eˢ) = 1 / (1 + e⁻ˢ) Where s = θᵀX (linear combination of features) # s → -∞ → σ(s) → 0 (definitely class 0) # s = 0 → σ(s) = 0.5 (totally uncertain) # s → +∞ → σ(s) → 1 (definitely class 1) # Classification rule: ŷ = 1 if σ(θᵀX) ≥ threshold (e.g. 0.5) ŷ = 0 otherwise

🎮 Sigmoid Function Explorer

Move the slider to change the linear score s = θᵀX and see how the sigmoid converts it to a probability.
Move the slider to explore...

⬇️ Training: Gradient Descent + MLE

How LR is Trained

Unlike Naive Bayes (just counting), Logistic Regression must be optimized iteratively. We find θ that maximizes the log-likelihood of our training data.

# Objective: maximize log-likelihood (or minimize negative log-likelihood) L(θ) = Σᵢ [ yᵢ·log σ(θᵀXᵢ) + (1-yᵢ)·log(1 - σ(θᵀXᵢ)) ] # Gradient Descent update rule: θ ← θ + η · ∇L(θ) (gradient ASCENT to maximize L) OR θ ← θ - η · ∇(-L(θ)) (gradient DESCENT to minimize -L) # η = learning rate (how big a step we take each iteration) # Repeat until convergence (parameters stop changing much)
💡 Why no closed-form solution? Linear Regression has an analytical solution (matrix inversion). Logistic Regression's sigmoid makes the objective non-linear, so we must use iterative gradient descent. But the good news: the function is convex → guaranteed to find the global optimum!
📏

Part 2: Support Vector Machine (SVM)

The maximum-margin classifier

↔️ The Margin — Why SVM is Different

What

The margin is the distance between the decision boundary and the nearest data points of each class (the "support vectors"). SVM finds the hyperplane that maximizes this margin.

Why maximize margin?

The Perceptron finds any separating line — but there are infinitely many. Which is most reliable for new data? The one with the largest gap between classes, because it's most robust to noise and new examples near the boundary.

SVM: Maximum Margin margin = 2/‖θ‖ ⭕ = support vectors
# Decision rule (same as Perceptron): f(x) = xᵀθ + b # Classification: ŷ = +1 if f(x) ≥ 0 ŷ = -1 if f(x) < 0 # SVM constraint (support vectors sit at ±1): yᵢ(xᵢᵀθ + b) ≥ 1 for all training points # Objective: maximize margin = 2/‖θ‖ # Equivalently: minimize ½‖θ‖² (convex quadratic problem) Margin = 2 / ‖θ‖ # Larger ‖θ‖ = smaller margin (bad) # Smaller ‖θ‖ = larger margin (good!)
⚠️ Quiz 3 Tested This! "The separating hyperplane produced by the SVM and perceptron are both unique" → FALSE. SVM gives a unique maximum-margin solution; Perceptron gives a non-unique arbitrary solution.

🌀 The Kernel Trick — Non-Linear SVM

What

Sometimes data is not linearly separable. The kernel trick implicitly maps data to a higher-dimensional space where it IS separable — without ever computing the high-dimensional vectors.

Why

Computing dot products in a huge feature space is expensive. The kernel function computes the inner product in the transformed space directly from the original space — efficient and powerful.

# Common kernels: Linear kernel: K(xᵢ, xⱼ) = xᵢᵀxⱼ Polynomial kernel: K(xᵢ, xⱼ) = (xᵢᵀxⱼ + c)ᵈ RBF (Gaussian): K(xᵢ, xⱼ) = exp(-γ‖xᵢ - xⱼ‖²) # The dual form replaces dot products with kernel evaluations: f(x) = Σᵢ αᵢ yᵢ K(xᵢ, x) + b # αᵢ are dual variables (non-zero only for support vectors!)

Part 3: Perceptron

The simplest learnable classifier — the OG neural unit

🔌 How the Perceptron Works

What

The Perceptron is a binary classifier that makes hard predictions (±1) and updates its weights whenever it makes a mistake, until all points are correctly classified.

Why it matters

The Perceptron is literally the building block of neural networks. Each "neuron" in a neural network IS a perceptron with a different activation function. Understanding it deeply means understanding deep learning.

# Perceptron model: f(x) = θᵀx + θ₀ (linear combination of features) ŷ = sign(f(x)) (+1 or -1) # Training algorithm: Initialize θ = 0 FOR each data point (xᵢ, yᵢ): ŷᵢ = sign(θᵀxᵢ) IF yᵢ × ŷᵢ ≤ 0: # misclassified! θ ← θ + η × yᵢ × xᵢ # update toward correct class Repeat until no mistakes (convergence)

🎮 Perceptron Update Visualizer

See how the perceptron updates its weights when it makes a mistake.

Test point: x = [2, 1], true label y = +1

Adjust sliders to see update...
⚠️ Perceptron Convergence Theorem The perceptron is guaranteed to converge IF the data is linearly separable. If not linearly separable → it will never converge!

✅ Advantages

  • Extremely simple and fast
  • Online learning (updates one point at a time)
  • Foundation of neural networks

❌ Disadvantages

  • Only works for linearly separable data
  • No unique solution — depends on initialization
  • No probability output (hard classification only)
⚖️

Side-by-Side Comparison

All three linear classifiers at a glance
PropertyPerceptronLogistic RegressionSVM
Output typeHard (sign)Soft (probability)Hard (sign)
Unique solution?No — any separating lineYes (global optimum)Yes — max-margin line
Handles non-separable?No (won't converge)YesYes (soft-margin SVM)
ObjectiveClassify all points correctlyMaximize likelihoodMaximize margin
OptimizationSimple update ruleGradient descent (iterative)Quadratic programming (convex)
TypeDiscriminativeDiscriminativeDiscriminative
Probabilistic?NoYesNo
🔑 Key Insight: Generative vs. Discriminative Naive Bayes = Generative (models P(X|Y) and P(Y)). All three this week = Discriminative (compute P(Y|X) or decision boundary directly). Discriminative models often outperform generative when enough data is available.

🧪 Quiz Prep — Week 3 Questions

Q1. Which of the following is NOT a discriminative model?

Q2. What is the best way to select a threshold for Logistic Regression's sigmoid output?

Q3. The separating hyperplane produced by SVM and Perceptron are both unique — True or False?

Q4. For the Perceptron, it is typical to iterate through data one point at a time — True or False?