Three powerful linear classifiers, each with a different philosophy. Master these and you understand the DNA of modern ML.
| Algorithm | Core Idea | Output | Type |
|---|---|---|---|
| Perceptron | Find ANY line that separates classes | Hard label (±1) | Hard classification |
| Logistic Regression | Find a line + convert to probability | Probability [0,1] | Soft classification |
| SVM | Find the line with MAXIMUM margin | Hard label (±1) | Hard classification (max-margin) |
The sigmoid (logistic) function squishes any real number into the range [0, 1], making it interpretable as a probability.
Linear combination of features (θᵀX) can be any number from -∞ to +∞. For classification, we need a probability between 0 and 1. The sigmoid provides exactly this transformation while being mathematically derived from the log-odds ratio.
Unlike Naive Bayes (just counting), Logistic Regression must be optimized iteratively. We find θ that maximizes the log-likelihood of our training data.
The margin is the distance between the decision boundary and the nearest data points of each class (the "support vectors"). SVM finds the hyperplane that maximizes this margin.
The Perceptron finds any separating line — but there are infinitely many. Which is most reliable for new data? The one with the largest gap between classes, because it's most robust to noise and new examples near the boundary.
Sometimes data is not linearly separable. The kernel trick implicitly maps data to a higher-dimensional space where it IS separable — without ever computing the high-dimensional vectors.
Computing dot products in a huge feature space is expensive. The kernel function computes the inner product in the transformed space directly from the original space — efficient and powerful.
The Perceptron is a binary classifier that makes hard predictions (±1) and updates its weights whenever it makes a mistake, until all points are correctly classified.
The Perceptron is literally the building block of neural networks. Each "neuron" in a neural network IS a perceptron with a different activation function. Understanding it deeply means understanding deep learning.
Test point: x = [2, 1], true label y = +1
| Property | Perceptron | Logistic Regression | SVM |
|---|---|---|---|
| Output type | Hard (sign) | Soft (probability) | Hard (sign) |
| Unique solution? | No — any separating line | Yes (global optimum) | Yes — max-margin line |
| Handles non-separable? | No (won't converge) | Yes | Yes (soft-margin SVM) |
| Objective | Classify all points correctly | Maximize likelihood | Maximize margin |
| Optimization | Simple update rule | Gradient descent (iterative) | Quadratic programming (convex) |
| Type | Discriminative | Discriminative | Discriminative |
| Probabilistic? | No | Yes | No |
Q1. Which of the following is NOT a discriminative model?
Q2. What is the best way to select a threshold for Logistic Regression's sigmoid output?
Q3. The separating hyperplane produced by SVM and Perceptron are both unique — True or False?
Q4. For the Perceptron, it is typical to iterate through data one point at a time — True or False?