🗺️

The Big Picture

Three linear classifiers, same input — different approaches
Key Shift This Week: Generative → Discriminative Naive Bayes (Week 2) models HOW data is generated (generative). This week's models skip that — they directly learn a decision boundary to separate classes (discriminative). Simpler assumption, often better performance.
AlgorithmCore IdeaOutputType
PerceptronFind ANY line that separates classesHard label (±1)Hard classification
Logistic RegressionFind a line + convert to probabilityProbability [0,1]Soft classification
SVMFind the line with MAXIMUM marginHard label (±1)Hard classification (max-margin)
📉

Part 1: Logistic Regression

The backbone of neural networks

🔗 The Sigmoid Function — Converting Scores to Probabilities

What

The sigmoid (logistic) function squishes any real number into the range [0, 1], making it interpretable as a probability.

Why

Linear combination of features (θᵀX) can be any number from -∞ to +∞. For classification, we need a probability between 0 and 1. The sigmoid provides exactly this transformation while being mathematically derived from the log-odds ratio.

σ(s) = eˢ / (1 + eˢ) = 1 / (1 + e⁻ˢ) Where s = θᵀX (linear combination of features) # s → -∞ → σ(s) → 0 (definitely class 0) # s = 0 → σ(s) = 0.5 (totally uncertain) # s → +∞ → σ(s) → 1 (definitely class 1) # Classification rule: ŷ = 1 if σ(θᵀX) ≥ threshold (e.g. 0.5) ŷ = 0 otherwise

🎮 Sigmoid Function Explorer

Move the slider to change the linear score s = θᵀX and see how the sigmoid converts it to a probability.
Move the slider to explore...

⬇️ Training: Gradient Descent + MLE

How LR is Trained

Unlike Naive Bayes (just counting), Logistic Regression must be optimized iteratively. We find θ that maximizes the log-likelihood of our training data.

# Objective: maximize log-likelihood (or minimize negative log-likelihood) L(θ) = Σᵢ [ yᵢ·log σ(θᵀXᵢ) + (1-yᵢ)·log(1 - σ(θᵀXᵢ)) ] # Gradient Descent update rule: θ ← θ + η · ∇L(θ) (gradient ASCENT to maximize L) OR θ ← θ - η · ∇(-L(θ)) (gradient DESCENT to minimize -L) # η = learning rate (how big a step we take each iteration) # Repeat until convergence (parameters stop changing much)
💡 Why no closed-form solution? Linear Regression has an analytical solution (matrix inversion). Logistic Regression's sigmoid makes the objective non-linear, so we must use iterative gradient descent. But the good news: the function is convex → guaranteed to find the global optimum!
📏

Part 2: Support Vector Machine (SVM)

The maximum-margin classifier

↔️ The Margin — Why SVM is Different

What

The margin is the distance between the decision boundary and the nearest data points of each class (the "support vectors"). SVM finds the hyperplane that maximizes this margin.

Why maximize margin?

The Perceptron finds any separating line — but there are infinitely many. Which is most reliable for new data? The one with the largest gap between classes, because it's most robust to noise and new examples near the boundary.

SVM: Maximum Margin margin = 2/‖θ‖ ⭕ = support vectors
# Decision rule (same as Perceptron): f(x) = xᵀθ + b # Classification: ŷ = +1 if f(x) ≥ 0 ŷ = -1 if f(x) < 0 # SVM constraint (support vectors sit at ±1): yᵢ(xᵢᵀθ + b) ≥ 1 for all training points # Objective: maximize margin = 2/‖θ‖ # Equivalently: minimize ½‖θ‖² (convex quadratic problem) Margin = 2 / ‖θ‖ # Larger ‖θ‖ = smaller margin (bad) # Smaller ‖θ‖ = larger margin (good!)
⚠️ Quiz 3 Tested This! "The separating hyperplane produced by the SVM and perceptron are both unique" → FALSE. SVM gives a unique maximum-margin solution; Perceptron gives a non-unique arbitrary solution.

🌀 The Kernel Trick — Non-Linear SVM

What

Sometimes data is not linearly separable. The kernel trick implicitly maps data to a higher-dimensional space where it IS separable — without ever computing the high-dimensional vectors.

Why

Computing dot products in a huge feature space is expensive. The kernel function computes the inner product in the transformed space directly from the original space — efficient and powerful.

# Common kernels: Linear kernel: K(xᵢ, xⱼ) = xᵢᵀxⱼ Polynomial kernel: K(xᵢ, xⱼ) = (xᵢᵀxⱼ + c)ᵈ RBF (Gaussian): K(xᵢ, xⱼ) = exp(-γ‖xᵢ - xⱼ‖²) # The dual form replaces dot products with kernel evaluations: f(x) = Σᵢ αᵢ yᵢ K(xᵢ, x) + b # αᵢ are dual variables (non-zero only for support vectors!)

Part 3: Perceptron

The simplest learnable classifier — the OG neural unit

🔌 How the Perceptron Works

What

The Perceptron is a binary classifier that makes hard predictions (±1) and updates its weights whenever it makes a mistake, until all points are correctly classified.

Why it matters

The Perceptron is literally the building block of neural networks. Each "neuron" in a neural network IS a perceptron with a different activation function. Understanding it deeply means understanding deep learning.

# Perceptron model: f(x) = θᵀx + θ₀ (linear combination of features) ŷ = sign(f(x)) (+1 or -1) # Training algorithm: Initialize θ = 0 FOR each data point (xᵢ, yᵢ): ŷᵢ = sign(θᵀxᵢ) IF yᵢ × ŷᵢ ≤ 0: # misclassified! θ ← θ + η × yᵢ × xᵢ # update toward correct class Repeat until no mistakes (convergence)

🎮 Perceptron Update Visualizer

See how the perceptron updates its weights when it makes a mistake.

Test point: x = [2, 1], true label y = +1

Adjust sliders to see update...
⚠️ Perceptron Convergence Theorem The perceptron is guaranteed to converge IF the data is linearly separable. If not linearly separable → it will never converge!

✅ Advantages

  • Extremely simple and fast
  • Online learning (updates one point at a time)
  • Foundation of neural networks

❌ Disadvantages

  • Only works for linearly separable data
  • No unique solution — depends on initialization
  • No probability output (hard classification only)
⚖️

Side-by-Side Comparison

All three linear classifiers at a glance
PropertyPerceptronLogistic RegressionSVM
Output typeHard (sign)Soft (probability)Hard (sign)
Unique solution?No — any separating lineYes (global optimum)Yes — max-margin line
Handles non-separable?No (won't converge)YesYes (soft-margin SVM)
ObjectiveClassify all points correctlyMaximize likelihoodMaximize margin
OptimizationSimple update ruleGradient descent (iterative)Quadratic programming (convex)
TypeDiscriminativeDiscriminativeDiscriminative
Probabilistic?NoYesNo
🔑 Key Insight: Generative vs. Discriminative Naive Bayes = Generative (models P(X|Y) and P(Y)). All three this week = Discriminative (compute P(Y|X) or decision boundary directly). Discriminative models often outperform generative when enough data is available.

🧪 Quiz Prep — Week 3 Questions

Q1. Which of the following is NOT a discriminative model?

Q2. What is the best way to select a threshold for Logistic Regression's sigmoid output?

Q3. The separating hyperplane produced by SVM and Perceptron are both unique — True or False?

Q4. For the Perceptron, it is typical to iterate through data one point at a time — True or False?

🔬 Beyond the Slides · Graduate Depth

Core NLP Tasks: The Benchmarks Every NLP Researcher Must Know

Classification is the entry point. But NLP research spans a rich ecosystem of structured tasks — NLI, SRL, coreference, MT, QA, relation extraction — that define the benchmarks used to evaluate every frontier model. Understanding them is essential for reading papers and doing research in 2026.

🔗 Natural Language Inference (NLI) Core Benchmark

Task: Given a premise and a hypothesis, classify the relationship as ENTAILMENT NEUTRAL or CONTRADICTION.

📖 Example (SNLI dataset)

Premise: "A woman is walking her dog in the park."
H1: "Someone is outside with an animal." → ENTAILMENT
H2: "The woman is a professional dog trainer." → NEUTRAL
H3: "There are no animals in the park." → CONTRADICTION

Key Datasets

  • SNLI (2015): 570K pairs, crowd-sourced
  • MultiNLI (2018): 433K pairs, 10 genres
  • ANLI (2020): adversarially collected
  • XNLI: multilingual extension, 15 languages

Why NLI Matters in 2026

  • Used as reward signal in RLHF (does response contradict context?)
  • Hallucination detection: does output entail source?
  • Fact-checking pipelines
  • GLUE/SuperGLUE include NLI tasks (RTE, MNLI)

🎭 Semantic Role Labeling (SRL) Structure

SRL answers the question: "Who did what to whom, where, when, and how?" It identifies the predicate and labels its semantic arguments using PropBank frame schema.

📖 PropBank Example

Sentence: "The scientist published a groundbreaking paper in Nature."
ARG0 (Agent/Publisher): The scientist
V (Predicate): published
ARG1 (Theme/Published): a groundbreaking paper
ARGM-LOC (Locative): in Nature

Dataset: OntoNotes 5.0, CoNLL-2009. Models: BERT + span-based SRL (Shi & Lin 2019) achieves >85 F1. Use in 2026: knowledge extraction pipelines, question answering, event detection, LLM evaluation for factual consistency.

🔄 Coreference Resolution

Determines which mentions in a text refer to the same real-world entity. Critical for document understanding, summarization, and dialogue.

📖 Example

"Dr. Chen won the award. She had worked on the project for 10 years. It was finally recognized."
Clusters: {Dr. Chen, She} · {the project, It}

Pipeline Steps

  1. Mention detection (spans)
  2. Mention representation (BERT embeddings)
  3. Pairwise scoring (coref likelihood)
  4. Clustering (greedy or global)

Benchmarks & Models

  • OntoNotes 5.0: gold standard, 1.6M tokens
  • CoNLL-2012: shared task benchmark
  • SpanBERT: span-based, state-of-the-art ~82 F1
  • LLMs: GPT-4 can resolve coref zero-shot, ~79 F1

🗂️ Key NLP Tasks at a Glance

TaskWhat It DoesGold DatasetSOTA Model (2024)Metric
NLIPremise → hypothesis relationshipMultiNLI, SNLIDeBERTa-v3, GPT-4Accuracy
SRLWho did what to whomOntoNotes, CoNLL-09BERT + span predictorF1
CoreferenceWhich mentions = same entityOntoNotes 5.0SpanBERT, LingMessAvg F1 (MUC/B³/CEAF)
REExtract (entity, relation, entity)TACRED, DocREDREBEL (seq2seq)F1
MTTranslate between languagesWMT, FLORESNLLB-200, GPT-4BLEU, chrF, COMET
QA (Extractive)Find answer span in passageSQuAD 2.0BERT, DeBERTaEM / F1
QA (Abstractive)Generate answer from knowledgeNaturalQuestions, TriviaQARAG + LLMEM / ROUGE
SummarizationCompress document to summaryCNN/DM, XSumPEGASUS, GPT-4ROUGE-1/2/L

🏆 Relation Extraction: From Rules to LLMs Research Frontier

Task Definition

Given: "Elon Musk founded SpaceX in 2002."
Extract: (Elon Musk, founder_of, SpaceX) and (SpaceX, founded_in, 2002)

Datasets: TACRED (42 relation types), DocRED (96 types, document-level), Re-TACRED (revised, cleaner labels)

Approaches

  • Supervised: BERT → span classifier
  • Distant supervision: align KB (Freebase) with text, noisy labels
  • Seq2seq: REBEL generates (h, r, t) triples directly
  • LLM prompting: GPT-4 with structured output