Week 3 – Logistic Regression, SVM & Perceptron

🗺️

The Big Picture

Three linear classifiers, same input — different approaches

Key Shift This Week: Generative → Discriminative Naive Bayes (Week 2) models HOW data is generated (generative). This week's models skip that — they directly learn a decision boundary to separate classes (discriminative). Simpler assumption, often better performance.

Algorithm	Core Idea	Output	Type
Perceptron	Find ANY line that separates classes	Hard label (±1)	Hard classification
Logistic Regression	Find a line + convert to probability	Probability [0,1]	Soft classification
SVM	Find the line with MAXIMUM margin	Hard label (±1)	Hard classification (max-margin)

📉

Part 1: Logistic Regression

The backbone of neural networks

🔗 The Sigmoid Function — Converting Scores to Probabilities

▼

What

The sigmoid (logistic) function squishes any real number into the range [0, 1], making it interpretable as a probability.

Why

Linear combination of features (θᵀX) can be any number from -∞ to +∞. For classification, we need a probability between 0 and 1. The sigmoid provides exactly this transformation while being mathematically derived from the log-odds ratio.

σ(s) = eˢ / (1 + eˢ) = 1 / (1 + e⁻ˢ) Where s = θᵀX (linear combination of features) # s → -∞ → σ(s) → 0 (definitely class 0) # s = 0 → σ(s) = 0.5 (totally uncertain) # s → +∞ → σ(s) → 1 (definitely class 1) # Classification rule: ŷ = 1 if σ(θᵀX) ≥ threshold (e.g. 0.5) ŷ = 0 otherwise

🎮 Sigmoid Function Explorer

Move the slider to change the linear score s = θᵀX and see how the sigmoid converts it to a probability.

Linear score s: 0.0

Move the slider to explore...

⬇️ Training: Gradient Descent + MLE

▼

How LR is Trained

Unlike Naive Bayes (just counting), Logistic Regression must be optimized iteratively. We find θ that maximizes the log-likelihood of our training data.

# Objective: maximize log-likelihood (or minimize negative log-likelihood) L(θ) = Σᵢ [ yᵢ·log σ(θᵀXᵢ) + (1-yᵢ)·log(1 - σ(θᵀXᵢ)) ] # Gradient Descent update rule: θ ← θ + η · ∇L(θ) (gradient ASCENT to maximize L) OR θ ← θ - η · ∇(-L(θ)) (gradient DESCENT to minimize -L) # η = learning rate (how big a step we take each iteration) # Repeat until convergence (parameters stop changing much)

💡 Why no closed-form solution? Linear Regression has an analytical solution (matrix inversion). Logistic Regression's sigmoid makes the objective non-linear, so we must use iterative gradient descent. But the good news: the function is convex → guaranteed to find the global optimum!

📏

Part 2: Support Vector Machine (SVM)

The maximum-margin classifier

↔️ The Margin — Why SVM is Different

▼

What

The margin is the distance between the decision boundary and the nearest data points of each class (the "support vectors"). SVM finds the hyperplane that maximizes this margin.

Why maximize margin?

The Perceptron finds any separating line — but there are infinitely many. Which is most reliable for new data? The one with the largest gap between classes, because it's most robust to noise and new examples near the boundary.

# Decision rule (same as Perceptron): f(x) = xᵀθ + b # Classification: ŷ = +1 if f(x) ≥ 0 ŷ = -1 if f(x) < 0 # SVM constraint (support vectors sit at ±1): yᵢ(xᵢᵀθ + b) ≥ 1 for all training points # Objective: maximize margin = 2/‖θ‖ # Equivalently: minimize ½‖θ‖² (convex quadratic problem) Margin = 2 / ‖θ‖ # Larger ‖θ‖ = smaller margin (bad) # Smaller ‖θ‖ = larger margin (good!)

⚠️ Quiz 3 Tested This! "The separating hyperplane produced by the SVM and perceptron are both unique" → FALSE. SVM gives a unique maximum-margin solution; Perceptron gives a non-unique arbitrary solution.

🌀 The Kernel Trick — Non-Linear SVM

▼

What

Sometimes data is not linearly separable. The kernel trick implicitly maps data to a higher-dimensional space where it IS separable — without ever computing the high-dimensional vectors.

Why

Computing dot products in a huge feature space is expensive. The kernel function computes the inner product in the transformed space directly from the original space — efficient and powerful.

# Common kernels: Linear kernel: K(xᵢ, xⱼ) = xᵢᵀxⱼ Polynomial kernel: K(xᵢ, xⱼ) = (xᵢᵀxⱼ + c)ᵈ RBF (Gaussian): K(xᵢ, xⱼ) = exp(-γ‖xᵢ - xⱼ‖²) # The dual form replaces dot products with kernel evaluations: f(x) = Σᵢ αᵢ yᵢ K(xᵢ, x) + b # αᵢ are dual variables (non-zero only for support vectors!)

⚡

Part 3: Perceptron

The simplest learnable classifier — the OG neural unit

🔌 How the Perceptron Works

▼

What

The Perceptron is a binary classifier that makes hard predictions (±1) and updates its weights whenever it makes a mistake, until all points are correctly classified.

Why it matters

The Perceptron is literally the building block of neural networks. Each "neuron" in a neural network IS a perceptron with a different activation function. Understanding it deeply means understanding deep learning.

# Perceptron model: f(x) = θᵀx + θ₀ (linear combination of features) ŷ = sign(f(x)) (+1 or -1) # Training algorithm: Initialize θ = 0 FOR each data point (xᵢ, yᵢ): ŷᵢ = sign(θᵀxᵢ) IF yᵢ × ŷᵢ ≤ 0: # misclassified! θ ← θ + η × yᵢ × xᵢ # update toward correct class Repeat until no mistakes (convergence)

🎮 Perceptron Update Visualizer

See how the perceptron updates its weights when it makes a mistake.

θ₁: 0.5

θ₂: -0.3

η (learning rate): 0.1

Test point: x = [2, 1], true label y = +1

Adjust sliders to see update...

⚠️ Perceptron Convergence Theorem The perceptron is guaranteed to converge IF the data is linearly separable. If not linearly separable → it will never converge!

✅ Advantages

Extremely simple and fast
Online learning (updates one point at a time)
Foundation of neural networks

❌ Disadvantages

Only works for linearly separable data
No unique solution — depends on initialization
No probability output (hard classification only)

⚖️

Side-by-Side Comparison

All three linear classifiers at a glance

Property	Perceptron	Logistic Regression	SVM
Output type	Hard (sign)	Soft (probability)	Hard (sign)
Unique solution?	No — any separating line	Yes (global optimum)	Yes — max-margin line
Handles non-separable?	No (won't converge)	Yes	Yes (soft-margin SVM)
Objective	Classify all points correctly	Maximize likelihood	Maximize margin
Optimization	Simple update rule	Gradient descent (iterative)	Quadratic programming (convex)
Type	Discriminative	Discriminative	Discriminative
Probabilistic?	No	Yes	No

🔑 Key Insight: Generative vs. Discriminative Naive Bayes = Generative (models P(X|Y) and P(Y)). All three this week = Discriminative (compute P(Y|X) or decision boundary directly). Discriminative models often outperform generative when enough data is available.

🧪 Quiz Prep — Week 3 Questions

Q1. Which of the following is NOT a discriminative model?

Q2. What is the best way to select a threshold for Logistic Regression's sigmoid output?

Q3. The separating hyperplane produced by SVM and Perceptron are both unique — True or False?

Q4. For the Perceptron, it is typical to iterate through data one point at a time — True or False?

← Week 2: Naive Bayes Week 4: SVD & GloVe →

🔗 Natural Language Inference (NLI) Core Benchmark

Task: Given a premise and a hypothesis, classify the relationship as ENTAILMENT NEUTRAL or CONTRADICTION.

📖 Example (SNLI dataset)

Premise: "A woman is walking her dog in the park."
H1: "Someone is outside with an animal." → ENTAILMENT
H2: "The woman is a professional dog trainer." → NEUTRAL
H3: "There are no animals in the park." → CONTRADICTION

Key Datasets

SNLI (2015): 570K pairs, crowd-sourced
MultiNLI (2018): 433K pairs, 10 genres
ANLI (2020): adversarially collected
XNLI: multilingual extension, 15 languages

Why NLI Matters in 2026

Used as reward signal in RLHF (does response contradict context?)
Hallucination detection: does output entail source?
Fact-checking pipelines
GLUE/SuperGLUE include NLI tasks (RTE, MNLI)

🎭 Semantic Role Labeling (SRL) Structure

SRL answers the question: "Who did what to whom, where, when, and how?" It identifies the predicate and labels its semantic arguments using PropBank frame schema.

📖 PropBank Example

Sentence: "The scientist published a groundbreaking paper in Nature."
ARG0 (Agent/Publisher): The scientist
V (Predicate): published
ARG1 (Theme/Published): a groundbreaking paper
ARGM-LOC (Locative): in Nature

Dataset: OntoNotes 5.0, CoNLL-2009. Models: BERT + span-based SRL (Shi & Lin 2019) achieves >85 F1. Use in 2026: knowledge extraction pipelines, question answering, event detection, LLM evaluation for factual consistency.

🔄 Coreference Resolution

Determines which mentions in a text refer to the same real-world entity. Critical for document understanding, summarization, and dialogue.

📖 Example

"Dr. Chen won the award. She had worked on the project for 10 years. It was finally recognized."
Clusters: {Dr. Chen, She} · {the project, It}

Pipeline Steps

Mention detection (spans)
Mention representation (BERT embeddings)
Pairwise scoring (coref likelihood)
Clustering (greedy or global)

Benchmarks & Models

OntoNotes 5.0: gold standard, 1.6M tokens
CoNLL-2012: shared task benchmark
SpanBERT: span-based, state-of-the-art ~82 F1
LLMs: GPT-4 can resolve coref zero-shot, ~79 F1

🗂️ Key NLP Tasks at a Glance

Task	What It Does	Gold Dataset	SOTA Model (2024)	Metric
NLI	Premise → hypothesis relationship	MultiNLI, SNLI	DeBERTa-v3, GPT-4	Accuracy
SRL	Who did what to whom	OntoNotes, CoNLL-09	BERT + span predictor	F1
Coreference	Which mentions = same entity	OntoNotes 5.0	SpanBERT, LingMess	Avg F1 (MUC/B³/CEAF)
RE	Extract (entity, relation, entity)	TACRED, DocRED	REBEL (seq2seq)	F1
MT	Translate between languages	WMT, FLORES	NLLB-200, GPT-4	BLEU, chrF, COMET
QA (Extractive)	Find answer span in passage	SQuAD 2.0	BERT, DeBERTa	EM / F1
QA (Abstractive)	Generate answer from knowledge	NaturalQuestions, TriviaQA	RAG + LLM	EM / ROUGE
Summarization	Compress document to summary	CNN/DM, XSum	PEGASUS, GPT-4	ROUGE-1/2/L

🏆 Relation Extraction: From Rules to LLMs Research Frontier

Task Definition

Given: "Elon Musk founded SpaceX in 2002."
Extract: (Elon Musk, founder_of, SpaceX) and (SpaceX, founded_in, 2002)

Datasets: TACRED (42 relation types), DocRED (96 types, document-level), Re-TACRED (revised, cleaner labels)

Approaches

Supervised: BERT → span classifier
Distant supervision: align KB (Freebase) with text, noisy labels
Seq2seq: REBEL generates (h, r, t) triples directly
LLM prompting: GPT-4 with structured output

Logistic Regression, SVM & Perceptron

The Big Picture

Part 1: Logistic Regression

🔗 The Sigmoid Function — Converting Scores to Probabilities

🎮 Sigmoid Function Explorer

⬇️ Training: Gradient Descent + MLE

Part 2: Support Vector Machine (SVM)

↔️ The Margin — Why SVM is Different

🌀 The Kernel Trick — Non-Linear SVM

Part 3: Perceptron

🔌 How the Perceptron Works

🎮 Perceptron Update Visualizer

✅ Advantages

❌ Disadvantages

Side-by-Side Comparison

🧪 Quiz Prep — Week 3 Questions

Core NLP Tasks: The Benchmarks Every NLP Researcher Must Know

🔗 Natural Language Inference (NLI) Core Benchmark

Key Datasets

Why NLI Matters in 2026

🎭 Semantic Role Labeling (SRL) Structure

🔄 Coreference Resolution

Pipeline Steps

Benchmarks & Models

🗂️ Key NLP Tasks at a Glance

🏆 Relation Extraction: From Rules to LLMs Research Frontier

Task Definition

Approaches