Week 10 – Sequence Labeling: POS Tagging & NER

🏷️

What Is Sequence Labeling?

Assigning a tag to every token in a sequence

The Core Idea: One Label Per Token

▼

What

Sequence labeling is the task of mapping a sequence of input tokens (words) to a corresponding sequence of output labels — one label per token. Unlike text classification (which gives one label to an entire document), sequence labeling is token-level classification.

Why

Many real-world NLP tasks require understanding at the word level: knowing that "Apple" is a company vs. a fruit, that "run" is a verb vs. a noun, or that "Bank of America" is a single entity spanning three words.

How

The input/output sequences have the same length (one tag per token). This is why encoder-only models (like BERT) are preferred — they produce one representation per token, which can be directly fed into a classification layer.

Task Type	Input	Output	Example
Text Classification	Full document	One label	"This email is spam" → SPAM
Sequence Labeling	N tokens	N labels	"Apple Inc. was founded" → [B-ORG, I-ORG, O, O]

Two Major Sequence Labeling Tasks in Module 10

▼

🔤 Part-of-Speech (POS) Tagging

Assign each word its grammatical role: noun, verb, adjective, adverb, etc. One tag per word, same number of tags as words.

🔍 Named Entity Recognition (NER)

Identify and categorize named entities — people, places, organizations, dates — which can span multiple words. Uses special IOB tagging to handle multi-word spans.

🔗 POS tagging feeds NER! POS tags are commonly used as features for NER models. Knowing "Franklin" is a proper noun (NNP) helps NER decide whether it's a person or a city.

🔤

Part-of-Speech (POS) Tagging

Labeling every word with its grammatical role

What POS Tagging Does & Why It Matters

▼

What

POS tagging labels each word with its grammatical category. The key challenge is ambiguity: the same word can be different parts of speech depending on context.

CLASSIC AMBIGUITY EXAMPLE — "sitting"

"This is a sitting objective area"

Here sitting is used as an Adjective (JJ) — it modifies the noun "area".

"I am sitting at my desk"

Here sitting is used as a Verb (VBG) — present participle describing the action.

Why

POS tagging is useful for:

Machine Translation: Knowing adjective vs. noun helps with word ordering (e.g., English "spacious car" → French "voiture spacieuse" — adjective comes after noun).
Text-to-Speech (TTS): POS determines pronunciation — "I read the book every day" (present tense) vs. "I read it yesterday" (past tense, different vowel sound).
Transcription Evaluation: Compare POS tags of reference vs. ASR output to measure how well the model captures grammatical structure.
NER (downstream): POS tags are features that improve NER model performance.

Common POS Tags (Penn Treebank Tagset)

▼

Tag	Full Name	Example
NN	Noun, singular	"dog", "book", "car"
NNS	Noun, plural	"dogs", "books"
NNP	Proper noun, singular	"Atlanta", "Google", "Paris"
NNPS	Proper noun, plural	"Americans", "Vikings"
VB	Verb, base form	"run", "visit", "travel"
VBD	Verb, past tense	"ran", "visited", "traveled"
VBG	Verb, gerund/present participle	"running", "visiting"
VBP	Verb, non-3rd person singular present	"run", "travel" (I/we/they run)
TO	Infinitive marker "to" (before a verb)	"to visit", "to run" — NOT a preposition
JJ	Adjective	"spacious", "big", "sitting"
RB	Adverb	"quickly", "very", "not"
DT	Determiner	"the", "a", "an", "this"
IN	Preposition or subordinating conjunction	"from", "via", "in", "of", "at"
CC	Coordinating conjunction	"and", "but", "or"
PRP	Personal pronoun	"I", "he", "she", "we"

💡 Memory trick NN = Noun, VB = Verb, JJ = adJective (J for jumbled!), RB = adveRB, DT = DeTerminer, IN = INpreposition, CC = Coordinating Conjunction.

POS Tagging Algorithms

▼

⚙️ Average Perceptron Tagger (NLTK default)

Step 1 — Vectorize: Represent each word using any encoding method (one-hot, Word2Vec, custom features).
Step 2 — Initialize weights: Random weights for each grammatical category (NN, VB, JJ, etc.).
Step 3 — Score: For each word, compute a dot product against each category's weight vector. The category with the highest score is predicted.
Step 4 — Update: If the prediction is wrong vs. the true label, update the weights (perceptron update rule). Weights are averaged over all iterations ("average" perceptron) to reduce overfitting.

🤖 BERT (Encoder-based Transformer)

Use a fine-tuned BERT model via HuggingFace pipeline for token classification.
BERT produces a contextualized embedding per token → feed into a linear classification layer → predict POS tag.
Handles ambiguity naturally: the same word gets different embeddings depending on context.
Limitation: Pre-trained BERT may not recognize domain-specific proper nouns (it may sub-word-tokenize unfamiliar words). Fine-tuning on domain data helps.

⚠️ Why not Encoder-Decoder for POS? Encoder-decoder (seq2seq) models are designed for tasks where input and output have different lengths (e.g., machine translation). POS tagging has equal-length input/output (one tag per token), so an encoder-decoder would be overkill — adding unnecessary complexity. An encoder-only model like BERT is preferred.

# NLTK Average Perceptron Tagger (Python) import nltk from nltk.tokenize import word_tokenize from nltk.tag import pos_tag sentence = "Professor traveled from Atlanta to Paris" tokens = word_tokenize(sentence) tags = pos_tag(tokens) # Output: [('Professor', 'NNP'), ('traveled', 'VBD'), ('from', 'IN'), ...] # BERT via HuggingFace (Python) from transformers import pipeline pos_pipeline = pipeline("token-classification", model="QCRI/bert-base-multilingual-cased-pos-english") result = pos_pipeline(sentence) # Each token gets a label: {"word": "Professor", "entity": "NNP", "score": 0.98}

🎮 Interactive POS Tagger

Click a sentence to see token-level POS tags visualized. (Simplified — shows representative tags for illustration.)

🔍

Named Entity Recognition (NER)

Finding and categorizing named entities in text

What NER Does & Why It's Harder than POS

▼

What

NER identifies and categorizes named entities in text into predefined categories. The goal is to extract structured information from unstructured text — turning free text into a knowledge base.

Why

Used in: information retrieval (finding documents about specific people/orgs), question answering, summarization, sentiment analysis (who is the sentiment about?), and knowledge graph construction.

📌 Common NER Categories

PER — Person names ("Barack Obama")
LOC — Locations ("Paris", "the Amazon")
ORG — Organizations ("United Airlines")
DATE — Time expressions ("last month")
GPE — Geopolitical entities ("France", "Atlanta")

⚠️ Two Key Difficulties

Segmentation: Unlike POS (one tag per word), NER entities can span multiple words ("Bank of America" = one entity).
Ambiguity: "Franklin" could be a person (PER) or a city in North Carolina (LOC) — context determines the correct label.

IOB / BIO Tagging System — The Key to Multi-word Entities

▼

What

IOB (Inside–Outside–Beginning) tagging — also called BIO — gives a structured way to annotate entities that span multiple words. Each token gets ONE of three prefixes:

B-TYPE = Beginning of entity I-TYPE = Inside (continuation of) entity O = Outside (not an entity)

# Example: "Professor traveled from Atlanta to Paris via United Airlines" B-PER O O B-LOC O B-LOC O B-ORG I-ORG Professor traveled from Atlanta to Paris via United Airlines # Key rules: # - First token of an entity → B-TYPE (Beginning) # - Subsequent tokens of SAME entity → I-TYPE (Inside) # - Non-entity tokens → O (Outside) # - "United Airlines" = TWO tokens, ONE entity → B-ORG + I-ORG

Why IOB?

Without IOB, we couldn't distinguish: (a) two separate single-word entities back-to-back, vs. (b) one two-word entity. The B vs. I prefix solves this. IOB is also required for evaluation tools like CoNLL score.

LIVE BIO EXAMPLE — click a word to highlight its entity:

NER Algorithms

▼

⚙️ NLTK ne_chunk (Maximum Entropy Classifier)

NLTK's ne_chunk() uses a classifier trained on annotated data.
Uses a Maximum Entropy (MaxEnt) model — predicts the probability distribution over possible NER labels given features like: POS tags, word shape, surrounding words, capitalization.
Returns a tree structure. Use tree2conlltags() to convert to IOB format for comparison.
binary=True: labels only NE vs. not-NE. binary=False: gives entity type (PERSON, GPE, etc.).

# NLTK NER (Python) import nltk from nltk.chunk import ne_chunk, tree2conlltags from nltk.tag import pos_tag from nltk.tokenize import word_tokenize sentence = "Professor traveled from Atlanta to Paris via United Airlines" tokens = word_tokenize(sentence) pos_tagged = pos_tag(tokens) # Step 1: POS tag first! tree = ne_chunk(pos_tagged) # Step 2: NER chunking iob_tags = tree2conlltags(tree) # Step 3: Convert to IOB format # iob_tags: [('Atlanta', 'NNP', 'B-GPE'), ('Paris', 'NNP', 'B-GPE'), ...]

🤖 BERT for NER (Preferred Approach)

Use a BERT model fine-tuned for NER (e.g., dslim/bert-base-NER on HuggingFace).
BERT produces one contextualized token embedding per word → a linear classification layer predicts the IOB tag.
BERT is encoder-only, making it ideal: one output per input token, directly mapped to one IOB label.
Known limitation: BERT sub-word tokenizes unusual proper nouns — an entity like an unfamiliar name may get split across sub-word tokens, requiring post-processing to realign.

WHY ENCODER-ONLY (BERT) IS PREFERRED OVER ENCODER-DECODER FOR NER:

✅ Encoder-only (BERT) — Preferred

One output per input token (matches NER task)
Token outputs are independent → easy to add classification head
Uses masking in pre-training → richer token representations
Computationally efficient for sequence labeling

✗ Encoder-Decoder — Less Preferred

Designed for different input/output lengths (NER has equal lengths)
Can't easily tweak individual output positions
Training is more computationally expensive
Used for translation, summarization — not token classification

🎮 Interactive NER Tagger with IOB Labels

Click a sentence to see IOB-tagged tokens. Hover over colored tokens to see entity details.

⚖️

POS Tagging vs. NER — Side-by-Side

Key similarities and differences to know for Quiz 8

Complete Comparison Table

▼

Dimension	POS Tagging	NER
Goal	Grammatical role per word	Named entity type per word/span
Tag granularity	Always 1 tag per word	IOB tags — can span multiple words
Tag examples	NN, VBD, JJ, RB, DT, IN, NNP	B-PER, I-ORG, B-LOC, O
Key challenge	Lexical ambiguity ("sitting")	Segmentation + semantic ambiguity ("Franklin")
Classical algorithm	Average Perceptron Tagger (NLTK)	MaxEnt NE Chunker (NLTK)
DL approach	BERT (fine-tuned for POS)	BERT (fine-tuned for NER)
Preferred architecture	Encoder-only	Encoder-only
Input = Output length?	Yes	Yes (with IOB, one tag per token)
Relationship	Often done first	Uses POS tags as features
NLTK function	`pos_tag(tokens)`	`ne_chunk(pos_tagged)`

When to Use Which Transformer Architecture

▼

Architecture	Best For	Why	Examples
Encoder-only (BERT, RoBERTa)	POS Tagging, NER, Classification, Q&A (extractive)	Bidirectional context; one output per input token; easy classification head	BERT, RoBERTa, DistilBERT
Decoder-only (GPT family)	Text generation, completion, chatbots	Autoregressive: predicts next token; unidirectional	GPT-2, GPT-3, GPT-4
Encoder-Decoder (T5, BART)	Translation, summarization, Q&A (generative)	Different input/output lengths; encoder reads, decoder generates	T5, BART, mBART

💡 Exam insight: For tasks where input length = output length (POS, NER, token classification), always prefer encoder-only. Encoder-decoder is for tasks where lengths differ. Using encoder-decoder for POS/NER is possible but computationally wasteful.

📝

Quiz 8 Practice Questions

Modeled after actual Quizzes 1–7 — this material is covered by Quiz 9

🎯 Week 10 Practice Quiz — POS Tagging & NER (covers Quiz 9 material)

Q1. In the sentence "I read the book every day", the word "read" is tagged as VBP (present tense). In "I read the book yesterday", the same word would be tagged differently. What does this demonstrate about POS tagging?

Q2. Which of the following correctly describes the IOB/BIO tagging scheme used in NER? (Select all that apply)

Q3. Which of the following statements is FALSE?

Q4. In NLTK, what is the correct order of operations to perform Named Entity Recognition?

Q5. What is the primary reason why encoder-decoder (seq2seq) models are generally not preferred for Part-of-Speech tagging?

Problem	Question	Algorithm	Complexity
Evaluation	P(observations \| model)?	Forward Algorithm	O(N²T)
Decoding	Most likely state sequence given observations?	Viterbi	O(N²T)
Learning	Estimate HMM parameters from unlabeled data?	Baum-Welch (EM)	O(N²T) per iteration

Sequence Labeling: POS Tagging & NER

What Is Sequence Labeling?

The Core Idea: One Label Per Token

Two Major Sequence Labeling Tasks in Module 10

🔤 Part-of-Speech (POS) Tagging

🔍 Named Entity Recognition (NER)

Part-of-Speech (POS) Tagging

What POS Tagging Does & Why It Matters

"This is a sitting objective area"

"I am sitting at my desk"

Common POS Tags (Penn Treebank Tagset)

POS Tagging Algorithms

⚙️ Average Perceptron Tagger (NLTK default)

🤖 BERT (Encoder-based Transformer)

🎮 Interactive POS Tagger

Named Entity Recognition (NER)

What NER Does & Why It's Harder than POS

📌 Common NER Categories

⚠️ Two Key Difficulties

IOB / BIO Tagging System — The Key to Multi-word Entities

NER Algorithms

⚙️ NLTK ne_chunk (Maximum Entropy Classifier)

🤖 BERT for NER (Preferred Approach)

✅ Encoder-only (BERT) — Preferred

✗ Encoder-Decoder — Less Preferred

🎮 Interactive NER Tagger with IOB Labels

POS Tagging vs. NER — Side-by-Side

Complete Comparison Table

When to Use Which Transformer Architecture

Quiz 8 Practice Questions

🎯 Week 10 Practice Quiz — POS Tagging & NER (covers Quiz 9 material)

HMM & CRF Deep Dive: Derivations That Actually Explain Sequence Labeling

🎲 HMM Fundamentals — The Three Problems Mathematical

⏩ Forward-Backward Algorithm Derivation PhD

Forward Variable α

Backward Variable β

🏹 Viterbi Algorithm — Finding the Best Path

Intuition: The Trellis (POS Tagging Example)

🔗 CRF: The Partition Function & Gradient PhD

CRF Training

BERT + CRF for NER