๐Ÿท๏ธ

What Is Sequence Labeling?

Assigning a tag to every token in a sequence

The Core Idea: One Label Per Token

โ–ผ
What

Sequence labeling is the task of mapping a sequence of input tokens (words) to a corresponding sequence of output labels โ€” one label per token. Unlike text classification (which gives one label to an entire document), sequence labeling is token-level classification.

Why

Many real-world NLP tasks require understanding at the word level: knowing that "Apple" is a company vs. a fruit, that "run" is a verb vs. a noun, or that "Bank of America" is a single entity spanning three words.

How

The input/output sequences have the same length (one tag per token). This is why encoder-only models (like BERT) are preferred โ€” they produce one representation per token, which can be directly fed into a classification layer.

Task TypeInputOutputExample
Text ClassificationFull documentOne label"This email is spam" โ†’ SPAM
Sequence LabelingN tokensN labels"Apple Inc. was founded" โ†’ [B-ORG, I-ORG, O, O]

Two Major Sequence Labeling Tasks in Module 10

โ–ผ

๐Ÿ”ค Part-of-Speech (POS) Tagging

Assign each word its grammatical role: noun, verb, adjective, adverb, etc. One tag per word, same number of tags as words.

๐Ÿ” Named Entity Recognition (NER)

Identify and categorize named entities โ€” people, places, organizations, dates โ€” which can span multiple words. Uses special IOB tagging to handle multi-word spans.

๐Ÿ”— POS tagging feeds NER! POS tags are commonly used as features for NER models. Knowing "Franklin" is a proper noun (NNP) helps NER decide whether it's a person or a city.
๐Ÿ”ค

Part-of-Speech (POS) Tagging

Labeling every word with its grammatical role

What POS Tagging Does & Why It Matters

โ–ผ
What

POS tagging labels each word with its grammatical category. The key challenge is ambiguity: the same word can be different parts of speech depending on context.

CLASSIC AMBIGUITY EXAMPLE โ€” "sitting"

"This is a sitting objective area"

Here sitting is used as an Adjective (JJ) โ€” it modifies the noun "area".

"I am sitting at my desk"

Here sitting is used as a Verb (VBG) โ€” present participle describing the action.

Why

POS tagging is useful for:

  • Machine Translation: Knowing adjective vs. noun helps with word ordering (e.g., English "spacious car" โ†’ French "voiture spacieuse" โ€” adjective comes after noun).
  • Text-to-Speech (TTS): POS determines pronunciation โ€” "I read the book every day" (present tense) vs. "I read it yesterday" (past tense, different vowel sound).
  • Transcription Evaluation: Compare POS tags of reference vs. ASR output to measure how well the model captures grammatical structure.
  • NER (downstream): POS tags are features that improve NER model performance.

Common POS Tags (Penn Treebank Tagset)

โ–ผ
TagFull NameExample
NNNoun, singular"dog", "book", "car"
NNSNoun, plural"dogs", "books"
NNPProper noun, singular"Atlanta", "Google", "Paris"
NNPSProper noun, plural"Americans", "Vikings"
VBVerb, base form"run", "visit", "travel"
VBDVerb, past tense"ran", "visited", "traveled"
VBGVerb, gerund/present participle"running", "visiting"
VBPVerb, non-3rd person singular present"run", "travel" (I/we/they run)
TOInfinitive marker "to" (before a verb)"to visit", "to run" โ€” NOT a preposition
JJAdjective"spacious", "big", "sitting"
RBAdverb"quickly", "very", "not"
DTDeterminer"the", "a", "an", "this"
INPreposition or subordinating conjunction"from", "via", "in", "of", "at"
CCCoordinating conjunction"and", "but", "or"
PRPPersonal pronoun"I", "he", "she", "we"
๐Ÿ’ก Memory trick NN = Noun, VB = Verb, JJ = adJective (J for jumbled!), RB = adveRB, DT = DeTerminer, IN = INpreposition, CC = Coordinating Conjunction.

POS Tagging Algorithms

โ–ผ

โš™๏ธ Average Perceptron Tagger (NLTK default)

  • Step 1 โ€” Vectorize: Represent each word using any encoding method (one-hot, Word2Vec, custom features).
  • Step 2 โ€” Initialize weights: Random weights for each grammatical category (NN, VB, JJ, etc.).
  • Step 3 โ€” Score: For each word, compute a dot product against each category's weight vector. The category with the highest score is predicted.
  • Step 4 โ€” Update: If the prediction is wrong vs. the true label, update the weights (perceptron update rule). Weights are averaged over all iterations ("average" perceptron) to reduce overfitting.

๐Ÿค– BERT (Encoder-based Transformer)

  • Use a fine-tuned BERT model via HuggingFace pipeline for token classification.
  • BERT produces a contextualized embedding per token โ†’ feed into a linear classification layer โ†’ predict POS tag.
  • Handles ambiguity naturally: the same word gets different embeddings depending on context.
  • Limitation: Pre-trained BERT may not recognize domain-specific proper nouns (it may sub-word-tokenize unfamiliar words). Fine-tuning on domain data helps.
โš ๏ธ Why not Encoder-Decoder for POS? Encoder-decoder (seq2seq) models are designed for tasks where input and output have different lengths (e.g., machine translation). POS tagging has equal-length input/output (one tag per token), so an encoder-decoder would be overkill โ€” adding unnecessary complexity. An encoder-only model like BERT is preferred.
# NLTK Average Perceptron Tagger (Python) import nltk from nltk.tokenize import word_tokenize from nltk.tag import pos_tag sentence = "Professor traveled from Atlanta to Paris" tokens = word_tokenize(sentence) tags = pos_tag(tokens) # Output: [('Professor', 'NNP'), ('traveled', 'VBD'), ('from', 'IN'), ...] # BERT via HuggingFace (Python) from transformers import pipeline pos_pipeline = pipeline("token-classification", model="QCRI/bert-base-multilingual-cased-pos-english") result = pos_pipeline(sentence) # Each token gets a label: {"word": "Professor", "entity": "NNP", "score": 0.98}

๐ŸŽฎ Interactive POS Tagger

Click a sentence to see token-level POS tags visualized. (Simplified โ€” shows representative tags for illustration.)
๐Ÿ”

Named Entity Recognition (NER)

Finding and categorizing named entities in text

What NER Does & Why It's Harder than POS

โ–ผ
What

NER identifies and categorizes named entities in text into predefined categories. The goal is to extract structured information from unstructured text โ€” turning free text into a knowledge base.

Why

Used in: information retrieval (finding documents about specific people/orgs), question answering, summarization, sentiment analysis (who is the sentiment about?), and knowledge graph construction.

๐Ÿ“Œ Common NER Categories

  • PER โ€” Person names ("Barack Obama")
  • LOC โ€” Locations ("Paris", "the Amazon")
  • ORG โ€” Organizations ("United Airlines")
  • DATE โ€” Time expressions ("last month")
  • GPE โ€” Geopolitical entities ("France", "Atlanta")

โš ๏ธ Two Key Difficulties

  • Segmentation: Unlike POS (one tag per word), NER entities can span multiple words ("Bank of America" = one entity).
  • Ambiguity: "Franklin" could be a person (PER) or a city in North Carolina (LOC) โ€” context determines the correct label.

IOB / BIO Tagging System โ€” The Key to Multi-word Entities

โ–ผ
What

IOB (Insideโ€“Outsideโ€“Beginning) tagging โ€” also called BIO โ€” gives a structured way to annotate entities that span multiple words. Each token gets ONE of three prefixes:

B-TYPE = Beginning of entity I-TYPE = Inside (continuation of) entity O = Outside (not an entity)
# Example: "Professor traveled from Atlanta to Paris via United Airlines" B-PER O O B-LOC O B-LOC O B-ORG I-ORG Professor traveled from Atlanta to Paris via United Airlines # Key rules: # - First token of an entity โ†’ B-TYPE (Beginning) # - Subsequent tokens of SAME entity โ†’ I-TYPE (Inside) # - Non-entity tokens โ†’ O (Outside) # - "United Airlines" = TWO tokens, ONE entity โ†’ B-ORG + I-ORG
Why IOB?

Without IOB, we couldn't distinguish: (a) two separate single-word entities back-to-back, vs. (b) one two-word entity. The B vs. I prefix solves this. IOB is also required for evaluation tools like CoNLL score.

LIVE BIO EXAMPLE โ€” click a word to highlight its entity:

NER Algorithms

โ–ผ

โš™๏ธ NLTK ne_chunk (Maximum Entropy Classifier)

  • NLTK's ne_chunk() uses a classifier trained on annotated data.
  • Uses a Maximum Entropy (MaxEnt) model โ€” predicts the probability distribution over possible NER labels given features like: POS tags, word shape, surrounding words, capitalization.
  • Returns a tree structure. Use tree2conlltags() to convert to IOB format for comparison.
  • binary=True: labels only NE vs. not-NE. binary=False: gives entity type (PERSON, GPE, etc.).
# NLTK NER (Python) import nltk from nltk.chunk import ne_chunk, tree2conlltags from nltk.tag import pos_tag from nltk.tokenize import word_tokenize sentence = "Professor traveled from Atlanta to Paris via United Airlines" tokens = word_tokenize(sentence) pos_tagged = pos_tag(tokens) # Step 1: POS tag first! tree = ne_chunk(pos_tagged) # Step 2: NER chunking iob_tags = tree2conlltags(tree) # Step 3: Convert to IOB format # iob_tags: [('Atlanta', 'NNP', 'B-GPE'), ('Paris', 'NNP', 'B-GPE'), ...]

๐Ÿค– BERT for NER (Preferred Approach)

  • Use a BERT model fine-tuned for NER (e.g., dslim/bert-base-NER on HuggingFace).
  • BERT produces one contextualized token embedding per word โ†’ a linear classification layer predicts the IOB tag.
  • BERT is encoder-only, making it ideal: one output per input token, directly mapped to one IOB label.
  • Known limitation: BERT sub-word tokenizes unusual proper nouns โ€” an entity like an unfamiliar name may get split across sub-word tokens, requiring post-processing to realign.

WHY ENCODER-ONLY (BERT) IS PREFERRED OVER ENCODER-DECODER FOR NER:

โœ… Encoder-only (BERT) โ€” Preferred
  • One output per input token (matches NER task)
  • Token outputs are independent โ†’ easy to add classification head
  • Uses masking in pre-training โ†’ richer token representations
  • Computationally efficient for sequence labeling
โœ— Encoder-Decoder โ€” Less Preferred
  • Designed for different input/output lengths (NER has equal lengths)
  • Can't easily tweak individual output positions
  • Training is more computationally expensive
  • Used for translation, summarization โ€” not token classification

๐ŸŽฎ Interactive NER Tagger with IOB Labels

Click a sentence to see IOB-tagged tokens. Hover over colored tokens to see entity details.
โš–๏ธ

POS Tagging vs. NER โ€” Side-by-Side

Key similarities and differences to know for Quiz 8

Complete Comparison Table

โ–ผ
DimensionPOS TaggingNER
GoalGrammatical role per wordNamed entity type per word/span
Tag granularityAlways 1 tag per wordIOB tags โ€” can span multiple words
Tag examplesNN, VBD, JJ, RB, DT, IN, NNPB-PER, I-ORG, B-LOC, O
Key challengeLexical ambiguity ("sitting")Segmentation + semantic ambiguity ("Franklin")
Classical algorithmAverage Perceptron Tagger (NLTK)MaxEnt NE Chunker (NLTK)
DL approachBERT (fine-tuned for POS)BERT (fine-tuned for NER)
Preferred architectureEncoder-onlyEncoder-only
Input = Output length?YesYes (with IOB, one tag per token)
RelationshipOften done firstUses POS tags as features
NLTK functionpos_tag(tokens)ne_chunk(pos_tagged)

When to Use Which Transformer Architecture

โ–ผ
ArchitectureBest ForWhyExamples
Encoder-only
(BERT, RoBERTa)
POS Tagging, NER, Classification, Q&A (extractive) Bidirectional context; one output per input token; easy classification head BERT, RoBERTa, DistilBERT
Decoder-only
(GPT family)
Text generation, completion, chatbots Autoregressive: predicts next token; unidirectional GPT-2, GPT-3, GPT-4
Encoder-Decoder
(T5, BART)
Translation, summarization, Q&A (generative) Different input/output lengths; encoder reads, decoder generates T5, BART, mBART
๐Ÿ’ก Exam insight: For tasks where input length = output length (POS, NER, token classification), always prefer encoder-only. Encoder-decoder is for tasks where lengths differ. Using encoder-decoder for POS/NER is possible but computationally wasteful.
๐Ÿ“

Quiz 8 Practice Questions

Modeled after actual Quizzes 1โ€“7 โ€” this material is covered by Quiz 9

๐ŸŽฏ Week 10 Practice Quiz โ€” POS Tagging & NER (covers Quiz 9 material)

Q1. In the sentence "I read the book every day", the word "read" is tagged as VBP (present tense). In "I read the book yesterday", the same word would be tagged differently. What does this demonstrate about POS tagging?

Q2. Which of the following correctly describes the IOB/BIO tagging scheme used in NER? (Select all that apply)

Q3. Which of the following statements is FALSE?

Q4. In NLTK, what is the correct order of operations to perform Named Entity Recognition?

Q5. What is the primary reason why encoder-decoder (seq2seq) models are generally not preferred for Part-of-Speech tagging?

๐Ÿ”ฌ Beyond the Slides ยท Graduate Depth

HMM & CRF Deep Dive: Derivations That Actually Explain Sequence Labeling

The slides show you what HMMs and CRFs do. This section shows you why they work โ€” deriving the Forward-Backward algorithm, Viterbi dynamic programming, and the CRF partition function from first principles. These derivations appear in every sequence modeling interview and in classic NLP papers.

๐ŸŽฒ HMM Fundamentals โ€” The Three Problems Mathematical

Every HMM application reduces to one of three classic problems, each with a specialized algorithm:

ProblemQuestionAlgorithmComplexity
EvaluationP(observations | model)?Forward AlgorithmO(NยฒT)
DecodingMost likely state sequence given observations?ViterbiO(NยฒT)
LearningEstimate HMM parameters from unlabeled data?Baum-Welch (EM)O(NยฒT) per iteration

N = number of hidden states, T = sequence length

โฉ Forward-Backward Algorithm Derivation PhD

Forward Variable ฮฑ

# ฮฑ_t(j) = P(oโ‚,...,oโ‚œ, qโ‚œ=j | ฮป) # Probability of observing oโ‚..oโ‚œ and # being in state j at time t Initialization (t=1): ฮฑโ‚(j) = ฯ€_j ยท b_j(oโ‚) # ฯ€_j: initial probability of state j # b_j(oโ‚): emission prob of obs oโ‚ in j Recursion (t=2..T): ฮฑ_t(j) = b_j(oโ‚œ) ยท ฮฃแตข [ฮฑ_{t-1}(i) ยท aแตขโฑผ] # Sum over all previous states i # aแตขโฑผ: transition prob iโ†’j Termination: P(O | ฮป) = ฮฃโฑผ ฮฑ_T(j)

Backward Variable ฮฒ

# ฮฒ_t(i) = P(o_{t+1},...,o_T | qโ‚œ=i, ฮป) # Probability of future observations # given we're in state i at time t Initialization (t=T): ฮฒ_T(i) = 1 for all states i Recursion (t=T-1..1): ฮฒ_t(i) = ฮฃโฑผ [aแตขโฑผ ยท b_j(o_{t+1}) ยท ฮฒ_{t+1}(j)] # Sum over all next states j State posterior (used in Baum-Welch): ฮณ_t(i) = ฮฑ_t(i)ยทฮฒ_t(i) / P(O|ฮป)

๐Ÿน Viterbi Algorithm โ€” Finding the Best Path

Viterbi is Forward algorithm with sum replaced by max. It finds the most probable hidden state sequence (e.g., POS tag sequence), not just the probability of observations.

# Viterbi variable ฮด_t(j) = max probability of reaching state j at t Initialization: ฮดโ‚(j) = ฯ€_j ยท b_j(oโ‚) ฯˆโ‚(j) = 0 # backpointer Recursion: ฮด_t(j) = max_i [ฮด_{t-1}(i) ยท aแตขโฑผ] ยท b_j(oโ‚œ) ฯˆ_t(j) = argmax_i [ฮด_{t-1}(i) ยท aแตขโฑผ] # store best predecessor Termination + Backtrack: q*_T = argmax_j ฮด_T(j) q*_t = ฯˆ_{t+1}(q*_{t+1}) # follow backpointers

Intuition: The Trellis (POS Tagging Example)

State \ Time
t=1: "Fed"
t=2: "raises"
t=3: "rates"
Best
NOUN
0.32 โ†ฯ€
0.05
0.18
โœ“
VERB
0.08
0.41
0.03
โ€”
DET
0.10
0.02
0.01
โ€”

Best path: NOUN โ†’ VERB โ†’ NOUN (Fed/NNP raises/VBZ rates/NNS) โ€” backtracking from max ฮด_T across all states.

๐Ÿ”— CRF: The Partition Function & Gradient PhD

A linear-chain CRF models P(y|x) directly (discriminative), avoiding the independence assumptions of HMMs (generative). The challenge: computing the normalization constant Z(x).

# CRF probability of label sequence y given x P(y|x) = (1/Z(x)) ยท exp(ฮฃโ‚œ score(yโ‚œโ‚‹โ‚, yโ‚œ, x, t)) # Score = feature function dot weights score(yโ‚œโ‚‹โ‚, yโ‚œ, x, t) = w ยท f(yโ‚œโ‚‹โ‚, yโ‚œ, x, t) # Partition function (normalization) Z(x) = ฮฃ_{all y sequences} exp(ฮฃโ‚œ score(yโ‚œโ‚‹โ‚,yโ‚œ,x,t)) # = exponentially many terms! # Computed efficiently with Forward algo: # Same as HMM forward, but in log space

CRF Training

Log-likelihood gradient requires computing Z(x) for each example โ€” done with the Forward algorithm (O(Tยท|Y|ยฒ)). The Viterbi algorithm decodes the best label sequence at inference time.

BERT + CRF for NER

A linear-chain CRF on top of BERT's token embeddings captures inter-label dependencies (e.g., I-ORG must follow B-ORG) that a simple softmax cannot. Still used in production NER pipelines in 2026, especially where invalid tag sequences are costly.