Giving every token a label โ from grammatical roles to named entities. The bridge between language understanding and structured information extraction.
Sequence labeling is the task of mapping a sequence of input tokens (words) to a corresponding sequence of output labels โ one label per token. Unlike text classification (which gives one label to an entire document), sequence labeling is token-level classification.
Many real-world NLP tasks require understanding at the word level: knowing that "Apple" is a company vs. a fruit, that "run" is a verb vs. a noun, or that "Bank of America" is a single entity spanning three words.
The input/output sequences have the same length (one tag per token). This is why encoder-only models (like BERT) are preferred โ they produce one representation per token, which can be directly fed into a classification layer.
| Task Type | Input | Output | Example |
|---|---|---|---|
| Text Classification | Full document | One label | "This email is spam" โ SPAM |
| Sequence Labeling | N tokens | N labels | "Apple Inc. was founded" โ [B-ORG, I-ORG, O, O] |
Assign each word its grammatical role: noun, verb, adjective, adverb, etc. One tag per word, same number of tags as words.
Identify and categorize named entities โ people, places, organizations, dates โ which can span multiple words. Uses special IOB tagging to handle multi-word spans.
POS tagging labels each word with its grammatical category. The key challenge is ambiguity: the same word can be different parts of speech depending on context.
CLASSIC AMBIGUITY EXAMPLE โ "sitting"
Here sitting is used as an Adjective (JJ) โ it modifies the noun "area".
Here sitting is used as a Verb (VBG) โ present participle describing the action.
POS tagging is useful for:
| Tag | Full Name | Example |
|---|---|---|
| NN | Noun, singular | "dog", "book", "car" |
| NNS | Noun, plural | "dogs", "books" |
| NNP | Proper noun, singular | "Atlanta", "Google", "Paris" |
| NNPS | Proper noun, plural | "Americans", "Vikings" |
| VB | Verb, base form | "run", "visit", "travel" |
| VBD | Verb, past tense | "ran", "visited", "traveled" |
| VBG | Verb, gerund/present participle | "running", "visiting" |
| VBP | Verb, non-3rd person singular present | "run", "travel" (I/we/they run) |
| TO | Infinitive marker "to" (before a verb) | "to visit", "to run" โ NOT a preposition |
| JJ | Adjective | "spacious", "big", "sitting" |
| RB | Adverb | "quickly", "very", "not" |
| DT | Determiner | "the", "a", "an", "this" |
| IN | Preposition or subordinating conjunction | "from", "via", "in", "of", "at" |
| CC | Coordinating conjunction | "and", "but", "or" |
| PRP | Personal pronoun | "I", "he", "she", "we" |
NER identifies and categorizes named entities in text into predefined categories. The goal is to extract structured information from unstructured text โ turning free text into a knowledge base.
Used in: information retrieval (finding documents about specific people/orgs), question answering, summarization, sentiment analysis (who is the sentiment about?), and knowledge graph construction.
IOB (InsideโOutsideโBeginning) tagging โ also called BIO โ gives a structured way to annotate entities that span multiple words. Each token gets ONE of three prefixes:
Without IOB, we couldn't distinguish: (a) two separate single-word entities back-to-back, vs. (b) one two-word entity. The B vs. I prefix solves this. IOB is also required for evaluation tools like CoNLL score.
LIVE BIO EXAMPLE โ click a word to highlight its entity:
ne_chunk() uses a classifier trained on annotated data.tree2conlltags() to convert to IOB format for comparison.binary=True: labels only NE vs. not-NE. binary=False: gives entity type (PERSON, GPE, etc.).dslim/bert-base-NER on HuggingFace).WHY ENCODER-ONLY (BERT) IS PREFERRED OVER ENCODER-DECODER FOR NER:
| Dimension | POS Tagging | NER |
|---|---|---|
| Goal | Grammatical role per word | Named entity type per word/span |
| Tag granularity | Always 1 tag per word | IOB tags โ can span multiple words |
| Tag examples | NN, VBD, JJ, RB, DT, IN, NNP | B-PER, I-ORG, B-LOC, O |
| Key challenge | Lexical ambiguity ("sitting") | Segmentation + semantic ambiguity ("Franklin") |
| Classical algorithm | Average Perceptron Tagger (NLTK) | MaxEnt NE Chunker (NLTK) |
| DL approach | BERT (fine-tuned for POS) | BERT (fine-tuned for NER) |
| Preferred architecture | Encoder-only | Encoder-only |
| Input = Output length? | Yes | Yes (with IOB, one tag per token) |
| Relationship | Often done first | Uses POS tags as features |
| NLTK function | pos_tag(tokens) | ne_chunk(pos_tagged) |
| Architecture | Best For | Why | Examples |
|---|---|---|---|
| Encoder-only (BERT, RoBERTa) |
POS Tagging, NER, Classification, Q&A (extractive) | Bidirectional context; one output per input token; easy classification head | BERT, RoBERTa, DistilBERT |
| Decoder-only (GPT family) |
Text generation, completion, chatbots | Autoregressive: predicts next token; unidirectional | GPT-2, GPT-3, GPT-4 |
| Encoder-Decoder (T5, BART) |
Translation, summarization, Q&A (generative) | Different input/output lengths; encoder reads, decoder generates | T5, BART, mBART |
Q1. In the sentence "I read the book every day", the word "read" is tagged as VBP (present tense). In "I read the book yesterday", the same word would be tagged differently. What does this demonstrate about POS tagging?
Q2. Which of the following correctly describes the IOB/BIO tagging scheme used in NER? (Select all that apply)
Q3. Which of the following statements is FALSE?
Q4. In NLTK, what is the correct order of operations to perform Named Entity Recognition?
Q5. What is the primary reason why encoder-decoder (seq2seq) models are generally not preferred for Part-of-Speech tagging?
The slides show you what HMMs and CRFs do. This section shows you why they work โ deriving the Forward-Backward algorithm, Viterbi dynamic programming, and the CRF partition function from first principles. These derivations appear in every sequence modeling interview and in classic NLP papers.
Every HMM application reduces to one of three classic problems, each with a specialized algorithm:
| Problem | Question | Algorithm | Complexity |
|---|---|---|---|
| Evaluation | P(observations | model)? | Forward Algorithm | O(NยฒT) |
| Decoding | Most likely state sequence given observations? | Viterbi | O(NยฒT) |
| Learning | Estimate HMM parameters from unlabeled data? | Baum-Welch (EM) | O(NยฒT) per iteration |
N = number of hidden states, T = sequence length
Viterbi is Forward algorithm with sum replaced by max. It finds the most probable hidden state sequence (e.g., POS tag sequence), not just the probability of observations.
Best path: NOUN โ VERB โ NOUN (Fed/NNP raises/VBZ rates/NNS) โ backtracking from max ฮด_T across all states.
A linear-chain CRF models P(y|x) directly (discriminative), avoiding the independence assumptions of HMMs (generative). The challenge: computing the normalization constant Z(x).
Log-likelihood gradient requires computing Z(x) for each example โ done with the Forward algorithm (O(Tยท|Y|ยฒ)). The Viterbi algorithm decodes the best label sequence at inference time.
A linear-chain CRF on top of BERT's token embeddings captures inter-label dependencies (e.g., I-ORG must follow B-ORG) that a simple softmax cannot. Still used in production NER pipelines in 2026, especially where invalid tag sequences are costly.