Week 1 – Text Preprocessing & Representations

Term	Definition	Example
Corpus	A collection of text documents	All Yelp reviews, Wikipedia, emails
Token	A single unit of text (usually a word)	"hello" in "hello world"
Vocabulary (V)	Set of all unique words in the corpus	{"cat", "dog", "fish"} → d=3
Tokenization	Splitting text into tokens	"I am here" → ["I","am","here"]
Stop Words	Common words with little meaning	"a", "the", "is", "and"
N-gram	Sequence of N consecutive tokens	Bigram of "NLP is fun": ("NLP","is"),("is","fun")
Syntax	Grammatical structure of language	Subject → Verb → Object
Semantics	Meaning carried by words/text	"bank" → financial institution or river bank?

Term

Definition

Example

Corpus

A collection of text documents

All Yelp reviews, Wikipedia, emails

Token

A single unit of text (usually a word)

"hello" in "hello world"

Vocabulary (V)

Set of all unique words in the corpus

{"cat", "dog", "fish"} → d=3

Tokenization

Splitting text into tokens

"I am here" → ["I","am","here"]

Stop Words

Common words with little meaning

"a", "the", "is", "and"

N-gram

Sequence of N consecutive tokens

Bigram of "NLP is fun": ("NLP","is"),("is","fun")

Syntax

Grammatical structure of language

Subject → Verb → Object

Semantics

Meaning carried by words/text

"bank" → financial institution or river bank?

Technique	Method	Input	Output	Notes
Stemming	Chops off prefixes/suffixes (rule-based)	programmer	programm	Fast but may not be a real word
Lemmatization	Uses a dictionary to find the true root	programmer	program	Slower but linguistically correct

Technique

Method

Input

Output

Notes

Stemming

Chops off prefixes/suffixes (rule-based)

programmer

programm

Fast but may not be a real word

Lemmatization

Uses a dictionary to find the true root

programmer

program

Slower but linguistically correct

Feature	One-Hot Encoding	Bag of Words	TF-IDF
Captures frequency?	No (just 0/1)	Yes	Yes (weighted)
Penalizes common words?	No	No	Yes ← key advantage
Captures word order?	No	No	No
Word similarity?	No (orthogonal)	No	No
Best for?	Simple baselines, categorical inputs	Text classification	Search, document similarity, clustering
Matrix shape	N×d (one-hot per word) or d×d	N×d	N×d

Feature

One-Hot Encoding

Bag of Words

TF-IDF

Captures frequency?

No (just 0/1)

Yes

Yes (weighted)

Penalizes common words?

Yes ← key advantage

Captures word order?

Word similarity?

No (orthogonal)

Best for?

Simple baselines, categorical inputs

Text classification

Search, document similarity, clustering

Matrix shape

N×d (one-hot per word) or d×d

N×d

🤔 Why Not Words? Why Not Characters?

📖 The Core Problem

Imagine training on English news. The word tokenization appears 10× but detokenization appears once. A word-level model marks it [UNK] — the model is blind. A character model sees 16 individual characters with no semantic structure. Subword tokenization splits detokenization → de##token##ization — meaningful pieces, each seen thousands of times.

Word-Level Problems

Vocabulary explosion (millions of forms)
OOV for rare, new, or morphological variants
No cross-lingual sharing
"tokenization" ≠ "tokenize" + "ation" to the model

Character-Level Problems

Very long sequences → quadratic attention cost
No semantic units — model learns structure from scratch
English "play"= 4 tokens; Arabic equiv may be 2+
Works better for languages with small alphabets

Subword sweet spot: fixed vocab (~32K–100K), common words stay whole, rare words split into known morphemes. Zero OOV.

⚙️ Byte Pair Encoding (BPE) PhD

Originally a data compression algorithm (1994), adapted for NLP by Sennrich et al. (2016). Used by: GPT-2, GPT-3, GPT-4, LLaMA, Mistral, Qwen, Falcon.

# BPE Training — core loop def train_bpe(corpus, target_vocab_size): # Step 1: Initialize with character vocabulary vocab = character_vocab(corpus) # {a, b, ..., z, </w>} merge_rules = [] while len(vocab) < target_vocab_size: # Step 2: Count all adjacent symbol pairs pairs = count_adjacent_pairs(corpus) # {'e s':9, 'es t':6, ...} # Step 3: Merge the MOST FREQUENT pair best = max(pairs, key=pairs.get) # e.g. ('e','s') → count=9 new_token = best[0] + best[1] # 'es' vocab.add(new_token) merge_rules.append(best) corpus = apply_merge(corpus, best) # Replace all 'e s' → 'es' return vocab, merge_rules

Worked Example — Vocabulary from Scratch

Corpus: "low low low lower lower newest newest widest"

Step	Most Frequent Pair	Count	New Token	Effect
Init	—	—	—	l o w </w>, n e w e s t </w>…
1	(e, s)	9	`es`	n e w es t </w>
2	(es, t)	6	`est`	n e w est </w>
3	(est, </w>)	6	`est</w>`	n e w est</w>
4	(l, o)	8	`lo`	lo w </w>
5	(lo, w)	8	`low`	low </w>
…	…	…	…	…

At inference: lower → lower | newest → newest | unplayed → unplayed

📐 WordPiece vs. BPE vs. Unigram LM

WordPiece (BERT)

Merges based on likelihood gain rather than frequency:

score(A,B) = freq(AB) / (freq(A) × freq(B))

Favors informative combinations. Continuation pieces marked with ##: playing → play ##ing. Used by: BERT, DistilBERT, MobileBERT.

Unigram Language Model (T5, XLNet)

Starts with a large vocabulary and prunes by minimizing likelihood loss. Uses EM algorithm — naturally probabilistic, enables multiple segmentations for data augmentation:

L = Σᵢ log P(xᵢ) where P(x) = Σ_{seg} P(seg)

SentencePiece — Language-Agnostic Wrapper

SentencePiece is a library (not an algorithm) that runs BPE or Unigram LM without any language-specific preprocessing — no whitespace assumptions, works for Japanese/Chinese/Arabic/Thai. Treats the space character as a normal character. Word starts marked with ▁. Used by: LLaMA, T5, Gemma, mT5, NLLB.

Algorithm	Criterion	Direction	Used By	Continuation Marker
BPE	Frequency	Bottom-up merge	GPT-4, LLaMA, Mistral	Ġ (space prefix)
WordPiece	Likelihood gain	Bottom-up merge	BERT, DistilBERT	## prefix
Unigram LM	Min likelihood loss	Top-down prune	T5, XLNet, mBART	▁ word start
SentencePiece	BPE or Unigram	Language-agnostic	LLaMA, Gemma, mT5	▁ word start

📏 Fertility — Why Tokenizer Choice Is an Equity Issue in 2026

Fertility = average tokens per word. English BPE fertility ≈ 1.3–1.5. Arabic fertility ≈ 2.5–3.5 due to morphological richness. Thai/CJK even higher. This means an Arabic sentence uses 2× more tokens than English — fewer reasoning steps per context window, higher cost per task, lower effective quality.

🌍 Real Impact

GPT-3 and early LLaMA used English-dominant BPE vocabularies. A Turkish verb like çevrimiçileştiremeyebileceklerinden (one word meaning "because they may not be able to make [it] online") tokenizes into 20+ BPE tokens — forcing the model to "spend" its context budget on morphological decomposition. Models trained on larger multilingual vocabularies (XLM-R's 250K tokens) show much better parity. This is an active fairness research area in 2026.

🔧 Interactive BPE Tokenizer Demo

Try BPE-style Tokenization

Type any English word or phrase to see approximate BPE-style subword splits, token count, and fertility score.

Text Preprocessing & Text Representations

The Big Picture

Raw Text (what we have)

Processed Vector (what we need)

Part 1: Text Preprocessing

📌 NLP Terminology You Must Know

✂️ Stemming vs. Lemmatization

🔇 Noise Removal

🎮 Interactive Preprocessor

Part 2: Text Representations

1️⃣ One-Hot Encoding (OHE)

✅ Advantages

❌ Disadvantages

🛍️ Bag of Words (BoW)

✅ Advantages

❌ Disadvantages

📊 TF-IDF (Term Frequency – Inverse Document Frequency)

🎮 TF-IDF Live Calculator

Comparison: OHE vs BoW vs TF-IDF

🧪 Quiz Prep — Week 1 Questions

Subword Tokenization: BPE, WordPiece & Unigram LM

🤔 Why Not Words? Why Not Characters?

Word-Level Problems

Character-Level Problems

⚙️ Byte Pair Encoding (BPE) PhD

Worked Example — Vocabulary from Scratch

📐 WordPiece vs. BPE vs. Unigram LM

WordPiece (BERT)

Unigram Language Model (T5, XLNet)

SentencePiece — Language-Agnostic Wrapper

📏 Fertility — Why Tokenizer Choice Is an Equity Issue in 2026

🔧 Interactive BPE Tokenizer Demo

Try BPE-style Tokenization