🗺️

The Big Picture

Why this week matters
Core Challenge of NLP Computers only understand numbers — but language is text. Before we can use any ML model, we must answer: "How do I convert a document into a meaningful number?" That's exactly what this week solves.

Raw Text (what we have)

"He is a Programmer! Check out https://abc.com 😊"

Processed Vector (what we need)

[0, 1, 0, 2, 0, 1, 0, 0, 0, 1 …] ← a row in our matrix

The pipeline: Raw Text → Preprocessing → Encoding → Numeric Matrix → ML Model

🧹

Part 1: Text Preprocessing

Cleaning and normalizing text before encoding

📌 NLP Terminology You Must Know

TermDefinitionExample
CorpusA collection of text documentsAll Yelp reviews, Wikipedia, emails
TokenA single unit of text (usually a word)"hello" in "hello world"
Vocabulary (V)Set of all unique words in the corpus{"cat", "dog", "fish"} → d=3
TokenizationSplitting text into tokens"I am here" → ["I","am","here"]
Stop WordsCommon words with little meaning"a", "the", "is", "and"
N-gramSequence of N consecutive tokensBigram of "NLP is fun": ("NLP","is"),("is","fun")
SyntaxGrammatical structure of languageSubject → Verb → Object
SemanticsMeaning carried by words/text"bank" → financial institution or river bank?

✂️ Stemming vs. Lemmatization

What

Both techniques reduce a word to its base/root form to group together inflected versions of the same word (e.g., "running", "runs", "ran" → all mean "run").

Why

Without normalization, "program", "programming", and "programmer" are treated as 3 different features. They all carry the same root meaning. Reducing them helps the model generalize better and reduces the vocabulary size.

How — The Difference

TechniqueMethodInputOutputNotes
StemmingChops off prefixes/suffixes (rule-based)programmerprogrammFast but may not be a real word
LemmatizationUses a dictionary to find the true rootprogrammerprogramSlower but linguistically correct
# Python: Stemming with Porter Stemmer (NLTK) from nltk.stem import PorterStemmer ps = PorterStemmer() ps.stem("programmer") → "programm" ps.stem("running") → "run" # Python: Lemmatization with WordNetLemmatizer (NLTK) from nltk.stem import WordNetLemmatizer wl = WordNetLemmatizer() wl.lemmatize("programmer") → "programmer" wl.lemmatize("running", pos='v') → "run"
⚠️ Quiz 1 Tested This! "What is the output if stemming is applied to programmer?" → Answer: "programm" (not a real word — that's the nature of stemming).

🔇 Noise Removal

What

Removing irrelevant characters: punctuation, special characters, numbers, extra whitespace, HTML tags, URLs, emojis.

Why

These characters add noise to your feature space — they inflate vocabulary size and don't carry semantic meaning in most tasks.

import re text = "Check out https://abc.com! Score: 100% 😊" clean = re.sub(r'[^a-zA-Z\s]', '', text) → "Check out Score "

🎮 Interactive Preprocessor

Type any sentence below and see how each preprocessing step transforms it.
Click "Process Text" to see the transformation steps...
📐

Part 2: Text Representations

Turning cleaned text into numbers

1️⃣ One-Hot Encoding (OHE)

What

Each word in the vocabulary gets its own dimension (column). A word's vector is all 0s except a single 1 at the position corresponding to that word in the vocabulary.

Why

It's the simplest way to represent a word as a number that a computer can store and calculate with. It's the starting point for understanding all other encodings.

How — The Math

For a corpus with vocabulary size d: each word → vector of size d, with a 1 at index i (its position in the vocab). A document with n words → matrix of size n × d.

Corpus: "This is a simple sentence" → 5 unique words → d = 5

⚠️ Quiz 1 Tested This! "In a n×d OHE matrix, how many 1s are there total?" → Answer: n (exactly one 1 per row/word).

✅ Advantages

  • Simple and easy to implement
  • Works with any vocabulary

❌ Disadvantages

  • Vectors are huge (size = vocab size)
  • Very sparse (mostly zeros)
  • No notion of similarity — dot(good, great) = dot(good, bad) = 0

🛍️ Bag of Words (BoW)

What

Each document is represented as a vector of word counts. Instead of one 1 per word, we count how many times each vocabulary word appears in the document.

Why

We want to capture that a document about "machine learning" probably uses the word "model" many times. BoW gives frequency information that OHE misses.

How — The Math

For N documents and vocabulary of size d → document-term matrix of shape N × d. Each cell (i, j) = count of word j in document i.

# Example Vocab: [this, was, the, best, worst, of, times] Doc 1: "this was the best of times" → [1, 1, 1, 1, 0, 1, 1] Doc 2: "this was the worst of times" → [1, 1, 1, 0, 1, 1, 1] # Note: 'this','was','the','of','times' are SAME → can't distinguish docs!
⚠️ Quiz 1: Key Limitation BoW (and OHE) both ignore word order and context. "Dog bites man" and "Man bites dog" produce the same BoW vector!

✅ Advantages

  • Captures word frequency (better than OHE)
  • Simple to implement (CountVectorizer)

❌ Disadvantages

  • Still huge and sparse
  • All words equally "important"
  • Ignores word order and context

📊 TF-IDF (Term Frequency – Inverse Document Frequency)

What

TF-IDF weights each word by how important it is to a specific document relative to the whole corpus. Common words like "the" get low scores; rare but meaningful words get high scores.

Why

BoW gives the word "the" the same weight as the word "neural" — but "the" appears everywhere and is useless for differentiating documents. TF-IDF fixes this by penalizing words that appear in many documents.

How — The Formula (Step by Step)

TF(t, d) = count(t in d) / total_words_in_d → How often does this word appear in THIS document? IDF(t) = log( N / df(t) ) → N = total documents, df(t) = # docs containing word t → If t appears in ALL docs: IDF = log(1) = 0 ← common = useless → If t appears in FEW docs: IDF is HIGH ← rare = distinctive TF-IDF(t, d) = TF(t, d) × IDF(t)

🎮 TF-IDF Live Calculator

See how TF-IDF scores change as word frequency and document frequency change.
Adjust the sliders above...

Example: 2 documents, 6-word vocabulary

✅ When to use TF-IDF Use TF-IDF for document search, finding similar documents, document clustering, and classification tasks where rare distinctive words matter most.
⚠️ Quiz 1: Key Fact "TF-IDF can be used to find similar documents" → TRUE. "TF-IDF captures context" → FALSE (it still has no word order).
⚖️

Comparison: OHE vs BoW vs TF-IDF

When to use which?
FeatureOne-Hot EncodingBag of WordsTF-IDF
Captures frequency?No (just 0/1)YesYes (weighted)
Penalizes common words?NoNoYes ← key advantage
Captures word order?NoNoNo
Word similarity?No (orthogonal)NoNo
Best for?Simple baselines, categorical inputsText classificationSearch, document similarity, clustering
Matrix shapeN×d (one-hot per word) or d×dN×dN×d
🔑 Key Insight for All Three All three representations create sparse, high-dimensional vectors (size = vocab size d). This is why Week 4 introduces word embeddings — to replace these with small, dense vectors that actually capture meaning.

🧪 Quiz Prep — Week 1 Questions

Q1. What is a potential limitation of the Bag of Words approach?

Q2. Which statement about stemming vs lemmatization is TRUE?

Q3. In One-Hot Encoding, a document has n words and vocabulary size d. How many 1s are in the n×d matrix?

Q4. Which representation can be used to find similar documents?

🔬 Beyond the Slides · Graduate Depth

Subword Tokenization: BPE, WordPiece & Unigram LM

Every modern LLM — GPT-4, LLaMA, BERT, Gemini — starts with subword tokenization. This section covers the algorithms that make these models work: why character and word tokenization both fail, how BPE and WordPiece solve the OOV problem, and why tokenizer choice shapes multilingual fairness in 2026.

🤔 Why Not Words? Why Not Characters?

Imagine training on English news. The word tokenization appears 10× but detokenization appears once. A word-level model marks it [UNK] — the model is blind. A character model sees 16 individual characters with no semantic structure. Subword tokenization splits detokenizationde##token##ization — meaningful pieces, each seen thousands of times.

Word-Level Problems

  • Vocabulary explosion (millions of forms)
  • OOV for rare, new, or morphological variants
  • No cross-lingual sharing
  • "tokenization" ≠ "tokenize" + "ation" to the model

Character-Level Problems

  • Very long sequences → quadratic attention cost
  • No semantic units — model learns structure from scratch
  • English "play"= 4 tokens; Arabic equiv may be 2+
  • Works better for languages with small alphabets

Subword sweet spot: fixed vocab (~32K–100K), common words stay whole, rare words split into known morphemes. Zero OOV.

⚙️ Byte Pair Encoding (BPE) PhD

Originally a data compression algorithm (1994), adapted for NLP by Sennrich et al. (2016). Used by: GPT-2, GPT-3, GPT-4, LLaMA, Mistral, Qwen, Falcon.

# BPE Training — core loop def train_bpe(corpus, target_vocab_size): # Step 1: Initialize with character vocabulary vocab = character_vocab(corpus) # {a, b, ..., z, </w>} merge_rules = [] while len(vocab) < target_vocab_size: # Step 2: Count all adjacent symbol pairs pairs = count_adjacent_pairs(corpus) # {'e s':9, 'es t':6, ...} # Step 3: Merge the MOST FREQUENT pair best = max(pairs, key=pairs.get) # e.g. ('e','s') → count=9 new_token = best[0] + best[1] # 'es' vocab.add(new_token) merge_rules.append(best) corpus = apply_merge(corpus, best) # Replace all 'e s' → 'es' return vocab, merge_rules

Worked Example — Vocabulary from Scratch

Corpus: "low low low lower lower newest newest widest"

StepMost Frequent PairCountNew TokenEffect
Initl o w </w>, n e w e s t </w>…
1(e, s)9esn e w es t </w>
2(es, t)6estn e w est </w>
3(est, </w>)6est</w>n e w est</w>
4(l, o)8lolo w </w>
5(lo, w)8lowlow </w>

At inference: lowerlower | newestnewest | unplayedunplayed

📐 WordPiece vs. BPE vs. Unigram LM

WordPiece (BERT)

Merges based on likelihood gain rather than frequency:

score(A,B) = freq(AB) / (freq(A) × freq(B))

Favors informative combinations. Continuation pieces marked with ##: playingplay ##ing. Used by: BERT, DistilBERT, MobileBERT.

Unigram Language Model (T5, XLNet)

Starts with a large vocabulary and prunes by minimizing likelihood loss. Uses EM algorithm — naturally probabilistic, enables multiple segmentations for data augmentation:

L = Σᵢ log P(xᵢ) where P(x) = Σ_{seg} P(seg)

SentencePiece — Language-Agnostic Wrapper

SentencePiece is a library (not an algorithm) that runs BPE or Unigram LM without any language-specific preprocessing — no whitespace assumptions, works for Japanese/Chinese/Arabic/Thai. Treats the space character as a normal character. Word starts marked with ▁. Used by: LLaMA, T5, Gemma, mT5, NLLB.

AlgorithmCriterionDirectionUsed ByContinuation Marker
BPEFrequencyBottom-up mergeGPT-4, LLaMA, MistralĠ (space prefix)
WordPieceLikelihood gainBottom-up mergeBERT, DistilBERT## prefix
Unigram LMMin likelihood lossTop-down pruneT5, XLNet, mBART▁ word start
SentencePieceBPE or UnigramLanguage-agnosticLLaMA, Gemma, mT5▁ word start

📏 Fertility — Why Tokenizer Choice Is an Equity Issue in 2026

Fertility = average tokens per word. English BPE fertility ≈ 1.3–1.5. Arabic fertility ≈ 2.5–3.5 due to morphological richness. Thai/CJK even higher. This means an Arabic sentence uses 2× more tokens than English — fewer reasoning steps per context window, higher cost per task, lower effective quality.

GPT-3 and early LLaMA used English-dominant BPE vocabularies. A Turkish verb like çevrimiçileştiremeyebileceklerinden (one word meaning "because they may not be able to make [it] online") tokenizes into 20+ BPE tokens — forcing the model to "spend" its context budget on morphological decomposition. Models trained on larger multilingual vocabularies (XLM-R's 250K tokens) show much better parity. This is an active fairness research area in 2026.

🔧 Interactive BPE Tokenizer Demo

Try BPE-style Tokenization

Type any English word or phrase to see approximate BPE-style subword splits, token count, and fertility score.