How do we turn messy, raw text into numbers a machine can learn from? This week lays the entire foundation.
"He is a Programmer! Check out https://abc.com 😊"
[0, 1, 0, 2, 0, 1, 0, 0, 0, 1 …] ← a row in our matrix
The pipeline: Raw Text → Preprocessing → Encoding → Numeric Matrix → ML Model
| Term | Definition | Example |
|---|---|---|
| Corpus | A collection of text documents | All Yelp reviews, Wikipedia, emails |
| Token | A single unit of text (usually a word) | "hello" in "hello world" |
| Vocabulary (V) | Set of all unique words in the corpus | {"cat", "dog", "fish"} → d=3 |
| Tokenization | Splitting text into tokens | "I am here" → ["I","am","here"] |
| Stop Words | Common words with little meaning | "a", "the", "is", "and" |
| N-gram | Sequence of N consecutive tokens | Bigram of "NLP is fun": ("NLP","is"),("is","fun") |
| Syntax | Grammatical structure of language | Subject → Verb → Object |
| Semantics | Meaning carried by words/text | "bank" → financial institution or river bank? |
Both techniques reduce a word to its base/root form to group together inflected versions of the same word (e.g., "running", "runs", "ran" → all mean "run").
Without normalization, "program", "programming", and "programmer" are treated as 3 different features. They all carry the same root meaning. Reducing them helps the model generalize better and reduces the vocabulary size.
| Technique | Method | Input | Output | Notes |
|---|---|---|---|---|
| Stemming | Chops off prefixes/suffixes (rule-based) | programmer | programm | Fast but may not be a real word |
| Lemmatization | Uses a dictionary to find the true root | programmer | program | Slower but linguistically correct |
Removing irrelevant characters: punctuation, special characters, numbers, extra whitespace, HTML tags, URLs, emojis.
These characters add noise to your feature space — they inflate vocabulary size and don't carry semantic meaning in most tasks.
Each word in the vocabulary gets its own dimension (column). A word's vector is all 0s except a single 1 at the position corresponding to that word in the vocabulary.
It's the simplest way to represent a word as a number that a computer can store and calculate with. It's the starting point for understanding all other encodings.
For a corpus with vocabulary size d: each word → vector of size d, with a 1 at index i (its position in the vocab). A document with n words → matrix of size n × d.
Corpus: "This is a simple sentence" → 5 unique words → d = 5
Each document is represented as a vector of word counts. Instead of one 1 per word, we count how many times each vocabulary word appears in the document.
We want to capture that a document about "machine learning" probably uses the word "model" many times. BoW gives frequency information that OHE misses.
For N documents and vocabulary of size d → document-term matrix of shape N × d. Each cell (i, j) = count of word j in document i.
TF-IDF weights each word by how important it is to a specific document relative to the whole corpus. Common words like "the" get low scores; rare but meaningful words get high scores.
BoW gives the word "the" the same weight as the word "neural" — but "the" appears everywhere and is useless for differentiating documents. TF-IDF fixes this by penalizing words that appear in many documents.
Example: 2 documents, 6-word vocabulary
| Feature | One-Hot Encoding | Bag of Words | TF-IDF |
|---|---|---|---|
| Captures frequency? | No (just 0/1) | Yes | Yes (weighted) |
| Penalizes common words? | No | No | Yes ← key advantage |
| Captures word order? | No | No | No |
| Word similarity? | No (orthogonal) | No | No |
| Best for? | Simple baselines, categorical inputs | Text classification | Search, document similarity, clustering |
| Matrix shape | N×d (one-hot per word) or d×d | N×d | N×d |
Q1. What is a potential limitation of the Bag of Words approach?
Q2. Which statement about stemming vs lemmatization is TRUE?
Q3. In One-Hot Encoding, a document has n words and vocabulary size d. How many 1s are in the n×d matrix?
Q4. Which representation can be used to find similar documents?
Every modern LLM — GPT-4, LLaMA, BERT, Gemini — starts with subword tokenization. This section covers the algorithms that make these models work: why character and word tokenization both fail, how BPE and WordPiece solve the OOV problem, and why tokenizer choice shapes multilingual fairness in 2026.
Imagine training on English news. The word tokenization appears 10× but detokenization appears once. A word-level model marks it [UNK] — the model is blind. A character model sees 16 individual characters with no semantic structure. Subword tokenization splits detokenization → de##token##ization — meaningful pieces, each seen thousands of times.
Subword sweet spot: fixed vocab (~32K–100K), common words stay whole, rare words split into known morphemes. Zero OOV.
Originally a data compression algorithm (1994), adapted for NLP by Sennrich et al. (2016). Used by: GPT-2, GPT-3, GPT-4, LLaMA, Mistral, Qwen, Falcon.
Corpus: "low low low lower lower newest newest widest"
| Step | Most Frequent Pair | Count | New Token | Effect |
|---|---|---|---|---|
| Init | — | — | — | l o w </w>, n e w e s t </w>… |
| 1 | (e, s) | 9 | es | n e w es t </w> |
| 2 | (es, t) | 6 | est | n e w est </w> |
| 3 | (est, </w>) | 6 | est</w> | n e w est</w> |
| 4 | (l, o) | 8 | lo | lo w </w> |
| 5 | (lo, w) | 8 | low | low </w> |
| … | … | … | … | … |
At inference: lower → lower | newest → newest | unplayed → unplayed
Merges based on likelihood gain rather than frequency:
Favors informative combinations. Continuation pieces marked with ##: playing → play ##ing. Used by: BERT, DistilBERT, MobileBERT.
Starts with a large vocabulary and prunes by minimizing likelihood loss. Uses EM algorithm — naturally probabilistic, enables multiple segmentations for data augmentation:
SentencePiece is a library (not an algorithm) that runs BPE or Unigram LM without any language-specific preprocessing — no whitespace assumptions, works for Japanese/Chinese/Arabic/Thai. Treats the space character as a normal character. Word starts marked with ▁. Used by: LLaMA, T5, Gemma, mT5, NLLB.
| Algorithm | Criterion | Direction | Used By | Continuation Marker |
|---|---|---|---|---|
| BPE | Frequency | Bottom-up merge | GPT-4, LLaMA, Mistral | Ġ (space prefix) |
| WordPiece | Likelihood gain | Bottom-up merge | BERT, DistilBERT | ## prefix |
| Unigram LM | Min likelihood loss | Top-down prune | T5, XLNet, mBART | ▁ word start |
| SentencePiece | BPE or Unigram | Language-agnostic | LLaMA, Gemma, mT5 | ▁ word start |
Fertility = average tokens per word. English BPE fertility ≈ 1.3–1.5. Arabic fertility ≈ 2.5–3.5 due to morphological richness. Thai/CJK even higher. This means an Arabic sentence uses 2× more tokens than English — fewer reasoning steps per context window, higher cost per task, lower effective quality.
GPT-3 and early LLaMA used English-dominant BPE vocabularies. A Turkish verb like çevrimiçileştiremeyebileceklerinden (one word meaning "because they may not be able to make [it] online") tokenizes into 20+ BPE tokens — forcing the model to "spend" its context budget on morphological decomposition. Models trained on larger multilingual vocabularies (XLM-R's 250K tokens) show much better parity. This is an active fairness research area in 2026.
Type any English word or phrase to see approximate BPE-style subword splits, token count, and fertility score.