How do we turn messy, raw text into numbers a machine can learn from? This week lays the entire foundation.
"He is a Programmer! Check out https://abc.com 😊"
[0, 1, 0, 2, 0, 1, 0, 0, 0, 1 …] ← a row in our matrix
The pipeline: Raw Text → Preprocessing → Encoding → Numeric Matrix → ML Model
| Term | Definition | Example |
|---|---|---|
| Corpus | A collection of text documents | All Yelp reviews, Wikipedia, emails |
| Token | A single unit of text (usually a word) | "hello" in "hello world" |
| Vocabulary (V) | Set of all unique words in the corpus | {"cat", "dog", "fish"} → d=3 |
| Tokenization | Splitting text into tokens | "I am here" → ["I","am","here"] |
| Stop Words | Common words with little meaning | "a", "the", "is", "and" |
| N-gram | Sequence of N consecutive tokens | Bigram of "NLP is fun": ("NLP","is"),("is","fun") |
| Syntax | Grammatical structure of language | Subject → Verb → Object |
| Semantics | Meaning carried by words/text | "bank" → financial institution or river bank? |
Both techniques reduce a word to its base/root form to group together inflected versions of the same word (e.g., "running", "runs", "ran" → all mean "run").
Without normalization, "program", "programming", and "programmer" are treated as 3 different features. They all carry the same root meaning. Reducing them helps the model generalize better and reduces the vocabulary size.
| Technique | Method | Input | Output | Notes |
|---|---|---|---|---|
| Stemming | Chops off prefixes/suffixes (rule-based) | programmer | programm | Fast but may not be a real word |
| Lemmatization | Uses a dictionary to find the true root | programmer | program | Slower but linguistically correct |
Removing irrelevant characters: punctuation, special characters, numbers, extra whitespace, HTML tags, URLs, emojis.
These characters add noise to your feature space — they inflate vocabulary size and don't carry semantic meaning in most tasks.
Each word in the vocabulary gets its own dimension (column). A word's vector is all 0s except a single 1 at the position corresponding to that word in the vocabulary.
It's the simplest way to represent a word as a number that a computer can store and calculate with. It's the starting point for understanding all other encodings.
For a corpus with vocabulary size d: each word → vector of size d, with a 1 at index i (its position in the vocab). A document with n words → matrix of size n × d.
Corpus: "This is a simple sentence" → 5 unique words → d = 5
Each document is represented as a vector of word counts. Instead of one 1 per word, we count how many times each vocabulary word appears in the document.
We want to capture that a document about "machine learning" probably uses the word "model" many times. BoW gives frequency information that OHE misses.
For N documents and vocabulary of size d → document-term matrix of shape N × d. Each cell (i, j) = count of word j in document i.
TF-IDF weights each word by how important it is to a specific document relative to the whole corpus. Common words like "the" get low scores; rare but meaningful words get high scores.
BoW gives the word "the" the same weight as the word "neural" — but "the" appears everywhere and is useless for differentiating documents. TF-IDF fixes this by penalizing words that appear in many documents.
Example: 2 documents, 6-word vocabulary
| Feature | One-Hot Encoding | Bag of Words | TF-IDF |
|---|---|---|---|
| Captures frequency? | No (just 0/1) | Yes | Yes (weighted) |
| Penalizes common words? | No | No | Yes ← key advantage |
| Captures word order? | No | No | No |
| Word similarity? | No (orthogonal) | No | No |
| Best for? | Simple baselines, categorical inputs | Text classification | Search, document similarity, clustering |
| Matrix shape | N×d (one-hot per word) or d×d | N×d | N×d |
Q1. What is a potential limitation of the Bag of Words approach?
Q2. Which statement about stemming vs lemmatization is TRUE?
Q3. In One-Hot Encoding, a document has n words and vocabulary size d. How many 1s are in the n×d matrix?
Q4. Which representation can be used to find similar documents?