🗺️

The Big Picture

Why this week matters
Core Challenge of NLP Computers only understand numbers — but language is text. Before we can use any ML model, we must answer: "How do I convert a document into a meaningful number?" That's exactly what this week solves.

Raw Text (what we have)

"He is a Programmer! Check out https://abc.com 😊"

Processed Vector (what we need)

[0, 1, 0, 2, 0, 1, 0, 0, 0, 1 …] ← a row in our matrix

The pipeline: Raw Text → Preprocessing → Encoding → Numeric Matrix → ML Model

🧹

Part 1: Text Preprocessing

Cleaning and normalizing text before encoding

📌 NLP Terminology You Must Know

TermDefinitionExample
CorpusA collection of text documentsAll Yelp reviews, Wikipedia, emails
TokenA single unit of text (usually a word)"hello" in "hello world"
Vocabulary (V)Set of all unique words in the corpus{"cat", "dog", "fish"} → d=3
TokenizationSplitting text into tokens"I am here" → ["I","am","here"]
Stop WordsCommon words with little meaning"a", "the", "is", "and"
N-gramSequence of N consecutive tokensBigram of "NLP is fun": ("NLP","is"),("is","fun")
SyntaxGrammatical structure of languageSubject → Verb → Object
SemanticsMeaning carried by words/text"bank" → financial institution or river bank?

✂️ Stemming vs. Lemmatization

What

Both techniques reduce a word to its base/root form to group together inflected versions of the same word (e.g., "running", "runs", "ran" → all mean "run").

Why

Without normalization, "program", "programming", and "programmer" are treated as 3 different features. They all carry the same root meaning. Reducing them helps the model generalize better and reduces the vocabulary size.

How — The Difference

TechniqueMethodInputOutputNotes
StemmingChops off prefixes/suffixes (rule-based)programmerprogrammFast but may not be a real word
LemmatizationUses a dictionary to find the true rootprogrammerprogramSlower but linguistically correct
# Python: Stemming with Porter Stemmer (NLTK) from nltk.stem import PorterStemmer ps = PorterStemmer() ps.stem("programmer") → "programm" ps.stem("running") → "run" # Python: Lemmatization with WordNetLemmatizer (NLTK) from nltk.stem import WordNetLemmatizer wl = WordNetLemmatizer() wl.lemmatize("programmer") → "programmer" wl.lemmatize("running", pos='v') → "run"
⚠️ Quiz 1 Tested This! "What is the output if stemming is applied to programmer?" → Answer: "programm" (not a real word — that's the nature of stemming).

🔇 Noise Removal

What

Removing irrelevant characters: punctuation, special characters, numbers, extra whitespace, HTML tags, URLs, emojis.

Why

These characters add noise to your feature space — they inflate vocabulary size and don't carry semantic meaning in most tasks.

import re text = "Check out https://abc.com! Score: 100% 😊" clean = re.sub(r'[^a-zA-Z\s]', '', text) → "Check out Score "

🎮 Interactive Preprocessor

Type any sentence below and see how each preprocessing step transforms it.
Click "Process Text" to see the transformation steps...
📐

Part 2: Text Representations

Turning cleaned text into numbers

1️⃣ One-Hot Encoding (OHE)

What

Each word in the vocabulary gets its own dimension (column). A word's vector is all 0s except a single 1 at the position corresponding to that word in the vocabulary.

Why

It's the simplest way to represent a word as a number that a computer can store and calculate with. It's the starting point for understanding all other encodings.

How — The Math

For a corpus with vocabulary size d: each word → vector of size d, with a 1 at index i (its position in the vocab). A document with n words → matrix of size n × d.

Corpus: "This is a simple sentence" → 5 unique words → d = 5

⚠️ Quiz 1 Tested This! "In a n×d OHE matrix, how many 1s are there total?" → Answer: n (exactly one 1 per row/word).

✅ Advantages

  • Simple and easy to implement
  • Works with any vocabulary

❌ Disadvantages

  • Vectors are huge (size = vocab size)
  • Very sparse (mostly zeros)
  • No notion of similarity — dot(good, great) = dot(good, bad) = 0

🛍️ Bag of Words (BoW)

What

Each document is represented as a vector of word counts. Instead of one 1 per word, we count how many times each vocabulary word appears in the document.

Why

We want to capture that a document about "machine learning" probably uses the word "model" many times. BoW gives frequency information that OHE misses.

How — The Math

For N documents and vocabulary of size d → document-term matrix of shape N × d. Each cell (i, j) = count of word j in document i.

# Example Vocab: [this, was, the, best, worst, of, times] Doc 1: "this was the best of times" → [1, 1, 1, 1, 0, 1, 1] Doc 2: "this was the worst of times" → [1, 1, 1, 0, 1, 1, 1] # Note: 'this','was','the','of','times' are SAME → can't distinguish docs!
⚠️ Quiz 1: Key Limitation BoW (and OHE) both ignore word order and context. "Dog bites man" and "Man bites dog" produce the same BoW vector!

✅ Advantages

  • Captures word frequency (better than OHE)
  • Simple to implement (CountVectorizer)

❌ Disadvantages

  • Still huge and sparse
  • All words equally "important"
  • Ignores word order and context

📊 TF-IDF (Term Frequency – Inverse Document Frequency)

What

TF-IDF weights each word by how important it is to a specific document relative to the whole corpus. Common words like "the" get low scores; rare but meaningful words get high scores.

Why

BoW gives the word "the" the same weight as the word "neural" — but "the" appears everywhere and is useless for differentiating documents. TF-IDF fixes this by penalizing words that appear in many documents.

How — The Formula (Step by Step)

TF(t, d) = count(t in d) / total_words_in_d → How often does this word appear in THIS document? IDF(t) = log( N / df(t) ) → N = total documents, df(t) = # docs containing word t → If t appears in ALL docs: IDF = log(1) = 0 ← common = useless → If t appears in FEW docs: IDF is HIGH ← rare = distinctive TF-IDF(t, d) = TF(t, d) × IDF(t)

🎮 TF-IDF Live Calculator

See how TF-IDF scores change as word frequency and document frequency change.
Adjust the sliders above...

Example: 2 documents, 6-word vocabulary

✅ When to use TF-IDF Use TF-IDF for document search, finding similar documents, document clustering, and classification tasks where rare distinctive words matter most.
⚠️ Quiz 1: Key Fact "TF-IDF can be used to find similar documents" → TRUE. "TF-IDF captures context" → FALSE (it still has no word order).
⚖️

Comparison: OHE vs BoW vs TF-IDF

When to use which?
FeatureOne-Hot EncodingBag of WordsTF-IDF
Captures frequency?No (just 0/1)YesYes (weighted)
Penalizes common words?NoNoYes ← key advantage
Captures word order?NoNoNo
Word similarity?No (orthogonal)NoNo
Best for?Simple baselines, categorical inputsText classificationSearch, document similarity, clustering
Matrix shapeN×d (one-hot per word) or d×dN×dN×d
🔑 Key Insight for All Three All three representations create sparse, high-dimensional vectors (size = vocab size d). This is why Week 4 introduces word embeddings — to replace these with small, dense vectors that actually capture meaning.

🧪 Quiz Prep — Week 1 Questions

Q1. What is a potential limitation of the Bag of Words approach?

Q2. Which statement about stemming vs lemmatization is TRUE?

Q3. In One-Hot Encoding, a document has n words and vocabulary size d. How many 1s are in the n×d matrix?

Q4. Which representation can be used to find similar documents?