Week 8 – Multimodal NLP & Multilingual Models

🖼️

Multimodal NLP: Vision + Language

CLIP · LLaVA · Flamingo · Contrastive Pretraining

🔗 CLIP: Contrastive Language-Image Pretraining

▼

WhatWhyHowPhD Depth

CLIP (Radford et al., OpenAI 2021) trained on 400M image-text pairs from the web. The key insight: instead of predicting image labels, train a model to match images with their natural language descriptions using contrastive loss.

📖 Why CLIP Changed Everything

Before CLIP, you needed labeled training data for each vision task. CLIP enables zero-shot classification: given a new class "raccoon eating pizza," just encode the text "a photo of a raccoon eating pizza" and find the image whose embedding is closest. No raccoon-pizza training examples needed. This opened the door to general-purpose vision-language models.

CLIP Architecture

Input batch: N image-text pairs (I₁,T₁), ..., (Iₙ,Tₙ) Image Encoder (ViT or ResNet) Text Encoder (Transformer) ↓ ↓ image_emb ∈ ℝ^d text_emb ∈ ℝ^d ↓ ↓ L2 normalize L2 normalize ↓_________________________↓ Cosine similarity matrix: [N × N] diagonal = correct pairs off-diagonal = negative pairs Loss: symmetric InfoNCE (cross-entropy on rows AND columns) L_img = -log( exp(sim(Iᵢ,Tᵢ)/τ) / Σⱼ exp(sim(Iᵢ,Tⱼ)/τ) ) L_txt = -log( exp(sim(Iᵢ,Tᵢ)/τ) / Σⱼ exp(sim(Iⱼ,Tᵢ)/τ) ) L = (L_img + L_txt) / 2 τ = learned temperature

Zero-Shot Classification

Create text prompts for each class: "a photo of a {class}". Encode all prompts. Classify image by nearest text embedding. Matches ResNet-50 on ImageNet with zero labeled examples.

Training Scale

400M image-text pairs (WIT dataset). ViT-L/14 encoder. 12 days on 256 V100 GPUs. Temperature τ learned from ~0.07 → ~0.01 during training.

Why InfoNCE?

InfoNCE is a lower bound on mutual information between image and text embeddings. Maximizing it pulls correct pairs together while pushing N-1 incorrect pairs apart. Larger batches = stronger negatives = better training signal.

🦙 LLaVA: Visual Instruction Tuning

▼

WhatHow2024 Frontier

LLaVA (Liu et al., 2023) connects a CLIP vision encoder to an LLM (LLaMA/Vicuna) via a simple linear projection layer. It's fine-tuned on synthetic visual instruction following data generated by GPT-4.

Architecture: Image → [CLIP ViT encoder] → visual tokens (e.g. 256 tokens) ↓ Linear Projection Layer (W) ↓ language token space embeddings ↓ Text tokens: "Describe this image" → [Tokenizer] ──────────→ [LLM] Concat ↑ Training Stage 1: Freeze CLIP + LLM, train projection W only (align visual features to text space) Training Stage 2: Freeze CLIP, fine-tune W + LLM jointly (visual instruction tuning on GPT-4 generated QA)

LLaVA-1.5 and LLaVA-NeXT (2024) improved the projection layer (MLP instead of linear), used higher resolution inputs, and trained on more diverse visual instruction data — achieving SOTA on multiple VQA benchmarks with a remarkably simple architecture.

🦩 Flamingo: Cross-Attention Vision-Language Model

▼

WhatDeepMind 2022

Flamingo (Alayrac et al., DeepMind 2022) takes a different approach: keep both a frozen vision encoder (NFNet) and a frozen LLM (Chinchilla), and connect them via new cross-attention layers inserted between existing LLM layers.

Flamingo Architecture: Image(s) → Frozen Vision Encoder → Resampler → K visual tokens ↓ Frozen LLM: [...] → [Gated Cross-Attn Layer] → [Standard LLM Layer] → [...] ↑ attends to visual tokens (only cross-attn + gate params trained) Key innovations: 1. Resampler: fixed number of output tokens regardless of image size 2. Gated cross-attention: tanh gate α·tanh(K)·v, α learned from 0 3. Supports interleaved image-text sequences: [img1][text1][img2][text2] 4. Few-shot in-context: 4-shot visual QA without gradient updates

Flamingo's design — freezing the backbone and training only cross-attention adapters — directly inspired modern efficient multimodal architectures. GPT-4V, Gemini Pro Vision, and Claude Vision all use variants of this cross-modal bridging approach.

📊 Multimodal Model Comparison

▼

Model	Vision Encoder	Language Model	Connector	Training Strategy
CLIP	ViT (trained from scratch)	Transformer text encoder	Shared embedding space	Contrastive on 400M pairs
LLaVA	CLIP ViT (frozen)	LLaMA/Vicuna	Linear projection / MLP	2-stage: alignment then instruction tuning
Flamingo	NFNet (frozen)	Chinchilla (frozen)	Gated cross-attention	Cross-attn layers trained on interleaved data
GPT-4V	Unknown ViT	GPT-4	Unknown	RLHF + safety training
Gemini	Native multimodal	Gemini LLM	Native (trained jointly)	End-to-end multimodal pretraining

🌍

Multilingual NLP: Models for 100+ Languages

mBERT · XLM-R · Cross-lingual Transfer · Low-Resource NLP

🌐 mBERT: Multilingual BERT

▼

WhatWhyHow

mBERT (Google 2019) trains BERT on Wikipedia dumps from 104 languages simultaneously, using a shared WordPiece vocabulary of 110K tokens. No cross-lingual objective — just masked language modeling on multilingual text.

📖 The Surprising Result: Zero-Shot Cross-Lingual Transfer

The remarkable finding: fine-tune mBERT on English NER, then test it on German NER with no German labeled data. It works — achieving ~75% of the performance of a German-only model. The shared vocabulary and joint pretraining create a universal representation where analogous entities occupy similar positions across languages. This "cross-lingual transfer" emerged without explicit cross-lingual supervision.

Architecture

Same as BERT-base: 12 layers, 768 hidden, 12 heads, 110M params. Shared 110K WordPiece vocabulary across all languages.

Limitations

Curse of multilinguality: as languages increase, per-language capacity decreases. Low-resource languages get very few tokens. Vocabulary allocation biased toward high-resource languages.

Why It Works

Shared vocabulary creates lexical bridges (cognates, code-switching). Structural similarities across languages create implicit alignment. Numbers, punctuation, and code tokens are universal.

🚀 XLM-R: Cross-lingual Language Model — RoBERTa Scale

▼

WhatSOTA Multilingual

XLM-R (Conneau et al., Facebook AI 2020) applies RoBERTa-style training (no NSP, more data, larger batches, longer sequences) at multilingual scale — training on CC-100, a 2.5TB cleaned multilingual web corpus.

Aspect	mBERT	XLM-R Base	XLM-R Large
Languages	104	100	100
Parameters	110M	270M	560M
Vocabulary	110K WordPiece	250K SentencePiece	250K SentencePiece
Training data	Wikipedia only	CC-100 (2.5TB)	CC-100 (2.5TB)
XNLI avg accuracy	65.4%	79.2%	83.6%
Swahili NER F1 (zero-shot)	~50%	~65%	~70%

XLM-R's 250K SentencePiece vocabulary (vs mBERT's 110K WordPiece) dramatically improves tokenization quality for non-Latin scripts. Arabic, Thai, and Chinese receive much more vocabulary budget — lower fertility, better representation.

🌙 Arabic NLP: A Case Study in Morphological Richness

▼

Why DifficultSolutions

Arabic presents unique challenges for NLP that make it a fascinating case study — and an important one given 350M+ native speakers and significant underrepresentation in most LLMs.

Morphological Richness

One Arabic root (e.g., k-t-b = writing) can generate 50+ derived words: كَتَبَ (kataba, "he wrote"), كِتَاب (kitāb, "book"), مَكْتُوب (maktūb, "written"). BPE on Arabic has very high fertility.

Diacritization Problem

Arabic text is typically written without short vowels (diacritics). كتب can mean "he wrote" (kataba), "book" (kitāb), or "office" (maktab). Context determines pronunciation and meaning.

Diglossia

Modern Standard Arabic (formal text) vs. 25+ dialectal varieties (Egyptian, Gulf, Levantine…). Models trained on MSA perform poorly on dialects. Most social media is dialectal.

Arabic-Specific Models

AraBERT: BERT pretrained on Arabic Wikipedia + news. CAMeL BERT: dialectal Arabic focus. AraGPT-2: Arabic GPT-2 variant. Jais: Arabic-English bilingual LLM (2023).

📊 Cross-Lingual Transfer Strategies

▼

Strategy	Data Required	Quality	When to Use
Zero-shot XLT	Source lang labeled only	Good (XLM-R)	No target labeled data available
Few-shot XLT	Source + few target examples	Better	Can collect 10–100 target examples
Translate-Train	MT system available	Variable	High-resource translation pairs exist
Translate-Test	MT at inference time	Good baseline	Quick baseline, MT errors propagate
Native Training	Large target labeled set	Best	High-resource target language

🧠 Knowledge Check: Multimodal & Multilingual NLP

1. CLIP's training objective maximizes similarity between correct image-text pairs. What loss function does it use?

2. A researcher fine-tunes XLM-R on English NER, then tests it on Spanish NER with zero Spanish labeled data. This is called:

3. In LLaVA's architecture, what is the primary role of the projection layer between the CLIP encoder and the LLM?

4. Which statement about the "curse of multilinguality" in mBERT is most accurate?