🖼️

Multimodal NLP: Vision + Language

CLIP · LLaVA · Flamingo · Contrastive Pretraining

🔗 CLIP: Contrastive Language-Image Pretraining

WhatWhyHowPhD Depth

CLIP (Radford et al., OpenAI 2021) trained on 400M image-text pairs from the web. The key insight: instead of predicting image labels, train a model to match images with their natural language descriptions using contrastive loss.

Before CLIP, you needed labeled training data for each vision task. CLIP enables zero-shot classification: given a new class "raccoon eating pizza," just encode the text "a photo of a raccoon eating pizza" and find the image whose embedding is closest. No raccoon-pizza training examples needed. This opened the door to general-purpose vision-language models.

CLIP Architecture

Input batch: N image-text pairs (I₁,T₁), ..., (Iₙ,Tₙ) Image Encoder (ViT or ResNet) Text Encoder (Transformer) ↓ ↓ image_emb ∈ ℝ^d text_emb ∈ ℝ^d ↓ ↓ L2 normalize L2 normalize ↓_________________________↓ Cosine similarity matrix: [N × N] diagonal = correct pairs off-diagonal = negative pairs Loss: symmetric InfoNCE (cross-entropy on rows AND columns) L_img = -log( exp(sim(Iᵢ,Tᵢ)/τ) / Σⱼ exp(sim(Iᵢ,Tⱼ)/τ) ) L_txt = -log( exp(sim(Iᵢ,Tᵢ)/τ) / Σⱼ exp(sim(Iⱼ,Tᵢ)/τ) ) L = (L_img + L_txt) / 2 τ = learned temperature

Zero-Shot Classification

Create text prompts for each class: "a photo of a {class}". Encode all prompts. Classify image by nearest text embedding. Matches ResNet-50 on ImageNet with zero labeled examples.

Training Scale

400M image-text pairs (WIT dataset). ViT-L/14 encoder. 12 days on 256 V100 GPUs. Temperature τ learned from ~0.07 → ~0.01 during training.

Why InfoNCE?

InfoNCE is a lower bound on mutual information between image and text embeddings. Maximizing it pulls correct pairs together while pushing N-1 incorrect pairs apart. Larger batches = stronger negatives = better training signal.

🦙 LLaVA: Visual Instruction Tuning

WhatHow2024 Frontier

LLaVA (Liu et al., 2023) connects a CLIP vision encoder to an LLM (LLaMA/Vicuna) via a simple linear projection layer. It's fine-tuned on synthetic visual instruction following data generated by GPT-4.

Architecture: Image → [CLIP ViT encoder] → visual tokens (e.g. 256 tokens) ↓ Linear Projection Layer (W) ↓ language token space embeddings ↓ Text tokens: "Describe this image" → [Tokenizer] ──────────→ [LLM] Concat ↑ Training Stage 1: Freeze CLIP + LLM, train projection W only (align visual features to text space) Training Stage 2: Freeze CLIP, fine-tune W + LLM jointly (visual instruction tuning on GPT-4 generated QA)

LLaVA-1.5 and LLaVA-NeXT (2024) improved the projection layer (MLP instead of linear), used higher resolution inputs, and trained on more diverse visual instruction data — achieving SOTA on multiple VQA benchmarks with a remarkably simple architecture.

🦩 Flamingo: Cross-Attention Vision-Language Model

WhatDeepMind 2022

Flamingo (Alayrac et al., DeepMind 2022) takes a different approach: keep both a frozen vision encoder (NFNet) and a frozen LLM (Chinchilla), and connect them via new cross-attention layers inserted between existing LLM layers.

Flamingo Architecture: Image(s) → Frozen Vision Encoder → Resampler → K visual tokens ↓ Frozen LLM: [...] → [Gated Cross-Attn Layer] → [Standard LLM Layer] → [...] ↑ attends to visual tokens (only cross-attn + gate params trained) Key innovations: 1. Resampler: fixed number of output tokens regardless of image size 2. Gated cross-attention: tanh gate α·tanh(K)·v, α learned from 0 3. Supports interleaved image-text sequences: [img1][text1][img2][text2] 4. Few-shot in-context: 4-shot visual QA without gradient updates

Flamingo's design — freezing the backbone and training only cross-attention adapters — directly inspired modern efficient multimodal architectures. GPT-4V, Gemini Pro Vision, and Claude Vision all use variants of this cross-modal bridging approach.

📊 Multimodal Model Comparison

ModelVision EncoderLanguage ModelConnectorTraining Strategy
CLIPViT (trained from scratch)Transformer text encoderShared embedding spaceContrastive on 400M pairs
LLaVACLIP ViT (frozen)LLaMA/VicunaLinear projection / MLP2-stage: alignment then instruction tuning
FlamingoNFNet (frozen)Chinchilla (frozen)Gated cross-attentionCross-attn layers trained on interleaved data
GPT-4VUnknown ViTGPT-4UnknownRLHF + safety training
GeminiNative multimodalGemini LLMNative (trained jointly)End-to-end multimodal pretraining
🌍

Multilingual NLP: Models for 100+ Languages

mBERT · XLM-R · Cross-lingual Transfer · Low-Resource NLP

🌐 mBERT: Multilingual BERT

WhatWhyHow

mBERT (Google 2019) trains BERT on Wikipedia dumps from 104 languages simultaneously, using a shared WordPiece vocabulary of 110K tokens. No cross-lingual objective — just masked language modeling on multilingual text.

The remarkable finding: fine-tune mBERT on English NER, then test it on German NER with no German labeled data. It works — achieving ~75% of the performance of a German-only model. The shared vocabulary and joint pretraining create a universal representation where analogous entities occupy similar positions across languages. This "cross-lingual transfer" emerged without explicit cross-lingual supervision.

Architecture

Same as BERT-base: 12 layers, 768 hidden, 12 heads, 110M params. Shared 110K WordPiece vocabulary across all languages.

Limitations

Curse of multilinguality: as languages increase, per-language capacity decreases. Low-resource languages get very few tokens. Vocabulary allocation biased toward high-resource languages.

Why It Works

Shared vocabulary creates lexical bridges (cognates, code-switching). Structural similarities across languages create implicit alignment. Numbers, punctuation, and code tokens are universal.

🚀 XLM-R: Cross-lingual Language Model — RoBERTa Scale

WhatSOTA Multilingual

XLM-R (Conneau et al., Facebook AI 2020) applies RoBERTa-style training (no NSP, more data, larger batches, longer sequences) at multilingual scale — training on CC-100, a 2.5TB cleaned multilingual web corpus.

AspectmBERTXLM-R BaseXLM-R Large
Languages104100100
Parameters110M270M560M
Vocabulary110K WordPiece250K SentencePiece250K SentencePiece
Training dataWikipedia onlyCC-100 (2.5TB)CC-100 (2.5TB)
XNLI avg accuracy65.4%79.2%83.6%
Swahili NER F1 (zero-shot)~50%~65%~70%

XLM-R's 250K SentencePiece vocabulary (vs mBERT's 110K WordPiece) dramatically improves tokenization quality for non-Latin scripts. Arabic, Thai, and Chinese receive much more vocabulary budget — lower fertility, better representation.

🌙 Arabic NLP: A Case Study in Morphological Richness

Why DifficultSolutions

Arabic presents unique challenges for NLP that make it a fascinating case study — and an important one given 350M+ native speakers and significant underrepresentation in most LLMs.

Morphological Richness

One Arabic root (e.g., k-t-b = writing) can generate 50+ derived words: كَتَبَ (kataba, "he wrote"), كِتَاب (kitāb, "book"), مَكْتُوب (maktūb, "written"). BPE on Arabic has very high fertility.

Diacritization Problem

Arabic text is typically written without short vowels (diacritics). كتب can mean "he wrote" (kataba), "book" (kitāb), or "office" (maktab). Context determines pronunciation and meaning.

Diglossia

Modern Standard Arabic (formal text) vs. 25+ dialectal varieties (Egyptian, Gulf, Levantine…). Models trained on MSA perform poorly on dialects. Most social media is dialectal.

Arabic-Specific Models

AraBERT: BERT pretrained on Arabic Wikipedia + news. CAMeL BERT: dialectal Arabic focus. AraGPT-2: Arabic GPT-2 variant. Jais: Arabic-English bilingual LLM (2023).

📊 Cross-Lingual Transfer Strategies

StrategyData RequiredQualityWhen to Use
Zero-shot XLTSource lang labeled onlyGood (XLM-R)No target labeled data available
Few-shot XLTSource + few target examplesBetterCan collect 10–100 target examples
Translate-TrainMT system availableVariableHigh-resource translation pairs exist
Translate-TestMT at inference timeGood baselineQuick baseline, MT errors propagate
Native TrainingLarge target labeled setBestHigh-resource target language

🧠 Knowledge Check: Multimodal & Multilingual NLP

1. CLIP's training objective maximizes similarity between correct image-text pairs. What loss function does it use?

2. A researcher fine-tunes XLM-R on English NER, then tests it on Spanish NER with zero Spanish labeled data. This is called:

3. In LLaVA's architecture, what is the primary role of the projection layer between the CLIP encoder and the LLM?

4. Which statement about the "curse of multilinguality" in mBERT is most accurate?