CLIP, LLaVA, Flamingo — bridging vision and language. mBERT, XLM-R — enabling NLP across 100+ languages. These are the architectures that extend NLP beyond English text.
CLIP (Radford et al., OpenAI 2021) trained on 400M image-text pairs from the web. The key insight: instead of predicting image labels, train a model to match images with their natural language descriptions using contrastive loss.
Before CLIP, you needed labeled training data for each vision task. CLIP enables zero-shot classification: given a new class "raccoon eating pizza," just encode the text "a photo of a raccoon eating pizza" and find the image whose embedding is closest. No raccoon-pizza training examples needed. This opened the door to general-purpose vision-language models.
Create text prompts for each class: "a photo of a {class}". Encode all prompts. Classify image by nearest text embedding. Matches ResNet-50 on ImageNet with zero labeled examples.
400M image-text pairs (WIT dataset). ViT-L/14 encoder. 12 days on 256 V100 GPUs. Temperature τ learned from ~0.07 → ~0.01 during training.
InfoNCE is a lower bound on mutual information between image and text embeddings. Maximizing it pulls correct pairs together while pushing N-1 incorrect pairs apart. Larger batches = stronger negatives = better training signal.
LLaVA (Liu et al., 2023) connects a CLIP vision encoder to an LLM (LLaMA/Vicuna) via a simple linear projection layer. It's fine-tuned on synthetic visual instruction following data generated by GPT-4.
LLaVA-1.5 and LLaVA-NeXT (2024) improved the projection layer (MLP instead of linear), used higher resolution inputs, and trained on more diverse visual instruction data — achieving SOTA on multiple VQA benchmarks with a remarkably simple architecture.
Flamingo (Alayrac et al., DeepMind 2022) takes a different approach: keep both a frozen vision encoder (NFNet) and a frozen LLM (Chinchilla), and connect them via new cross-attention layers inserted between existing LLM layers.
Flamingo's design — freezing the backbone and training only cross-attention adapters — directly inspired modern efficient multimodal architectures. GPT-4V, Gemini Pro Vision, and Claude Vision all use variants of this cross-modal bridging approach.
| Model | Vision Encoder | Language Model | Connector | Training Strategy |
|---|---|---|---|---|
| CLIP | ViT (trained from scratch) | Transformer text encoder | Shared embedding space | Contrastive on 400M pairs |
| LLaVA | CLIP ViT (frozen) | LLaMA/Vicuna | Linear projection / MLP | 2-stage: alignment then instruction tuning |
| Flamingo | NFNet (frozen) | Chinchilla (frozen) | Gated cross-attention | Cross-attn layers trained on interleaved data |
| GPT-4V | Unknown ViT | GPT-4 | Unknown | RLHF + safety training |
| Gemini | Native multimodal | Gemini LLM | Native (trained jointly) | End-to-end multimodal pretraining |
mBERT (Google 2019) trains BERT on Wikipedia dumps from 104 languages simultaneously, using a shared WordPiece vocabulary of 110K tokens. No cross-lingual objective — just masked language modeling on multilingual text.
The remarkable finding: fine-tune mBERT on English NER, then test it on German NER with no German labeled data. It works — achieving ~75% of the performance of a German-only model. The shared vocabulary and joint pretraining create a universal representation where analogous entities occupy similar positions across languages. This "cross-lingual transfer" emerged without explicit cross-lingual supervision.
Same as BERT-base: 12 layers, 768 hidden, 12 heads, 110M params. Shared 110K WordPiece vocabulary across all languages.
Curse of multilinguality: as languages increase, per-language capacity decreases. Low-resource languages get very few tokens. Vocabulary allocation biased toward high-resource languages.
Shared vocabulary creates lexical bridges (cognates, code-switching). Structural similarities across languages create implicit alignment. Numbers, punctuation, and code tokens are universal.
XLM-R (Conneau et al., Facebook AI 2020) applies RoBERTa-style training (no NSP, more data, larger batches, longer sequences) at multilingual scale — training on CC-100, a 2.5TB cleaned multilingual web corpus.
| Aspect | mBERT | XLM-R Base | XLM-R Large |
|---|---|---|---|
| Languages | 104 | 100 | 100 |
| Parameters | 110M | 270M | 560M |
| Vocabulary | 110K WordPiece | 250K SentencePiece | 250K SentencePiece |
| Training data | Wikipedia only | CC-100 (2.5TB) | CC-100 (2.5TB) |
| XNLI avg accuracy | 65.4% | 79.2% | 83.6% |
| Swahili NER F1 (zero-shot) | ~50% | ~65% | ~70% |
XLM-R's 250K SentencePiece vocabulary (vs mBERT's 110K WordPiece) dramatically improves tokenization quality for non-Latin scripts. Arabic, Thai, and Chinese receive much more vocabulary budget — lower fertility, better representation.
Arabic presents unique challenges for NLP that make it a fascinating case study — and an important one given 350M+ native speakers and significant underrepresentation in most LLMs.
One Arabic root (e.g., k-t-b = writing) can generate 50+ derived words: كَتَبَ (kataba, "he wrote"), كِتَاب (kitāb, "book"), مَكْتُوب (maktūb, "written"). BPE on Arabic has very high fertility.
Arabic text is typically written without short vowels (diacritics). كتب can mean "he wrote" (kataba), "book" (kitāb), or "office" (maktab). Context determines pronunciation and meaning.
Modern Standard Arabic (formal text) vs. 25+ dialectal varieties (Egyptian, Gulf, Levantine…). Models trained on MSA perform poorly on dialects. Most social media is dialectal.
AraBERT: BERT pretrained on Arabic Wikipedia + news. CAMeL BERT: dialectal Arabic focus. AraGPT-2: Arabic GPT-2 variant. Jais: Arabic-English bilingual LLM (2023).
| Strategy | Data Required | Quality | When to Use |
|---|---|---|---|
| Zero-shot XLT | Source lang labeled only | Good (XLM-R) | No target labeled data available |
| Few-shot XLT | Source + few target examples | Better | Can collect 10–100 target examples |
| Translate-Train | MT system available | Variable | High-resource translation pairs exist |
| Translate-Test | MT at inference time | Good baseline | Quick baseline, MT errors propagate |
| Native Training | Large target labeled set | Best | High-resource target language |
1. CLIP's training objective maximizes similarity between correct image-text pairs. What loss function does it use?
2. A researcher fine-tunes XLM-R on English NER, then tests it on Spanish NER with zero Spanish labeled data. This is called:
3. In LLaVA's architecture, what is the primary role of the projection layer between the CLIP encoder and the LLM?
4. Which statement about the "curse of multilinguality" in mBERT is most accurate?