Topic hub · 35 claims
Multimodal AI — vision, image generation, and cross-modal models
Models that combine vision, text, audio, or video. Hand-verified release dates, foundational papers, and the organizations behind them.
The vision-language unification
Until 2021, vision and language were largely separate research stacks. CLIP (Radford et al., OpenAI 2021) unified them with contrastive image-text pretraining. Flamingo (DeepMind 2022) demonstrated few-shot multimodal learning. By 2024 every frontier model (GPT-4o, Claude 3 family, Gemini 1.5/2.0) was natively multimodal — vision, audio, and text in a single forward pass.
Image generation — diffusion takes over
GANs (Goodfellow et al. 2014) ruled image synthesis for ~7 years. Then diffusion arrived: DDPM (Ho et al. 2020), Stable Diffusion (CompVis 2022), DALL·E 3 (OpenAI 2023), Imagen (Google 2022), Stable Diffusion 3 (Stability AI 2024). Each generation refined photorealism and prompt-following. The community split between closed (DALL·E, Imagen) and open (Stable Diffusion, Flux).
Speech + video — the remaining modalities
Whisper (OpenAI 2022, large-v3 2023) made high-quality speech-to-text public. Sora (OpenAI 2024) and Veo (Google 2024) opened text-to-video. The trend: every modality becomes accessible to a single API call within ~12 months of the breakthrough paper.
Defined terms (3)
- Multimodal model
- A model that accepts and/or generates more than one modality (text, image, audio, video) in a unified architecture.
- Diffusion model
- A generative model that learns to reverse a noising process. Produces high-quality images, audio, and video samples.
- Contrastive pretraining
- Training paradigm that learns by pulling matched pairs together and pushing unmatched pairs apart in embedding space. Used by CLIP.
All claims in this topic (35)
- AlexNet·introduced in paper ImageNet Classification with Deep Convolutional Neural Networks (Krizhevsky, Sutskever, Hinton, 2012)(1.00 · 2 sources)
- Allen AI Molmo·publicly released on 2024-09-25 by Allen Institute for AI — fully-open multimodal VLM family (1B/7B/72B), Apache 2.0(1.00 · 2 sources)
- Black Forest Labs Flux·publicly released on 2024-08-01 — Flux.1 [pro/dev/schnell] image generation(1.00 · 2 sources)
- CLIP·introduced in paper Learning Transferable Visual Models From Natural Language Supervision (Radford et al., 2021)(1.00 · 2 sources)
- CLIP (Contrastive Language-Image Pretraining)·introduced in paper Learning Transferable Visual Models From Natural Language Supervision (Radford et al., 2021)(1.00 · 2 sources)
- Cohere Aya Vision·publicly released on 2025-03-04 by Cohere For AI — multilingual open-weight vision-language models (8B + 32B), 23 languages(1.00 · 2 sources)
- Cohere Embed v4·publicly released on 2025-04-09 by Cohere — multimodal embedding model, 256k context support, 100+ languages(1.00 · 2 sources)
- DALL-E 3·announced on 2023-09-20(1.00 · 1 sources)
- DALL·E 2·released on 2022-04-06(1.00 · 2 sources)
- Denoising Diffusion Probabilistic Models (DDPM)·introduced in paper Denoising Diffusion Probabilistic Models (Ho, Jain, Abbeel, 2020)(1.00 · 2 sources)
- Flamingo·introduced in Alayrac et al. 2022 — DeepMind few-shot vision-language model(1.00 · 2 sources)
- Google Gemma 3·publicly released on 2025-03-12 by Google DeepMind — Gemma 3 family (1B/4B/12B/27B), 128k context, multimodal vision(1.00 · 2 sources)
- GPT-4 Vision·publicly released on 2023-09-25 by OpenAI(1.00 · 2 sources)
- GPT-4o·released on 2024-05-13(1.00 · 1 sources)
- ImageNet dataset·introduced in paper ImageNet: A Large-Scale Hierarchical Image Database (Deng et al., 2009)(1.00 · 2 sources)
- Jina AI·founded in 2020 by Han Xiao — open-source multimodal AI infrastructure (Jina Framework, Jina Embeddings, Reader API)(1.00 · 2 sources)
- Latent Diffusion Models (LDM)·introduced in paper High-Resolution Image Synthesis with Latent Diffusion Models (Rombach et al., 2021)(1.00 · 2 sources)
- Llama 3.2 (multimodal release including 11B and 90B vision models)·released on 2024-09-25(1.00 · 2 sources)
- Llama 4·released on 2025-04-05 by Meta — Scout + Maverick + Behemoth lineup(1.00 · 2 sources)
- Meta Llama 3.2 Vision·publicly released on 2024-09-25 by Meta — 11B + 90B vision-language variants of Llama 3.2(1.00 · 2 sources)
- Meta SAM 2·publicly released on 2024-07-29 by Meta AI — Segment Anything Model 2, real-time video segmentation(1.00 · 2 sources)
- Microsoft Phi-4 Multimodal·publicly released on 2025-02-26 by Microsoft — Phi-4-multimodal 5.6B parameters with audio + image + text(1.00 · 2 sources)
- Midjourney·publicly released on 2022-07-12 — public beta launch(1.00 · 2 sources)
- Mistral Pixtral 12B·publicly released on 2024-09-11 by Mistral AI — 12B multimodal vision-language model, Apache 2.0(1.00 · 2 sources)
- OpenAI o4-mini·publicly released on 2025-04-16 by OpenAI — successor to o3-mini reasoning model with multimodal input + tool use(1.00 · 2 sources)
- ResNet (Residual Networks)·introduced in paper Deep Residual Learning for Image Recognition (He et al., 2015)(1.00 · 2 sources)
- Stable Diffusion 1.0·released on 2022-08-22(1.00 · 2 sources)
- Stable Diffusion 1.x·released on 2022-08-22(1.00 · 2 sources)
- Whisper large-v3·publicly released on 2023-11-06 by OpenAI(1.00 · 2 sources)
- Black Forest Labs Flux.1 Kontext·publicly released on 2025-05-29 by Black Forest Labs — image generation with in-context editing (text + image input)(0.95 · 2 sources)
- Krea AI·founded in 2023 by Diego Rodriguez + Victor Perez — real-time AI image generation + creative tooling(0.95 · 2 sources)
- Reka Core·publicly released on 2024-04-15 by Reka AI — frontier multimodal LLM (text + image + audio + video input)(0.95 · 2 sources)
- Anthropic Files API·publicly released on 2025-03-25 by Anthropic — file upload + reference API for Claude, supports PDF + image + spreadsheet(0.85 · 2 sources)
- PixelRNN·introduced in paper Pixel Recurrent Neural Networks (van den Oord et al., 2016)(0.82 · 2 sources)
- Show and Tell (Neural Image Caption Generator)·introduced in paper Show and Tell: A Neural Image Caption Generator (Vinyals et al., 2014)(0.82 · 2 sources)