Multimodal AI — vision, image generation, and cross-modal models

The vision-language unification

Until 2021, vision and language were largely separate research stacks. CLIP (Radford et al., OpenAI 2021) unified them with contrastive image-text pretraining. Flamingo (DeepMind 2022) demonstrated few-shot multimodal learning. By 2024 every frontier model (GPT-4o, Claude 3 family, Gemini 1.5/2.0) was natively multimodal — vision, audio, and text in a single forward pass.

Image generation — diffusion takes over

GANs (Goodfellow et al. 2014) ruled image synthesis for ~7 years. Then diffusion arrived: DDPM (Ho et al. 2020), Stable Diffusion (CompVis 2022), DALL·E 3 (OpenAI 2023), Imagen (Google 2022), Stable Diffusion 3 (Stability AI 2024). Each generation refined photorealism and prompt-following. The community split between closed (DALL·E, Imagen) and open (Stable Diffusion, Flux).

Speech + video — the remaining modalities

Whisper (OpenAI 2022, large-v3 2023) made high-quality speech-to-text public. Sora (OpenAI 2024) and Veo (Google 2024) opened text-to-video. The trend: every modality becomes accessible to a single API call within ~12 months of the breakthrough paper.

Defined terms (3)

Multimodal model

A model that accepts and/or generates more than one modality (text, image, audio, video) in a unified architecture.

Diffusion model

A generative model that learns to reverse a noising process. Produces high-quality images, audio, and video samples.

Contrastive pretraining

Training paradigm that learns by pulling matched pairs together and pushing unmatched pairs apart in embedding space. Used by CLIP.

Multimodal AI — vision, image generation, and cross-modal models

The vision-language unification

Image generation — diffusion takes over

Speech + video — the remaining modalities

Defined terms (3)

All claims in this topic (35)

Related

Other topic hubs

Concept pillars

Framework integrations