Beta

Curating trusted science content, from across the internet.

Explore Deep Dive

But how do AI images and videos actually work? | Guest video by Welch Labs

25/07/2025·

Watch: 37 minRead: 3 min

AI & Machine Learning

# diffusion models # classifier-free-guidance # negative-prompt # text-to-video # Stephen-Welch

Below is a short summary and detailed review of this video written by FutureFactual:

Diffusion, CLIP, and Guidance: How Text Prompts Shape Video Generation with Diffusion Models

This video explains how modern AI turns text prompts into video through diffusion models, linking noise reduction to physical diffusion and Brownian motion. Stephen Welch walks through an open-source example using WN 2.1, contrasting naive noise removal with sophisticated training that predicts the total noise added over many steps. The talk covers CLIP’s two-model language and vision framework, the DDPM and DDIM approaches for efficient generation, and how conditioning with text embeddings and classifier-free guidance can produce highly faithful outputs. It also shows practical tricks like negative prompts and class conditioning to steer the results, including a clip from Dolly-2’s capabilities and a hands-on 2D toy visualization to illuminate the idea of vector fields guiding generation.

Overview: From Noise to Video Reality

Stephen Welch introduces diffusion models as a physics-inspired path from pure noise to coherent video, showing how a transformer progressively shapes noise into realism. The output can be steered with prompts and even simple inputs, and an open-source model like WN 2.1 demonstrates how prompts influence motion and composition. A hands-on 2D toy example grounds intuition for working in high-dimensional image spaces where diffusion operates like a vector field guiding samples toward plausible data.

"The key will be thinking of diffusion models as learning a time varying vector field" - Stephen Welch

CLIP: A Shared Space for Words and Images

The talk summarizes CLIP, OpenAI's 2021 model pairing a text transformer with an image encoder to create a shared embedding space. By training on image-caption pairs and using contrastive learning, CLIP aligns image and caption vectors so that related pairs cluster together. This shared space enables operations like arithmetic on image concepts and serves as a foundation for steering diffusion models with text inputs.

"Diffusion models learn to point back towards the original data distribution by conditioning on time" - Stephen Welch

From Diffusion to Diffusion Probabilistic Models (DDPM)

DDPM established that you can train a model to remove noise from images by reversing a progressive noise process. Two key insights stand out: (1) during generation, random noise is added at each step to preserve diversity, and (2) training targets the total noise added along the entire diffusion path rather than one-step denoising. This reframes learning as predicting a diffusion score or vector field that points toward the original data.

"Conditioning alone is not enough to achieve the level of prompt adherence" - Stephen Welch

Speed and Steering with DDIM and Flow Matching

To reduce compute, researchers developed DDIM, an ordinary differential equation-based variant that yields the same final distribution as DDPM but deterministically and with fewer steps. Flow matching and related approaches show how to map the learned vector field into efficient, high-quality generation, even without iterating through every noisy step. The discussion links these ideas back to the physics of diffusion and the mathematics of time-evolving systems.

"Classifier-free guidance works remarkably well" - Stephen Welch

Conditioning, Guidance, and Negative Prompts

The talk then covers practical ways to steer diffusion toward desired outputs. Conditioning with text embeddings (via CLIP-like models) helps, but is not sufficient on its own. Classifier-free guidance subtracts an unconditioned vector field from a conditioned one, amplifying the prompt-driven direction and improving adherence. Negative prompts are used in some models to explicitly forbid unwanted features, such as extraneous elements or implausible motion, and can be implemented effectively even in non-English prompts.

"All you need is language" - Stephen Welch

Hands-on and Real-World Implications

Welch highlights open-source demonstrations like WN 2.1 and describes how Dolly-2 inverted CLIP to achieve remarkable prompt fidelity. The session ties theory to practice with visual examples, including a desert tree prompt and a negative-prompt comparison, showing how the diffusion process can produce lifelike video outputs without traditional artistry tools.

Note: This content reflects the ideas and demonstrations presented in the video, including hands-on prompts, model comparisons, and theoretical explanations of diffusion models and guidance techniques.

“The future of factual AI content is about credible, cross-referenced exploration” - Stephen Welch

To find out more about the video and 3Blue1Brown go to: But how do AI images and videos actually work? | Guest video by Welch Labs.