Computer Vision Interview 20 essential Q&A Updated 2026
Diffusion

Diffusion Models: 20 Essential Q&A

Gradually destroy data with noise, then learn to reverse the process—state-of-the-art image and video generation.

~12 min read 20 questions Advanced
forward processU-Netguidancelatent
1 What is a diffusion model? ⚡ easy
Answer: Generative model that learns to reverse a gradual noising process—start from Gaussian noise and denoise into a sample.
2 Forward process? 📊 medium
Answer: Fixed Markov chain adding Gaussian noise over T steps until data ≈ pure noise—q(x_t|x_{t−1}) with known variances.
3 Reverse process? 📊 medium
Answer: Learn p_θ(x_{t−1}|x_t) approximating true posterior—typically predict noise ε or x_0 with a neural net.
4 Training objective (DDPM)? 🔥 hard
Answer: Simplified ε-prediction MSE: network predicts noise added at each t—equivalent to variational lower bound with reweights.
5 Noise schedule β_t? 📊 medium
Answer: How fast variance grows with t—linear, cosine, etc.; affects training stability and sample quality.
# ε-prediction: target noise ε; pred = unet(x_t, t)
6 Why a U-Net? 📊 medium
Answer: Multi-scale spatial denoising with skip connections—preserves detail while aggregating context; time t injected via embeddings.
7 Sampling cost? 📊 medium
Answer: Autoregressive in time—hundreds/thousands of steps slow; accelerators (DDIM, distillation) reduce steps.
8 DDIM? 🔥 hard
Answer: Non-Markovian deterministic sampling sharing training objective—fewer steps with some quality tradeoff.
9 Classifier guidance? 🔥 hard
Answer: Use gradients from a classifier p(y|x_t) during sampling to steer generation—sharp but needs extra classifier.
10 Classifier-free guidance? 📊 medium
Answer: Train conditional and unconditional model together; interpolate scores at sample time—no separate classifier, widely used in SD.
11 Latent diffusion? 🔥 hard
Answer: Run diffusion in VAE latent space (lower res)—much cheaper; decode with VAE decoder (Stable Diffusion).
12 Stable Diffusion pieces? 📊 medium
Answer: CLIP text encoder, U-Net denoiser in latent space, VAE—plus schedulers and safety tooling around the stack.
13 vs GANs? 📊 medium
Answer: Diffusion: stable training, great diversity, slower sampling. GAN: fast one-shot but trickier mode coverage.
14 Video diffusion? 📊 medium
Answer: Add temporal layers or 3D convs; causal attention across frames—data and compute heavy.
15 Inpainting? ⚡ easy
Answer: Condition on known regions by concatenating mask/channel inputs to U-Net—fill missing areas consistently.
16 Text conditioning? 📊 medium
Answer: Cross-attention from text tokens to spatial features (like transformers)—T5/CLIP embeddings as context.
17 SNR weighting? 🔥 hard
Answer: Different timesteps contribute unequally to loss—reweighting (v-prediction, Min-SNR) improves quality.
18 Flow matching? 🔥 hard
Answer: Related generative path from noise to data via ODE/flows—competes with diffusion on speed and quality in recent work.
19 Compute / data? ⚡ easy
Answer: Large image-text pairs for T2I; training is GPU-heavy; inference optimizes with TensorRT, FlashAttention, distilled samplers.
20 Evaluation? 📊 medium
Answer: FID, CLIP score for text alignment, human preference studies—no single metric captures all.

Diffusion Cheat Sheet

Train
  • Predict ε
Sample
  • Reverse steps
  • CFG
LDM
  • VAE latent

💡 Pro tip: Forward fixed, reverse learned; mention CFG and latent diffusion.

Full tutorial track

Go deeper with the matching tutorial chapter and code examples.