Diffusion Models: 20 Essential Q&A

Question 1

1 What is a diffusion model? ⚡ easy

Answer

Answer: Generative model that learns to reverse a gradual noising process—start from Gaussian noise and denoise into a sample.

Question 2

2 Forward process? 📊 medium

Answer

Answer: Fixed Markov chain adding Gaussian noise over T steps until data ≈ pure noise—q(x_t|x_{t−1}) with known variances.

Question 3

3 Reverse process? 📊 medium

Answer

Answer: Learn p_θ(x_{t−1}|x_t) approximating true posterior—typically predict noise ε or x_0 with a neural net.

Question 4

4 Training objective (DDPM)? 🔥 hard

Answer

Answer: Simplified ε-prediction MSE: network predicts noise added at each t—equivalent to variational lower bound with reweights.

Question 5

5 Noise schedule β_t? 📊 medium

Answer

Answer: How fast variance grows with t—linear, cosine, etc.; affects training stability and sample quality.

Question 6

6 Why a U-Net? 📊 medium

Answer

Answer: Multi-scale spatial denoising with skip connections—preserves detail while aggregating context; time t injected via embeddings.

Question 7

7 Sampling cost? 📊 medium

Answer

Answer: Autoregressive in time—hundreds/thousands of steps slow; accelerators (DDIM, distillation) reduce steps.

Question 8

8 DDIM? 🔥 hard

Answer

Answer: Non-Markovian deterministic sampling sharing training objective—fewer steps with some quality tradeoff.

Question 9

9 Classifier guidance? 🔥 hard

Answer

Answer: Use gradients from a classifier p(y|x_t) during sampling to steer generation—sharp but needs extra classifier.

Question 10

10 Classifier-free guidance? 📊 medium

Answer

Answer: Train conditional and unconditional model together; interpolate scores at sample time—no separate classifier, widely used in SD.

Question 11

11 Latent diffusion? 🔥 hard

Answer

Answer: Run diffusion in VAE latent space (lower res)—much cheaper; decode with VAE decoder (Stable Diffusion).

Question 12

12 Stable Diffusion pieces? 📊 medium

Answer

Answer: CLIP text encoder, U-Net denoiser in latent space, VAE—plus schedulers and safety tooling around the stack.

Question 13

13 vs GANs? 📊 medium

Answer

Answer: Diffusion: stable training, great diversity, slower sampling. GAN: fast one-shot but trickier mode coverage.

Question 14

14 Video diffusion? 📊 medium

Answer

Answer: Add temporal layers or 3D convs; causal attention across frames—data and compute heavy.

Question 15

15 Inpainting? ⚡ easy

Answer

Answer: Condition on known regions by concatenating mask/channel inputs to U-Net—fill missing areas consistently.

Question 16

16 Text conditioning? 📊 medium

Answer

Answer: Cross-attention from text tokens to spatial features (like transformers)—T5/CLIP embeddings as context.

Question 17

17 SNR weighting? 🔥 hard

Answer

Answer: Different timesteps contribute unequally to loss—reweighting (v-prediction, Min-SNR) improves quality.

Question 18

18 Flow matching? 🔥 hard

Answer

Answer: Related generative path from noise to data via ODE/flows—competes with diffusion on speed and quality in recent work.

Question 19

19 Compute / data? ⚡ easy

Answer

Answer: Large image-text pairs for T2I; training is GPU-heavy; inference optimizes with TensorRT, FlashAttention, distilled samplers.

Question 20

20 Evaluation? 📊 medium

Answer

Answer: FID, CLIP score for text alignment, human preference studies—no single metric captures all.

Related Computer Vision Links