Neural Networks 15 Essential Q&A
Interview Prep

Convolutional Neural Networks — 15 Interview Questions

Local receptive fields, parameter sharing, 1×1 convs, depthwise separable convs, and why CNNs beat flat MLPs on images.

Colored left borders per card; green / amber / red difficulty chips.

Conv Pool RF Channels
1 What is convolution in a CNN?Easy
Answer: Slide a small learned filter over the input (with optional padding/stride), computing dot products at each position—local connectivity + weight sharing across space.
2 Stride and output spatial size (1D intuition).Medium
Answer: Stride s subsamples outputs. Common formula (1D): out = floor((n + 2p − k)/s) + 1 for input length n, padding p, kernel k—layout conventions (NCHW) same idea in 2D per axis.
out = ⌊(n + 2p − k) / s⌋ + 1
3 Parameter count for conv layer.Medium
Answer: Per filter: k_h × k_w × C_in weights + bias (if used). With C_out filters: multiply by C_out. Far fewer than fully connecting all pixels to all hidden units.
4 What does a 1×1 convolution do?Medium
Answer: Mixes channels at each spatial location without blending neighbors—changes depth (bottleneck/expansion), adds nonlinearity stacks cheaply (Inception, ResNet bottlenecks).
5 Max pooling vs average pooling.Easy
Answer: Max: strongest local activation—sharp features. Avg: smoother downsampling—used in some network heads (e.g. Global Average Pooling). Both reduce spatial size.
6 Global average pooling (GAP).Medium
Answer: Average each channel over full H×W → vector length C—replaces large FC layers, reduces parameters and overfitting (ResNet-style classifiers).
7 Transposed convolution (deconv)—one line.Hard
Answer: Learned upsampling with learnable kernel—used in segmentation/decoders; can create checkerboard artifacts if not careful.
8 Dilated (atrous) convolution—why?Hard
Answer: Spaces kernel taps with holes to increase receptive field without more parameters or losing resolution—common in segmentation (DeepLab).
9 Depthwise separable convolution.Medium
Answer: Depthwise conv per channel + pointwise 1×1 to mix channels—much fewer FLOPs than standard conv; MobileNet family.
10 Translation equivariance—what does CNN get?Medium
Answer: Shift input → feature maps shift correspondingly (before pooling). Equivariant to translation; pooling adds approximate invariance locally.
11 CNN vs MLP on images—interview answer.Easy
Answer: CNN exploits locality and sharing—fewer parameters, better sample efficiency; MLP ignores spatial structure and scales poorly with resolution.
12 Receptive field—define.Easy
Answer: Region of input pixels that can affect one output activation—grows with depth, kernel size, dilation; shrinks effective field with stride/pooling.
13 What do deeper channels often represent?Medium
Answer: Hierarchical features: edges/textures low → parts → object-level abstractions (interpretation; not guaranteed per filter).
14 Data augmentation for CNNs—examples.Easy
Answer: Random crop/flip, color jitter, Cutout/RandAugment—reduces overfitting by simulating label-preserving transforms.
15 Name one classic and one modern CNN family.Easy
Answer: Classic: VGG / ResNet. Modern: EfficientNet, ConvNeXt, or ViT hybrid—shows you know field moved beyond plain conv stacks.
Be ready to sketch a conv → BN → ReLU → pool block.

Quick review checklist

  • Conv, stride, padding; params per layer; 1×1 and depthwise sep.
  • Pooling vs strided conv; GAP; receptive field.
  • Equivariance; CNN vs MLP; augmentation.