Related Computer Vision Links
Learn Cnn Computer Vision Tutorial, validate concepts with Cnn Computer Vision MCQ Questions, and prepare interviews through Cnn Computer Vision Interview Questions and Answers.
Computer Vision Interview
20 essential Q&A
Updated 2026
CNN
CNNs for Vision: 20 Essential Q&A
Why convolutions beat dense layers on images—and how pooling, padding, and depth build representations.
~11 min read
20 questions
Intermediate
convpoolReLUparameter sharing
Quick Navigation
1
Why CNNs for vision?
⚡ easy
Answer: Images have spatial structure; conv layers exploit local correlations and share weights—far fewer parameters and better generalization than huge FC layers on pixels.
# PyTorch: nn.Conv2d(in_c, out_c, k, padding=1) # standard conv layer
2
What does a conv layer do?
📊 medium
Answer: Slides learnable filters over the input; each output location is dot product of filter with local patch—detects patterns like edges/textures at many positions.
4
Local receptive field?
⚡ easy
Answer: Each neuron sees only a small neighborhood—deeper layers indirectly see larger context via stacked convs.
5
Stride and padding?
📊 medium
Answer: Stride subsamples spatial size; same padding keeps H×W with zero border; valid shrinks without padding.
6
Purpose of pooling?
📊 medium
Answer: Reduces spatial resolution, adds slight translation tolerance, and lowers compute—max pool keeps strongest activations in each window.
7
Output depth?
⚡ easy
Answer: Number of filters = number of output channels—each filter produces one feature map.
8
Receptive field size?
🔥 hard
Answer: Grows with kernel sizes, strides, and stacking—after L layers network “sees” a region of that size in the input image.
9
Why 1×1 conv?
📊 medium
Answer: Mixes channels at each spatial location—cheap way to change depth (bottleneck), add nonlinearity, or implement MLP per pixel.
10
CNN vs fully connected?
📊 medium
Answer: FC connects all inputs to each output—no locality; used at end (or as 1×1 conv) after spatial reduction for classification.
11
Translation equivariance?
🔥 hard
Answer: Shift input → shifted feature maps (before pooling)—CNN respects spatial structure; pooling adds limited invariance.
12
RGB input?
⚡ easy
Answer: First conv has 3 input channels per filter—depth matches image channels (or more for hyperspectral).
13
Role of batch norm?
📊 medium
Answer: Normalize activations per channel for stable training and higher learning rates—slight regularization effect.
14
Dropout in CNNs?
📊 medium
Answer: More common in FC heads; sometimes spatial dropout drops whole feature maps—less standard than in MLPs.
15
Global average pooling?
📊 medium
Answer: Average each channel to one value—reduces params vs large FC layers before softmax (Network in Network / ResNet style).
16
Classification loss?
⚡ easy
Answer: Cross-entropy with softmax over classes—multi-label uses sigmoid + BCE per class.
17
Typical augmentation?
📊 medium
Answer: Random crop/flip, color jitter, mixup/cutmix—improves generalization and simulates viewpoint/light changes.
18
Transfer learning?
📊 medium
Answer: Initialize backbone from ImageNet pretrain, replace head, fine-tune—standard when labeled data is limited.
19
Estimate complexity?
🔥 hard
Answer: Conv: roughly O(H_out×W_out×C_in×C_out×k²)—depthwise separable reduces this (MobileNet).
20
CNN vs Vision Transformer?
🔥 hard
Answer: CNN: local inductive bias and efficiency. ViT: global attention, needs more data—hybrids (ConvNeXt, Swin) blend ideas.
CNN Cheat Sheet
Conv
- Local + shared
- Stride/pad
Pool
- Downsample
- Invariance
Head
- GAP + FC
- Softmax CE
💡 Pro tip: Sharing weights is the core efficiency vs FC on pixels.
Full tutorial track
Go deeper with the matching tutorial chapter and code examples.