Neural Networks 15 Essential Q&A
Interview Prep

Multi-Layer Perceptron — 15 Interview Questions

Feedforward fully connected stacks, hidden layers, capacity, parameters, and when MLPs shine or give way to CNNs—same visual design as the perceptron interview page.

Colored left borders per card; green / amber / red difficulty chips.

Feedforward Hidden layers Fully connected Depth & width
1 What is a multi-layer perceptron (MLP)?Easy
Answer: An MLP is a feedforward neural network: layers of neurons where each layer is typically fully connected to the next, with non-linear activations between affine transforms. Information flows input → hidden(s) → output without cycles.
2 What does “feedforward” mean?Easy
Answer: Activations are computed in one direction only—from input to output. There are no recurrent edges that feed a layer’s output back into earlier layers in the same forward pass (that would be an RNN or similar).
3 Why do we use hidden layers?Easy
Answer: Hidden layers let the network build intermediate representations. Stacking non-linear layers composes functions so the model can approximate non-linear boundaries (e.g. XOR) that a single linear layer cannot.
4 Depth vs width—how do interviewers expect you to compare them?Medium
Answer: Depth (more layers) increases compositional power and hierarchical features; can help sample efficiency for some tasks but risks optimization issues. Width (more units per layer) increases capacity per layer; very wide shallow nets can also approximate functions. Trade-offs: data, compute, vanishing gradients, and inductive bias.
5 How many parameters in a linear layer from d_in to d_out?Easy
Answer: Weights: d_in × d_out. Bias: d_out. Total d_in × d_out + d_out. Mention this scales quickly for large fully connected layers.
params = d_in × d_out + d_out
6 What happens if you remove activations and only stack linear layers?Easy
Answer: A composition of linear maps is still linear. The entire deep stack collapses to a single affine transform—no extra expressive power vs one linear layer.
7 One hidden layer MLP—what can it approximate?Hard
Answer: With a suitable non-linearity and enough hidden units, a single-hidden-layer MLP can approximate many continuous functions on compact domains (universal approximation theme). In practice depth, data, and optimization matter—not only width.
8 How does a small MLP solve XOR?Medium
Answer: A hidden layer can form new features (e.g. AND-like combinations) so the output layer becomes linearly separable in that feature space. Classic example: 2 inputs → small hidden layer with non-linearity → output.
Sketch two hidden units as combining half-spaces—interviewers reward the intuition, not memorizing exact weights.
9 How does a batch dimension change MLP math?Medium
Answer: For batch size B, input is B × d_in. Linear layer: Y = XW + b (broadcast bias). Same weights for all batch rows—this is why matrix multiply is efficient on GPUs.
10 When prefer CNN over MLP for images?Medium
Answer: Images have local structure and translation patterns. CNNs use shared local filters—far fewer parameters and better inductive bias. A flat MLP on pixels ignores locality and scales poorly with resolution.
11 Why do large MLPs overfit easily?Medium
Answer: High parameter count vs data lets the network memorize noise. Mitigate with regularization (L2, dropout), more data, early stopping, or architecture better matched to the problem.
12 Why does weight initialization matter in deep MLPs?Hard
Answer: Poor scaling can make activations explode or vanish layer-to-layer, giving useless gradients. Schemes like Xavier/He set variance based on fan-in/fan-out to keep signal scale stable at initialization.
13 Typical output layer for multi-class classification?Easy
Answer: Linear logits followed by softmax (often applied inside loss for numerical stability). Training uses cross-entropy on probabilities or logits with log-softmax.
14 MLP for regression—output and loss?Easy
Answer: Often a linear output (no squashing) with MSE or Huber loss for real-valued targets. For bounded outputs you might use sigmoid scaling or tanh to a range.
15 When is an MLP still a reasonable first choice?Medium
Answer: Tabular or fixed-length feature vectors without strong spatial or sequential structure, baselines, or as a component inside larger models. For sequences use RNN/Transformer; for grids use CNN.

Quick review checklist

  • Define MLP, feedforward, and hidden layer roles.
  • Count parameters for a linear layer; explain linear-collapse without activations.
  • Contrast depth vs width; XOR + hidden layer story.
  • MLP vs CNN on images; regression vs classification heads.