Multi-Layer Perceptron â€” 15 Interview Questions

Feedforward fully connected stacks, hidden layers, capacity, parameters, and when MLPs shine or give way to CNNsâ€”same visual design as the perceptron interview page.

Colored left borders per card; green / amber / red difficulty chips.

Feedforward Hidden layers Fully connected Depth & width

1 What is a multi-layer perceptron (MLP)?Easy

Answer: An MLP is a feedforward neural network: layers of neurons where each layer is typically fully connected to the next, with non-linear activations between affine transforms. Information flows input â†’ hidden(s) â†’ output without cycles.

2 What does â€œfeedforwardâ€ mean?Easy

Answer: Activations are computed in one direction onlyâ€”from input to output. There are no recurrent edges that feed a layerâ€™s output back into earlier layers in the same forward pass (that would be an RNN or similar).

3 Why do we use hidden layers?Easy

Answer: Hidden layers let the network build intermediate representations. Stacking non-linear layers composes functions so the model can approximate non-linear boundaries (e.g. XOR) that a single linear layer cannot.

4 Depth vs widthâ€”how do interviewers expect you to compare them?Medium

Answer: Depth (more layers) increases compositional power and hierarchical features; can help sample efficiency for some tasks but risks optimization issues. Width (more units per layer) increases capacity per layer; very wide shallow nets can also approximate functions. Trade-offs: data, compute, vanishing gradients, and inductive bias.

5 How many parameters in a linear layer from d_in to d_out?Easy

Answer: Weights: d_in Ã— d_out. Bias: d_out. Total d_in Ã— d_out + d_out. Mention this scales quickly for large fully connected layers.

params = d_in Ã— d_out + d_out

6 What happens if you remove activations and only stack linear layers?Easy

Answer: A composition of linear maps is still linear. The entire deep stack collapses to a single affine transformâ€”no extra expressive power vs one linear layer.

7 One hidden layer MLPâ€”what can it approximate?Hard

Answer: With a suitable non-linearity and enough hidden units, a single-hidden-layer MLP can approximate many continuous functions on compact domains (universal approximation theme). In practice depth, data, and optimization matterâ€”not only width.

8 How does a small MLP solve XOR?Medium

Answer: A hidden layer can form new features (e.g. AND-like combinations) so the output layer becomes linearly separable in that feature space. Classic example: 2 inputs â†’ small hidden layer with non-linearity â†’ output.

Sketch two hidden units as combining half-spacesâ€”interviewers reward the intuition, not memorizing exact weights.

9 How does a batch dimension change MLP math?Medium

Answer: For batch size B, input is B Ã— d_in. Linear layer: Y = XW + b (broadcast bias). Same weights for all batch rowsâ€”this is why matrix multiply is efficient on GPUs.

10 When prefer CNN over MLP for images?Medium

Answer: Images have local structure and translation patterns. CNNs use shared local filtersâ€”far fewer parameters and better inductive bias. A flat MLP on pixels ignores locality and scales poorly with resolution.

11 Why do large MLPs overfit easily?Medium

Answer: High parameter count vs data lets the network memorize noise. Mitigate with regularization (L2, dropout), more data, early stopping, or architecture better matched to the problem.

12 Why does weight initialization matter in deep MLPs?Hard

Answer: Poor scaling can make activations explode or vanish layer-to-layer, giving useless gradients. Schemes like Xavier/He set variance based on fan-in/fan-out to keep signal scale stable at initialization.

13 Typical output layer for multi-class classification?Easy

Answer: Linear logits followed by softmax (often applied inside loss for numerical stability). Training uses cross-entropy on probabilities or logits with log-softmax.

14 MLP for regressionâ€”output and loss?Easy

Answer: Often a linear output (no squashing) with MSE or Huber loss for real-valued targets. For bounded outputs you might use sigmoid scaling or tanh to a range.

15 When is an MLP still a reasonable first choice?Medium

Answer: Tabular or fixed-length feature vectors without strong spatial or sequential structure, baselines, or as a component inside larger models. For sequences use RNN/Transformer; for grids use CNN.

Quick review checklist

Define MLP, feedforward, and hidden layer roles.
Count parameters for a linear layer; explain linear-collapse without activations.
Contrast depth vs width; XOR + hidden layer story.
MLP vs CNN on images; regression vs classification heads.

Previous: Perceptron Next: Activations

Related Neural Networks Links

Multi-Layer Perceptron â€” 15 Interview Questions

Quick review checklist