Neural Networks
20 Short Answers
Interview Essentials
Neural Networks: 20 Interview Questions
Master the fundamentals of neural networks — from perceptrons to backpropagation, activation functions, regularization, and modern architectures. Concise answers for FAANG-level interviews.
Perceptron
Backpropagation
Activation
Vanishing Gradient
Dropout
1
What is a perceptron? How is it different from a neuron in deep learning?
⚡ Easy
Answer: A perceptron is the simplest artificial neuron, invented by Rosenblatt. It takes binary inputs, applies weights, sums them, and passes through a step activation function (0 or 1). Modern deep learning neurons use continuous activation functions (ReLU, Sigmoid, Tanh) and are arranged in multiple layers.
Perceptron: Step function, binary output, linear separator
Modern Neuron: Differentiable, continuous output, stackable
Modern Neuron: Differentiable, continuous output, stackable
2
Why do we need non-linear activation functions?
⚡ Easy
Answer: Without non-linearity, stacking multiple linear layers collapses into a single linear transformation. Non-linear activations (ReLU, sigmoid, tanh) allow neural networks to approximate any complex, non-linear function (universal approximation theorem).
# Linear composition: W2(W1*x) = (W2*W1)*x → Still linear!
3
Compare ReLU, Sigmoid, and Tanh activations. When to use each?
📊 Medium
Answer:
- ReLU (max(0,x)): Default for hidden layers. Fast, sparse, mitigates vanishing gradient. Dead neurons issue.
- Sigmoid (0 to 1): Output layer for binary classification. Prone to vanishing gradient.
- Tanh (-1 to 1): Zero-centered, often used in RNNs/classical nets. Still suffers saturation.
✅ ReLU: Most common for CNNs/Transformers
⚠️ Sigmoid/Tanh: Used in specific gates (LSTM) or binary output
⚠️ Sigmoid/Tanh: Used in specific gates (LSTM) or binary output
4
Explain backpropagation in simple terms.
📊 Medium
Answer: Backpropagation computes gradients of the loss with respect to each weight using the chain rule. It propagates error backward from output to input, layer by layer. These gradients are used by optimizers (SGD, Adam) to update weights and minimize loss.
∂L/∂w = ∂L/∂ŷ · ∂ŷ/∂z · ∂z/∂w
5
What is vanishing gradient? How do you fix it?
🔥 Hard
Answer: Vanishing gradient occurs when gradients become extremely small in early layers, preventing learning. Causes: deep networks with sigmoid/tanh. Fixes: Use ReLU, residual connections (ResNet), batch normalization, proper weight initialization (Xavier/He), LSTM gates.
ReLU, ResNet, BatchNorm
Sigmoid deep stacks
6
What is the difference between batch gradient descent, SGD, and mini-batch?
⚡ Easy
Answer:
- Batch GD: Full dataset – accurate but slow, memory heavy.
- SGD: One sample at a time – fast updates, high variance.
- Mini-batch: Subset (e.g., 32, 64) – balance between speed and stability. Most common.
7
Explain Dropout. Why does it work?
📊 Medium
Answer: Dropout randomly deactivates neurons during training with probability p. It prevents co-adaptation of neurons, forces redundant representations, and acts as ensemble learning. At inference, all neurons are used (weights scaled by p).
model.add(Dropout(0.5)) # 50% neurons dropped each step
8
What is Batch Normalization? How does it help?
📊 Medium
Answer: BatchNorm normalizes layer outputs to zero mean and unit variance within each mini-batch. It stabilizes training, allows higher learning rates, reduces internal covariate shift, and acts as a regularizer.
9
What is weight initialization? Why is it important?
📊 Medium
Answer: Weight initialization sets initial values of weights. Poor init causes vanishing/exploding gradients. Xavier init for tanh/sigmoid, He init for ReLU. Proper init speeds convergence.
Xavier: Var = 1/n_in
He: Var = 2/n_in
He: Var = 2/n_in
10
What is the Universal Approximation Theorem?
🔥 Hard
Answer: A feedforward network with a single hidden layer and non-linear activation can approximate any continuous function on a compact domain, given enough neurons. Depth improves parameter efficiency, not just theoretical capacity.
11
What is the difference between epoch, batch, and iteration?
⚡ Easy
Answer:
- Epoch: One full pass of entire training data.
- Batch: Number of samples processed before update.
- Iteration: One batch update = steps per epoch.
12
What is cross-entropy loss? When do you use it?
⚡ Easy
Answer: Cross-entropy measures difference between predicted probability distribution and true labels. Used for classification (binary: binary cross-entropy, multi-class: categorical cross-entropy). Preferred over MSE for classification.
13
Explain underfitting and overfitting in neural networks.
⚡ Easy
Answer:
- Underfitting: Model too simple, high bias – fails on training data. Fix: increase capacity, train longer.
- Overfitting: Model memorizes noise, high variance – low train error, high test error. Fix: dropout, regularization, more data.
14
What is the role of the learning rate?
⚡ Easy
Answer: Learning rate controls step size during gradient descent. Too high: overshoot, divergence. Too low: slow convergence, gets stuck. Use learning rate schedules or adaptive optimizers (Adam).
15
Compare Adam vs SGD optimizer.
📊 Medium
Answer:
- SGD: Simple, requires manual LR tuning, may need momentum.
- Adam: Adaptive LR + momentum, works well out-of-box, less sensitive to hyperparameters. Tends to generalize slightly worse than tuned SGD.
16
What is gradient clipping? When is it needed?
📊 Medium
Answer: Gradient clipping caps gradients to a threshold value during backprop. Prevents exploding gradients, common in RNNs and Transformers. Maintains stable training.
torch.nn.utils.clip_grad_norm_(model.parameters(), max_norm=1.0)
17
What is the difference between a neural network and a deep neural network?
⚡ Easy
Answer: "Neural network" is a broad term. Deep neural network (DNN) typically has more than 2-3 hidden layers. Depth allows hierarchical feature learning. Shallow nets may suffice for simple tasks.
18
What are skip connections? Why are they useful?
🔥 Hard
Answer: Skip connections (ResNet) add input of a layer to its output (F(x) + x). They alleviate vanishing gradient, enable training of very deep networks (>100 layers), and act as gradient superhighways.
19
What is the F1 score? When is it better than accuracy?
📊 Medium
Answer: F1 is harmonic mean of precision and recall. Better than accuracy for imbalanced datasets. For example, fraud detection (99.9% negative) – accuracy high but model useless; F1 reflects minority class performance.
20
How do you decide the number of layers and neurons?
🔥 Hard
Answer: No fixed rule. Start with architecture proven for similar tasks. Use validation error: increase capacity until overfitting, then add regularization. Heuristic: more data → deeper/wider. Automated via hyperparameter search (Grid/Random/Bayesian).
Start simple, scale up
Avoid guessing randomly
Quick Recap: Neural Networks Interview
Core Concepts
- Perceptron → Deep Neuron
- Non-linearity = Universal Approx
- Backprop = Chain Rule
- ReLU for hidden, Sigmoid for binary out
Problems & Fixes
- Vanishing Gradient → ReLU, ResNet
- Overfitting → Dropout, BatchNorm
- Exploding Gradient → Clipping
- Slow Convergence → Adam, He init
Verdict: Master these 20 Q&A to ace neural networks interviews.
20/20 questions covered
Activation Functions