PyTorch
20 Essential Q/A
DL Framework Interview
PyTorch: 20 Interview Questions & Answers
Tensors, autograd, nn.Module, DataLoader, custom layers, GPU training, mixed precision, TorchScript, and deployment. Concise answers for FAANG-level interviews.
Tensor
Autograd
nn.Module
DataLoader
CUDA
TorchScript
1 What is PyTorch? ⚡ Easy
Answer: PyTorch is an open‑source deep learning framework based on Torch, with dynamic computation graphs (define‑by‑run). It provides tensor computation with GPU acceleration, automatic differentiation via
autograd, and a modular ecosystem (torch.nn, torch.optim, torch.utils.data). dynamic graph imperative
2 PyTorch Tensor vs NumPy array – differences? 📊 Medium
Answer: Both share similar APIs, but PyTorch tensors run on GPUs, support automatic differentiation, and integrate with deep learning ops. NumPy is CPU‑only. Convert via
.numpy() and torch.from_numpy() (shared memory).import torch, numpy as np
a = np.array([1,2])
t = torch.from_numpy(a) # shares memory
b = t.numpy()3 How does
autograd work? 🔥 HardAnswer:
autograd records operations on tensors with requires_grad=True to build a dynamic computation graph. During backward pass, it traverses the graph in reverse to compute gradients using the chain rule. Gradients accumulate in the .grad attribute.y = x² → dy/dx = 2x.
y.backward() stores 2x in x.grad.4 What is
nn.Module? 📊 MediumAnswer: Base class for all neural network layers/models. It encapsulates parameters (
nn.Parameter), forward method, and submodules. Provides .to(device), .train(), .eval(), and built‑in parameter tracking.5
nn.Module vs nn.functional? 🔥 HardAnswer:
nn.Module contains state (parameters) and is the recommended way to build layers. nn.functional provides stateless functions (e.g., F.relu, F.cross_entropy). Usually you use nn.Linear (module) but call F.relu inside forward.6 Explain
torch.utils.data.Dataset and DataLoader. 📊 MediumAnswer:
Dataset stores samples and their labels (implement __len__ and __getitem__). DataLoader wraps a Dataset, provides batching, shuffling, and multi‑thread loading (num_workers).7 How do you use GPU in PyTorch? ⚡ Easy
Answer:
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu'). Then model.to(device) and tensor.to(device). Always assign to variables.8 How does
loss.backward() work? 📊 MediumAnswer: Computes gradients of the loss w.r.t all tensors with
requires_grad=True using backpropagation. Gradients are accumulated, so you need optimizer.zero_grad() before each step.9 What does
optimizer.step() do? ⚡ EasyAnswer: Updates parameters using the gradients stored in
.grad fields, according to the optimization algorithm (SGD, Adam, etc.). Called after .backward().10 How to create a custom autograd Function? 🔥 Hard
Answer: Subclass
torch.autograd.Function and implement static forward(ctx, ...) and backward(ctx, grad_output). Use ctx.save_for_backward to cache tensors.class MyReLU(torch.autograd.Function):
@staticmethod
def forward(ctx, x):
ctx.save_for_backward(x)
return x.clamp(min=0)
@staticmethod
def backward(ctx, grad_out):
x, = ctx.saved_tensors
return grad_out * (x > 0).float()11 How do you implement distributed training? 🔥 Hard
Answer: Use
torch.nn.DataParallel (single‑node multi‑GPU) or torch.nn.parallel.DistributedDataParallel (multi‑node). DDP is faster, spawns one process per GPU, handles sync via torch.distributed.12 What is mixed precision training (AMP)? 📊 Medium
Answer: Uses both float16 and float32 to speed up training and reduce memory. PyTorch provides
torch.cuda.amp: autocast() for forward and GradScaler to prevent underflow.13 When to use
torch.no_grad()? ⚡ EasyAnswer: Disables gradient tracking, useful for inference and validation to save memory/computation. Also used when you need to modify tensors in‑place without affecting autograd.
14 How to save/load a PyTorch model? 📊 Medium
Answer: Save:
torch.save(model.state_dict(), 'path.pth'). Load: model.load_state_dict(torch.load('path.pth')). For full model: torch.save(model, ...) (less flexible).15 Explain TorchScript. 🔥 Hard
Answer: A way to serialize PyTorch models for production (C++ runtime). Two methods: tracing (
torch.jit.trace) and scripting (torch.jit.script). Script handles control flow.16 Dynamic computation graph advantage? 📊 Medium
Answer: Graph is built on‑the‑fly per forward pass. Easier debugging, variable length inputs, Pythonic control flow. PyTorch uses dynamic; TensorFlow 1.x used static. TF2 adopted eager execution.
17 How to do transfer learning? 📊 Medium
Answer: Load a pretrained model (
torchvision.models), freeze layers (param.requires_grad = False), replace the classifier head, and train only the new layers (or fine‑tune later).18 What is gradient accumulation? 🔥 Hard
Answer: Accumulates gradients over several mini‑batches before updating weights. Simulates larger batch size when memory is limited. Do
loss.backward() every micro‑batch, and step/zero_grad every N steps.19 How to use TensorBoard with PyTorch? 📊 Medium
Answer: Use
torch.utils.tensorboard.SummaryWriter. Log scalars, histograms, images, and model graph. Alternatively, third‑party like wandb.20 Common PyTorch pitfalls? 🔥 Hard
Answer: Forgetting
optimizer.zero_grad(), not calling .to(device) for all tensors, in‑place ops after gradient computation, mixing numpy and torch without care, detaching graph incorrectly. Debuggable, Pythonic
Steeper CUDA setup
PyTorch Interview Cheat Sheet
Core
torch.Tensor– GPU + gradautograd– dynamic graphnn.Module– parameter containeroptim– SGD, Adam, etc.
Data
Dataset+DataLoadertorchvision/torchtext- Samplers, collate_fn
Production
- TorchScript / JIT
- ONNX export
- Distributed (DDP)
- AMP (mixed precision)
“PyTorch uses define‑by‑run – you write the code as you would in numpy, with automatic differentiation.”