PyTorch 20 Essential Q/A
DL Framework Interview

PyTorch: 20 Interview Questions & Answers

Tensors, autograd, nn.Module, DataLoader, custom layers, GPU training, mixed precision, TorchScript, and deployment. Concise answers for FAANG-level interviews.

Tensor Autograd nn.Module DataLoader CUDA TorchScript
1 What is PyTorch? ⚡ Easy
Answer: PyTorch is an open‑source deep learning framework based on Torch, with dynamic computation graphs (define‑by‑run). It provides tensor computation with GPU acceleration, automatic differentiation via autograd, and a modular ecosystem (torch.nn, torch.optim, torch.utils.data).
dynamic graph imperative
2 PyTorch Tensor vs NumPy array – differences? 📊 Medium
Answer: Both share similar APIs, but PyTorch tensors run on GPUs, support automatic differentiation, and integrate with deep learning ops. NumPy is CPU‑only. Convert via .numpy() and torch.from_numpy() (shared memory).
import torch, numpy as np
a = np.array([1,2])
t = torch.from_numpy(a)   # shares memory
b = t.numpy()
3 How does autograd work? 🔥 Hard
Answer: autograd records operations on tensors with requires_grad=True to build a dynamic computation graph. During backward pass, it traverses the graph in reverse to compute gradients using the chain rule. Gradients accumulate in the .grad attribute.
y = x² → dy/dx = 2x. y.backward() stores 2x in x.grad.
4 What is nn.Module? 📊 Medium
Answer: Base class for all neural network layers/models. It encapsulates parameters (nn.Parameter), forward method, and submodules. Provides .to(device), .train(), .eval(), and built‑in parameter tracking.
5 nn.Module vs nn.functional? 🔥 Hard
Answer: nn.Module contains state (parameters) and is the recommended way to build layers. nn.functional provides stateless functions (e.g., F.relu, F.cross_entropy). Usually you use nn.Linear (module) but call F.relu inside forward.
6 Explain torch.utils.data.Dataset and DataLoader. 📊 Medium
Answer: Dataset stores samples and their labels (implement __len__ and __getitem__). DataLoader wraps a Dataset, provides batching, shuffling, and multi‑thread loading (num_workers).
7 How do you use GPU in PyTorch? ⚡ Easy
Answer: device = torch.device('cuda' if torch.cuda.is_available() else 'cpu'). Then model.to(device) and tensor.to(device). Always assign to variables.
8 How does loss.backward() work? 📊 Medium
Answer: Computes gradients of the loss w.r.t all tensors with requires_grad=True using backpropagation. Gradients are accumulated, so you need optimizer.zero_grad() before each step.
9 What does optimizer.step() do? ⚡ Easy
Answer: Updates parameters using the gradients stored in .grad fields, according to the optimization algorithm (SGD, Adam, etc.). Called after .backward().
10 How to create a custom autograd Function? 🔥 Hard
Answer: Subclass torch.autograd.Function and implement static forward(ctx, ...) and backward(ctx, grad_output). Use ctx.save_for_backward to cache tensors.
class MyReLU(torch.autograd.Function):
    @staticmethod
    def forward(ctx, x):
        ctx.save_for_backward(x)
        return x.clamp(min=0)
    @staticmethod
    def backward(ctx, grad_out):
        x, = ctx.saved_tensors
        return grad_out * (x > 0).float()
11 How do you implement distributed training? 🔥 Hard
Answer: Use torch.nn.DataParallel (single‑node multi‑GPU) or torch.nn.parallel.DistributedDataParallel (multi‑node). DDP is faster, spawns one process per GPU, handles sync via torch.distributed.
12 What is mixed precision training (AMP)? 📊 Medium
Answer: Uses both float16 and float32 to speed up training and reduce memory. PyTorch provides torch.cuda.amp: autocast() for forward and GradScaler to prevent underflow.
13 When to use torch.no_grad()? ⚡ Easy
Answer: Disables gradient tracking, useful for inference and validation to save memory/computation. Also used when you need to modify tensors in‑place without affecting autograd.
14 How to save/load a PyTorch model? 📊 Medium
Answer: Save: torch.save(model.state_dict(), 'path.pth'). Load: model.load_state_dict(torch.load('path.pth')). For full model: torch.save(model, ...) (less flexible).
15 Explain TorchScript. 🔥 Hard
Answer: A way to serialize PyTorch models for production (C++ runtime). Two methods: tracing (torch.jit.trace) and scripting (torch.jit.script). Script handles control flow.
16 Dynamic computation graph advantage? 📊 Medium
Answer: Graph is built on‑the‑fly per forward pass. Easier debugging, variable length inputs, Pythonic control flow. PyTorch uses dynamic; TensorFlow 1.x used static. TF2 adopted eager execution.
17 How to do transfer learning? 📊 Medium
Answer: Load a pretrained model (torchvision.models), freeze layers (param.requires_grad = False), replace the classifier head, and train only the new layers (or fine‑tune later).
18 What is gradient accumulation? 🔥 Hard
Answer: Accumulates gradients over several mini‑batches before updating weights. Simulates larger batch size when memory is limited. Do loss.backward() every micro‑batch, and step/zero_grad every N steps.
19 How to use TensorBoard with PyTorch? 📊 Medium
Answer: Use torch.utils.tensorboard.SummaryWriter. Log scalars, histograms, images, and model graph. Alternatively, third‑party like wandb.
20 Common PyTorch pitfalls? 🔥 Hard
Answer: Forgetting optimizer.zero_grad(), not calling .to(device) for all tensors, in‑place ops after gradient computation, mixing numpy and torch without care, detaching graph incorrectly.
Debuggable, Pythonic
Steeper CUDA setup

PyTorch Interview Cheat Sheet

Core
  • torch.Tensor – GPU + grad
  • autograd – dynamic graph
  • nn.Module – parameter container
  • optim – SGD, Adam, etc.
Data
  • Dataset + DataLoader
  • torchvision / torchtext
  • Samplers, collate_fn
Production
  • TorchScript / JIT
  • ONNX export
  • Distributed (DDP)
  • AMP (mixed precision)

“PyTorch uses define‑by‑run – you write the code as you would in numpy, with automatic differentiation.”

Model Deployment