Computational Graphs

A computational graph is a directed acyclic graph (DAG) whose nodes are operations (add, multiply, matmul, ReLU, log, â€¦) and whose edges carry tensors (values flowing between ops). Your neural network forward pass is exactly such a graph. Automatic differentiation (â€œautodiffâ€) schedules a backward traversal that applies local Jacobian-vector productsâ€”what we colloquially call backpropagation in deep learning.

DAG reverse mode checkpointing dynamic graph

Nodes, Edges, and Evaluation Order

Each node knows how to compute its outputs from its inputs (forward) and how to propagate gradients from outputs back to inputs (backward). The graph must be acyclic so there is a clear topological order: you can run forward from inputs to loss, then backward from loss to parameters.

Shared subgraphs (the same tensor feeding multiple consumers) require gradient accumulation: several backward paths add their contributions to the same tensorâ€™s gradient. Frameworks handle this with reference counting or explicit â€œbackward hooks.â€

Example: a = x * y; b = relu(a); L = (b - t)**2 Nodes: mul, relu, sub, pow/square Forward: bottom-up in topo order Backward: start at âˆ‚L/âˆ‚L = 1, visit nodes in reverse order

Dynamic vs Static Graphs

Dynamic (define-by-run) graphs, typical of eager PyTorch, are built as Python executes each line. That makes debugging and control flow (loops, if statements) naturalâ€”the graph can change every iteration.

Static (define-then-run) graphs, historically common in TensorFlow 1-style sessions, describe the whole program once, then feed data repeatedly. Modern TensorFlow 2 and JAX blur the line with tracing and compilation (e.g. tf.function, XLA) that specialize a graph for performance while keeping flexible front ends.

Neither style changes the underlying math: both need correct forward and backward rules per primitive op. Static compilation can fuse kernels and optimize memory; dynamic execution favors research velocity.

Memory, Checkpointing, and Recomputation

Backward needs whatever forward saved: typically input tensors to each op (or enough to recompute them). Long sequences (RNNs) and huge vision models motivated gradient checkpointing: do not store every activation; re-run forward segments during backward. You trade compute for RAMâ€”essential for large-batch or long-context training.

Profiling. OOM errors often mean activations dominate memory. Check batch size, image resolution, sequence length, and whether retain_graph=True is accidentally keeping graphs alive.

Higher-Order and Custom Ops

Hessian-vector products and meta-learning sometimes need derivatives of derivatives. Frameworks can extend autodiff to second order, but cost grows quickly. Custom CUDA kernels or fused ops still participate in the graph if wrapped with autograd rules.

When you implement a new layer, you provide forward and backward (or use autograd.Function in PyTorch). The graph stays consistent as long as local gradients are correct.

Seeing the Idea in PyTorch

Every tensor operation that touches a leaf with requires_grad=True records a grad_fn linking to the graph. Inspecting loss.grad_fn after a forward pass reveals the backward function chainâ€”useful for education, rarely needed day-to-day.

Graph leaves and grad_fn

import torch
w = torch.tensor(1.0, requires_grad=True)
x = torch.tensor(2.0, requires_grad=False)
y = w * x + 3
loss = y ** 2
print(loss.grad_fn)          # PowBackward0
print(loss.grad_fn.next_functions)

Summary

Networks are DAGs of differentiable ops; forward evaluates values, backward propagates gradients.
Reverse-mode autodiff (backprop) is efficient when there are many inputs and one scalar loss.
Graph style (dynamic vs compiled static) affects tooling and performance, not the chain rule.
Memory management (checkpointing) is part of scaling real graphs.

Next. Network designâ€”how width, depth, and structure connect to capacity and inductive bias.

Previous: Backpropagation Next: Network design

Related Neural Networks Links