Neural Networks Intro Tutorial
Beginner friendly Python examples

What Are Neural Networks?

A practical introduction to artificial neural networks: how they compute, how they learn from data, and how the pieces—layers, activations, loss, and optimizers—fit into a training loop you can implement in a few lines of code.

deep learning feedforward backpropagation PyTorch NumPy

Layers

Stacked transformations

Activations

Nonlinearities

Loss

How wrong we are

Training

Update weights

Introduction: From Programs That Follow Rules to Programs That Learn

A neural network is a parameterized function, usually built from many simple units (artificial neurons) organized in layers. You give it an input (for example, a vector of numbers representing pixels, words, or sensor readings). It produces an output (a class label, a score, a translated sentence, etc.). The “magic” is that the mapping is not hand-coded rule-by-rule; instead, weights and biases are adjusted automatically using data so that the network’s output gets closer to what you want.

In classical programming you write if/else and explicit formulas. In machine learning you choose a model family (here: a network shape), a loss function that measures error on examples, and an optimizer that nudges parameters to reduce that error. Deep learning simply means using neural networks with many layers (deep stacks), which can represent rich patterns—at the cost of needing more data and compute.

When are NNs a good fit?
  • Large labeled datasets (images, speech, text)
  • Complex patterns hard to engineer by hand
  • You can afford training time and tuning
  • Prediction quality matters more than interpretability
When to try simpler models first
  • Very small data or strict explainability needs
  • Strong linear structure (try linear / logistic regression)
  • Baselines like random forests still competitive on tabular data
Terminology. “Network,” “model,” and “function approximator” are often used interchangeably. A parameter is a learnable number (typically a weight or bias). Inference (or forward pass) means computing outputs with fixed parameters; training means updating parameters using gradients.

Building Blocks: Neurons, Layers, Weights, Biases

The simplest building block is a linear layer (fully connected / dense): it computes y = Wx + b. Here x is an input vector, W is a weight matrix, b is a bias vector, and y is a vector of pre-activations (logits before a nonlinearity). Stacking linear layers without nonlinearities between them would collapse to a single linear map—so we insert activation functions (e.g. ReLU, sigmoid) between layers to introduce nonlinearity.

Input x → [Linear: W₁, b₁] → σ(·) → [Linear: W₂, b₂] → σ(·) → … → output ŷ

Mini example: one neuron by hand

Suppose input x = [2.0, -1.0], weights w = [0.5, 1.0], bias b = 0.25. The pre-activation is the dot product plus bias: z = 0.5×2 + 1.0×(-1) + 0.25 = 0.25. If we use ReLU, a = max(0, z) = 0.25. That single value could feed the next layer or, in a binary classifier, pass through a sigmoid to get a probability.

NumPy: one linear layer + ReLU
import numpy as np

# Input vector (batch size 1, two features)
x = np.array([[2.0, -1.0]])          # shape (1, 2)
W = np.array([[0.5], [1.0]])         # shape (2, 1) — maps 2 inputs → 1 output
b = np.array([[0.25]])               # shape (1, 1)

z = x @ W + b                        # linear: (1,2) @ (2,1) + (1,1) → (1,1)
a = np.maximum(0, z)                 # ReLU
print("pre-activation z:", z)
print("after ReLU a:   ", a)

Activation Functions (Short and Practical)

Activations introduce nonlinearity so the network can approximate curved decision boundaries and complex mappings. Common choices:

  • ReLU max(0, z) — default for hidden layers in many vision MLPs/CNNs; cheap and avoids vanishing gradient for positive activations.
  • Sigmoid 1/(1+e^{-z}) — squashes to (0,1); classic for binary probabilities at the output.
  • Tanh — squashes to (-1,1); sometimes in RNNs; zero-centered unlike sigmoid.
  • Softmax — turns a vector of logits into a probability distribution over classes (multi-class output).
ReLU and sigmoid in NumPy
import numpy as np

def relu(z):
    return np.maximum(0, z)

def sigmoid(z):
    return 1.0 / (1.0 + np.exp(-z))

z = np.array([-2.0, 0.0, 0.5, 2.0])
print("ReLU:  ", relu(z))
print("Sigmoid:", sigmoid(z))
Softmax (stable)
import numpy as np

def softmax(logits):
    # subtract max for numerical stability
    e = np.exp(logits - np.max(logits))
    return e / e.sum()

logits = np.array([2.0, 1.0, 0.1])
print("probs:", softmax(logits))
print("sum =", softmax(logits).sum())  # 1.0

Forward Pass: From Input to Prediction

The forward pass applies each layer in order: linear transforms, then activations, until you reach the output head. For a classifier, the last layer often produces one logit per class; softmax converts those logits to probabilities. For regression, the last layer may have a single linear output with no activation (or a bounded activation if outputs must stay in a range).

Tiny 2-layer MLP forward (NumPy)
import numpy as np

def relu(z): return np.maximum(0, z)

# Toy dimensions: 3 inputs → 4 hidden → 2 outputs (e.g. 2 classes)
rng = np.random.default_rng(42)
W1 = rng.normal(0, 0.5, (3, 4))
b1 = np.zeros((1, 4))
W2 = rng.normal(0, 0.5, (4, 2))
b2 = np.zeros((1, 2))

def forward(x):
    z1 = x @ W1 + b1
    a1 = relu(z1)
    z2 = a1 @ W2 + b2   # logits
    return z2, a1

x = np.array([[1.0, 0.5, -0.5]])   # one example
logits, h = forward(x)
print("hidden activations:", h)
print("output logits:", logits)

Loss, Optimization, and the Training Loop

Training means finding parameters that make the network’s predictions match labels (supervised case). You choose a loss that measures mismatch: e.g. mean squared error for regression, cross-entropy for classification. The optimizer (SGD, Adam, …) uses gradients of the loss with respect to each parameter—computed efficiently via backpropagation (chain rule through the network)—and updates weights in small steps controlled by the learning rate.

In code frameworks, you rarely implement backprop by hand; you define the forward computation, call loss.backward() (PyTorch) or run a training step in Keras/TensorFlow, and the library tracks the graph and gradients.

Typical supervised training loop

  1. Sample a mini-batch of inputs and labels.
  2. Forward pass → predictions.
  3. Compute loss.
  4. Backward pass → gradients.
  5. Optimizer step → update weights.
  6. Repeat until convergence or budget exhausted; validate on held-out data.
Minimal PyTorch training skeleton (concept)
import torch
import torch.nn as nn

# Toy data: 10 examples, 5 features, 3 classes
X = torch.randn(10, 5)
y = torch.randint(0, 3, (10,))

model = nn.Sequential(
    nn.Linear(5, 16),
    nn.ReLU(),
    nn.Linear(16, 3),
)
loss_fn = nn.CrossEntropyLoss()
optimizer = torch.optim.Adam(model.parameters(), lr=0.01)

model.train()
for epoch in range(50):
    optimizer.zero_grad()
    logits = model(X)
    loss = loss_fn(logits, y)
    loss.backward()
    optimizer.step()

model.eval()
with torch.no_grad():
    preds = model(X).argmax(dim=1)
print("final loss:", loss.item())
Overfitting. A big network can memorize training data. Use held-out validation, regularization (dropout, weight decay), more data, or simpler architectures. Early stopping and proper metrics (accuracy, F1, AUC) help you judge generalization—not just training loss.

Families of Neural Networks

The same training idea extends beyond plain feedforward MLPs:

  • Convolutional Neural Networks (CNNs) — share weights over space; ideal for images and spatial structure.
  • Recurrent networks (RNN, LSTM, GRU) — process sequences (time series, text) with hidden state.
  • Transformers — attention-based; dominant in modern NLP and increasingly in vision.

This series walks from fundamentals toward those architectures step by step.

Where Neural Networks Show Up

Vision

Object detection, segmentation, medical imaging, face recognition, quality inspection.

Language

Translation, summarization, chatbots, search ranking, code assistants.

Speech & audio

Speech recognition, synthesis, music tagging, noise suppression.

Suggested Learning Path (This Track)

After this overview, a natural order is: perceptron → MLP → activations (in depth) → forward propagation → loss functions → gradient descent → backpropagation → then regularization, optimizers, and specialized architectures (CNN, RNN, attention).

Use the left menu to jump to each lesson; MCQ and interview pages mirror the same topic order for practice.

Key Takeaways

  • A neural network is a differentiable, layered function trained with data and gradients.
  • Nonlinear activations are essential; stacking linear layers alone is not enough.
  • Training = forward → loss → backward → optimizer step, repeated over mini-batches.
  • Choose depth, width, and architecture based on data, task, and compute—not “bigger is always better.”

Next in series: single artificial neuron and the perceptron rule.

Next: Perceptron