Reinforcement Learning: Learn by Interaction
Reinforcement Learning is the science of decision making. An agent learns to achieve a goal by interacting with an environment, receiving rewards, and improving its policy. From classical control to mastering Go and robotics.
MDP
(S, A, P, R, γ)
Bellman Eq
Optimality
Deep RL
DQN, PPO, SAC
OpenAI Gym
Environments
The Reinforcement Learning Framework
RL is formalized as a Markov Decision Process (MDP): Agent observes state sₜ, takes action aₜ, receives reward rₜ₊₁, transitions to next state sₜ₊₁. Goal: maximize cumulative discounted reward.
The agent learns to map states to actions to maximize return Gₜ = Σ γᵏ rₜ₊ₖ₊₁.
Bellman Equations & Dynamic Programming
Bellman Expectation Equations
V^π(s) = Σ π(a|s) [R(s,a) + γ Σ P(s'|s,a) V^π(s')]
Q^π(s,a) = R(s,a) + γ Σ P(s'|s,a) Σ π(a'|s') Q^π(s',a')
Recursive decomposition of value.
Bellman Optimality Equations
V^*(s) = max_a [R(s,a) + γ Σ P(s'|s,a) V^*(s')]
Q^*(s,a) = R(s,a) + γ Σ P(s'|s,a) max_a' Q^*(s',a')
Optimal values satisfy these fixed-point equations.
Policy Iteration
- Evaluate V^π (solve linear system)
- Improve π: greedy wrt V^π
- Repeat until convergence
Value Iteration
- Initialize V(s)=0
- V(s) ← max_a [R(s,a) + γ Σ P V(s')]
- Converges to V^*
import numpy as np
def value_iteration(P, R, gamma=0.9, theta=1e-6):
n_states = P.shape[0]
n_actions = P.shape[1]
V = np.zeros(n_states)
while True:
delta = 0
for s in range(n_states):
v = V[s]
# Bellman optimality backup
V[s] = max([sum([P[s, a, s'] * (R[s, a, s'] + gamma * V[s'])
for s' in range(n_states)]) for a in range(n_actions)])
delta = max(delta, abs(v - V[s]))
if delta < theta:
break
# Extract policy
policy = np.zeros(n_states, dtype=int)
for s in range(n_states):
policy[s] = np.argmax([sum([P[s, a, s'] * (R[s, a, s'] + gamma * V[s'])
for s' in range(n_states)]) for a in range(n_actions)])
return policy, V
Model-Free Learning: Monte Carlo & TD
When dynamics (P,R) are unknown, learn from experience.
Monte Carlo (MC)
Complete episodes, average returns.
V(s) ← V(s) + α [Gₜ - V(s)]
High variance, unbiased.
Temporal Difference (TD0)
Bootstrap: V(s) ← V(s) + α [r + γV(s') - V(s)]
Lower variance, biased.
TD Error: δ = r + γV(s') - V(s)
TD(λ) / Eligibility Traces
Unify MC and TD. Credit assignment over multiple steps.
V(s) ← V(s) + α δ e(s)
Q-Learning & SARSA
Q-Learning (Off-Policy)
Q(s,a) ← Q(s,a) + α [r + γ max_a' Q(s',a') - Q(s,a)]
Learns optimal Q* regardless of behavior policy. Uses max.
Exploration: ε-greedy
SARSA (On-Policy)
Q(s,a) ← Q(s,a) + α [r + γ Q(s',a') - Q(s,a)]
Learns Q for behavior policy. More stable, safer for live systems.
import gymnasium as gym
import numpy as np
env = gym.make('FrozenLake-v1', is_slippery=True)
n_states = env.observation_space.n
n_actions = env.action_space.n
Q = np.zeros((n_states, n_actions))
alpha = 0.1
gamma = 0.99
epsilon = 0.1
episodes = 10000
for episode in range(episodes):
state, _ = env.reset()
done = False
while not done:
# ε-greedy
if np.random.random() < epsilon:
action = env.action_space.sample()
else:
action = np.argmax(Q[state])
next_state, reward, terminated, truncated, _ = env.step(action)
done = terminated or truncated
# Q-Learning update
best_next = np.max(Q[next_state])
td_target = reward + gamma * best_next * (1 - done)
td_error = td_target - Q[state, action]
Q[state, action] += alpha * td_error
state = next_state
# Evaluate
state, _ = env.reset()
done = False
total_reward = 0
while not done:
action = np.argmax(Q[state])
state, reward, terminated, truncated, _ = env.step(action)
done = terminated or truncated
total_reward += reward
print(f"Test reward: {total_reward}")
Deep Q-Networks (DQN)
When state space is continuous/high-dimensional, use neural networks as Q-function approximators.
DQN Innovations
- Experience Replay: Store transitions (s,a,r,s') in buffer, sample randomly. Breaks correlation.
- Target Network: Fixed Q_target for TD target. Updated periodically.
- Gradient Clipping: Huber loss for stability.
class DQN(nn.Module):
def __init__(self, state_dim, action_dim):
super().__init__()
self.net = nn.Sequential(
nn.Linear(state_dim, 128),
nn.ReLU(),
nn.Linear(128, 128),
nn.ReLU(),
nn.Linear(128, action_dim)
)
def forward(self, x):
return self.net(x)
# Training step
def optimize_dqn():
if len(replay_buffer) < batch_size:
return
states, actions, rewards, next_states, dones = replay_buffer.sample(batch_size)
# Compute Q(s, a)
q_values = policy_net(states).gather(1, actions)
# Compute target: r + γ max_a' Q_target(s', a')
with torch.no_grad():
next_q_values = target_net(next_states).max(1, keepdim=True)[0]
targets = rewards + gamma * next_q_values * (1 - dones)
loss = nn.HuberLoss()(q_values, targets)
optimizer.zero_grad()
loss.backward()
optimizer.step()
Policy Gradient: REINFORCE
Directly optimize policy π(a|s; θ) using gradient ascent on expected return.
Policy Gradient Theorem
∇J(θ) = E_π [∇log π(a|s; θ) · Q^π(s,a)]
REINFORCE: Monte Carlo estimate of Q^π using Gₜ.
# REINFORCE update
for t in range(episode_len):
G = sum(gamma**k * r[t+k] for k in range(episode_len-t))
loss = -log π(a[t]|s[t]) * G
loss.backward()
Advantage: Reduce Variance
Use baseline b(s): ∇log π · (Gₜ - b(s)). Common: state-value V(s).
A(s,a) = Q(s,a) - V(s) = advantage function.
Actor-Critic Methods
Combine policy-based (actor) and value-based (critic) learning. Actor updates policy in direction suggested by critic.
A2C / A3C
Actor: ∇log π(a|s) * A(s,a)
Critic: TD error δ = r + γV(s') - V(s)
A3C: Asynchronous parallel workers. A2C: synchronous.
PPO – Proximal Policy Optimization
Clipped surrogate objective prevents too large policy updates.
L^CLIP(θ) = E[min(r(θ) A, clip(r(θ), 1-ε, 1+ε) A)]
Default in OpenAI, DeepMind
SAC – Soft Actor-Critic
Maximize reward + entropy → better exploration.
J(π) = Σ E[r + α H(π(·|s))]
State-of-the-art for continuous control.
DDPG / TD3
Deterministic policy gradients for continuous actions. DDPG + twin critics + target policy smoothing = TD3.
Practical RL with Stable-Baselines3
Industry-standard library for RL. Provides tested implementations of PPO, SAC, DQN, etc.
from stable_baselines3 import PPO
from stable_baselines3.common.env_util import make_vec_env
# Create environment
env = make_vec_env('CartPole-v1', n_envs=4)
# Initialize PPO
model = PPO(
policy='MlpPolicy',
env=env,
learning_rate=3e-4,
n_steps=2048,
batch_size=64,
n_epochs=10,
gamma=0.99,
verbose=1
)
# Train
model.learn(total_timesteps=100000)
# Save and load
model.save("ppo_cartpole")
model = PPO.load("ppo_cartpole")
# Evaluate
obs = env.reset()
for _ in range(1000):
action, _ = model.predict(obs, deterministic=True)
obs, reward, done, info = env.step(action)
Multi-Agent & Advanced RL
MARL
Multiple agents: cooperative, competitive, or mixed.
VDN, QMIX, MADDPG.
Inverse RL
Infer reward function from expert demonstrations.
Hierarchical RL
Options, temporal abstraction.
RL Algorithm Comparison
| Algorithm | Type | Action Space | Policy | Stability | Sample Efficiency |
|---|---|---|---|---|---|
| Q-Learning | Value | Discrete | Off-policy | ⭐⭐ | ⭐⭐ |
| DQN | Value | Discrete | Off-policy | ⭐⭐⭐ | ⭐⭐⭐ |
| REINFORCE | Policy | Both | On-policy | ⭐ | ⭐ |
| A2C/A3C | Actor-Critic | Both | On-policy | ⭐⭐⭐ | ⭐⭐ |
| PPO | Actor-Critic | Both | On-policy | ⭐⭐⭐⭐ | ⭐⭐⭐ |
| SAC | Actor-Critic | Continuous | Off-policy | ⭐⭐⭐⭐ | ⭐⭐⭐⭐ |
| TD3 | Actor-Critic | Continuous | Off-policy | ⭐⭐⭐⭐ | ⭐⭐⭐⭐ |
RL in the Wild
Games
AlphaGo, Dota 5, StarCraft II
Robotics
Manipulation, locomotion
Drug Discovery
Molecule generation
Finance
Portfolio optimization