Related Data Science Links
Learn Probability Data Science Tutorial, validate concepts with Probability Data Science MCQ Questions, and prepare interviews through Probability Data Science Interview Questions and Answers.
Probability Theory for Data Science & Machine Learning
Probability quantifies uncertainty. Most machine learning algorithms either use probability directly (Naive Bayes, Bayesian models) or are best understood with a probabilistic view.
Random Variables & Events
A random variable is a variable whose value is uncertain. We model it using a probability distribution. Examples:
- Number of clicks on an ad (discrete).
- Height of a person (continuous).
- Class label in classification (categorical).
import numpy as np
# Simulate 10,000 coin flips (0 = tails, 1 = heads)
np.random.seed(42)
flips = np.random.binomial(n=1, p=0.5, size=10_000)
prob_heads = flips.mean()
prob_tails = 1 - prob_heads
print("P(heads) ≈", round(prob_heads, 3))
print("P(tails) ≈", round(prob_tails, 3))
Common Probability Distributions
Some distributions appear again and again in Data Science:
- Bernoulli / Binomial: binary outcomes and counts.
- Normal (Gaussian): continuous, “bell‑shaped” data.
- Poisson: counts over time (events per minute).
- Exponential: time between events.
import numpy as np
from scipy import stats
np.random.seed(0)
# Normal distribution
normal_data = np.random.normal(loc=0, scale=1, size=1000)
# Binomial distribution
binom_data = np.random.binomial(n=10, p=0.3, size=1000)
# Poisson distribution
poisson_data = np.random.poisson(lam=3, size=1000)
print("Normal mean/std:", round(normal_data.mean(), 3), round(normal_data.std(), 3))
print("Binomial mean:", round(binom_data.mean(), 3))
print("Poisson mean:", round(poisson_data.mean(), 3))
Conditional Probability & Bayes' Theorem
Conditional probability is the probability of an event given that another event has occurred: \(P(A \mid B)\). Bayes' theorem connects prior and posterior probabilities:
\[ P(A \mid B) = \frac{P(B \mid A) \; P(A)}{P(B)} \]
# Simple Bayes theorem example in code
P_disease = 0.01 # 1% have the disease (prior)
P_positive_given_disease = 0.99
P_positive_given_healthy = 0.05
P_healthy = 1 - P_disease
# Total probability of a positive test
P_positive = (P_positive_given_disease * P_disease +
P_positive_given_healthy * P_healthy)
# Posterior: P(disease | positive)
P_disease_given_positive = (P_positive_given_disease * P_disease) / P_positive
print("P(positive):", round(P_positive, 3))
print("P(disease | positive):", round(P_disease_given_positive, 3))