Descriptive Statistics for Exploratory Data Analysis (EDA)

Descriptive statistics summarize your dataset with a few numbers. They are the first tools you should use in any Data Science project to understand your data.

Measures of Central Tendency

Central tendency measures tell you where the “center” of the data lies.

Mean: arithmetic average, sensitive to outliers.
Median: middle value, robust to outliers.
Mode: most frequent value.

import numpy as np
import pandas as pd
from scipy import stats

data = np.array([10, 12, 13, 13, 14, 100])  # 100 is an outlier

mean = data.mean()
median = np.median(data)
mode = stats.mode(data, keepdims=True).mode[0]

print("Data:", data)
print("Mean  :", mean)
print("Median:", median)
print("Mode  :", mode)

Measures of Spread (Dispersion)

Spread tells you how variable your data is. Two datasets can have the same mean with very different spreads.

Range: max − min.
Variance & Standard Deviation: average squared deviation from the mean.
Percentiles & IQR: robust spread measures (IQR = Q3 − Q1).

import numpy as np

data = np.array([10, 12, 13, 13, 14, 100])

data_min, data_max = data.min(), data.max()
data_range = data_max - data_min
variance = np.var(data, ddof=1)          # sample variance
std_dev = np.std(data, ddof=1)           # sample standard deviation
q1, q3 = np.percentile(data, [25, 75])
iqr = q3 - q1

print("Range      :", data_range)
print("Variance   :", round(variance, 2))
print("Std Dev    :", round(std_dev, 2))
print("Q1, Q3, IQR:", q1, q3, iqr)

Quick Summary with pandas.describe()

In real projects you rarely compute all statistics manually. Instead, you use pandas.DataFrame.describe() to get a quick overview.

import pandas as pd

df = pd.DataFrame({
    "age": [23, 25, 31, 40, 29, 37, 45],
    "salary": [35000, 42000, 50000, 70000, 48000, 65000, 90000]
})

print(df.describe())

Next: Inferential Statistics

Related Data Science Links

Measures of Central Tendency

Measures of Spread (Dispersion)

Quick Summary with pandas.describe()