Data Cleaning Intermediate
~15 min read

Cleaning Real‑World Data Before Modeling

Real datasets are messy: missing values, outliers, duplicates and inconsistent formats. Data cleaning turns raw data into something reliable enough for analysis and machine learning.

Handling Missing Values

Missing values can bias your analysis if not handled properly. Common strategies include removing rows, filling with simple statistics (mean/median/mode), or using model‑based imputations.

The right strategy depends on why the data is missing (Missing Completely At Random, Missing At Random or Missing Not At Random). For example, dropping rows is usually safe when only a small percentage of values are missing at random, while more advanced techniques like KNN or model‑based imputation are preferred when missingness has structure.

import pandas as pd
import numpy as np

df = pd.DataFrame({
    "age": [25, np.nan, 30, 35, np.nan],
    "salary": [50000, 52000, np.nan, 60000, 58000]
})

print("Missing per column:")
print(df.isna().sum())

# Simple imputations
df["age"] = df["age"].fillna(df["age"].median())
df["salary"] = df["salary"].fillna(df["salary"].mean())

print("After imputation:")
print(df)

Detecting & Treating Outliers

Outliers can heavily influence means and linear models. Use the Interquartile Range (IQR) or z‑scores to detect unusual points, then decide whether to cap, transform or remove them based on domain knowledge.

import numpy as np
import pandas as pd

df = pd.DataFrame({"income": [30, 32, 35, 36, 37, 500]})

q1, q3 = df["income"].quantile([0.25, 0.75])
iqr = q3 - q1
lower = q1 - 1.5 * iqr
upper = q3 + 1.5 * iqr

outliers = df[(df["income"] < lower) | (df["income"] > upper)]
print("Outliers:\n", outliers)

Removing Duplicates & Fixing Formats

Duplicated rows and inconsistent formats (dates, categories, casing) are another common source of bad data. Pandas provides helpers for both.

df = pd.DataFrame({
    "email": ["a@example.com", "B@Example.com", "a@example.com"],
    "signup_date": ["2024-01-01", "01/02/2024", "2024-01-01"]
})

# Normalize email casing
df["email"] = df["email"].str.lower()

# Parse dates
df["signup_date"] = pd.to_datetime(df["signup_date"], dayfirst=False, errors="coerce")

# Drop duplicates
df = df.drop_duplicates(subset=["email"])
print(df)