Dimensionality Reduction & PCA

Learn how to reduce the number of features while keeping most of the information, using Principal Component Analysis (PCA) in Python.

Why Dimensionality Reduction?

High-dimensional data can be noisy and hard to visualize.
Too many features can lead to overfitting.
Reducing dimensions can speed up training and improve generalization.

Principal Component Analysis (PCA)

PCA finds new axes (principal components) that capture the maximum variance in the data. We can keep only the first few components that explain most of the variance.

Unsupervised technique (does not use labels).
Transforms original features into orthogonal components.
Often used for visualization (2D/3D) and noise reduction.

Example: PCA on Iris Dataset

Reduce to 2D and Plot

from sklearn.datasets import load_iris
from sklearn.decomposition import PCA
import matplotlib.pyplot as plt

# Load Iris data (4 features)
iris = load_iris()
X = iris.data
y = iris.target
target_names = iris.target_names

# Reduce to 2 principal components
pca = PCA(n_components=2)
X_pca = pca.fit_transform(X)

print("Explained variance ratio:", pca.explained_variance_ratio_)

# Plot in 2D
plt.figure(figsize=(8, 6))
colors = ["navy", "turquoise", "darkorange"]

for color, i, target_name in zip(colors, [0, 1, 2], target_names):
    plt.scatter(
        X_pca[y == i, 0],
        X_pca[y == i, 1],
        color=color,
        alpha=0.7,
        label=target_name
    )

plt.xlabel("PC 1")
plt.ylabel("PC 2")
plt.title("PCA of Iris Dataset")
plt.legend()
plt.show()

Next: Linear Regression

Related Data Science Links

Dimensionality Reduction & PCA

Why Dimensionality Reduction?

Principal Component Analysis (PCA)

Example: PCA on Iris Dataset