Related Data Science Links
Learn Feature Engineering Data Science Tutorial, validate concepts with Feature Engineering Data Science MCQ Questions, and prepare interviews through Feature Engineering Data Science Interview Questions and Answers.
Feature Engineering Basics
Learn how to transform raw data into better features using scaling, encoding, and simple feature creation, with hands-on Python examples.
What is Feature Engineering?
Feature engineering is the process of transforming raw data into meaningful inputs that help machine learning models perform better. Good features often matter more than fancy algorithms.
- Handle missing values and inconsistent formats.
- Scale numeric features to similar ranges.
- Encode categorical variables into numbers.
- Create new features from existing ones.
Scaling Numeric Features
Many algorithms (e.g., linear models, KNN, neural networks) work better when features are on similar scales.
import pandas as pd
from sklearn.preprocessing import StandardScaler, MinMaxScaler
df = pd.DataFrame({
"age": [18, 25, 40, 60],
"salary": [20000, 35000, 60000, 120000]
})
print("Original data:")
print(df)
# Standardization: mean=0, std=1
std_scaler = StandardScaler()
df_std = pd.DataFrame(
std_scaler.fit_transform(df),
columns=df.columns
)
print("\nStandardized features:")
print(df_std.round(2))
# Min-max scaling: range [0, 1]
mm_scaler = MinMaxScaler()
df_mm = pd.DataFrame(
mm_scaler.fit_transform(df),
columns=df.columns
)
print("\nMin-max scaled features:")
print(df_mm.round(2))
Encoding Categorical Variables
Models work with numbers, not text. We convert categories to numeric form using techniques like one-hot encoding.
import pandas as pd
df = pd.DataFrame({
"city": ["Delhi", "Mumbai", "Delhi", "Chennai"],
"gender": ["M", "F", "F", "M"]
})
print("Original:")
print(df)
# One-hot encode categorical variables
df_encoded = pd.get_dummies(df, columns=["city", "gender"], drop_first=True)
print("\nEncoded:")
print(df_encoded)
Creating New Features
Sometimes you can combine or transform existing columns to capture useful patterns.
Salary per Year of Experience
import pandas as pd
df = pd.DataFrame({
"years_experience": [1, 3, 5, 10],
"salary": [30000, 45000, 70000, 120000]
})
# Create a new feature
df["salary_per_year"] = df["salary"] / df["years_experience"]
print(df)