Core Machine Learning Concepts & Terminology

Before diving into algorithms, itâ€™s critical to understand the common language of Machine Learning â€” datasets, features, labels, loss, overfitting, generalization and more.

Datasets, Features & Labels

A typical ML dataset can be represented as a matrix \(X\) (rows = samples, columns = features) and a target vector \(y\) (labels).

Sample / Instance: a single row of data (e.g., one customer, one transaction).
Feature: an input variable (age, income, number of clicks).
Label / Target: what we want to predict (price, churn yes/no).

samples features labels tabular data

Training, Validation & Test Sets

We split the dataset to estimate how well our model will perform on unseen data:

Training set: used to fit the model parameters.
Validation set: used for model selection and hyperparameter tuning.
Test set: used once at the end to report final performance.

Rule of thumb: never use your test data to make modeling decisions. That leads to optimistic and unreliable metrics.

Overfitting vs Underfitting

Models must balance fit and simplicity:

Underfitting: model is too simple, cannot capture the pattern (high bias).
Overfitting: model memorizes noise in training data (high variance).
Just right: low training error and low validation error.

Loss Functions & Evaluation Metrics

During training, we minimize a loss function. After training, we report metrics that are easier to interpret.

Regression: MSE, RMSE, MAE, \(R^2\).
Classification: Accuracy, Precision, Recall, F1â€‘Score, ROCâ€‘AUC.

Previous: Types of ML Next: ML Workflow

Related Machine Learning Links