Machine Learning Exercises

Use these exercises to reinforce your understanding of ML theory, algorithms and implementation details.

Topicâ€‘1: ML Basics & Theory

Define supervised, unsupervised and reinforcement learning. For each, give two realâ€‘world examples.
Explain biasâ€‘variance tradeâ€‘off. For three different models (linear, tree, deep net), describe where they typically sit on this spectrum.
List at least five common sources of data leakage in ML projects and propose a mitigation for each.
For classification, compare Logistic Regression, kâ€‘NN and Decision Trees in terms of interpretability, training speed and robustness to noise.

On a housing dataset, split into train/validation. Train Linear Regression, Ridge and Lasso; compare RMSE and discuss which features are shrunk or removed.
Implement gradient descent for univariate Linear Regression in NumPy and verify that the solution matches the closedâ€‘form solution.
Create polynomial features (degree 2, 3) for a synthetic 1D dataset and show how training/validation error changes with degree and regularization.

Write a function that, given y_true and y_pred, computes accuracy, precision, recall, F1â€‘score and confusion matrix (without using sklearn.metrics).
On the Titanic or a similar dataset, train at least three classifiers (Logistic Regression, Random Forest, SVM) and compare ROCâ€‘AUC and PRâ€‘AUC.
Plot ROC and Precisionâ€‘Recall curves for an imbalanced dataset and explain when PRâ€‘AUC is more informative than ROCâ€‘AUC.

Given a mixedâ€‘type tabular dataset, design a preprocessing pipeline that imputes missing values, scales numeric features and encodes categoricals. Implement it with ColumnTransformer + Pipeline.
Create at least five new domainâ€‘inspired features for the House Price dataset and show how they impact model performance.
Demonstrate the effect of feature scaling on kâ€‘NN and SVM by training with and without scaling and comparing results.

Take a univariate time series (e.g., daily sales). Create lag and rollingâ€‘window features and train a treeâ€‘based regressor for oneâ€‘stepâ€‘ahead forecasting using proper timeâ€‘based splits.
Implement a naive, seasonal naive and simple movingâ€‘average forecast and compare them as baselines against your ML model.
Perform a train/validation backtest with a rolling window (e.g., 3 folds) and compute MAE/RMSE for each fold.

Build a simple spam/ham SMS classifier using bagâ€‘ofâ€‘words + Naive Bayes; then upgrade to TFâ€‘IDF and compare metrics.
Given a small text corpus, experiment with different tokenization strategies (word, subword, character) and discuss pros/cons.
Use a preâ€‘trained transformer (e.g., BERT via Hugging Face) and fineâ€‘tune it for sentiment analysis on a small dataset; measure improvement over classical models.

Implement a fullyâ€‘connected neural network for MNIST using a deep learning framework of your choice; experiment with different activations and regularization (dropout, weight decay).
Plot training and validation loss curves; identify and fix overfitting using early stopping and data augmentation.
Reâ€‘implement forward and backward passes for a simple 2â€‘layer network in pure NumPy to solidify your understanding of backpropagation.

Using NumPy only, implement standardization and minâ€‘max scaling functions and verify against sklearnâ€™s StandardScaler and MinMaxScaler.
With pandas, load a raw CSV, perform exploratory analysis (missing values, distributions, correlations) and summarize key data quality issues.
Build a complete sklearn Pipeline (preprocessing + model), wrap it in GridSearchCV and report the best configuration and scores.

Build a small REST API (FastAPI or Flask) that serves predictions from a trained model, including basic input validation and logging.
Create a notebook that benchmarks multiple models on the same dataset with clear visualizations and a short written report of conclusions.
Take an existing Kaggle notebook, refactor it into reusable functions/modules, and add at least two improvements (better features, tuning, or evaluation).