ML Practice Exercises
Hands-on

Machine Learning Exercises

Use these exercises to reinforce your understanding of ML theory, algorithms and implementation details.

Topic‑1: ML Basics & Theory

  • Define supervised, unsupervised and reinforcement learning. For each, give two real‑world examples.
  • Explain bias‑variance trade‑off. For three different models (linear, tree, deep net), describe where they typically sit on this spectrum.
  • List at least five common sources of data leakage in ML projects and propose a mitigation for each.
  • For classification, compare Logistic Regression, k‑NN and Decision Trees in terms of interpretability, training speed and robustness to noise.

Topic‑2: Regression

  • On a housing dataset, split into train/validation. Train Linear Regression, Ridge and Lasso; compare RMSE and discuss which features are shrunk or removed.
  • Implement gradient descent for univariate Linear Regression in NumPy and verify that the solution matches the closed‑form solution.
  • Create polynomial features (degree 2, 3) for a synthetic 1D dataset and show how training/validation error changes with degree and regularization.

Topic‑3: Classification & Metrics

  • Write a function that, given y_true and y_pred, computes accuracy, precision, recall, F1‑score and confusion matrix (without using sklearn.metrics).
  • On the Titanic or a similar dataset, train at least three classifiers (Logistic Regression, Random Forest, SVM) and compare ROC‑AUC and PR‑AUC.
  • Plot ROC and Precision‑Recall curves for an imbalanced dataset and explain when PR‑AUC is more informative than ROC‑AUC.

Topic‑4: Data Preprocessing & Feature Engineering

  • Given a mixed‑type tabular dataset, design a preprocessing pipeline that imputes missing values, scales numeric features and encodes categoricals. Implement it with ColumnTransformer + Pipeline.
  • Create at least five new domain‑inspired features for the House Price dataset and show how they impact model performance.
  • Demonstrate the effect of feature scaling on k‑NN and SVM by training with and without scaling and comparing results.

Topic‑5: Time Series

  • Take a univariate time series (e.g., daily sales). Create lag and rolling‑window features and train a tree‑based regressor for one‑step‑ahead forecasting using proper time‑based splits.
  • Implement a naive, seasonal naive and simple moving‑average forecast and compare them as baselines against your ML model.
  • Perform a train/validation backtest with a rolling window (e.g., 3 folds) and compute MAE/RMSE for each fold.

Topic‑6: NLP

  • Build a simple spam/ham SMS classifier using bag‑of‑words + Naive Bayes; then upgrade to TF‑IDF and compare metrics.
  • Given a small text corpus, experiment with different tokenization strategies (word, subword, character) and discuss pros/cons.
  • Use a pre‑trained transformer (e.g., BERT via Hugging Face) and fine‑tune it for sentiment analysis on a small dataset; measure improvement over classical models.

Topic‑7: Neural Networks & Deep Learning

  • Implement a fully‑connected neural network for MNIST using a deep learning framework of your choice; experiment with different activations and regularization (dropout, weight decay).
  • Plot training and validation loss curves; identify and fix overfitting using early stopping and data augmentation.
  • Re‑implement forward and backward passes for a simple 2‑layer network in pure NumPy to solidify your understanding of backpropagation.

Topic‑8: Pandas, NumPy & Scikit‑Learn

  • Using NumPy only, implement standardization and min‑max scaling functions and verify against sklearn’s StandardScaler and MinMaxScaler.
  • With pandas, load a raw CSV, perform exploratory analysis (missing values, distributions, correlations) and summarize key data quality issues.
  • Build a complete sklearn Pipeline (preprocessing + model), wrap it in GridSearchCV and report the best configuration and scores.

Topic‑9: Mini Projects & MLOps

  • Build a small REST API (FastAPI or Flask) that serves predictions from a trained model, including basic input validation and logging.
  • Create a notebook that benchmarks multiple models on the same dataset with clear visualizations and a short written report of conclusions.
  • Take an existing Kaggle notebook, refactor it into reusable functions/modules, and add at least two improvements (better features, tuning, or evaluation).