Data Science Lifecycle (CRISP‑DM & Modern Practice)

A successful Data Science project is not only about building a model. It follows a structured lifecycle from understanding the business problem to deploying and monitoring the solution in production.

What is the Data Science Lifecycle?

The Data Science lifecycle describes the standard steps a data project goes through. The most famous framework is CRISP‑DM (Cross‑Industry Standard Process for Data Mining), which is still widely used and has been extended with MLOps concepts.

Provides a repeatable, reliable process for data projects.
Helps teams avoid “model in a notebook” that never reaches production.
Makes communication with business stakeholders more structured.

CRISP‑DM in 6 Phases

1Business Understanding

Translate vague ideas into clear, measurable objectives.

Define the problem: “Reduce churn by 10%” not just “use AI”.
Identify stakeholders and success metrics (KPIs).
List constraints: data availability, time, budget, regulations.

2Data Understanding

Collect and explore the available data sources.

Connect to databases, data lakes, files, APIs.
Perform initial EDA (distributions, missing values, anomalies).
Check data quality and potential data leakage issues.

3Data Preparation

Turn raw data into clean, modeling‑ready datasets.

Handle missing values, outliers and inconsistent formats.
Feature engineering (aggregations, encodings, scaling).
Split data into train / validation / test sets.

4Modeling

Train, validate and select algorithms.

Try baseline models first (logistic/linear regression, trees).
Tune hyperparameters (GridSearch, RandomSearch, Bayesian).
Use cross‑validation to avoid overfitting.

5Evaluation

Check if the model meets business and technical criteria.

Use correct metrics (ROC‑AUC, F1, RMSE, etc.).
Perform error analysis and fairness checks.
Present results with clear visualizations and recommendations.

6Deployment & Monitoring

Put the solution into real‑world use and monitor it.

Deploy as batch jobs, APIs, or embedded models.
Monitor data drift, model performance and business KPIs.
Plan regular retraining and continuous improvement.

Modern Data & ML Pipeline (MLOps View)

In production environments, the lifecycle is implemented as an automated pipeline. Tools like Airflow, Kubeflow, MLflow, and cloud platforms help orchestrate the steps.

Raw Data  ──►  Ingestion  ──►  Data Lake / Warehouse
             (batch/stream)

Data Lake  ──►  Feature Engineering  ──►  Feature Store

Feature Store ──►  Model Training  ──►  Model Registry
                             ▲              │
                             │              └─►  Deployment (API, batch, edge)
                             └──────────────┬───────────────
                                            ▼
                                    Monitoring & Alerts
                              (data drift, model drift, KPIs)

Best Practices Across the Lifecycle

Reproducibility

Use version control (Git) for code & configs.
Fix random seeds for experiments.
Track datasets, models and metrics (MLflow, DVC).

Collaboration

Document assumptions and decisions.
Share notebooks and dashboards with the team.
Align frequently with business stakeholders.

Responsibility

Check bias and fairness of models.
Respect privacy and data regulations (GDPR, etc.).
Monitor and rollback if a model misbehaves.

Next: Linear Algebra Basics

Related Data Science Links