Machine Learning Cheatsheet

ML Fundamentals

ML Concepts & Types

# Types of Machine Learning

# Supervised Learning - Labeled data

- Classification: Predict categorical labels

- Regression: Predict continuous values

# Unsupervised Learning - No labels

- Clustering: Group similar instances

- Dimensionality Reduction: Simplify data

- Anomaly Detection: Find unusual instances

# Semi-supervised Learning - Some labels

- Combination of labeled and unlabeled data

# Reinforcement Learning - Learn from feedback

- Agent learns to make decisions

- Rewards and punishments

# Key ML Concepts

- Features: Input variables (X)

- Labels: Output variables (y) - supervised learning

- Training set: Data used to train the model

- Test set: Data used to evaluate the model

- Validation set: Data used to tune hyperparameters

- Overfitting: Model learns training data too well

- Underfitting: Model fails to learn patterns

- Bias: Error from erroneous assumptions

- Variance: Error from sensitivity to fluctuations

# Model Evaluation Concepts

- Accuracy: Percentage of correct predictions

- Precision: True positives / (True positives + False positives)

- Recall: True positives / (True positives + False negatives)

- F1 Score: Harmonic mean of precision and recall

- ROC Curve: Visualize classification performance

- AUC: Area Under the ROC Curve

- Cross-validation: Robust evaluation technique

Data Preprocessing

# Import libraries

import numpy as np

import pandas as pd

from sklearn.preprocessing import StandardScaler, LabelEncoder

from sklearn.impute import SimpleImputer

from sklearn.model_selection import train_test_split

# Load data

df = pd.read_csv('data.csv')

X = df.drop('target', axis=1)

y = df['target']

# Handle missing values

# Remove rows with missing values

df.dropna(inplace=True)

# Impute missing values

imputer = SimpleImputer(strategy='mean')  # or 'median', 'most_frequent'

X_imputed = imputer.fit_transform(X)

# Encode categorical variables

# Label encoding (for ordinal data)

label_encoder = LabelEncoder()

y_encoded = label_encoder.fit_transform(y)

# One-hot encoding (for nominal data)

X_encoded = pd.get_dummies(X, columns=['categorical_column'])

# Feature scaling

# Standardization (mean=0, std=1)

scaler = StandardScaler()

X_scaled = scaler.fit_transform(X)

# Normalization (min=0, max=1)

from sklearn.preprocessing import MinMaxScaler

minmax_scaler = MinMaxScaler()

X_normalized = minmax_scaler.fit_transform(X)

# Train-test split

X_train, X_test, y_train, y_test = train_test_split(

    X, y, test_size=0.2, random_state=42, stratify=y

)

# Feature selection

from sklearn.feature_selection import SelectKBest, f_classif

selector = SelectKBest(score_func=f_classif, k=10)

X_selected = selector.fit_transform(X, y)

Scikit-Learn

Classification Algorithms

# Import classifiers

from sklearn.linear_model import LogisticRegression

from sklearn.tree import DecisionTreeClassifier

from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier

from sklearn.svm import SVC

from sklearn.neighbors import KNeighborsClassifier

from sklearn.naive_bayes import GaussianNB

# Logistic Regression

log_reg = LogisticRegression(random_state=42)

log_reg.fit(X_train, y_train)

y_pred = log_reg.predict(X_test)

# Decision Tree

tree_clf = DecisionTreeClassifier(max_depth=5, random_state=42)

tree_clf.fit(X_train, y_train)

# Random Forest

forest_clf = RandomForestClassifier(

    n_estimators=100, max_depth=5, random_state=42

)

forest_clf.fit(X_train, y_train)

# Support Vector Machine

svm_clf = SVC(kernel='rbf', C=1.0, gamma='scale', random_state=42)

svm_clf.fit(X_train, y_train)

# K-Nearest Neighbors

knn_clf = KNeighborsClassifier(n_neighbors=5)

knn_clf.fit(X_train, y_train)

# Gradient Boosting

gb_clf = GradientBoostingClassifier(

    n_estimators=100, learning_rate=0.1, random_state=42

)

gb_clf.fit(X_train, y_train)

# Naive Bayes

nb_clf = GaussianNB()

nb_clf.fit(X_train, y_train)

# Model evaluation

from sklearn.metrics import (

    accuracy_score, precision_score, recall_score,

    f1_score, confusion_matrix, classification_report

)

accuracy = accuracy_score(y_test, y_pred)

precision = precision_score(y_test, y_pred, average='weighted')

recall = recall_score(y_test, y_pred, average='weighted')

f1 = f1_score(y_test, y_pred, average='weighted')

cm = confusion_matrix(y_test, y_pred)

report = classification_report(y_test, y_pred)

Regression & Clustering

# Regression algorithms

from sklearn.linear_model import LinearRegression, Ridge, Lasso

from sklearn.ensemble import RandomForestRegressor, GradientBoostingRegressor

from sklearn.svm import SVR

# Linear Regression

lin_reg = LinearRegression()

lin_reg.fit(X_train, y_train)

# Ridge Regression (L2 regularization)

ridge_reg = Ridge(alpha=1.0, random_state=42)

ridge_reg.fit(X_train, y_train)

# Lasso Regression (L1 regularization)

lasso_reg = Lasso(alpha=0.1, random_state=42)

lasso_reg.fit(X_train, y_train)

# Random Forest Regressor

forest_reg = RandomForestRegressor(n_estimators=100, random_state=42)

forest_reg.fit(X_train, y_train)

# Support Vector Regression

svr_reg = SVR(kernel='rbf', C=1.0, gamma='scale')

svr_reg.fit(X_train, y_train)

# Regression evaluation metrics

from sklearn.metrics import (

    mean_absolute_error, mean_squared_error, r2_score

)

y_pred = lin_reg.predict(X_test)

mae = mean_absolute_error(y_test, y_pred)

mse = mean_squared_error(y_test, y極)

rmse = np.sqrt(mse)

r2 = r2_score(y_test, y_pred)

# Clustering algorithms

from sklearn.cluster import KMeans, DBSCAN, AgglomerativeClustering

from sklearn.mixture import GaussianMixture

# K-Means Clustering

kmeans = KMeans(n_clusters=3, random_state=42)

kmeans.fit(X)

labels = kmeans.labels_

# DBSCAN (Density-Based Clustering)

dbscan = DBSCAN(eps=0.5, min_samples=5)

dbscan.fit(X)

# Hierarchical Clustering

agg_clustering = AgglomerativeClustering(n_clusters=3)

agg_clustering.fit(X)

# Gaussian Mixture Model

gmm = GaussianMixture(n_components=3, random_state=42)

gmm.fit(X)

labels = gmm.predict(X)

# Clustering evaluation metrics

from sklearn.metrics import silhouette_score, calinski_harabasz_score

silhouette_avg = silhouette_score(X, labels)

ch_score = calinski_harabasz_score(X, labels)

Neural Networks

TensorFlow & Keras

# Import TensorFlow and Keras

import tensorflow as tf

from tensorflow import keras

from tensorflow.keras import layers

# Define a simple sequential model

model = keras.Sequential([

    # Input layer

    layers.Dense(64, activation='relu', input_shape=(10,)),

    # Hidden layers

    layers.Dense(128, activation='relu'),

    layers.Dropout(0.2),

    layers.Dense(64, activation='relu'),

    # Output layer

    layers.Dense(1, activation='sigmoid')

])

# Compile the model

model.compile(

    optimizer='adam',

    loss='binary_crossentropy',

    metrics=['accuracy']

)

# Display the model architecture

model.summary()

# Train the model

history = model.fit(

    X_train, y_train,

    batch_size=32,

    epochs=50,

    validation_split=0.2,

    verbose=1

)

# Evaluate the model

test_loss, test_acc = model.evaluate(X_test, y_test, verbose=0)

print(f'Test accuracy: {test_acc}')

# Make predictions

predictions = model.predict(X_test)

# Save and load models

model.save('my_model.h5')

loaded_model = keras.models.load_model('my_model.h5')

# Functional API for complex models

inputs = keras.Input(shape=(10,))

x = layers.Dense(64, activation='relu')(inputs)

x = layers.Dense(128, activation='relu')(x)

outputs = layers.Dense(1, activation='sigmoid')(x)

model = keras.Model(inputs=inputs, outputs=outputs)

PyTorch

# Import PyTorch

import torch

import torch.nn as nn

import torch.optim as optim

from torch.utils.data import DataLoader, TensorDataset

# Define a neural network

class NeuralNet(nn.Module):

    def __init__(self, input_size, hidden_size, num_classes):

        super(NeuralNet, self).__init__()

        self.layer1 = nn.Linear(input_size, hidden_size)

        self.relu = nn.ReLU()

        self.layer2 = nn.Linear(hidden_size, hidden_size)

        self.layer3 = nn.Linear(hidden_size, num_classes)

        self.dropout = nn.Dropout(0.2)

    def forward(self, x):

        out = self.layer1(x)

        out = self.relu(out)

        out = self.dropout(out)

        out = self.layer2(极)

        out = self.relu(out)

        out = self.layer3(out)

        return out

# Instantiate the model

model = NeuralNet(input_size=10, hidden_size=128, num_classes=1)

# Define loss function and optimizer

criterion = nn.BCEWithLogitsLoss()

optimizer = optim.Adam(model.parameters(), lr=0.001)

# Convert data to PyTorch tensors

X_train_tensor = torch.tensor(X_train, dtype=torch.float32)

y_train_tensor = torch.tensor(y_train, dtype=torch.float32)

X_test_tensor = torch.tensor(X_test, dtype=torch.float32)

y_test_tensor = torch.tensor(y_test, dtype=torch.float32)

# Create DataLoader

train_dataset = TensorDataset(X_train_tensor, train_loader = DataLoader(train_dataset, batch_size=32, shuffle=True)

# Training loop

num_epochs = 50

for epoch in range(num_epochs):

    for i, (inputs, labels) in enumerate(train_loader):

        # Forward pass

        outputs = model(inputs)

        loss = criterion(outputs, labels.view(-1, 1))

        # Backward and optimize

        optimizer.zero_grad()

        loss.backward()

        optimizer.step()

# Model evaluation

model.eval()

with torch.no_grad():

    outputs = model(X_test_tensor)

    predicted = (torch.sigmoid(outputs) > 0.5).float()

    accuracy = (predicted == y_test_tensor.view(-1, 1)).float().mean()

    print(f'Test Accuracy: {accuracy.item()}')

Additional Resources

Useful Resources & References

# Popular ML Libraries

- Scikit-learn: Machine learning in Python

- TensorFlow: Open source ML framework by Google

- PyTorch: Open source ML framework by Facebook

- Keras: High-level neural networks API

- XGBoost: Optimized distributed gradient boosting

- LightGBM: Gradient boosting framework by Microsoft

- CatBoost: High-performance gradient boosting

- OpenCV: Computer vision library

- NLTK: Natural Language Toolkit

- SpaCy: Industrial-strength NLP

- Hugging Face Transformers: State-of-the-art NLP

# Online Courses & Tutorials

- Coursera: Machine Learning by Andrew Ng

- fast.ai: Practical deep learning for coders

- Kaggle Learn: Hands-on data science courses

- Google Machine Learning Crash Course

- Stanford CS229: Machine Learning course

- MIT Introduction to Deep Learning

# Books

- "Hands-On Machine Learning with Scikit-Learn, Keras & TensorFlow" by Aurélien Géron

- "Pattern Recognition and Machine Learning" by Christopher M. Bishop

- "Deep Learning" by Ian Goodfellow, Yoshua Bengio, and Aaron Courville

- "The Hundred-Page Machine Learning Book" by Andriy Burkov

- "Machine Learning Yearning" by Andrew Ng

- "Python Machine Learning" by Sebastian Raschka and Vahid Mirjalili

# Communities & Platforms

- Kaggle: Data science competitions and datasets

- GitHub: Open source ML projects

- arXiv: Latest research papers

- Towards Data Science: ML articles and tutorials

- Stack Overflow: Q&A for ML practitioners

- Reddit: r/MachineLearning, r/datascience

# Datasets Repositories

- UCI Machine Learning Repository

- Kaggle Datasets

- Google Dataset Search

- AWS Open Data Registry

- Microsoft Research Open Data

- Government open data portals

- Academic dataset collections

Related Cheatsheet Links

ML Fundamentals

ML Concepts & Types

Data Preprocessing

Scikit-Learn

Classification Algorithms

Regression & Clustering

Neural Networks

TensorFlow & Keras

PyTorch

Additional Resources

Useful Resources & References