ML Fundamentals

ML Concepts & Types

# Types of Machine Learning
# Supervised Learning - Labeled data
- Classification: Predict categorical labels
- Regression: Predict continuous values

# Unsupervised Learning - No labels
- Clustering: Group similar instances
- Dimensionality Reduction: Simplify data
- Anomaly Detection: Find unusual instances

# Semi-supervised Learning - Some labels
- Combination of labeled and unlabeled data

# Reinforcement Learning - Learn from feedback
- Agent learns to make decisions
- Rewards and punishments

# Key ML Concepts
- Features: Input variables (X)
- Labels: Output variables (y) - supervised learning
- Training set: Data used to train the model
- Test set: Data used to evaluate the model
- Validation set: Data used to tune hyperparameters
- Overfitting: Model learns training data too well
- Underfitting: Model fails to learn patterns
- Bias: Error from erroneous assumptions
- Variance: Error from sensitivity to fluctuations

# Model Evaluation Concepts
- Accuracy: Percentage of correct predictions
- Precision: True positives / (True positives + False positives)
- Recall: True positives / (True positives + False negatives)
- F1 Score: Harmonic mean of precision and recall
- ROC Curve: Visualize classification performance
- AUC: Area Under the ROC Curve
- Cross-validation: Robust evaluation technique

Data Preprocessing

# Import libraries
import numpy as np
import pandas as pd
from sklearn.preprocessing import StandardScaler, LabelEncoder
from sklearn.impute import SimpleImputer
from sklearn.model_selection import train_test_split

# Load data
df = pd.read_csv('data.csv')
X = df.drop('target', axis=1)
y = df['target']

# Handle missing values
# Remove rows with missing values
df.dropna(inplace=True)

# Impute missing values
imputer = SimpleImputer(strategy='mean') # or 'median', 'most_frequent'
X_imputed = imputer.fit_transform(X)

# Encode categorical variables
# Label encoding (for ordinal data)
label_encoder = LabelEncoder()
y_encoded = label_encoder.fit_transform(y)

# One-hot encoding (for nominal data)
X_encoded = pd.get_dummies(X, columns=['categorical_column'])

# Feature scaling
# Standardization (mean=0, std=1)
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)

# Normalization (min=0, max=1)
from sklearn.preprocessing import MinMaxScaler
minmax_scaler = MinMaxScaler()
X_normalized = minmax_scaler.fit_transform(X)

# Train-test split
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42, stratify=y
)

# Feature selection
from sklearn.feature_selection import SelectKBest, f_classif
selector = SelectKBest(score_func=f_classif, k=10)
X_selected = selector.fit_transform(X, y)

Scikit-Learn

Classification Algorithms

# Import classifiers
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier
from sklearn.svm import SVC
from sklearn.neighbors import KNeighborsClassifier
from sklearn.naive_bayes import GaussianNB

# Logistic Regression
log_reg = LogisticRegression(random_state=42)
log_reg.fit(X_train, y_train)
y_pred = log_reg.predict(X_test)

# Decision Tree
tree_clf = DecisionTreeClassifier(max_depth=5, random_state=42)
tree_clf.fit(X_train, y_train)

# Random Forest
forest_clf = RandomForestClassifier(
    n_estimators=100, max_depth=5, random_state=42
)
forest_clf.fit(X_train, y_train)

# Support Vector Machine
svm_clf = SVC(kernel='rbf', C=1.0, gamma='scale', random_state=42)
svm_clf.fit(X_train, y_train)

# K-Nearest Neighbors
knn_clf = KNeighborsClassifier(n_neighbors=5)
knn_clf.fit(X_train, y_train)

# Gradient Boosting
gb_clf = GradientBoostingClassifier(
    n_estimators=100, learning_rate=0.1, random_state=42
)
gb_clf.fit(X_train, y_train)

# Naive Bayes
nb_clf = GaussianNB()
nb_clf.fit(X_train, y_train)

# Model evaluation
from sklearn.metrics import (
    accuracy_score, precision_score, recall_score,
    f1_score, confusion_matrix, classification_report
)

accuracy = accuracy_score(y_test, y_pred)
precision = precision_score(y_test, y_pred, average='weighted')
recall = recall_score(y_test, y_pred, average='weighted')
f1 = f1_score(y_test, y_pred, average='weighted')
cm = confusion_matrix(y_test, y_pred)
report = classification_report(y_test, y_pred)

Regression & Clustering

# Regression algorithms
from sklearn.linear_model import LinearRegression, Ridge, Lasso
from sklearn.ensemble import RandomForestRegressor, GradientBoostingRegressor
from sklearn.svm import SVR

# Linear Regression
lin_reg = LinearRegression()
lin_reg.fit(X_train, y_train)

# Ridge Regression (L2 regularization)
ridge_reg = Ridge(alpha=1.0, random_state=42)
ridge_reg.fit(X_train, y_train)

# Lasso Regression (L1 regularization)
lasso_reg = Lasso(alpha=0.1, random_state=42)
lasso_reg.fit(X_train, y_train)

# Random Forest Regressor
forest_reg = RandomForestRegressor(n_estimators=100, random_state=42)
forest_reg.fit(X_train, y_train)

# Support Vector Regression
svr_reg = SVR(kernel='rbf', C=1.0, gamma='scale')
svr_reg.fit(X_train, y_train)

# Regression evaluation metrics
from sklearn.metrics import (
    mean_absolute_error, mean_squared_error, r2_score
)

y_pred = lin_reg.predict(X_test)
mae = mean_absolute_error(y_test, y_pred)
mse = mean_squared_error(y_test, y極)
rmse = np.sqrt(mse)
r2 = r2_score(y_test, y_pred)

# Clustering algorithms
from sklearn.cluster import KMeans, DBSCAN, AgglomerativeClustering
from sklearn.mixture import GaussianMixture

# K-Means Clustering
kmeans = KMeans(n_clusters=3, random_state=42)
kmeans.fit(X)
labels = kmeans.labels_

# DBSCAN (Density-Based Clustering)
dbscan = DBSCAN(eps=0.5, min_samples=5)
dbscan.fit(X)

# Hierarchical Clustering
agg_clustering = AgglomerativeClustering(n_clusters=3)
agg_clustering.fit(X)

# Gaussian Mixture Model
gmm = GaussianMixture(n_components=3, random_state=42)
gmm.fit(X)
labels = gmm.predict(X)

# Clustering evaluation metrics
from sklearn.metrics import silhouette_score, calinski_harabasz_score
silhouette_avg = silhouette_score(X, labels)
ch_score = calinski_harabasz_score(X, labels)

Neural Networks

TensorFlow & Keras

# Import TensorFlow and Keras
import tensorflow as tf
from tensorflow import keras
from tensorflow.keras import layers

# Define a simple sequential model
model = keras.Sequential([
    # Input layer
    layers.Dense(64, activation='relu', input_shape=(10,)),
    # Hidden layers
    layers.Dense(128, activation='relu'),
    layers.Dropout(0.2),
    layers.Dense(64, activation='relu'),
    # Output layer
    layers.Dense(1, activation='sigmoid')
])

# Compile the model
model.compile(
    optimizer='adam',
    loss='binary_crossentropy',
    metrics=['accuracy']
)

# Display the model architecture
model.summary()

# Train the model
history = model.fit(
    X_train, y_train,
    batch_size=32,
    epochs=50,
    validation_split=0.2,
    verbose=1
)

# Evaluate the model
test_loss, test_acc = model.evaluate(X_test, y_test, verbose=0)
print(f'Test accuracy: {test_acc}')

# Make predictions
predictions = model.predict(X_test)

# Save and load models
model.save('my_model.h5')
loaded_model = keras.models.load_model('my_model.h5')

# Functional API for complex models
inputs = keras.Input(shape=(10,))
x = layers.Dense(64, activation='relu')(inputs)
x = layers.Dense(128, activation='relu')(x)
outputs = layers.Dense(1, activation='sigmoid')(x)
model = keras.Model(inputs=inputs, outputs=outputs)

PyTorch

# Import PyTorch
import torch
import torch.nn as nn
import torch.optim as optim
from torch.utils.data import DataLoader, TensorDataset

# Define a neural network
class NeuralNet(nn.Module):
    def __init__(self, input_size, hidden_size, num_classes):
        super(NeuralNet, self).__init__()
        self.layer1 = nn.Linear(input_size, hidden_size)
        self.relu = nn.ReLU()
        self.layer2 = nn.Linear(hidden_size, hidden_size)
        self.layer3 = nn.Linear(hidden_size, num_classes)
        self.dropout = nn.Dropout(0.2)
    
    def forward(self, x):
        out = self.layer1(x)
        out = self.relu(out)
        out = self.dropout(out)
        out = self.layer2()
        out = self.relu(out)
        out = self.layer3(out)
        return out

# Instantiate the model
model = NeuralNet(input_size=10, hidden_size=128, num_classes=1)

# Define loss function and optimizer
criterion = nn.BCEWithLogitsLoss()
optimizer = optim.Adam(model.parameters(), lr=0.001)

# Convert data to PyTorch tensors
X_train_tensor = torch.tensor(X_train, dtype=torch.float32)
y_train_tensor = torch.tensor(y_train, dtype=torch.float32)
X_test_tensor = torch.tensor(X_test, dtype=torch.float32)
y_test_tensor = torch.tensor(y_test, dtype=torch.float32)

# Create DataLoader
train_dataset = TensorDataset(X_train_tensor, train_loader = DataLoader(train_dataset, batch_size=32, shuffle=True)

# Training loop
num_epochs = 50
for epoch in range(num_epochs):
    for i, (inputs, labels) in enumerate(train_loader):
        # Forward pass
        outputs = model(inputs)
        loss = criterion(outputs, labels.view(-1, 1))
        
        # Backward and optimize
        optimizer.zero_grad()
        loss.backward()
        optimizer.step()

# Model evaluation
model.eval()
with torch.no_grad():
    outputs = model(X_test_tensor)
    predicted = (torch.sigmoid(outputs) > 0.5).float()
    accuracy = (predicted == y_test_tensor.view(-1, 1)).float().mean()
    print(f'Test Accuracy: {accuracy.item()}')

Additional Resources

Useful Resources & References

# Popular ML Libraries
- Scikit-learn: Machine learning in Python
- TensorFlow: Open source ML framework by Google
- PyTorch: Open source ML framework by Facebook
- Keras: High-level neural networks API
- XGBoost: Optimized distributed gradient boosting
- LightGBM: Gradient boosting framework by Microsoft
- CatBoost: High-performance gradient boosting
- OpenCV: Computer vision library
- NLTK: Natural Language Toolkit
- SpaCy: Industrial-strength NLP
- Hugging Face Transformers: State-of-the-art NLP

# Online Courses & Tutorials
- Coursera: Machine Learning by Andrew Ng
- fast.ai: Practical deep learning for coders
- Kaggle Learn: Hands-on data science courses
- Google Machine Learning Crash Course
- Stanford CS229: Machine Learning course
- MIT Introduction to Deep Learning

# Books
- "Hands-On Machine Learning with Scikit-Learn, Keras & TensorFlow" by Aurélien Géron
- "Pattern Recognition and Machine Learning" by Christopher M. Bishop
- "Deep Learning" by Ian Goodfellow, Yoshua Bengio, and Aaron Courville
- "The Hundred-Page Machine Learning Book" by Andriy Burkov
- "Machine Learning Yearning" by Andrew Ng
- "Python Machine Learning" by Sebastian Raschka and Vahid Mirjalili

# Communities & Platforms
- Kaggle: Data science competitions and datasets
- GitHub: Open source ML projects
- arXiv: Latest research papers
- Towards Data Science: ML articles and tutorials
- Stack Overflow: Q&A for ML practitioners
- Reddit: r/MachineLearning, r/datascience

# Datasets Repositories
- UCI Machine Learning Repository
- Kaggle Datasets
- Google Dataset Search
- AWS Open Data Registry
- Microsoft Research Open Data
- Government open data portals
- Academic dataset collections