Evaluation Metrics

Accuracy (correct / total) is easy to interpret but misleading under class imbalance: a 99% negative fraud dataset yields 99% accuracy for a trivial â€œalways negativeâ€ classifier. Precision asks: of positive predictions, how many were right? Recall: of actual positives, how many did we catch? F1 is their harmonic mean. ROC-AUC summarizes tradeoffs across thresholds; PR-AUC often suits rare positive classes better.

confusion matrix threshold macro vs micro RMSE

Confusion Matrix & Classification

For binary problems, counts fall into true positive, true negative, false positive, false negative. Precision = TP / (TP + FP); recall = TP / (TP + FN). Choose the metric that reflects business cost: missing fraud (FN) vs annoying users (FP). For multi-class, use macro (average per class, treats classes equally) or micro (pool all decisionsâ€”closer to accuracy).

ROC, AUC, and Calibration

The ROC curve plots true positive rate vs false positive rate as you vary the decision threshold. AUC is the area under ROCâ€”ranking quality of scores independent of one threshold. When positives are rare, inspect the precisionâ€“recall curve too. Well-calibrated probabilities matter when outputs drive decisions (expected fraction of positives among 0.7-scored examples â‰ˆ 0.7).

Always report metrics on a held-out validation or test set you did not tune on, or numbers will be optimistic.

Regression

MAE (mean absolute error) is robust to outliers in a linear way. MSE / RMSE penalize large errors more heavily. RÂ² describes variance explained relative to a constant baseline. Use the same scale as your target (e.g. dollars, degrees) when communicating with stakeholders.

Sklearn Example

Binary classification report

from sklearn.metrics import classification_report, roc_auc_score

print(classification_report(y_true, y_pred, digits=4))
auc = roc_auc_score(y_true, y_proba)

Summary

Accuracy alone is insufficient for imbalance; use precision, recall, F1, PR/ROC.
Pick metrics aligned with error costs and whether you care about ranking or hard labels.
Regression: MAE vs RMSE vs RÂ² depending on outlier sensitivity.
Next: PyTorch workflow for building and training nets.

Turn theory into code with PyTorchâ€”modules, tensors, and training loops.

Previous: Transfer learning Next: PyTorch

Related Neural Networks Links