Neural Networks Metrics
F1 AUC

Evaluation Metrics

Accuracy (correct / total) is easy to interpret but misleading under class imbalance: a 99% negative fraud dataset yields 99% accuracy for a trivial “always negative” classifier. Precision asks: of positive predictions, how many were right? Recall: of actual positives, how many did we catch? F1 is their harmonic mean. ROC-AUC summarizes tradeoffs across thresholds; PR-AUC often suits rare positive classes better.

confusion matrix threshold macro vs micro RMSE

Confusion Matrix & Classification

For binary problems, counts fall into true positive, true negative, false positive, false negative. Precision = TP / (TP + FP); recall = TP / (TP + FN). Choose the metric that reflects business cost: missing fraud (FN) vs annoying users (FP). For multi-class, use macro (average per class, treats classes equally) or micro (pool all decisions—closer to accuracy).

ROC, AUC, and Calibration

The ROC curve plots true positive rate vs false positive rate as you vary the decision threshold. AUC is the area under ROC—ranking quality of scores independent of one threshold. When positives are rare, inspect the precision–recall curve too. Well-calibrated probabilities matter when outputs drive decisions (expected fraction of positives among 0.7-scored examples ≈ 0.7).

Always report metrics on a held-out validation or test set you did not tune on, or numbers will be optimistic.

Regression

MAE (mean absolute error) is robust to outliers in a linear way. MSE / RMSE penalize large errors more heavily. R² describes variance explained relative to a constant baseline. Use the same scale as your target (e.g. dollars, degrees) when communicating with stakeholders.

Sklearn Example

Binary classification report
from sklearn.metrics import classification_report, roc_auc_score

print(classification_report(y_true, y_pred, digits=4))
auc = roc_auc_score(y_true, y_proba)

Summary

  • Accuracy alone is insufficient for imbalance; use precision, recall, F1, PR/ROC.
  • Pick metrics aligned with error costs and whether you care about ranking or hard labels.
  • Regression: MAE vs RMSE vs R² depending on outlier sensitivity.
  • Next: PyTorch workflow for building and training nets.

Turn theory into code with PyTorch—modules, tensors, and training loops.