Classification Metrics

โœ…
accuracy / precision / recall / f1 / roc_auc / average_precision
Choose the right metric for your imbalance level and business cost structure
precisionrecallROC-AUCF1imbalance
โ–พ
Syntax
Example
Which Metric?
Internals
from sklearn.metrics import ( accuracy_score, precision_score, recall_score, f1_score, roc_auc_score, average_precision_score, classification_report, confusion_matrix, precision_recall_curve, roc_curve ) classification_report(y_true, y_pred, zero_division=0) # shows P/R/F1 per class roc_auc_score(y_true, y_prob) # threshold-free average_precision_score(y_true, y_prob) # area under PR curve
python
from sklearn.metrics import (
    classification_report, roc_auc_score,
    average_precision_score, precision_recall_curve
)

y_prob = model.predict_proba(X_test)[:, 1]
y_pred = model.predict(X_test)

# Full report: P, R, F1 per class
print(classification_report(y_test, y_pred,
      target_names=['No Churn', 'Churn'], zero_division=0))

# Threshold-free metrics
auc   = roc_auc_score(y_test, y_prob)
ap    = average_precision_score(y_test, y_prob)
print(f'ROC-AUC: {auc:.4f}  |  Avg Precision: {ap:.4f}')

# Find optimal threshold using PR curve
precisions, recalls, thresholds = precision_recall_curve(y_test, y_prob)
f1_scores = 2 * precisions * recalls / (precisions + recalls + 1e-8)
best_thresh = thresholds[np.argmax(f1_scores[:-1])]
print(f'Best threshold: {best_thresh:.3f}')

# Apply custom threshold
y_pred_custom = (y_prob >= best_thresh).astype(int)

# Multi-class ROC-AUC
auc_mc = roc_auc_score(y_test_mc, y_prob_mc,
    multi_class='ovr', average='macro')
Scenario Use Reason
Balanced classes Accuracy, F1 (macro) Equal weight per class
Imbalanced (5โ€“20% positive) ROC-AUC, F1 (weighted) AUC stable under imbalance
Rare positives (<5%) Average Precision (AUCPR) Focuses on positive class; AUC misleading
Asymmetric costs Set custom threshold on PR curve Business cost guides precision vs recall tradeoff
Ranking tasks NDCG, MRR (custom) Order matters, not binary labels
  • ROC-AUC: Probability that a random positive ranks higher than a random negative. Threshold-independent. Can be misleading with extreme class imbalance (many TN inflate the curve).
  • Average Precision: Area under precision-recall curve. Focuses entirely on the positive class โ€” preferred for rare event detection (fraud, cancer).
  • F1: Harmonic mean of precision and recall. Penalizes extreme cases (0 precision or 0 recall โ†’ F1=0). Use F1-macro for multi-class with balanced interest across classes.

Regression Metrics

๐Ÿ“
MAE / MSE / RMSE / MAPE / Rยฒ / adjusted-Rยฒ
Choose the right error metric โ€” scale sensitivity, outlier robustness, and interpretability
MAERMSERยฒregression
โ–พ
Syntax
Example
Which Metric?
from sklearn.metrics import ( mean_absolute_error, mean_squared_error, mean_absolute_percentage_error, r2_score ) mean_absolute_error(y_true, y_pred) # MAE: same unit as y mean_squared_error(y_true, y_pred) # MSE: penalizes large errors np.sqrt(mean_squared_error(y_true, y_pred)) # RMSE mean_absolute_percentage_error(y_true, y_pred) # MAPE: scale-free r2_score(y_true, y_pred) # Rยฒ: proportion explained
python
from sklearn.metrics import mean_absolute_error, mean_squared_error, r2_score
import numpy as np

y_pred = model.predict(X_test)

mae  = mean_absolute_error(y_test, y_pred)
rmse = np.sqrt(mean_squared_error(y_test, y_pred))
r2   = r2_score(y_test, y_pred)

# Adjusted Rยฒ: penalizes extra features
n, p = X_test.shape
r2_adj = 1 - (1 - r2) * (n - 1) / (n - p - 1)

print(f'MAE:  {mae:.2f}')
print(f'RMSE: {rmse:.2f}')
print(f'Rยฒ:   {r2:.4f}  |  Adj-Rยฒ: {r2_adj:.4f}')

# Residual analysis
residuals = y_test - y_pred
print(f'Residual mean: {residuals.mean():.4f}')   # near 0 = unbiased
print(f'Residual std:  {residuals.std():.4f}')
# Plot residuals vs predicted (check heteroscedasticity)
import matplotlib.pyplot as plt
plt.scatter(y_pred, residuals, alpha=0.4)
plt.axhline(0, color='r', ls='--')
Metric Outlier Sensitivity Interpretable Use When
MAE โœ… Robust โœ… Same unit as y Median prediction; symmetric cost
RMSE โŒ Sensitive โœ… Same unit as y Large errors costly; Gaussian noise assumed
MAPE โŒ Sensitive โœ… % relative Relative errors matter; never use with yโ‰ˆ0
Rยฒ โŒ Sensitive โœ… % variance Baseline comparison; can be negative

ROC & Precision-Recall Curves

๐Ÿ“ˆ
roc_curve / precision_recall_curve โ€” plotting and threshold selection
Visualize classifier performance across all thresholds and select operating points
ROCPR curvethresholdYouden
โ–พ
Example
python
from sklearn.metrics import roc_curve, precision_recall_curve, auc

fpr, tpr, thresholds_roc = roc_curve(y_test, y_prob)
roc_auc = auc(fpr, tpr)

precision, recall, thresholds_pr = precision_recall_curve(y_test, y_prob)
pr_auc = auc(recall, precision)

fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(12, 5))

# ROC Curve
ax1.plot(fpr, tpr, label=f'AUC={roc_auc:.4f}', lw=2)
ax1.plot([0,1], [0,1], 'k--', lw=1)
# Youden's J โ€” max(TPR - FPR)
J = tpr - fpr
best_idx = np.argmax(J)
ax1.scatter(fpr[best_idx], tpr[best_idx],
            color='red', s=80, zorder=5,
            label=f'Best t={thresholds_roc[best_idx]:.3f}')
ax1.set_xlabel('FPR'); ax1.set_ylabel('TPR')
ax1.set_title('ROC Curve'); ax1.legend()

# PR Curve
ax2.plot(recall, precision, label=f'AP={pr_auc:.4f}', lw=2)
# Random classifier baseline (= positive rate)
baseline = y_test.mean()
ax2.axhline(baseline, ls='--', color='gray',
            label=f'Baseline ({baseline:.3f})')
ax2.set_xlabel('Recall'); ax2.set_ylabel('Precision')
ax2.set_title('Precision-Recall Curve'); ax2.legend()
plt.tight_layout()

Probability Calibration

๐ŸŒก๏ธ
CalibratedClassifierCV / calibration_curve / brier_score_loss
When predicted probabilities need to match true frequencies โ€” Platt/isotonic calibration
calibrationPlatt scalingBrier scorereliability diagram
โ–พ
Example
python
from sklearn.calibration import CalibratedClassifierCV, calibration_curve
from sklearn.metrics import brier_score_loss

# BEFORE calibration
brier_raw = brier_score_loss(y_test, y_prob_raw)  # lower=better

# Calibrate with Platt scaling (method='sigmoid') or Isotonic Regression
calibrated = CalibratedClassifierCV(
    base_model, method='isotonic', cv=5
)
calibrated.fit(X_train, y_train)
y_prob_cal = calibrated.predict_proba(X_test)[:, 1]
brier_cal  = brier_score_loss(y_test, y_prob_cal)
print(f'Brier before: {brier_raw:.4f}  |  after: {brier_cal:.4f}')

# Reliability diagram
fig, ax = plt.subplots(figsize=(6, 5))
for probs, label in [(y_prob_raw, 'Uncalibrated'),
                      (y_prob_cal, 'Calibrated')]:
    prob_true, prob_pred = calibration_curve(y_test, probs, n_bins=10)
    ax.plot(prob_pred, prob_true, marker='o', label=label)
ax.plot([0,1],[0,1], 'k--', label='Perfect')
ax.legend(); ax.set_xlabel('Mean predicted prob')
ax.set_ylabel('Fraction of positives')
ax.set_title('Reliability Diagram')

Learning & Validation Curves

๐Ÿ“‰
learning_curve / validation_curve
Diagnose bias vs variance โ€” do you need more data or more regularization?
learning curvebiasvariancediagnosis
โ–พ
Example
python
from sklearn.model_selection import learning_curve, validation_curve

train_sizes, train_scores, val_scores = learning_curve(
    model, X_train, y_train,
    train_sizes=np.linspace(0.1, 1.0, 10),
    cv=5, scoring='roc_auc', n_jobs=-1
)

fig, ax = plt.subplots(figsize=(8, 4))
ax.plot(train_sizes, train_scores.mean(axis=1), label='Train')
ax.fill_between(train_sizes,
    train_scores.mean(axis=1) - train_scores.std(axis=1),
    train_scores.mean(axis=1) + train_scores.std(axis=1), alpha=0.1)
ax.plot(train_sizes, val_scores.mean(axis=1), label='Validation')
ax.fill_between(train_sizes,
    val_scores.mean(axis=1) - val_scores.std(axis=1),
    val_scores.mean(axis=1) + val_scores.std(axis=1), alpha=0.1)
ax.legend(); ax.set_xlabel('Training Size')
ax.set_ylabel('ROC-AUC'); ax.set_title('Learning Curve')
# High train/low val gap โ†’ overfit โ†’ regularize or get more data
# Both low โ†’ underfit โ†’ more complex model or more features

# Validation curve: effect of one hyperparameter
param_range = np.logspace(-4, 2, 10)
tr, vr = validation_curve(
    model, X_train, y_train,
    param_name='C', param_range=param_range,
    cv=5, scoring='roc_auc', n_jobs=-1
)
ax2.semilogx(param_range, vr.mean(axis=1))  # optimal C at peak