Sklearn โ Evaluation
Classification and regression metrics, ROC/PR curves, calibration, learning curves, and model selection with proper train/test/val splits.
Classification Metrics
Regression Metrics
MAE / MSE / RMSE / MAPE / Rยฒ / adjusted-Rยฒ
Choose the right error metric โ scale sensitivity, outlier
robustness, and interpretability
โพ
Syntax
Example
Which Metric?
from sklearn.metrics import (
mean_absolute_error, mean_squared_error,
mean_absolute_percentage_error, r2_score
)
mean_absolute_error(y_true, y_pred) # MAE:
same unit as y
mean_squared_error(y_true, y_pred) # MSE:
penalizes large errors
np.sqrt(mean_squared_error(y_true,
y_pred)) # RMSE
mean_absolute_percentage_error(y_true, y_pred) # MAPE: scale-free
r2_score(y_true, y_pred) # Rยฒ: proportion
explained
python
from sklearn.metrics import mean_absolute_error, mean_squared_error, r2_score
import numpy as np
y_pred = model.predict(X_test)
mae = mean_absolute_error(y_test, y_pred)
rmse = np.sqrt(mean_squared_error(y_test, y_pred))
r2 = r2_score(y_test, y_pred)
# Adjusted Rยฒ: penalizes extra features
n, p = X_test.shape
r2_adj = 1 - (1 - r2) * (n - 1) / (n - p - 1)
print(f'MAE: {mae:.2f}')
print(f'RMSE: {rmse:.2f}')
print(f'Rยฒ: {r2:.4f} | Adj-Rยฒ: {r2_adj:.4f}')
# Residual analysis
residuals = y_test - y_pred
print(f'Residual mean: {residuals.mean():.4f}') # near 0 = unbiased
print(f'Residual std: {residuals.std():.4f}')
# Plot residuals vs predicted (check heteroscedasticity)
import matplotlib.pyplot as plt
plt.scatter(y_pred, residuals, alpha=0.4)
plt.axhline(0, color='r', ls='--')
| Metric | Outlier Sensitivity | Interpretable | Use When |
|---|---|---|---|
| MAE | โ Robust | โ Same unit as y | Median prediction; symmetric cost |
| RMSE | โ Sensitive | โ Same unit as y | Large errors costly; Gaussian noise assumed |
| MAPE | โ Sensitive | โ % relative | Relative errors matter; never use with yโ0 |
| Rยฒ | โ Sensitive | โ % variance | Baseline comparison; can be negative |
ROC & Precision-Recall Curves
roc_curve / precision_recall_curve โ plotting and threshold selection
Visualize classifier performance across all thresholds and select
operating points
โพ
Example
python
from sklearn.metrics import roc_curve, precision_recall_curve, auc
fpr, tpr, thresholds_roc = roc_curve(y_test, y_prob)
roc_auc = auc(fpr, tpr)
precision, recall, thresholds_pr = precision_recall_curve(y_test, y_prob)
pr_auc = auc(recall, precision)
fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(12, 5))
# ROC Curve
ax1.plot(fpr, tpr, label=f'AUC={roc_auc:.4f}', lw=2)
ax1.plot([0,1], [0,1], 'k--', lw=1)
# Youden's J โ max(TPR - FPR)
J = tpr - fpr
best_idx = np.argmax(J)
ax1.scatter(fpr[best_idx], tpr[best_idx],
color='red', s=80, zorder=5,
label=f'Best t={thresholds_roc[best_idx]:.3f}')
ax1.set_xlabel('FPR'); ax1.set_ylabel('TPR')
ax1.set_title('ROC Curve'); ax1.legend()
# PR Curve
ax2.plot(recall, precision, label=f'AP={pr_auc:.4f}', lw=2)
# Random classifier baseline (= positive rate)
baseline = y_test.mean()
ax2.axhline(baseline, ls='--', color='gray',
label=f'Baseline ({baseline:.3f})')
ax2.set_xlabel('Recall'); ax2.set_ylabel('Precision')
ax2.set_title('Precision-Recall Curve'); ax2.legend()
plt.tight_layout()
Probability Calibration
CalibratedClassifierCV / calibration_curve / brier_score_loss
When predicted probabilities need to match true frequencies โ
Platt/isotonic calibration
โพ
Example
python
from sklearn.calibration import CalibratedClassifierCV, calibration_curve
from sklearn.metrics import brier_score_loss
# BEFORE calibration
brier_raw = brier_score_loss(y_test, y_prob_raw) # lower=better
# Calibrate with Platt scaling (method='sigmoid') or Isotonic Regression
calibrated = CalibratedClassifierCV(
base_model, method='isotonic', cv=5
)
calibrated.fit(X_train, y_train)
y_prob_cal = calibrated.predict_proba(X_test)[:, 1]
brier_cal = brier_score_loss(y_test, y_prob_cal)
print(f'Brier before: {brier_raw:.4f} | after: {brier_cal:.4f}')
# Reliability diagram
fig, ax = plt.subplots(figsize=(6, 5))
for probs, label in [(y_prob_raw, 'Uncalibrated'),
(y_prob_cal, 'Calibrated')]:
prob_true, prob_pred = calibration_curve(y_test, probs, n_bins=10)
ax.plot(prob_pred, prob_true, marker='o', label=label)
ax.plot([0,1],[0,1], 'k--', label='Perfect')
ax.legend(); ax.set_xlabel('Mean predicted prob')
ax.set_ylabel('Fraction of positives')
ax.set_title('Reliability Diagram')
Learning & Validation Curves
learning_curve / validation_curve
Diagnose bias vs variance โ do you need more data or more
regularization?
โพ
Example
python
from sklearn.model_selection import learning_curve, validation_curve
train_sizes, train_scores, val_scores = learning_curve(
model, X_train, y_train,
train_sizes=np.linspace(0.1, 1.0, 10),
cv=5, scoring='roc_auc', n_jobs=-1
)
fig, ax = plt.subplots(figsize=(8, 4))
ax.plot(train_sizes, train_scores.mean(axis=1), label='Train')
ax.fill_between(train_sizes,
train_scores.mean(axis=1) - train_scores.std(axis=1),
train_scores.mean(axis=1) + train_scores.std(axis=1), alpha=0.1)
ax.plot(train_sizes, val_scores.mean(axis=1), label='Validation')
ax.fill_between(train_sizes,
val_scores.mean(axis=1) - val_scores.std(axis=1),
val_scores.mean(axis=1) + val_scores.std(axis=1), alpha=0.1)
ax.legend(); ax.set_xlabel('Training Size')
ax.set_ylabel('ROC-AUC'); ax.set_title('Learning Curve')
# High train/low val gap โ overfit โ regularize or get more data
# Both low โ underfit โ more complex model or more features
# Validation curve: effect of one hyperparameter
param_range = np.logspace(-4, 2, 10)
tr, vr = validation_curve(
model, X_train, y_train,
param_name='C', param_range=param_range,
cv=5, scoring='roc_auc', n_jobs=-1
)
ax2.semilogx(param_range, vr.mean(axis=1)) # optimal C at peak