Linear Regression Family

📈
LinearRegression / Ridge / Lasso / ElasticNet
OLS and regularized regression — when to use each and key parameters
regressionRidgeLassoregularization
Syntax
Internals
Example
When to Use
from sklearn.linear_model import ( LinearRegression, Ridge, Lasso, ElasticNet, RidgeCV, LassoCV ) LinearRegression(fit_intercept=True, n_jobs=None) Ridge(alpha=1.0, # L2 penalty strength solver='auto') # 'svd','cholesky','lsqr','saga' Lasso(alpha=1.0, # L1 penalty — induces sparsity max_iter=1000) ElasticNet(alpha=1.0, # overall regularization l1_ratio=0.5) # 0=Ridge, 1=Lasso # Built-in CV variants (efficient) RidgeCV(alphas=[0.1, 1, 10, 100]) LassoCV(cv=5, n_alphas=100)
  • LinearRegression — minimizes ‖Xw - y‖₂² exactly via Normal Equations (XᵀX)⁻¹Xᵀy. Closed-form, no hyperparameters. Fails with multicollinearity (singular XᵀX).
  • Ridge (L2) — adds α‖w‖₂² penalty. Invertible even with collinearity. All features kept; coefficients shrunk toward zero but never exactly zero.
  • Lasso (L1) — adds α‖w‖₁ penalty. Produces sparse solutions (zero coefficients = feature selection). Solved via coordinate descent.
  • ElasticNet — combines L1+L2. Better than Lasso alone for grouped correlated features (Lasso arbitrarily picks one).
python
from sklearn.linear_model import LassoCV, RidgeCV
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler

# LassoCV: auto-tunes alpha with cross-validation
lasso_pipe = Pipeline([
    ('scaler', StandardScaler()),
    ('model',  LassoCV(cv=5, max_iter=5000, n_jobs=-1))
])
lasso_pipe.fit(X_train, y_train)
best_alpha = lasso_pipe['model'].alpha_
print(f'Best alpha: {best_alpha:.4f}')
# Feature selection: non-zero coefficients
coefs = lasso_pipe['model'].coef_
selected = np.where(coefs != 0)[0]
print(f'{len(selected)} features selected out of {len(coefs)}')

# RidgeCV: efficient LOO or k-fold CV
ridge = RidgeCV(alphas=np.logspace(-4, 4, 50), cv=5)
ridge.fit(X_train_scaled, y_train)
print(f'Best Ridge alpha: {ridge.alpha_:.4f}')
Model Multicollinearity Feature Selection Best When
LinearRegression ❌ Unstable ❌ No n ≫ p, no collinearity
Ridge ✅ Handles ❌ No Many correlated features
Lasso ⚠️ Picks one ✅ Yes Sparse true signal
ElasticNet ✅ Handles ✅ Partial Grouped correlated features

Logistic Regression

🎯
LogisticRegression
Probabilistic linear classifier — solver selection, multi-class strategies, calibration
classificationsolvermulti-class
Syntax
Example
Internals
from sklearn.linear_model import LogisticRegression LogisticRegression( penalty='l2', # 'l1','l2','elasticnet',None C=1.0, # inverse regularization (smaller = stronger) solver='lbfgs', # 'saga' for L1; 'liblinear' for small data multi_class='auto', # 'ovr','multinomial' (lbfgs=multinomial) max_iter=100, # increase if ConvergenceWarning class_weight=None, # 'balanced' for imbalanced classes n_jobs=-1 )
python
from sklearn.linear_model import LogisticRegressionCV

# Built-in CV tuning of C
lr = LogisticRegressionCV(
    Cs=20,                 # 20 values of C on log scale
    cv=5, scoring='roc_auc',
    solver='lbfgs', max_iter=500,
    class_weight='balanced', n_jobs=-1
)
lr.fit(X_train_scaled, y_train)
print(f'Best C: {lr.C_[0]:.4f}')

# Predict probabilities
y_prob = lr.predict_proba(X_test)[:, 1]

# Coefficient analysis (after standardization: comparable magnitudes)
feat_coef = pd.Series(lr.coef_[0], index=feature_names)
feat_coef.abs().sort_values(ascending=False).head(10)
Solver Penalty Best For
lbfgs L2 Default; multiclass; medium datasets
liblinear L1, L2 Small datasets; binary
saga L1, L2, ElasticNet Large datasets; sparse features
sag L2 Large dense datasets

Tree Ensembles

🌲
RandomForestClassifier / RandomForestRegressor
Bagging of decision trees — OOB score, feature importance, and variance reduction
Random ForestbaggingOOBfeature importance
Syntax
Example
Performance
from sklearn.ensemble import RandomForestClassifier RandomForestClassifier( n_estimators=300, # more trees = lower variance (plateau~300) max_depth=None, # None = full depth; set 10-30 to reduce overfit max_features='sqrt', # features tried per split: sqrt(n) for clf min_samples_leaf=2, # min samples in leaf (regularizes) oob_score=True, # free OOB validation estimate class_weight='balanced', n_jobs=-1, random_state=42 )
python
rf = RandomForestClassifier(n_estimators=300, oob_score=True,
                              n_jobs=-1, random_state=42)
rf.fit(X_train, y_train)

# OOB score: unbiased estimate without touching test set
print(f'OOB accuracy: {rf.oob_score_:.4f}')

# Feature importances (MDI — Mean Decrease Impurity)
importances = pd.Series(rf.feature_importances_, index=feature_names)
importances.nlargest(15).plot(kind='barh')

# Permutation importances (more reliable than MDI for high-cardinality)
from sklearn.inspection import permutation_importance
perm = permutation_importance(rf, X_test, y_test,
    n_repeats=20, scoring='roc_auc', n_jobs=-1)
pi = pd.Series(perm.importances_mean, index=feature_names)
pi.sort_values().plot(kind='barh')
n_jobs
Set n_jobs=-1 to use all cores. Trees are built independently — perfect parallelism. Training time ~ O(n_estimators / n_jobs).
MDI vs Permutation
MDI (built-in importance) overestimates high-cardinality and numeric features. For unbiased importance, use permutation_importance on the test set.
🚀
HistGradientBoostingClassifier / GradientBoostingClassifier
State-of-the-art on tabular data — histogram-based boosting with native NaN/categoricals support
GBMboostingHistGBMtabular
Syntax
Example
Internals
from sklearn.ensemble import HistGradientBoostingClassifier HistGradientBoostingClassifier( max_iter=500, # n_estimators equivalent learning_rate=0.05, # shrinkage (lower = more iters needed) max_depth=6, # depth of each tree max_leaf_nodes=31, # controls complexity (XGBoost default=31) min_samples_leaf=20, # regularize leaf size l2_regularization=0.1, # L2 on leaf weights early_stopping=True, # auto early stopping on val set validation_fraction=0.1, n_iter_no_change=20, # patience categorical_features=None, # list of categorical col indices random_state=42 )
python
from sklearn.ensemble import HistGradientBoostingClassifier
from sklearn.model_selection import RandomizedSearchCV
from scipy.stats import loguniform, randint

hgb = HistGradientBoostingClassifier(
    early_stopping=True, n_iter_no_change=30, random_state=42
)

param_dist = {
    'learning_rate':     loguniform(0.01, 0.3),
    'max_leaf_nodes':    randint(20, 80),
    'min_samples_leaf':  randint(10, 50),
    'l2_regularization': loguniform(1e-4, 1),
    'max_iter': [500]
}
rs = RandomizedSearchCV(hgb, param_dist, n_iter=40,
    scoring='roc_auc', cv=5, n_jobs=-1, random_state=42)
rs.fit(X_train, y_train)
print(f'Best AUC: {rs.best_score_:.4f}')
print(rs.best_params_)
  • Histogram binning: HistGBM bins continuous values into max 255 bins. Dramatically reduces split-finding cost vs GBM (O(n) vs O(n log n)).
  • Native NaN support: HistGBM handles NaN natively — learns the best direction for missing values at each split. No imputation needed.
  • HistGBM vs GBM: HistGBM is 10-100× faster, especially on large datasets. Use HistGBM by default; switch to XGBoost/LightGBM for GPU or distributed training.

Support Vector Machines

✂️
SVC / LinearSVC / SVR
Maximum-margin classifier — kernels, C parameter, and scaling requirements
SVMkernelCscaling required
Syntax
Example
Internals
from sklearn.svm import SVC, LinearSVC, SVR SVC( C=1.0, # penalty; larger = less regularization kernel='rbf', # 'linear','poly','rbf','sigmoid' gamma='scale', # 'scale'=1/(n_feat*X.var()); 'auto'=1/n_feat probability=False, # True enables predict_proba (Platt scaling, slow) class_weight='balanced' ) LinearSVC(C=1.0, max_iter=2000) # much faster than SVC(kernel='linear')
python
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.svm import SVC

# Always scale before SVM!
pipe = Pipeline([
    ('scaler', StandardScaler()),
    ('svc',    SVC(C=10, kernel='rbf', gamma='scale',
                   class_weight='balanced'))
])
pipe.fit(X_train, y_train)

# Grid search C and gamma
from sklearn.model_selection import GridSearchCV
param_grid = {
    'svc__C':     [0.1, 1, 10, 100],
    'svc__gamma': ['scale', 'auto', 0.001, 0.01]
}
gs = GridSearchCV(pipe, param_grid, cv=5,
                   scoring='roc_auc', n_jobs=-1)
gs.fit(X_train, y_train)
print(gs.best_params_, gs.best_score_)
  • Scaling critical: SVM optimizes on distances. Unscaled features with large ranges dominate the kernel. Always use StandardScaler or MinMaxScaler before SVM.
  • RBF kernel bandwidth (gamma): Too small = underfitting (wide Gaussian); too large = overfitting (narrow Gaussian). Start with gamma='scale'.
  • probability=True cost: Enables predict_proba by fitting Platt scaling (extra 5-fold CV). Adds significant training time.
  • Scalability: SVC is O(n²–n³). Use LinearSVC or SGDClassifier for n > 50k samples.

K-Nearest Neighbors

🔵
KNeighborsClassifier / KNeighborsRegressor
Non-parametric lazy learner — distance metrics, k selection, and scaling requirements
KNNnon-parametricdistance
Syntax
Example
from sklearn.neighbors import KNeighborsClassifier KNeighborsClassifier( n_neighbors=5, # k — tune via CV weights='uniform', # 'distance': closer neighbors matter more metric='minkowski', # 'euclidean','manhattan','cosine' p=2, # Minkowski p (2=Euclidean, 1=Manhattan) algorithm='auto', # 'ball_tree','kd_tree','brute' leaf_size=30, # for ball_tree/kd_tree n_jobs=-1 )
python
from sklearn.neighbors import KNeighborsClassifier
from sklearn.model_selection import cross_val_score

# Always scale — KNN is distance-based
pipe = Pipeline([
    ('scaler', StandardScaler()),
    ('knn',    KNeighborsClassifier(weights='distance'))
])

# Tune k via cross-validation
k_vals = range(1, 31)
cv_scores = []
for k in k_vals:
    pipe.set_params(knn__n_neighbors=k)
    score = cross_val_score(pipe, X_train, y_train,
                             cv=5, scoring='roc_auc').mean()
    cv_scores.append(score)

best_k = k_vals[np.argmax(cv_scores)]
print(f'Best k: {best_k}, AUC: {max(cv_scores):.4f}')

# Use for anomaly: low neighbor distance = normal
nbrs = KNeighborsClassifier(n_neighbors=5)
nbrs.fit(X_train)
dists, _ = nbrs.kneighbors_.kneighbors(X_test)
anomaly_score = dists.mean(axis=1)

Clustering

🌀
KMeans / DBSCAN / AgglomerativeClustering
Unsupervised segmentation — choosing k, density-based clustering, and cluster evaluation
KMeansDBSCANclusteringelbow
Syntax
Example
Internals
from sklearn.cluster import KMeans, DBSCAN, AgglomerativeClustering KMeans( n_clusters=8, init='k-means++', n_init=10, # run n_init times, keep best inertia max_iter=300, random_state=42 ) DBSCAN( eps=0.5, # max distance between neighbors min_samples=5, # minimum neighbors to form core point metric='euclidean' # or 'cosine','manhattan' ) # label=-1 → noise/outlier
python
from sklearn.cluster import KMeans, DBSCAN
from sklearn.metrics import silhouette_score

# Elbow method + Silhouette for choosing k
inertias, silhouettes = [], []
K_range = range(2, 12)
for k in K_range:
    km = KMeans(n_clusters=k, n_init=10, random_state=42)
    labels = km.fit_predict(X_scaled)
    inertias.append(km.inertia_)
    silhouettes.append(silhouette_score(X_scaled, labels))

# Use cluster labels as features
km_final = KMeans(n_clusters=5, random_state=42)
df['cluster'] = km_final.fit_predict(X_scaled)

# DBSCAN — auto discovers clusters, labels outliers as -1
db = DBSCAN(eps=0.4, min_samples=10)
db_labels = db.fit_predict(X_scaled)
n_clusters = len(set(db_labels)) - (1 if -1 in db_labels else 0)
n_noise = (db_labels == -1).sum()
print(f'Clusters: {n_clusters}, Noise: {n_noise}')

# Tune DBSCAN eps via k-distance graph
from sklearn.neighbors import NearestNeighbors
nn = NearestNeighbors(n_neighbors=10).fit(X_scaled)
dists, _ = nn.kneighbors(X_scaled)
kth_dists = np.sort(dists[:, -1])[::-1]
# Plot kth_dists — "elbow" = good eps value
  • k-means++: Smart initialization — seeds centroids proportional to distance from existing centroids. Reduces iterations and improves convergence vs random init.
  • DBSCAN eps selection: Plot sorted k-NN distances. The "elbow" (sharp bend) indicates the natural cluster density threshold.
  • DBSCAN vs KMeans: DBSCAN handles arbitrary shapes and finds outliers; KMeans assumes spherical clusters and is sensitive to outliers.

Cross-Validation & Hyperparameter Tuning