Sklearn — Models
Linear models, tree ensembles, SVMs, neighbors, clustering, and hyperparameter search — implementation internals and practical guidance.
Linear Regression Family
Logistic Regression
LogisticRegression
Probabilistic linear classifier — solver selection, multi-class
strategies, calibration
▾
Syntax
Example
Internals
from sklearn.linear_model import LogisticRegression
LogisticRegression(
penalty='l2', #
'l1','l2','elasticnet',None
C=1.0, # inverse regularization (smaller =
stronger)
solver='lbfgs', # 'saga' for L1; 'liblinear'
for small data
multi_class='auto', # 'ovr','multinomial'
(lbfgs=multinomial)
max_iter=100, # increase if
ConvergenceWarning
class_weight=None, # 'balanced' for imbalanced
classes
n_jobs=-1
)
python
from sklearn.linear_model import LogisticRegressionCV
# Built-in CV tuning of C
lr = LogisticRegressionCV(
Cs=20, # 20 values of C on log scale
cv=5, scoring='roc_auc',
solver='lbfgs', max_iter=500,
class_weight='balanced', n_jobs=-1
)
lr.fit(X_train_scaled, y_train)
print(f'Best C: {lr.C_[0]:.4f}')
# Predict probabilities
y_prob = lr.predict_proba(X_test)[:, 1]
# Coefficient analysis (after standardization: comparable magnitudes)
feat_coef = pd.Series(lr.coef_[0], index=feature_names)
feat_coef.abs().sort_values(ascending=False).head(10)
| Solver | Penalty | Best For |
|---|---|---|
| lbfgs | L2 | Default; multiclass; medium datasets |
| liblinear | L1, L2 | Small datasets; binary |
| saga | L1, L2, ElasticNet | Large datasets; sparse features |
| sag | L2 | Large dense datasets |
Tree Ensembles
RandomForestClassifier / RandomForestRegressor
Bagging of decision trees — OOB score, feature importance, and
variance reduction
▾
Syntax
Example
Performance
from sklearn.ensemble import RandomForestClassifier
RandomForestClassifier(
n_estimators=300, # more trees = lower
variance (plateau~300)
max_depth=None, # None = full depth; set 10-30
to reduce overfit
max_features='sqrt', # features tried per
split: sqrt(n) for clf
min_samples_leaf=2, # min samples in leaf
(regularizes)
oob_score=True, # free OOB validation
estimate
class_weight='balanced',
n_jobs=-1, random_state=42
)
python
rf = RandomForestClassifier(n_estimators=300, oob_score=True,
n_jobs=-1, random_state=42)
rf.fit(X_train, y_train)
# OOB score: unbiased estimate without touching test set
print(f'OOB accuracy: {rf.oob_score_:.4f}')
# Feature importances (MDI — Mean Decrease Impurity)
importances = pd.Series(rf.feature_importances_, index=feature_names)
importances.nlargest(15).plot(kind='barh')
# Permutation importances (more reliable than MDI for high-cardinality)
from sklearn.inspection import permutation_importance
perm = permutation_importance(rf, X_test, y_test,
n_repeats=20, scoring='roc_auc', n_jobs=-1)
pi = pd.Series(perm.importances_mean, index=feature_names)
pi.sort_values().plot(kind='barh')
n_jobs
Set
n_jobs=-1 to use all cores. Trees are
built independently — perfect parallelism. Training time ~ O(n_estimators /
n_jobs).MDI vs Permutation
MDI (built-in importance) overestimates
high-cardinality and numeric features. For unbiased importance, use
permutation_importance on the test set.HistGradientBoostingClassifier / GradientBoostingClassifier
State-of-the-art on tabular data — histogram-based boosting with
native NaN/categoricals support
▾
Syntax
Example
Internals
from sklearn.ensemble import HistGradientBoostingClassifier
HistGradientBoostingClassifier(
max_iter=500, # n_estimators
equivalent
learning_rate=0.05, # shrinkage (lower = more
iters needed)
max_depth=6, # depth of each tree
max_leaf_nodes=31, # controls complexity
(XGBoost default=31)
min_samples_leaf=20, # regularize leaf
size
l2_regularization=0.1, # L2 on leaf
weights
early_stopping=True, # auto early stopping on
val set
validation_fraction=0.1,
n_iter_no_change=20, # patience
categorical_features=None, # list of
categorical col indices
random_state=42
)
python
from sklearn.ensemble import HistGradientBoostingClassifier
from sklearn.model_selection import RandomizedSearchCV
from scipy.stats import loguniform, randint
hgb = HistGradientBoostingClassifier(
early_stopping=True, n_iter_no_change=30, random_state=42
)
param_dist = {
'learning_rate': loguniform(0.01, 0.3),
'max_leaf_nodes': randint(20, 80),
'min_samples_leaf': randint(10, 50),
'l2_regularization': loguniform(1e-4, 1),
'max_iter': [500]
}
rs = RandomizedSearchCV(hgb, param_dist, n_iter=40,
scoring='roc_auc', cv=5, n_jobs=-1, random_state=42)
rs.fit(X_train, y_train)
print(f'Best AUC: {rs.best_score_:.4f}')
print(rs.best_params_)
- Histogram binning: HistGBM bins continuous values into max 255 bins. Dramatically reduces split-finding cost vs GBM (O(n) vs O(n log n)).
- Native NaN support: HistGBM handles NaN natively — learns the best direction for missing values at each split. No imputation needed.
- HistGBM vs GBM: HistGBM is 10-100× faster, especially on large datasets. Use HistGBM by default; switch to XGBoost/LightGBM for GPU or distributed training.
Support Vector Machines
SVC / LinearSVC / SVR
Maximum-margin classifier — kernels, C parameter, and scaling
requirements
▾
Syntax
Example
Internals
from sklearn.svm import
SVC, LinearSVC, SVR
SVC(
C=1.0, # penalty; larger = less
regularization
kernel='rbf', #
'linear','poly','rbf','sigmoid'
gamma='scale', # 'scale'=1/(n_feat*X.var());
'auto'=1/n_feat
probability=False, # True enables
predict_proba (Platt scaling, slow)
class_weight='balanced'
)
LinearSVC(C=1.0, max_iter=2000) # much faster than
SVC(kernel='linear')
python
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.svm import SVC
# Always scale before SVM!
pipe = Pipeline([
('scaler', StandardScaler()),
('svc', SVC(C=10, kernel='rbf', gamma='scale',
class_weight='balanced'))
])
pipe.fit(X_train, y_train)
# Grid search C and gamma
from sklearn.model_selection import GridSearchCV
param_grid = {
'svc__C': [0.1, 1, 10, 100],
'svc__gamma': ['scale', 'auto', 0.001, 0.01]
}
gs = GridSearchCV(pipe, param_grid, cv=5,
scoring='roc_auc', n_jobs=-1)
gs.fit(X_train, y_train)
print(gs.best_params_, gs.best_score_)
- Scaling critical: SVM optimizes on distances. Unscaled features with large ranges dominate the kernel. Always use StandardScaler or MinMaxScaler before SVM.
- RBF kernel bandwidth (gamma): Too small = underfitting (wide
Gaussian); too large = overfitting (narrow Gaussian). Start with
gamma='scale'. - probability=True cost: Enables
predict_probaby fitting Platt scaling (extra 5-fold CV). Adds significant training time. - Scalability: SVC is O(n²–n³). Use LinearSVC or SGDClassifier for n > 50k samples.
K-Nearest Neighbors
KNeighborsClassifier / KNeighborsRegressor
Non-parametric lazy learner — distance metrics, k selection, and
scaling requirements
▾
Syntax
Example
from sklearn.neighbors import KNeighborsClassifier
KNeighborsClassifier(
n_neighbors=5, # k — tune via CV
weights='uniform', # 'distance': closer
neighbors matter more
metric='minkowski', #
'euclidean','manhattan','cosine'
p=2, # Minkowski p (2=Euclidean,
1=Manhattan)
algorithm='auto', #
'ball_tree','kd_tree','brute'
leaf_size=30, # for ball_tree/kd_tree
n_jobs=-1
)
python
from sklearn.neighbors import KNeighborsClassifier
from sklearn.model_selection import cross_val_score
# Always scale — KNN is distance-based
pipe = Pipeline([
('scaler', StandardScaler()),
('knn', KNeighborsClassifier(weights='distance'))
])
# Tune k via cross-validation
k_vals = range(1, 31)
cv_scores = []
for k in k_vals:
pipe.set_params(knn__n_neighbors=k)
score = cross_val_score(pipe, X_train, y_train,
cv=5, scoring='roc_auc').mean()
cv_scores.append(score)
best_k = k_vals[np.argmax(cv_scores)]
print(f'Best k: {best_k}, AUC: {max(cv_scores):.4f}')
# Use for anomaly: low neighbor distance = normal
nbrs = KNeighborsClassifier(n_neighbors=5)
nbrs.fit(X_train)
dists, _ = nbrs.kneighbors_.kneighbors(X_test)
anomaly_score = dists.mean(axis=1)
Clustering
KMeans / DBSCAN / AgglomerativeClustering
Unsupervised segmentation — choosing k, density-based clustering,
and cluster evaluation
▾
Syntax
Example
Internals
from sklearn.cluster import KMeans, DBSCAN, AgglomerativeClustering
KMeans(
n_clusters=8, init='k-means++',
n_init=10, # run n_init times, keep best
inertia
max_iter=300,
random_state=42
)
DBSCAN(
eps=0.5, # max distance between
neighbors
min_samples=5, # minimum neighbors to form
core point
metric='euclidean' # or
'cosine','manhattan'
) # label=-1 → noise/outlier
python
from sklearn.cluster import KMeans, DBSCAN
from sklearn.metrics import silhouette_score
# Elbow method + Silhouette for choosing k
inertias, silhouettes = [], []
K_range = range(2, 12)
for k in K_range:
km = KMeans(n_clusters=k, n_init=10, random_state=42)
labels = km.fit_predict(X_scaled)
inertias.append(km.inertia_)
silhouettes.append(silhouette_score(X_scaled, labels))
# Use cluster labels as features
km_final = KMeans(n_clusters=5, random_state=42)
df['cluster'] = km_final.fit_predict(X_scaled)
# DBSCAN — auto discovers clusters, labels outliers as -1
db = DBSCAN(eps=0.4, min_samples=10)
db_labels = db.fit_predict(X_scaled)
n_clusters = len(set(db_labels)) - (1 if -1 in db_labels else 0)
n_noise = (db_labels == -1).sum()
print(f'Clusters: {n_clusters}, Noise: {n_noise}')
# Tune DBSCAN eps via k-distance graph
from sklearn.neighbors import NearestNeighbors
nn = NearestNeighbors(n_neighbors=10).fit(X_scaled)
dists, _ = nn.kneighbors(X_scaled)
kth_dists = np.sort(dists[:, -1])[::-1]
# Plot kth_dists — "elbow" = good eps value
- k-means++: Smart initialization — seeds centroids proportional to distance from existing centroids. Reduces iterations and improves convergence vs random init.
- DBSCAN eps selection: Plot sorted k-NN distances. The "elbow" (sharp bend) indicates the natural cluster density threshold.
- DBSCAN vs KMeans: DBSCAN handles arbitrary shapes and finds outliers; KMeans assumes spherical clusters and is sensitive to outliers.
Cross-Validation & Hyperparameter Tuning
cross_validate / GridSearchCV / RandomizedSearchCV /
HalvingRandomSearchCV
Reliable model evaluation and efficient hyperparameter search
strategies
▾
Syntax
Example
Strategy Guide
from sklearn.model_selection import (
cross_validate, GridSearchCV,
RandomizedSearchCV, StratifiedKFold
)
cross_validate(
estimator, X, y,
cv=5,
scoring=['roc_auc', 'f1', 'average_precision'],
return_train_score=True,
n_jobs=-1
) # returns dict of arrays
RandomizedSearchCV(
estimator, param_distributions,
n_iter=50, # random combinations to try
scoring='roc_auc', cv=5,
refit=True, # refit best on full train data
n_jobs=-1, random_state=42
)
python
from sklearn.model_selection import cross_validate, StratifiedKFold
from scipy.stats import loguniform, randint
# Multiple metrics in one CV run
cv = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)
results = cross_validate(pipe, X_train, y_train, cv=cv,
scoring={'auc': 'roc_auc', 'ap': 'average_precision',
'f1': 'f1'},
return_train_score=True, n_jobs=-1)
print(f"Val AUC: {results['test_auc'].mean():.4f} ± {results['test_auc'].std():.4f}")
# Detect overfitting: large gap between train and val = overfit
gap = results['train_auc'].mean() - results['test_auc'].mean()
print(f'Train-Val gap: {gap:.4f}')
# HalvingRandomSearch: faster with successive halving
from sklearn.experimental import enable_halving_search_cv # noqa
from sklearn.model_selection import HalvingRandomSearchCV
hrsv = HalvingRandomSearchCV(
hgb, param_dist, factor=3, # 3× more resources each round
scoring='roc_auc', cv=5, n_jobs=-1, random_state=42
)
hrsv.fit(X_train, y_train)
print(hrsv.best_params_)
| Strategy | Grid Size | Speed | Best For |
|---|---|---|---|
| GridSearchCV | Small (≤100) | 🐢 Slow | When exact grid is known; exhaustive |
| RandomizedSearchCV | Large continuous | 🚀 Fast | Most cases; use loguniform for LR, C |
| HalvingRandomSearch | Very large | ⚡ Fastest | Large datasets; many hyperparams |
| Optuna / BOHB | Any | 🏎️ Smart | Bayesian optimization; complex search spaces |