Sklearn — Preprocessing
train_test_split, Scalers, Encoders, Imputers, ColumnTransformer, and Pipeline — with internals, parameters, and data-leakage prevention.
Data Splitting
Feature Scaling
StandardScaler
Standardize features: zero mean, unit variance — required by most
linear models and neural networks
▾
Syntax
Internals
Example
When to Use
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler(
copy=True, # False modifies data
in-place
with_mean=True, # subtract mean
with_std=True # divide by std
)
scaler.fit(X_train) # learns mean_ and
scale_
X_train_s = scaler.transform(X_train)
X_test_s = scaler.transform(X_test) # use SAME
scaler
# Or combined:
X_train_s = scaler.fit_transform(X_train) #
fit+transform on train
Formula per feature: z = (x − mean) / std
- fit(X_train) — stores
mean_andscale_(std) per feature, computed only on training data. - transform(X) — applies the stored mean/std. Must never call
fiton test data. - Outliers inflate
std, pulling scaled values toward zero — useRobustScalerwhen outliers are present.
Data Leakage
NEVER call
fit_transform(X_all) before splitting. NEVER call
scaler.fit(X_test). The scaler must only see training data.
python
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.2, random_state=42
)
scaler = StandardScaler()
X_train_s = scaler.fit_transform(X_train) # fit on train, transform train
X_test_s = scaler.transform(X_test) # transform test with SAME params
# Inspect learned parameters
print(scaler.mean_) # mean of each feature
print(scaler.scale_) # std of each feature
# Invert scaling for predictions in original units
y_pred_scaled = model.predict(X_test_s)
# If target was also scaled:
target_scaler = StandardScaler().fit(y_train.values.reshape(-1, 1))
y_pred_orig = target_scaler.inverse_transform(y_pred_scaled.reshape(-1, 1))
| Scaler | Formula | Best For | Outlier Sensitive? |
|---|---|---|---|
| StandardScaler | (x-μ)/σ | Linear models, SVM, Neural nets, PCA | YES |
| MinMaxScaler | (x-min)/(max-min) | Neural nets when bounded [0,1] needed | YES (extreme) |
| RobustScaler | (x-median)/IQR | Data with significant outliers | NO |
| None needed | — | Tree models (RF, GBM, XGBoost) | N/A |
MinMaxScaler / RobustScaler
Scale to fixed range [0,1] or use robust median/IQR statistics
▾
Syntax
Example
from sklearn.preprocessing import MinMaxScaler, RobustScaler
MinMaxScaler(feature_range=(0, 1), copy=True)
# Formula: (x - min) / (max - min) → scales to [0,1]
# feature_range=(−1,1) for tanh-activated networks
RobustScaler(
with_centering=True,
with_scaling=True,
quantile_range=(25.0, 75.0) # IQR range (default = Q1–Q3)
)
# Formula: (x - median) / IQR → outliers don't skew scale
python
from sklearn.preprocessing import MinMaxScaler, RobustScaler
import numpy as np
rng = np.random.default_rng(42)
X = rng.standard_normal((100, 5))
# Inject outliers
X[0, :] = 100
# StandardScaler — distorted by outlier
ss = StandardScaler().fit(X)
print('STD scale:', ss.scale_) # inflated by outlier
# RobustScaler — resistant to outlier
rs = RobustScaler().fit(X)
print('Robust scale (IQR):', rs.scale_)
# MinMaxScaler for pixels/images → [0, 1]
mm = MinMaxScaler().fit(pixel_array)
pixels_norm = mm.transform(pixel_array)
Categorical Encoding
LabelEncoder / OrdinalEncoder
Convert string categories to integer codes — with important usage
constraints
▾
Syntax
Example
Mistakes
from sklearn.preprocessing import LabelEncoder, OrdinalEncoder
# LabelEncoder: for TARGET variable only (1D)
le = LabelEncoder()
y_enc = le.fit_transform(y) # 'cat'→0,
'dog'→1
y_orig = le.inverse_transform(y_enc) # decode
back
# OrdinalEncoder: for FEATURES (2D), supports multiple columns
oe = OrdinalEncoder(
categories='auto',
handle_unknown='use_encoded_value',
unknown_value=-1
)
X_enc = oe.fit_transform(X[cat_cols])
python
from sklearn.preprocessing import LabelEncoder
le = LabelEncoder()
y_train_enc = le.fit_transform(y_train)
y_test_enc = le.transform(y_test) # use fitted encoder on test
print(le.classes_) # ['cat' 'dog' 'fish']
print(le.transform(['dog'])) # [1]
# Decode model predictions back to labels
y_pred = model.predict(X_test)
y_pred_labels = le.inverse_transform(y_pred)
- LabelEncoder on features (not target): Assigns arbitrary integers (dog=0, cat=1, fish=2) — implies dog < cat < fish which is false for nominal data. Use OneHotEncoder for features.
- LabelEncoder with 2D arrays: It's 1D only. Calling it on a DataFrame column works; calling it on a DataFrame raises an error.
- Unseen categories at test time: Raises
ValueError. Use OrdinalEncoder withhandle_unknown='use_encoded_value'.
OneHotEncoder
Encode nominal categories as binary vectors — the correct way for
ML features
▾
Syntax
Example
Performance
from sklearn.preprocessing import OneHotEncoder
OneHotEncoder(
categories='auto', # or list of arrays per
feature
drop=None, # 'first' to avoid dummy
trap
sparse_output=True, # use sparse matrix
output
handle_unknown='error', # 'ignore' for prod
inference
min_frequency=None # group rare categories
into 'infrequent'
)
python
from sklearn.preprocessing import OneHotEncoder
import numpy as np
cats = np.array([['cat'], ['dog'], ['cat'], ['fish']])
ohe = OneHotEncoder(sparse_output=False, handle_unknown='ignore')
X_enc = ohe.fit_transform(cats)
print(X_enc)
# [[1. 0. 0.] ← cat
# [0. 1. 0.] ← dog
# [1. 0. 0.] ← cat
# [0. 0. 1.]] ← fish
print(ohe.categories_) # [array(['cat', 'dog', 'fish'])]
# Production: use inside Pipeline to handle unknown categories
ohe_prod = OneHotEncoder(
handle_unknown='ignore', # unknown → all zeros at inference
min_frequency=100, # rare cats < 100 occurrences → 'infrequent_sklearn'
sparse_output=True # save memory for high-cardinality
)
Sparse vs Dense
sparse_output=True (default) uses scipy
sparse matrices — up to 100× less memory for high-cardinality features.High Cardinality
Use
min_frequency to group rare
categories. For 10K+ unique values, consider target encoding or hashing instead.
Missing Value Imputation
SimpleImputer / KNNImputer
Fill missing values using statistical strategies — always inside a
Pipeline
▾
Syntax
Example
Mistakes
from sklearn.impute import SimpleImputer, KNNImputer
SimpleImputer(
missing_values=np.nan,
strategy='mean', #
'mean'|'median'|'most_frequent'|'constant'
fill_value=None, # used when
strategy='constant'
keep_empty_features=False
)
KNNImputer(
n_neighbors=5, # k nearest neighbors to
average
weights='uniform', # 'distance' weights
closer neighbors more
metric='nan_euclidean'
)
python
from sklearn.impute import SimpleImputer, KNNImputer
# Numeric imputer — median is outlier-robust
num_imputer = SimpleImputer(strategy='median')
X_train_imp = num_imputer.fit_transform(X_train[num_cols])
X_test_imp = num_imputer.transform(X_test[num_cols])
# Categorical imputer
cat_imputer = SimpleImputer(strategy='most_frequent')
# KNN imputer — better accuracy, higher cost
knn_imp = KNNImputer(n_neighbors=5)
X_knn = knn_imp.fit_transform(X_train) # uses kNN to fill NaN
# ✅ Always use inside Pipeline to prevent leakage
from sklearn.pipeline import Pipeline
from sklearn.linear_model import LogisticRegression
pipe = Pipeline([
('imputer', SimpleImputer(strategy='median')),
('scaler', StandardScaler()),
('model', LogisticRegression())
])
pipe.fit(X_train, y_train)
score = pipe.score(X_test, y_test)
- Fitting on full data: Computing mean/median on (train+test) before split is data leakage. Fit imputer only on train, transform both.
- KNNImputer on large data: O(n²) complexity — very slow for datasets with 100K+ rows. Use SimpleImputer or IterativeImputer for large datasets.
strategy='mean'on skewed distributions gives a biased fill. Prefer'median'for robustness.
Pipeline & ColumnTransformer
Pipeline / ColumnTransformer
Chain transformers and estimators — the correct way to avoid data
leakage in production
▾
Syntax
Internals
Full Example
Mistakes
from sklearn.pipeline import Pipeline, make_pipeline
from sklearn.compose import
ColumnTransformer, make_column_transformer
Pipeline(
steps=[ # list of (name, estimator) tuples
('step_name', estimator),
('model', classifier)
],
memory=None # path to cache fitted
transformers
)
ColumnTransformer(
transformers=[ # list of (name, transformer, columns)
('num', numeric_pipeline, num_cols),
('cat', cat_pipeline, cat_cols)
],
remainder='drop', # 'passthrough' to keep
unspecified cols
n_jobs=-1
)
How Pipeline.fit() works:
- Calls
fit_transform(X, y)on each step except the last. - On the last step (final estimator), calls
fit(X_transformed, y). - During
predict(X_test), callstransform(X_test)on every step except the last, thenpredict. - Because
fitandtransformare always called in sequence on the same data, leakage is architecturally impossible within the pipeline.
GridSearchCV + Pipeline
When using GridSearchCV with
a Pipeline, parameter names use double underscore notation:
{'model__C': [0.1, 1, 10], 'scaler__with_mean': [True]}
python
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.impute import SimpleImputer
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import classification_report
df = pd.read_csv('titanic.csv')
X = df.drop('survived', axis=1)
y = df['survived']
num_cols = ['age', 'fare', 'pclass']
cat_cols = ['sex', 'embarked']
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.2, random_state=42, stratify=y
)
# Numeric pipeline
num_pipe = Pipeline([
('imputer', SimpleImputer(strategy='median')),
('scaler', StandardScaler())
])
# Categorical pipeline
cat_pipe = Pipeline([
('imputer', SimpleImputer(strategy='most_frequent')),
('encoder', OneHotEncoder(handle_unknown='ignore', sparse_output=False))
])
# ColumnTransformer: apply different preprocessing per column type
preprocessor = ColumnTransformer([
('num', num_pipe, num_cols),
('cat', cat_pipe, cat_cols)
])
# Full pipeline: preprocessing + model
full_pipeline = Pipeline([
('preprocessor', preprocessor),
('model', RandomForestClassifier(n_estimators=100, random_state=42))
])
full_pipeline.fit(X_train, y_train)
y_pred = full_pipeline.predict(X_test)
print(classification_report(y_test, y_pred))
# GridSearchCV with pipeline params
from sklearn.model_selection import GridSearchCV
param_grid = {
'model__n_estimators': [50, 100, 200],
'model__max_depth': [None, 5, 10],
'preprocessor__num__scaler__with_mean': [True]
}
gs = GridSearchCV(full_pipeline, param_grid, cv=5, n_jobs=-1)
gs.fit(X_train, y_train)
print(gs.best_params_)
print(gs.best_score_)
- Forgetting to use Pipeline: Manual fit_transform on full data before cross-validation leaks information into every fold.
- Calling fit_transform on test data: Every transformer in the pipeline must only be fitted on training data — Pipeline enforces this automatically.
- Wrong step order: Imputation must come before Scaling. Scaling must come before model fitting. Pipeline enforces sequential execution.
- ColumnTransformer drops unlisted columns by default — set
remainder='passthrough'to keep them.
Dimensionality Reduction
PCA — Principal Component Analysis
Project data to its most informative directions — for
visualization, noise removal, and collinearity removal
▾
Syntax
Example
Internals
from sklearn.decomposition import PCA
PCA(
n_components=None, # int = exact, float (0-1)
= auto (keep % variance)
whiten=False, # normalize components to unit
variance
svd_solver='auto', #
'full','arpack','randomized','auto'
random_state=42
)
pca.fit(X_train_scaled) # ALWAYS fit on SCALED
data
X_pca = pca.transform(X_train_scaled)
# Key attributes after fit:
pca.explained_variance_ratio_ # per-component variance fraction
pca.components_ # eigenvectors (loadings)
pca.n_components_ # actual n if auto-selected
python
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA
import numpy as np
import matplotlib.pyplot as plt
# 1. Scale first (PCA is variance-based)
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X_train)
# 2. Fit PCA: keep 95% of variance
pca = PCA(n_components=0.95, random_state=42)
X_pca = pca.fit_transform(X_scaled)
print(f'Original: {X_scaled.shape[1]} features → PCA: {pca.n_components_}')
# Scree plot — find elbow
evr = pca95.explained_variance_ratio_
cum_evr = np.cumsum(evr)
fig, ax = plt.subplots(figsize=(8, 4))
ax.bar(range(len(evr)), evr, label='Individual')
ax.plot(range(len(evr)), cum_evr, 'r--', marker='o', label='Cumulative')
ax.axhline(0.95, ls='--', color='gray', label='95% threshold')
ax.legend(); ax.set_xlabel('Component'); ax.set_ylabel('Explained Variance')
# Use in Pipeline
from sklearn.pipeline import Pipeline
from sklearn.linear_model import LogisticRegression
pipe = Pipeline([
('scaler', StandardScaler()),
('pca', PCA(n_components=0.95)),
('model', LogisticRegression())
])
pipe.fit(X_train, y_train)
- Algorithm: Computes the covariance matrix of X, then runs Singular Value Decomposition (SVD). Principal components = eigenvectors of the covariance matrix, sorted by eigenvalue (variance) descending.
- svd_solver choices:
'full'= exact (slow for large n);'randomized'= Halko et al. approximation — use for large datasets (n_components ≪ n_features);'arpack'= sparse data. - Whitening:
whiten=Truedivides each component by its standard deviation → unit variance per component. Helps algorithms sensitive to feature scale (e.g., SVM, Gaussian mixture models). - Inverse transform:
pca.inverse_transform(X_pca)projects back to original space — useful for visualization and anomaly detection (large reconstruction error = anomaly).
Feature Selection
SelectKBest / SelectFromModel / RFE
Remove irrelevant features — reduce overfitting, training time,
and improve interpretability
▾
Syntax
Example
Which Method?
from sklearn.feature_selection import (
SelectKBest, f_classif, mutual_info_classif,
SelectFromModel, RFE, RFECV
)
# Univariate: score each feature independently
SelectKBest(score_func=f_classif, k=10)
# Model-based: use model's feature_importances_
SelectFromModel(estimator, threshold='mean')
# Recursive elimination: iteratively remove weakest
RFE(estimator, n_features_to_select=20,
step=1)
RFECV(estimator, cv=5, scoring='roc_auc') # auto-selects k
python
from sklearn.feature_selection import (
SelectKBest, f_classif, mutual_info_classif,
SelectFromModel, RFECV
)
from sklearn.ensemble import RandomForestClassifier
import pandas as pd
# ── Method 1: Univariate (ANOVA F-test)
selector = SelectKBest(f_classif, k=15)
X_sel = selector.fit_transform(X_train, y_train)
selected_mask = selector.get_support()
selected_feats = X_train.columns[selected_mask].tolist()
print('Selected:', selected_feats)
# ── Method 2: Model-based (tree feature importances)
rf = RandomForestClassifier(n_estimators=100, random_state=42)
sfm = SelectFromModel(rf, threshold='0.5*mean')
sfm.fit(X_train, y_train)
X_sfm = sfm.transform(X_train)
feat_imp = pd.Series(rf.feature_importances_, index=X_train.columns)
print(feat_imp.sort_values(ascending=False).head(10))
# ── Method 3: RFECV — cross-validated RFE
rfecv = RFECV(
estimator=RandomForestClassifier(50, random_state=42),
cv=5, scoring='roc_auc', step=1, n_jobs=-1
)
rfecv.fit(X_train, y_train)
print(f'Optimal features: {rfecv.n_features_}')
X_rfecv = rfecv.transform(X_train)
# Use inside Pipeline to prevent leakage
from sklearn.pipeline import Pipeline
from sklearn.linear_model import LogisticRegression
pipe = Pipeline([
('selector', SelectKBest(f_classif, k=10)),
('scaler', StandardScaler()),
('model', LogisticRegression())
])
pipe.fit(X_train, y_train)
| Method | Speed | Considers Interactions | Best For |
|---|---|---|---|
| SelectKBest (f_classif) | ⚡ Fast | ❌ No | Quick baseline, linear relationships |
| SelectKBest (mutual_info) | Medium | Partially | Non-linear relationships, classification |
| SelectFromModel | Medium | ✅ Yes | Tree-based importance, fast model selection |
| RFE / RFECV | 🐢 Slow | ✅ Yes | Best accuracy, high-stakes feature selection |