Data Splitting

✂️
train_test_split()
Split arrays into train/test subsets — the most critical preprocessing step
splittingstratifyrandom_state
Syntax
Parameters
Internals
Example
Mistakes
from sklearn.model_selection import train_test_split train_test_split( *arrays, test_size=None, # float (0-1) or int train_size=None, random_state=None, shuffle=True, stratify=None )
Parameter Type Default Description
*arrays array-like required Any number of arrays with same first dimension. Always pass X and y together so splits are aligned.
test_size float or int 0.25 Float = proportion (0.0–1.0). Int = absolute number of samples. If None, complement of train_size. Standard ML ratios: 0.2 (80/20) or 0.1 (90/10 for large datasets).
random_state int or None None Seeds the random number generator. Always set this for reproducibility. Any integer works; 42 is community convention.
shuffle bool True Shuffle before splitting. Set False only for time-series data where temporal ordering must be preserved.
stratify array-like or None None If provided, split preserves class proportions from this array. Essential for imbalanced classification. Pass y for standard stratification.

How it works internally:

  • Step 1 — Validate: Checks all arrays have the same first dimension (n_samples).
  • Step 2 — Shuffle: Creates a permuted index array using np.random.permutation(n) seeded with random_state.
  • Step 3 — Split: Slices the permuted indices at the split boundary.
  • Stratified split: Uses StratifiedShuffleSplit internally — splits each class separately, then concatenates to maintain proportions exactly.
💡
Returns copies, not views
The output arrays are always new copies, never views of the original. Modifying them won't affect the source data.
python
import pandas as pd
from sklearn.model_selection import train_test_split

df = pd.read_csv('dataset.csv')
X = df.drop('target', axis=1)
y = df['target']

# Standard 80/20 split
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42
)

# Stratified split (classification with imbalanced classes)
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42, stratify=y
)

# Verify class distribution is preserved
print(y_train.value_counts(normalize=True))
print(y_test.value_counts(normalize=True))

# 3-way split: train / val / test
X_temp, X_test, y_temp, y_test = train_test_split(
    X, y, test_size=0.1, random_state=42, stratify=y
)
X_train, X_val, y_train, y_val = train_test_split(
    X_temp, y_temp, test_size=0.111, random_state=42
)
# Result: ~80% train, ~10% val, ~10% test
  • Missing random_state: Every run produces a different split → different scores → unreproducible experiments.
  • Splitting before EDA: Any statistics (mean, std, class counts) computed before splitting contaminate test data. Split first, analyze train only.
  • Forgetting stratify for imbalanced data: A 1% minority class could end up entirely in train or test by chance.
  • shuffle=True with time-series: This leaks future data into training. Always shuffle=False for temporal data and use TimeSeriesSplit.
  • Using test set for model selection: Test set = final evaluation only. Use validation set / cross-validation for hyperparameter tuning.

Feature Scaling

📏
StandardScaler
Standardize features: zero mean, unit variance — required by most linear models and neural networks
scalingz-scorenormalization
Syntax
Internals
Example
When to Use
from sklearn.preprocessing import StandardScaler scaler = StandardScaler( copy=True, # False modifies data in-place with_mean=True, # subtract mean with_std=True # divide by std ) scaler.fit(X_train) # learns mean_ and scale_ X_train_s = scaler.transform(X_train) X_test_s = scaler.transform(X_test) # use SAME scaler # Or combined: X_train_s = scaler.fit_transform(X_train) # fit+transform on train

Formula per feature: z = (x − mean) / std

  • fit(X_train) — stores mean_ and scale_ (std) per feature, computed only on training data.
  • transform(X) — applies the stored mean/std. Must never call fit on test data.
  • Outliers inflate std, pulling scaled values toward zero — use RobustScaler when outliers are present.
🚨
Data Leakage
NEVER call fit_transform(X_all) before splitting. NEVER call scaler.fit(X_test). The scaler must only see training data.
python
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42
)

scaler = StandardScaler()
X_train_s = scaler.fit_transform(X_train)  # fit on train, transform train
X_test_s  = scaler.transform(X_test)       # transform test with SAME params

# Inspect learned parameters
print(scaler.mean_)    # mean of each feature
print(scaler.scale_)   # std of each feature

# Invert scaling for predictions in original units
y_pred_scaled = model.predict(X_test_s)
# If target was also scaled:
target_scaler = StandardScaler().fit(y_train.values.reshape(-1, 1))
y_pred_orig = target_scaler.inverse_transform(y_pred_scaled.reshape(-1, 1))
Scaler Formula Best For Outlier Sensitive?
StandardScaler (x-μ)/σ Linear models, SVM, Neural nets, PCA YES
MinMaxScaler (x-min)/(max-min) Neural nets when bounded [0,1] needed YES (extreme)
RobustScaler (x-median)/IQR Data with significant outliers NO
None needed Tree models (RF, GBM, XGBoost) N/A
📐
MinMaxScaler / RobustScaler
Scale to fixed range [0,1] or use robust median/IQR statistics
MinMaxRobustScaleroutliers
Syntax
Example
from sklearn.preprocessing import MinMaxScaler, RobustScaler MinMaxScaler(feature_range=(0, 1), copy=True) # Formula: (x - min) / (max - min) → scales to [0,1] # feature_range=(−1,1) for tanh-activated networks RobustScaler( with_centering=True, with_scaling=True, quantile_range=(25.0, 75.0) # IQR range (default = Q1–Q3) ) # Formula: (x - median) / IQR → outliers don't skew scale
python
from sklearn.preprocessing import MinMaxScaler, RobustScaler
import numpy as np

rng = np.random.default_rng(42)
X = rng.standard_normal((100, 5))
# Inject outliers
X[0, :] = 100

# StandardScaler — distorted by outlier
ss = StandardScaler().fit(X)
print('STD scale:', ss.scale_)   # inflated by outlier

# RobustScaler — resistant to outlier
rs = RobustScaler().fit(X)
print('Robust scale (IQR):', rs.scale_)

# MinMaxScaler for pixels/images → [0, 1]
mm = MinMaxScaler().fit(pixel_array)
pixels_norm = mm.transform(pixel_array)

Categorical Encoding

🏷️
LabelEncoder / OrdinalEncoder
Convert string categories to integer codes — with important usage constraints
encodingcategoricalordinal
Syntax
Example
Mistakes
from sklearn.preprocessing import LabelEncoder, OrdinalEncoder # LabelEncoder: for TARGET variable only (1D) le = LabelEncoder() y_enc = le.fit_transform(y) # 'cat'→0, 'dog'→1 y_orig = le.inverse_transform(y_enc) # decode back # OrdinalEncoder: for FEATURES (2D), supports multiple columns oe = OrdinalEncoder( categories='auto', handle_unknown='use_encoded_value', unknown_value=-1 ) X_enc = oe.fit_transform(X[cat_cols])
python
from sklearn.preprocessing import LabelEncoder

le = LabelEncoder()
y_train_enc = le.fit_transform(y_train)
y_test_enc  = le.transform(y_test)  # use fitted encoder on test

print(le.classes_)          # ['cat' 'dog' 'fish']
print(le.transform(['dog']))  # [1]

# Decode model predictions back to labels
y_pred = model.predict(X_test)
y_pred_labels = le.inverse_transform(y_pred)
  • LabelEncoder on features (not target): Assigns arbitrary integers (dog=0, cat=1, fish=2) — implies dog < cat < fish which is false for nominal data. Use OneHotEncoder for features.
  • LabelEncoder with 2D arrays: It's 1D only. Calling it on a DataFrame column works; calling it on a DataFrame raises an error.
  • Unseen categories at test time: Raises ValueError. Use OrdinalEncoder with handle_unknown='use_encoded_value'.
🔢
OneHotEncoder
Encode nominal categories as binary vectors — the correct way for ML features
one-hotdummysparse
Syntax
Example
Performance
from sklearn.preprocessing import OneHotEncoder OneHotEncoder( categories='auto', # or list of arrays per feature drop=None, # 'first' to avoid dummy trap sparse_output=True, # use sparse matrix output handle_unknown='error', # 'ignore' for prod inference min_frequency=None # group rare categories into 'infrequent' )
python
from sklearn.preprocessing import OneHotEncoder
import numpy as np

cats = np.array([['cat'], ['dog'], ['cat'], ['fish']])
ohe = OneHotEncoder(sparse_output=False, handle_unknown='ignore')
X_enc = ohe.fit_transform(cats)
print(X_enc)
# [[1. 0. 0.]   ← cat
#  [0. 1. 0.]   ← dog
#  [1. 0. 0.]   ← cat
#  [0. 0. 1.]]  ← fish

print(ohe.categories_)  # [array(['cat', 'dog', 'fish'])]

# Production: use inside Pipeline to handle unknown categories
ohe_prod = OneHotEncoder(
    handle_unknown='ignore',   # unknown → all zeros at inference
    min_frequency=100,         # rare cats < 100 occurrences → 'infrequent_sklearn'
    sparse_output=True          # save memory for high-cardinality
)
Sparse vs Dense
sparse_output=True (default) uses scipy sparse matrices — up to 100× less memory for high-cardinality features.
High Cardinality
Use min_frequency to group rare categories. For 10K+ unique values, consider target encoding or hashing instead.

Missing Value Imputation

🩹
SimpleImputer / KNNImputer
Fill missing values using statistical strategies — always inside a Pipeline
imputationNaNSimpleImputer
Syntax
Example
Mistakes
from sklearn.impute import SimpleImputer, KNNImputer SimpleImputer( missing_values=np.nan, strategy='mean', # 'mean'|'median'|'most_frequent'|'constant' fill_value=None, # used when strategy='constant' keep_empty_features=False ) KNNImputer( n_neighbors=5, # k nearest neighbors to average weights='uniform', # 'distance' weights closer neighbors more metric='nan_euclidean' )
python
from sklearn.impute import SimpleImputer, KNNImputer

# Numeric imputer — median is outlier-robust
num_imputer = SimpleImputer(strategy='median')
X_train_imp = num_imputer.fit_transform(X_train[num_cols])
X_test_imp  = num_imputer.transform(X_test[num_cols])

# Categorical imputer
cat_imputer = SimpleImputer(strategy='most_frequent')

# KNN imputer — better accuracy, higher cost
knn_imp = KNNImputer(n_neighbors=5)
X_knn = knn_imp.fit_transform(X_train)   # uses kNN to fill NaN

# ✅ Always use inside Pipeline to prevent leakage
from sklearn.pipeline import Pipeline
from sklearn.linear_model import LogisticRegression

pipe = Pipeline([
    ('imputer', SimpleImputer(strategy='median')),
    ('scaler',  StandardScaler()),
    ('model',   LogisticRegression())
])
pipe.fit(X_train, y_train)
score = pipe.score(X_test, y_test)
  • Fitting on full data: Computing mean/median on (train+test) before split is data leakage. Fit imputer only on train, transform both.
  • KNNImputer on large data: O(n²) complexity — very slow for datasets with 100K+ rows. Use SimpleImputer or IterativeImputer for large datasets.
  • strategy='mean' on skewed distributions gives a biased fill. Prefer 'median' for robustness.

Pipeline & ColumnTransformer

🔗
Pipeline / ColumnTransformer
Chain transformers and estimators — the correct way to avoid data leakage in production
PipelineColumnTransformerproduction
Syntax
Internals
Full Example
Mistakes
from sklearn.pipeline import Pipeline, make_pipeline from sklearn.compose import ColumnTransformer, make_column_transformer Pipeline( steps=[ # list of (name, estimator) tuples ('step_name', estimator), ('model', classifier) ], memory=None # path to cache fitted transformers ) ColumnTransformer( transformers=[ # list of (name, transformer, columns) ('num', numeric_pipeline, num_cols), ('cat', cat_pipeline, cat_cols) ], remainder='drop', # 'passthrough' to keep unspecified cols n_jobs=-1 )

How Pipeline.fit() works:

  • Calls fit_transform(X, y) on each step except the last.
  • On the last step (final estimator), calls fit(X_transformed, y).
  • During predict(X_test), calls transform(X_test) on every step except the last, then predict.
  • Because fit and transform are always called in sequence on the same data, leakage is architecturally impossible within the pipeline.
💡
GridSearchCV + Pipeline
When using GridSearchCV with a Pipeline, parameter names use double underscore notation: {'model__C': [0.1, 1, 10], 'scaler__with_mean': [True]}
python
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.impute import SimpleImputer
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import classification_report

df = pd.read_csv('titanic.csv')
X = df.drop('survived', axis=1)
y = df['survived']

num_cols = ['age', 'fare', 'pclass']
cat_cols = ['sex', 'embarked']

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42, stratify=y
)

# Numeric pipeline
num_pipe = Pipeline([
    ('imputer', SimpleImputer(strategy='median')),
    ('scaler',  StandardScaler())
])

# Categorical pipeline
cat_pipe = Pipeline([
    ('imputer', SimpleImputer(strategy='most_frequent')),
    ('encoder', OneHotEncoder(handle_unknown='ignore', sparse_output=False))
])

# ColumnTransformer: apply different preprocessing per column type
preprocessor = ColumnTransformer([
    ('num', num_pipe, num_cols),
    ('cat', cat_pipe, cat_cols)
])

# Full pipeline: preprocessing + model
full_pipeline = Pipeline([
    ('preprocessor', preprocessor),
    ('model', RandomForestClassifier(n_estimators=100, random_state=42))
])

full_pipeline.fit(X_train, y_train)
y_pred = full_pipeline.predict(X_test)
print(classification_report(y_test, y_pred))

# GridSearchCV with pipeline params
from sklearn.model_selection import GridSearchCV

param_grid = {
    'model__n_estimators': [50, 100, 200],
    'model__max_depth': [None, 5, 10],
    'preprocessor__num__scaler__with_mean': [True]
}
gs = GridSearchCV(full_pipeline, param_grid, cv=5, n_jobs=-1)
gs.fit(X_train, y_train)
print(gs.best_params_)
print(gs.best_score_)
  • Forgetting to use Pipeline: Manual fit_transform on full data before cross-validation leaks information into every fold.
  • Calling fit_transform on test data: Every transformer in the pipeline must only be fitted on training data — Pipeline enforces this automatically.
  • Wrong step order: Imputation must come before Scaling. Scaling must come before model fitting. Pipeline enforces sequential execution.
  • ColumnTransformer drops unlisted columns by default — set remainder='passthrough' to keep them.

Dimensionality Reduction

📉
PCA — Principal Component Analysis
Project data to its most informative directions — for visualization, noise removal, and collinearity removal
PCASVDvariancedimensionality
Syntax
Example
Internals
from sklearn.decomposition import PCA PCA( n_components=None, # int = exact, float (0-1) = auto (keep % variance) whiten=False, # normalize components to unit variance svd_solver='auto', # 'full','arpack','randomized','auto' random_state=42 ) pca.fit(X_train_scaled) # ALWAYS fit on SCALED data X_pca = pca.transform(X_train_scaled) # Key attributes after fit: pca.explained_variance_ratio_ # per-component variance fraction pca.components_ # eigenvectors (loadings) pca.n_components_ # actual n if auto-selected
python
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA
import numpy as np
import matplotlib.pyplot as plt

# 1. Scale first (PCA is variance-based)
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X_train)

# 2. Fit PCA: keep 95% of variance
pca = PCA(n_components=0.95, random_state=42)
X_pca = pca.fit_transform(X_scaled)
print(f'Original: {X_scaled.shape[1]} features → PCA: {pca.n_components_}')

# Scree plot — find elbow
evr = pca95.explained_variance_ratio_
cum_evr = np.cumsum(evr)
fig, ax = plt.subplots(figsize=(8, 4))
ax.bar(range(len(evr)), evr, label='Individual')
ax.plot(range(len(evr)), cum_evr, 'r--', marker='o', label='Cumulative')
ax.axhline(0.95, ls='--', color='gray', label='95% threshold')
ax.legend(); ax.set_xlabel('Component'); ax.set_ylabel('Explained Variance')

# Use in Pipeline
from sklearn.pipeline import Pipeline
from sklearn.linear_model import LogisticRegression
pipe = Pipeline([
    ('scaler', StandardScaler()),
    ('pca',    PCA(n_components=0.95)),
    ('model',  LogisticRegression())
])
pipe.fit(X_train, y_train)
  • Algorithm: Computes the covariance matrix of X, then runs Singular Value Decomposition (SVD). Principal components = eigenvectors of the covariance matrix, sorted by eigenvalue (variance) descending.
  • svd_solver choices: 'full' = exact (slow for large n); 'randomized' = Halko et al. approximation — use for large datasets (n_components ≪ n_features); 'arpack' = sparse data.
  • Whitening: whiten=True divides each component by its standard deviation → unit variance per component. Helps algorithms sensitive to feature scale (e.g., SVM, Gaussian mixture models).
  • Inverse transform: pca.inverse_transform(X_pca) projects back to original space — useful for visualization and anomaly detection (large reconstruction error = anomaly).

Feature Selection

🎯
SelectKBest / SelectFromModel / RFE
Remove irrelevant features — reduce overfitting, training time, and improve interpretability
SelectKBestRFEfeature importance
Syntax
Example
Which Method?
from sklearn.feature_selection import ( SelectKBest, f_classif, mutual_info_classif, SelectFromModel, RFE, RFECV ) # Univariate: score each feature independently SelectKBest(score_func=f_classif, k=10) # Model-based: use model's feature_importances_ SelectFromModel(estimator, threshold='mean') # Recursive elimination: iteratively remove weakest RFE(estimator, n_features_to_select=20, step=1) RFECV(estimator, cv=5, scoring='roc_auc') # auto-selects k
python
from sklearn.feature_selection import (
    SelectKBest, f_classif, mutual_info_classif,
    SelectFromModel, RFECV
)
from sklearn.ensemble import RandomForestClassifier
import pandas as pd

# ── Method 1: Univariate (ANOVA F-test)
selector = SelectKBest(f_classif, k=15)
X_sel = selector.fit_transform(X_train, y_train)
selected_mask  = selector.get_support()
selected_feats = X_train.columns[selected_mask].tolist()
print('Selected:', selected_feats)

# ── Method 2: Model-based (tree feature importances)
rf = RandomForestClassifier(n_estimators=100, random_state=42)
sfm = SelectFromModel(rf, threshold='0.5*mean')
sfm.fit(X_train, y_train)
X_sfm = sfm.transform(X_train)
feat_imp = pd.Series(rf.feature_importances_, index=X_train.columns)
print(feat_imp.sort_values(ascending=False).head(10))

# ── Method 3: RFECV — cross-validated RFE
rfecv = RFECV(
    estimator=RandomForestClassifier(50, random_state=42),
    cv=5, scoring='roc_auc', step=1, n_jobs=-1
)
rfecv.fit(X_train, y_train)
print(f'Optimal features: {rfecv.n_features_}')
X_rfecv = rfecv.transform(X_train)

# Use inside Pipeline to prevent leakage
from sklearn.pipeline import Pipeline
from sklearn.linear_model import LogisticRegression
pipe = Pipeline([
    ('selector', SelectKBest(f_classif, k=10)),
    ('scaler',   StandardScaler()),
    ('model',    LogisticRegression())
])
pipe.fit(X_train, y_train)
Method Speed Considers Interactions Best For
SelectKBest (f_classif) ⚡ Fast ❌ No Quick baseline, linear relationships
SelectKBest (mutual_info) Medium Partially Non-linear relationships, classification
SelectFromModel Medium ✅ Yes Tree-based importance, fast model selection
RFE / RFECV 🐢 Slow ✅ Yes Best accuracy, high-stakes feature selection