Pandas
Data loading, inspection, cleaning, indexing, feature engineering, groupby, merging, time-series, and performance patterns for ML workflows.
Data Loading
Inspection & EDA
head / info / describe / value_counts / dtypes / nunique
Know your data before modelling β the essential EDA toolkit
βΎ
Example
EDA Script
python
df.head(5) # first rows
df.tail(5) # last rows
df.sample(10) # random rows (more representative than head)
df.info() # dtypes, non-null counts, memory
df.describe() # numeric stats
df.describe(include='object') # categorical stats
df.shape # (rows, cols)
df.dtypes # column types
df.nunique() # unique count per column
df['col'].value_counts(normalize=True) # freq table
df.isnull().sum() # missing per column
python
# Full EDA summary function
def eda_summary(df):
print(f"Shape: {df.shape}")
summary = pd.DataFrame({
'dtype': df.dtypes,
'n_null': df.isnull().sum(),
'pct_null': df.isnull().mean().mul(100).round(2),
'n_unique': df.nunique(),
'example': df.iloc[0]
})
print(summary)
print("\nClass balance:")
print(df['target'].value_counts(normalize=True))
return summary
eda_summary(df)
Missing Values
isnull / dropna / fillna / interpolate
Detect, drop, fill, and flag missing data β with data leakage
awareness
βΎ
Syntax
Example
Mistakes
df.isnull() / df.notnull()
df.dropna(axis=0, how='any', subset=None, thresh=None)
df.fillna(value=None, method=None) # method: 'ffill','bfill'
df['col'].interpolate(method='linear') # for time-series
python
# Flag missing as feature BEFORE filling
df['income_missing'] = df['income'].isnull().astype(int)
# Fill numeric with TRAIN median (compute on train only!)
train_median = X_train['age'].median()
X_train['age'] = X_train['age'].fillna(train_median)
X_test['age'] = X_test['age'].fillna(train_median) # same value!
# Drop cols with > 40% missing
thresh = int(0.6 * len(df))
df.dropna(axis=1, thresh=thresh, inplace=True)
# Forward-fill for time-series (last known value)
df.sort_values('date', inplace=True)
df['price'] = df['price'].ffill() # or .bfill()
- fillna(df.mean()): If computed on the full dataset (train+test), you leak test info into train imputation. Always compute fill values on train only.
- inplace=True: Doesn't save memory and is deprecated-in-spirit.
Prefer
df = df.dropna(...).
Indexing: loc / iloc / query
loc / iloc / at / query / eval
Label vs position indexing β view/copy semantics, boolean
filtering, and fast query expressions
βΎ
Syntax
Internals
Example
Mistakes
df.loc[row_label, col_label] # label-based, inclusive stop
df.iloc[row_int, col_int] # position-based,
exclusive stop
df.at[label, col] # single scalar β
fastest
df.iat[row_int, col_int] # single scalar by
position
df.query('col > 5 and label == 1')
# string expression
df.eval('new_col = col1 + col2',
inplace=True)
View vs Copy rulebook:
- Chain indexing (
df[col][mask]) β always a copy. Assignment is silently ignored. df.loc[mask, col] = valβ modifies in-place (safe).- After boolean filtering: result is usually a copy. Use
.copy()to be explicit. - Copy-on-Write (pandas 2.0+) makes this more predictable β single indexing = view, chained = copy.
python
# loc β label-based (index value, not position)
df.loc[100:200, ['age', 'income']]
# iloc β positional (row 0-99, col 0-3)
X_train = df.iloc[:2000, 1:10]
# Boolean filter via loc
high_earners = df.loc[df['income'] > 100_000].copy()
# Multiple conditions
mask = (df['age'] > 30) & (df['churn'] == 1)
df_subset = df.loc[mask, ['age', 'income']]
# query β cleaner for multi-condition filters
df.query('age > 30 and churn == 1 and income > 50_000')
# eval β vectorized column creation (faster than assign)
df.eval('ratio = income / (age + 1)', inplace=True)
# at β fastest single cell access
df.at[idx, 'label'] = 1
- SettingWithCopyWarning:
df[mask]['col'] = valβ chained indexing fails silently. Always usedf.loc[mask, 'col'] = val. - loc vs iloc with reset_index: After
reset_index(), integer index labels equal positions β butloc[5]andiloc[5]may differ if index has gaps.
Transformation & Feature Engineering
apply / map / assign / cut / qcut / get_dummies
Create engineered features β row/column functions, binning, and
encoding
βΎ
Syntax
Example
Performance
df['col'].map(dict_or_func) # element-wise on
Series
df.apply(func, axis=0) # 0=column-wise, 1=row-wise
df.assign(new_col=lambda x: ...) # returns new df, chainable
pd.cut(series, bins, labels=None) # equal-width binning
pd.qcut(series, q, labels=None) # quantile binning
pd.get_dummies(df, columns=None,
drop_first=True, dtype=np.uint8)
python
# Encode ordinal with map
edu_map = {'high_school': 0, 'bachelor': 1, 'master': 2, 'phd': 3}
df['edu_enc'] = df['education'].map(edu_map)
# Feature engineering with assign (chainable)
df = (df
.assign(age_sq=lambda x: x['age']**2)
.assign(log_income=lambda x: np.log1p(x['income']))
.assign(income_per_year=lambda x: x['income'] / x['tenure'].clip(1))
)
# Age binning (equal-width)
df['age_group'] = pd.cut(df['age'],
bins=[0, 25, 40, 60, 100],
labels=['young', 'adult', 'senior', 'elder'])
# Quantile binning (income quartiles)
df['income_q'] = pd.qcut(df['income'], q=4,
labels=['Q1', 'Q2', 'Q3', 'Q4'])
# One-hot encode (drop one for linear models)
df_ohe = pd.get_dummies(df, columns=['job', 'education'],
drop_first=True, dtype=np.uint8)
apply is slow
df.apply(func, axis=1) (row-wise) iterates
in Python β can be 100Γ slower than vectorized. Always prefer vectorized column
ops or df.eval().Series.map vs apply
Series.map(dict) is fastest for encoding β
pure C-level dict lookup. For numeric ops, use vectorized NumPy directly on the
Series.str accessor β String Feature Engineering
Vectorized string operations on text columns β extract, clean, and
encode
βΎ
Syntax
Example
s.str.lower() / upper() / strip() / replace(pat, repl)
s.str.contains(pat, regex=True, na=False)
s.str.startswith(prefix) / endswith(suffix)
s.str.split(pat, expand=True) # expand=True β DataFrame
s.str.extract(r'(\d+)') # regex capture group
s.str.len() #
character count
s.str.count(pat) #
regex occurrences
python
# Clean text column
df['name'] = df['name'].str.strip().str.lower()
# Extract numeric from text ('$42,000' β 42000)
df['salary_num'] = (df['salary_text']
.str.replace(r'[\$,]', '', regex=True)
.astype(float))
# Binary feature: does text contain keyword?
df['has_degree'] = df['notes'].str.contains(
r'bachelor|master|phd', regex=True, case=False, na=False).astype(int)
# Split 'First Last' into two columns
df[['first', 'last']] = df['full_name'].str.split(' ', n=1, expand=True)
# Compute text length as feature
df['review_len'] = df['review'].str.len()
df['word_count'] = df['review'].str.split().str.len()
GroupBy & Aggregation
groupby / agg / transform / pivot_table
Aggregate, compute group statistics, and create powerful
group-level features
βΎ
Syntax
Example
Internals
df.groupby(by, sort=True, observed=False)[cols].agg(func)
# agg with multiple functions
.agg(['mean', 'std', 'count'])
.agg({'col1': 'sum', 'col2': ['min', 'max']})
.agg(new_name=pd.NamedAgg(column='col', aggfunc='mean'))
# transform: returns same-shape Series (for feature eng.)
df.groupby('group')['val'].transform('mean')
pd.pivot_table(df, values, index, columns, aggfunc, fill_value)
python
# ML feature: mean encoding (leakage-safe on train only)
group_means = X_train.groupby('category')['target'].mean()
X_train['cat_mean_enc'] = X_train['category'].map(group_means)
X_test['cat_mean_enc'] = X_test['category'].map(group_means)
# Aggregation: user-level feature engineering
user_feats = df.groupby('user_id').agg(
n_transactions=('amount', 'count'),
total_spend=('amount', 'sum'),
avg_spend=('amount', 'mean'),
max_spend=('amount', 'max'),
days_active=('date', 'nunique')
).reset_index()
# transform: add group mean as new column (same length)
df['dept_avg_salary'] = df.groupby('dept')['salary'].transform('mean')
df['salary_vs_dept'] = df['salary'] - df['dept_avg_salary']
# pivot_table: churn rate by segment Γ age group
pivot = pd.pivot_table(df,
values='churn', index='segment', columns='age_group',
aggfunc='mean', fill_value=0).round(3)
groupby internals:
- Creates a hash map from group keys to row indices. Multi-key groupby creates a composite hash.
transformis equivalent togroupby().agg() + mergeback β but faster (no join).observed=Trueβ important for Categorical dtypes to skip unseen categories.
Merging & Concatenation
pd.merge / pd.concat / df.join
Database-style joins and combining DataFrames β join types, key
validation
βΎ
Syntax
Example
Mistakes
pd.merge(
left, right,
how='inner', #
'left','right','outer','cross'
on=None, # shared key col(s)
left_on=None, right_on=None, # different key names
validate=None, # '1:1','1:m','m:1' β catches
duplicates
indicator=False # adds '_merge' column
(left_only etc)
)
pd.concat([df1, df2], axis=0,
ignore_index=True)
python
# Enrich features: join user demographics to transactions
df_full = pd.merge(transactions, users,
on='user_id', how='left',
validate='m:1') # ensure users are unique
# Indicator to detect unmatched rows
df_check = pd.merge(X, y_labels, on='id',
how='outer', indicator=True)
unmatched = df_check[df_check['_merge'] != 'both']
# Stack monthly datasets vertically
monthly = [pd.read_parquet(f'data_{m}.parquet') for m in months]
df_all = pd.concat(monthly, axis=0, ignore_index=True)
# Concatenate features horizontally (after separate preprocessing)
df_combined = pd.concat([num_feats, cat_feats], axis=1)
- Row explosion on m:m merge: If both sides have duplicate keys,
merge creates a Cartesian product. Use
validate='m:1'or'1:m'to catch this early. - concat without reset_index: Index may have duplicates after
stacking. Use
ignore_index=Trueordf.reset_index(drop=True).
Window Functions
rolling / expanding / ewm / shift / diff / pct_change
Time-series feature engineering β lagged, rolling, and exponential
window statistics
βΎ
Syntax
Example
Mistakes
df['col'].shift(n)
# lag by n periods
df['col'].diff(n) #
difference from n periods ago
df['col'].pct_change(n) # % change from n periods ago
df['col'].rolling(
window=7, # window size
min_periods=1, # allow partial window at
start
center=False # True = centered window (causes
leakage!)
).mean() / .std() / .sum()
df['col'].expanding().mean() # cumulative mean
df['col'].ewm(span=10).mean() #
exponential weighted mean
python
# Sort by time first!
df.sort_values(['user_id', 'date'], inplace=True)
# Lag features (yesterday's sales β feature for today)
df['sales_lag1'] = df.groupby('user_id')['sales'].shift(1)
df['sales_lag7'] = df.groupby('user_id')['sales'].shift(7)
# Rolling 7-day average (within group, no leakage)
df['rolling_7d_mean'] = (
df.groupby('user_id')['sales']
.transform(lambda x: x.rolling(7, min_periods=1).mean())
)
# Rate of change
df['mom_growth'] = df['sales'].pct_change(30) # 30-day growth
# EWM for trend (more weight on recent)
df['ewm_7'] = df['sales'].ewm(span=7, adjust=False).mean()
- center=True causes data leakage: Centered window uses future
values. For predictive ML, always use trailing window (
center=False). - Rolling without groupby: If data has multiple entities (users), rolling must be applied per-group. Else, the window contains data from different users.
- NaN at window head:
rolling(7)produces NaN for first 6 rows. Usemin_periods=1for partial windows, or impute/drop afterward.