Data Loading

πŸ“‚
pd.read_csv / read_parquet / read_excel / read_sql
Efficient data ingestion with memory control, chunking, and type inference
I/Omemorychunking
β–Ύ
Syntax
Example
Performance
import pandas as pd pd.read_csv( filepath, sep=',', dtype=None, # dict of col→dtype; avoids inference cost usecols=None, # list of cols to load — skip unwanted cols nrows=None, # load first N rows for exploration chunksize=None, # returns iterator of DataFrames parse_dates=False, na_values=None, low_memory=False # avoid mixed-type columns ) pd.read_parquet(path, columns=None) # preferred binary format pd.read_sql(query, con) # SQLAlchemy connection
python
# Minimal memory: specify dtypes + load only needed cols
dtype_map = {'age': np.int8, 'salary': np.float32, 'label': np.int8}
df = pd.read_csv('data.csv', dtype=dtype_map,
                  usecols=['age', 'salary', 'label'])

# Preview large file without loading all
df_head = pd.read_csv('big.csv', nrows=1000)

# Chunked processing for OOM situations
results = []
for chunk in pd.read_csv('huge.csv', chunksize=50_000):
    chunk_feats = chunk.groupby('user_id')['amount'].sum()
    results.append(chunk_feats)
agg = pd.concat(results).groupby(level=0).sum()

# Check memory usage
print(df.memory_usage(deep=True).sum() / 1e6, 'MB')

# Save as parquet (5-10Γ— smaller than CSV, faster to read)
df.to_parquet('data.parquet', index=False)
Format Speed
Parquet > Feather > CSV. For ML datasets > 100MB always store as parquet. pd.read_parquet is 5-10Γ— faster than CSV.
dtype=str
Force string dtype to avoid mixed-type inference: dtype={'id': str}. Prevents silent int→float conversion on IDs with NaN.

Inspection & EDA

πŸ”
head / info / describe / value_counts / dtypes / nunique
Know your data before modelling β€” the essential EDA toolkit
EDAinspectionprofiling
β–Ύ
Example
EDA Script
python
df.head(5)               # first rows
df.tail(5)               # last rows
df.sample(10)            # random rows (more representative than head)
df.info()               # dtypes, non-null counts, memory
df.describe()            # numeric stats
df.describe(include='object')  # categorical stats
df.shape                 # (rows, cols)
df.dtypes               # column types
df.nunique()             # unique count per column
df['col'].value_counts(normalize=True)  # freq table
df.isnull().sum()         # missing per column
python
# Full EDA summary function
def eda_summary(df):
    print(f"Shape: {df.shape}")
    summary = pd.DataFrame({
        'dtype':    df.dtypes,
        'n_null':   df.isnull().sum(),
        'pct_null': df.isnull().mean().mul(100).round(2),
        'n_unique': df.nunique(),
        'example':  df.iloc[0]
    })
    print(summary)
    print("\nClass balance:")
    print(df['target'].value_counts(normalize=True))
    return summary

eda_summary(df)

Missing Values

❓
isnull / dropna / fillna / interpolate
Detect, drop, fill, and flag missing data β€” with data leakage awareness
NaNimputationleakage
β–Ύ
Syntax
Example
Mistakes
df.isnull() / df.notnull() df.dropna(axis=0, how='any', subset=None, thresh=None) df.fillna(value=None, method=None) # method: 'ffill','bfill' df['col'].interpolate(method='linear') # for time-series
python
# Flag missing as feature BEFORE filling
df['income_missing'] = df['income'].isnull().astype(int)

# Fill numeric with TRAIN median (compute on train only!)
train_median = X_train['age'].median()
X_train['age'] = X_train['age'].fillna(train_median)
X_test['age']  = X_test['age'].fillna(train_median)  # same value!

# Drop cols with > 40% missing
thresh = int(0.6 * len(df))
df.dropna(axis=1, thresh=thresh, inplace=True)

# Forward-fill for time-series (last known value)
df.sort_values('date', inplace=True)
df['price'] = df['price'].ffill()  # or .bfill()
  • fillna(df.mean()): If computed on the full dataset (train+test), you leak test info into train imputation. Always compute fill values on train only.
  • inplace=True: Doesn't save memory and is deprecated-in-spirit. Prefer df = df.dropna(...).

Indexing: loc / iloc / query

🎯
loc / iloc / at / query / eval
Label vs position indexing β€” view/copy semantics, boolean filtering, and fast query expressions
locilocqueryview vs copy
β–Ύ
Syntax
Internals
Example
Mistakes
df.loc[row_label, col_label] # label-based, inclusive stop df.iloc[row_int, col_int] # position-based, exclusive stop df.at[label, col] # single scalar β€” fastest df.iat[row_int, col_int] # single scalar by position df.query('col > 5 and label == 1') # string expression df.eval('new_col = col1 + col2', inplace=True)

View vs Copy rulebook:

  • Chain indexing (df[col][mask]) β†’ always a copy. Assignment is silently ignored.
  • df.loc[mask, col] = val β†’ modifies in-place (safe).
  • After boolean filtering: result is usually a copy. Use .copy() to be explicit.
  • Copy-on-Write (pandas 2.0+) makes this more predictable β€” single indexing = view, chained = copy.
python
# loc β€” label-based (index value, not position)
df.loc[100:200, ['age', 'income']]

# iloc β€” positional (row 0-99, col 0-3)
X_train = df.iloc[:2000, 1:10]

# Boolean filter via loc
high_earners = df.loc[df['income'] > 100_000].copy()

# Multiple conditions
mask = (df['age'] > 30) & (df['churn'] == 1)
df_subset = df.loc[mask, ['age', 'income']]

# query β€” cleaner for multi-condition filters
df.query('age > 30 and churn == 1 and income > 50_000')

# eval β€” vectorized column creation (faster than assign)
df.eval('ratio = income / (age + 1)', inplace=True)

# at β€” fastest single cell access
df.at[idx, 'label'] = 1
  • SettingWithCopyWarning: df[mask]['col'] = val β€” chained indexing fails silently. Always use df.loc[mask, 'col'] = val.
  • loc vs iloc with reset_index: After reset_index(), integer index labels equal positions β€” but loc[5] and iloc[5] may differ if index has gaps.

Transformation & Feature Engineering

βš™οΈ
apply / map / assign / cut / qcut / get_dummies
Create engineered features β€” row/column functions, binning, and encoding
applymapbinningencoding
β–Ύ
Syntax
Example
Performance
df['col'].map(dict_or_func) # element-wise on Series df.apply(func, axis=0) # 0=column-wise, 1=row-wise df.assign(new_col=lambda x: ...) # returns new df, chainable pd.cut(series, bins, labels=None) # equal-width binning pd.qcut(series, q, labels=None) # quantile binning pd.get_dummies(df, columns=None, drop_first=True, dtype=np.uint8)
python
# Encode ordinal with map
edu_map = {'high_school': 0, 'bachelor': 1, 'master': 2, 'phd': 3}
df['edu_enc'] = df['education'].map(edu_map)

# Feature engineering with assign (chainable)
df = (df
  .assign(age_sq=lambda x: x['age']**2)
  .assign(log_income=lambda x: np.log1p(x['income']))
  .assign(income_per_year=lambda x: x['income'] / x['tenure'].clip(1))
)

# Age binning (equal-width)
df['age_group'] = pd.cut(df['age'],
    bins=[0, 25, 40, 60, 100],
    labels=['young', 'adult', 'senior', 'elder'])

# Quantile binning (income quartiles)
df['income_q'] = pd.qcut(df['income'], q=4,
    labels=['Q1', 'Q2', 'Q3', 'Q4'])

# One-hot encode (drop one for linear models)
df_ohe = pd.get_dummies(df, columns=['job', 'education'],
                          drop_first=True, dtype=np.uint8)
apply is slow
df.apply(func, axis=1) (row-wise) iterates in Python β€” can be 100Γ— slower than vectorized. Always prefer vectorized column ops or df.eval().
Series.map vs apply
Series.map(dict) is fastest for encoding β€” pure C-level dict lookup. For numeric ops, use vectorized NumPy directly on the Series.
πŸ”€
str accessor β€” String Feature Engineering
Vectorized string operations on text columns β€” extract, clean, and encode
str accessorregextext features
β–Ύ
Syntax
Example
s.str.lower() / upper() / strip() / replace(pat, repl) s.str.contains(pat, regex=True, na=False) s.str.startswith(prefix) / endswith(suffix) s.str.split(pat, expand=True) # expand=True β†’ DataFrame s.str.extract(r'(\d+)') # regex capture group s.str.len() # character count s.str.count(pat) # regex occurrences
python
# Clean text column
df['name'] = df['name'].str.strip().str.lower()

# Extract numeric from text ('$42,000' β†’ 42000)
df['salary_num'] = (df['salary_text']
    .str.replace(r'[\$,]', '', regex=True)
    .astype(float))

# Binary feature: does text contain keyword?
df['has_degree'] = df['notes'].str.contains(
    r'bachelor|master|phd', regex=True, case=False, na=False).astype(int)

# Split 'First Last' into two columns
df[['first', 'last']] = df['full_name'].str.split(' ', n=1, expand=True)

# Compute text length as feature
df['review_len'] = df['review'].str.len()
df['word_count'] = df['review'].str.split().str.len()

GroupBy & Aggregation

πŸ“¦
groupby / agg / transform / pivot_table
Aggregate, compute group statistics, and create powerful group-level features
groupbyaggtransformpivot_table
β–Ύ
Syntax
Example
Internals
df.groupby(by, sort=True, observed=False)[cols].agg(func) # agg with multiple functions .agg(['mean', 'std', 'count']) .agg({'col1': 'sum', 'col2': ['min', 'max']}) .agg(new_name=pd.NamedAgg(column='col', aggfunc='mean')) # transform: returns same-shape Series (for feature eng.) df.groupby('group')['val'].transform('mean') pd.pivot_table(df, values, index, columns, aggfunc, fill_value)
python
# ML feature: mean encoding (leakage-safe on train only)
group_means = X_train.groupby('category')['target'].mean()
X_train['cat_mean_enc'] = X_train['category'].map(group_means)
X_test['cat_mean_enc']  = X_test['category'].map(group_means)

# Aggregation: user-level feature engineering
user_feats = df.groupby('user_id').agg(
    n_transactions=('amount', 'count'),
    total_spend=('amount', 'sum'),
    avg_spend=('amount', 'mean'),
    max_spend=('amount', 'max'),
    days_active=('date', 'nunique')
).reset_index()

# transform: add group mean as new column (same length)
df['dept_avg_salary'] = df.groupby('dept')['salary'].transform('mean')
df['salary_vs_dept']  = df['salary'] - df['dept_avg_salary']

# pivot_table: churn rate by segment Γ— age group
pivot = pd.pivot_table(df,
    values='churn', index='segment', columns='age_group',
    aggfunc='mean', fill_value=0).round(3)

groupby internals:

  • Creates a hash map from group keys to row indices. Multi-key groupby creates a composite hash.
  • transform is equivalent to groupby().agg() + merge back β€” but faster (no join).
  • observed=True β€” important for Categorical dtypes to skip unseen categories.

Merging & Concatenation

πŸ”—
pd.merge / pd.concat / df.join
Database-style joins and combining DataFrames β€” join types, key validation
mergeconcatjoin
β–Ύ
Syntax
Example
Mistakes
pd.merge( left, right, how='inner', # 'left','right','outer','cross' on=None, # shared key col(s) left_on=None, right_on=None, # different key names validate=None, # '1:1','1:m','m:1' β€” catches duplicates indicator=False # adds '_merge' column (left_only etc) ) pd.concat([df1, df2], axis=0, ignore_index=True)
python
# Enrich features: join user demographics to transactions
df_full = pd.merge(transactions, users,
    on='user_id', how='left',
    validate='m:1')     # ensure users are unique

# Indicator to detect unmatched rows
df_check = pd.merge(X, y_labels, on='id',
    how='outer', indicator=True)
unmatched = df_check[df_check['_merge'] != 'both']

# Stack monthly datasets vertically
monthly = [pd.read_parquet(f'data_{m}.parquet') for m in months]
df_all = pd.concat(monthly, axis=0, ignore_index=True)

# Concatenate features horizontally (after separate preprocessing)
df_combined = pd.concat([num_feats, cat_feats], axis=1)
  • Row explosion on m:m merge: If both sides have duplicate keys, merge creates a Cartesian product. Use validate='m:1' or '1:m' to catch this early.
  • concat without reset_index: Index may have duplicates after stacking. Use ignore_index=True or df.reset_index(drop=True).

Window Functions

πŸ“…
rolling / expanding / ewm / shift / diff / pct_change
Time-series feature engineering β€” lagged, rolling, and exponential window statistics
rollinglag featurestime-seriesewm
β–Ύ
Syntax
Example
Mistakes
df['col'].shift(n) # lag by n periods df['col'].diff(n) # difference from n periods ago df['col'].pct_change(n) # % change from n periods ago df['col'].rolling( window=7, # window size min_periods=1, # allow partial window at start center=False # True = centered window (causes leakage!) ).mean() / .std() / .sum() df['col'].expanding().mean() # cumulative mean df['col'].ewm(span=10).mean() # exponential weighted mean
python
# Sort by time first!
df.sort_values(['user_id', 'date'], inplace=True)

# Lag features (yesterday's sales β†’ feature for today)
df['sales_lag1'] = df.groupby('user_id')['sales'].shift(1)
df['sales_lag7'] = df.groupby('user_id')['sales'].shift(7)

# Rolling 7-day average (within group, no leakage)
df['rolling_7d_mean'] = (
    df.groupby('user_id')['sales']
    .transform(lambda x: x.rolling(7, min_periods=1).mean())
)

# Rate of change
df['mom_growth'] = df['sales'].pct_change(30)  # 30-day growth

# EWM for trend (more weight on recent)
df['ewm_7'] = df['sales'].ewm(span=7, adjust=False).mean()
  • center=True causes data leakage: Centered window uses future values. For predictive ML, always use trailing window (center=False).
  • Rolling without groupby: If data has multiple entities (users), rolling must be applied per-group. Else, the window contains data from different users.
  • NaN at window head: rolling(7) produces NaN for first 6 rows. Use min_periods=1 for partial windows, or impute/drop afterward.