Data Cleaning & Imputation
Master the NaN float virus, memory interpolation, and Machine
Learning prep-work.
Machine Learning models are fundamentally just mathematical calculators. If you feed the math equation a missing spreadsheet cell, the equation explodes. The golden rule of AI is Garbage In, Garbage Out (GIGO). 80% of a Data Scientist's job is not building Neural Networks, it's manually scrubbing corrupt, missing, and duplicate data from millions of rows before the neural network ever sees it.
Pandas provides a suite of C-optimized tools to detect missing data fragments (NaN), surgically remove them, or intelligently guess what the data should have been through Mathematical Imputation.
Imagine a patient medical record sheet where the "Blood Pressure" box is left blank.
You have three choices:
1. Deletion (`dropna()`): Throw the entire patient's medical record in the trash. The cleanest approach, but you lose massive amounts of data.
2. Imputation (`fillna()`): Look at the other 1,000 patients, calculate the statistical Average blood pressure, and boldly write that Average into the missing box. You keep the patient record, but you injected artificial "fake" data into your model.
3. Interpolation (`interpolate()`): If the patient had a reading on Monday of 120, and Wednesday of 140, logically infer that the missing Tuesday box was exactly 130.
import pandas as pd
import numpy as np
# Scenario: Cleaning IoT Sensor data before feeding to an LSTM
# Sensor B disconnected at 02:00, creating a NaN void.
data = {
'Time': ['01:00', '02:00', '03:00'],
'Sensor_A': [10.5, 10.6, 10.7],
'Sensor_B': [50.0, np.nan, 70.0]
}
df = pd.DataFrame(data)
# 1. Check severity of corruption
print("Corrupted Rows:\n", df.isna().sum())
# 2. We CANNOT delete row 2, because we need Sensor_A's valid data.
# We CANNOT fill with 0.0, because a sudden 0.0 will crash the ML model.
# We must use Interpolation to draw a straight geometric line between 50 and 70.
df['Sensor_B'] = df['Sensor_B'].interpolate(method='linear')
print("\nCleaned Output:")
print(df)
# Sensor_B at 02:00 perfectly repairs itself to 60.0!
| Code Line | Explanation |
|---|---|
np.nan |
The literal Python manifestation of a missing value. `NaN` stands for "Not a Number", adhering to the IEEE-754 floating-point hardware standard. |
df.isna().sum() |
Chained operations. First, `isna()` creates a boolean DataFrame of identical size. Because Python treats `True` as `1` and `False` as `0`, `.sum()` simply walks down each column adding up the 1s, giving you a perfect count of missing values per column. |
.interpolate('linear') |
Pandas detects the missing void. It looks at the valid block before the void (50.0) and the valid block after (70.0). It calculates the mathematical slope across the missing index, and injects exactly `60.0` into the C-array. |
Why do missing values break Pandas Integer columns?
Under the IEEE-754 hardware standard, the concept of `NaN` was only programmed into the architecture for Floating Point decimals. There is literally no C-hardware representation for an "Integer NaN".
If you have a column of 10 Million memory-efficient 8-bit Integers, and a single `NaN` creeps into Row 5... Pandas Panics. Because NumPy refuses to mix types, Pandas is forced to physically rip all 10 Million Integers out of RAM, upgrade them entirely into massive 64-bit Floating Point decimals, and then inject the 64-bit `NaN` Float virus. Your RAM usage instantly explodes 800% because of 1 missing cell.
By default, df.fillna() does NOT alter your original DataFrame. It allocates
brand new RAM, builds a cleaned copy, and returns the pointer. To force Pandas to surgically
overwrite the original corrupted memory block without wasting RAM, you must use
df.fillna(0, inplace=True).
The Mathematical Black Hole of NaN:
By hardware definitions, `NaN` is infectious. Any math operation involving NaN evaluates to NaN.
10.5 + NaN = NaNNaN == NaN = False!
Wait, NaN does not equal NaN? Because `NaN` represents "The Unknown", the computer has no
idea if Unknown A is the exact same number as Unknown B. Therefore,
df['Col'] == np.nan will FAIL silently. You must explicitly use
df.isna().
Forward Fill (`ffill`) and Backward Fill (`bfill`):
If a bank account balance is recorded on Monday as $1,000, and Tuesday is accidentally left
blank, we shouldn't guess. We just drag Monday's value forward to fill Tuesday.
df['Funds'].ffill() physically locks the last known valid C-pointer and stamps
its value sequentially downwards over every `NaN` void it encounters until it hits a new
valid number.
Mistake: Data Leakage during Mean Imputation in Machine Learning.
Why is this disastrous?: You split your data into Training (80%) and
Validation (20%). Then you run df.fillna(df.mean()) on the ENTIRE dataset. You
just destroyed your model! The generated "Mean" was mathematically polluted by the
Validation data. When your AI model trains on the Training Data, it is subtly cheating
because it has mathematically seen hints of the Validation data inside the imputed blanks.
Fix: You strictly compute the Mean ONLY on the Training split, and apply
that exact singular number to both splits.
The PyArrow Nullable Backend (Pandas 2.0+):
The "Integer-to-Float NaN Virus" described in Section 6 drove Data Engineers insane for 10
years. In Pandas 2.0, they integrated the Apache Arrow memory architecture. PyArrow
introduces an entirely separate 1-bit Boolean mask layer. It stores your massive Integer
array perfectly intact, and uses a tiny parallel 1-bit array to flag "Missing (0)" or "Valid
(1)". You can now use df['Col'].astype("Int64[pyarrow]") to have true,
memory-efficient Integer Missing Values without exploding into Floats.