Missing Data
Contents
Missing Data#
Most data in the world is messy. It might be in a format that you will have trouble reading from or it might have errors in the data! One of the most common types of errors in datasets is missing data, where a row might have some of its column entries just missing!
For example, we have a dataset of fMRI (brain scan) data that, like most real-life datasets, is a little messy. The dataset has the following columns:
subject
: An identifier for the person being measuredtimepoint
: The time in the studyevent
: What type of stimulus the subject was givenregion
: Where the fMRI measurement was takensignal
: The measurement value
If we tried to load it into a DataFrame
, we would see the following.
import pandas as pd
df = pd.read_csv('fmri.csv')
print(df.tail()) # Prints the last rows
Notice that row 1059 has this weird value NaN
for both the timepoint and the signal!
NaN
, or “Not a Number”, is a common value in Python to represent the absence of a data-value. You should think of it a lot like None
but has some slightly different properties that make it a little more usable in some numerical processing applications (will be explained in the next section).
The fact that we see NaN
values in our data means the dataset is messy and we will have to deal with it. There is a myriad of ways to try to handle missing data, some more complicated than others. For right now, we will focus on how to detect missing data in the dataset and how to filter it out!
By default most pandas
operations just ignore missing values so this isn’t a problem if you are just in the pandas
world. However, we will see in the next module that other libraries for data visualization and machine learning will crash if we give them datasets that contain missing values, so we should learn how to deal with that now.
NaN
vs. None
#
NaN
is some kind of numerical equivalent to None
. It represents a number that is, in some sense, invalid or missing.
In Python, NaN
operates by two rules:
Any arithmetic operation on
NaN
, evaluates toNaN
Any boolean comparison on
NaN
, evaluates toFalse
We can access the value NaN
most easily by using the library numpy
(commonly imported as np
). We will learn more about numpy
in Module 7!
import numpy as np
print(np.nan) # nan
print(1 + np.nan) # nan
print(np.nan * 1) # nan
print(1 == np.nan) # False
print(np.nan == np.nan) # False
That last line is pretty surprising since we compared np.nan
to np.nan
. Remember though, one of the rules of NaN
is that every boolean comparison on NaN
is False
!
How is NaN
different than None
? None
doesn’t allow any numeric operations on it, it will cause an error!
print(1 + None)
NaN
in pandas
#
So now that we know what this magic-value NaN
is in our dataset, let’s see how to handle it in pandas
. Let’s start by taking the average of the 'signal'
column (that contains NaN
values).
import pandas as pd
df = pd.read_csv('fmri.csv')
print(df['signal'].mean())
Luckily for us, pandas
has some logic built into it to skip NaN
values for many of the simple operations like mean
! However, this won’t always work so we will need some special pandas
methods for finding and removing NaN
(although, they have a weird naming convention).
To detect if there are missing values:
isnull()
returns abool
Series
, whereTrue
marksNaN
values.notnull()
returns abool
Series
, whereTrue
marks non-NaN
values.
To return a new
DataFrame
withNaN
removed:dropna()
removes all rows with missing data.fillna(val)
replaces missing data with the given value
The following code block shows how these operations work.
import pandas as pd
df = pd.read_csv('fmri.csv')
# Returns all the rows in df where the signal is not NaN
df[df['signal'].notnull()]
# Returns all the rows in df that did not contain a NaN value
df.dropna()
# Replaces all NaN signals with 0
df['signal'] = df['signal'].fillna(0)
print(df)
Warning
Notice that the first two examples don’t modify df
. Why is that? Almost every operation on pandas
returns a new DataFrame
or Series
! For example, df.dropna()
returns a new DataFrame
with the NaN
rows missing. It’s not until the last example, do we actually save the return value into
df
so the result stays.
Warning
On take-home assessment 3, we will ask for you to deal with missing data. Think very carefully about which data you want to remove and how you will remove it. A common bug students run into involves removing too many rows or too few rows that have missing data for the relevant columns.