Practice/Reading: Pandas Tutorial
Contents
Practice/Reading: Pandas Tutorial#
Jupyter Notebooks
Reminder, that on this site the Jupyter Notebooks are read-only and you can’t interact with them. Click the button above to launch an interactive version of this notebook.
With Binder, you get a temporary Jupyter Notebook website that opens with this notebook. Any code you write will be lost when you close the tab. Make sure to download the notebook so you can save it for later!
With Colab, it will open Google Colaboratory. You can save the notebook there to your Google Drive. If you don’t save to your Drive, any code you write will be lost when you close the tab. You can find the data files for this notebook below:
You will need to run all the cells of the notebook to see the output. You can do this with hitting Shift-Enter
on each cell or clickin the “Run All” button above.
The first thing we will do is use the import
command to load the pandas
library. We will use this syntax shown below to “rename” pandas
to pd
so in your cells below, we only have to write out pd
whenever we want to use a pandas
feature.
import pandas as pd
Next, we will load the data from the CSV file tas.csv
that has the example data we were working with before. We will save it in a variable called df
(stands for data frame which is a common pandas
term). We do this with a provided function from pandas
called read_csv
.
df = pd.read_csv('tas.csv')
df
Name | Salary | |
---|---|---|
0 | Madrona | 3 |
1 | Ken | 1 |
2 | Ryan | 3 |
Notice that this shows the CSV in a tabular format! What is df
? It’s a pandas
object called a DataFrame
which stores a table of values, much like an Excel table.
Notice on the top row, it shows the name of the columns (Name
and Salary
) and on the left-most side, it shows an index for each row (0
, 1
, and 2
).
DataFrame
s are powerful because they provide lots of ways to access and perform computations on your data without you having to write much code!
Accessing a Column#
For example, you can get all of the TAs’ names with the following call.
df['Name']
0 Madrona
1 Ken
2 Ryan
Name: Name, dtype: object
df['Name']
returns another pandas
object called a Series
that represents a single column or row of a DataFrame
. A Series
is very similar to a list
from Python, but has many extra features that we will explore later.
Students sometimes get a little confused because this looks like df
is a dict
and it is trying to access a key named Name
. This is not the case! One of the reasons Python is so powerful is it lets people who program libraries “hook into” the syntax of the language to make their own custom meaning of the []
syntax! df
in this cell is really this special object defined by pandas
called a DataFrame
.
Problem 0#
In the cell below, write the code to access the Salary
column of the data and store it in a variable named ans0
! For testing purposes, your variable name has to exactly be ans0
.
# Write your answer here!
Now, pandas
is useful because it not only lets you access this data conveniently, but also perform computations on them.
A Series
object has many methods you can call on them to perform computation. Here is a list of some of the most useful ones:
mean
: Calculates the average value of theSeries
min
: Calculates the minimum value of theSeries
max
: Calculates the maximum value of theSeries
idxmin
: Calculates the index of the minimum value of theSeries
idxmax
: Calculates the index of the maximum value of theSeries
count
: Calculates the number of values in theSeries
unique
: Returns a newSeries
with all the unique values from theSeries
.And many more!
For example, if I wanted to compute the average Salary
of the TAs, I would write:
average_salary = df['Salary'].mean()
average_salary
2.3333333333333335
Reminder: Types matter#
When first learning pandas
, it’s easy to mix up DataFrame
and Series
.
A
DataFrame
is a 2-dimensional structure (it has rows and columns like a grid)Series
is 1-dimensional (it only has “one direction” like a single row or a single column).
When you access a single column (or as we will see later, a single row) of a DataFrame
, it returns a Series
.
Problem 1#
For this problem, you should compute the “range” of TA salaries (the maximum value - the minimum value
). For testing purposes, save the result in a variable called ans1
.
Hint: You might need to make two separate calls to pandas
to compute this since you need both the min and the max.
# Write your answer here!
Element-wise Operations#
For the rest of this slide, let’s consider a slightly more complex dataset that has a few more columns. This dataset tracks the emissions for cities around the world (but only has a few rows).
df2 = pd.read_csv('emissions.csv')
df2
city | country | emissions | population | |
---|---|---|---|---|
0 | New York | USA | 200 | 1500 |
1 | Paris | France | 48 | 42 |
2 | Beijing | China | 300 | 2000 |
3 | Nice | France | 40 | 60 |
4 | Seattle | USA | 100 | 1000 |
If we wanted to access the emissions column, we could write:
df2['emissions']
0 200
1 48
2 300
3 40
4 100
Name: emissions, dtype: int64
Or if we wanted to access the population columm, we could write:
df2['population']
0 1500
1 42
2 2000
3 60
4 1000
Name: population, dtype: int64
One useful feature of pandas
is it lets you combine values from different Series
. For example, if we wanted to, we could add the values of the emissions column and the population column.
df2['emissions'] + df2['population']
0 1700
1 90
2 2300
3 100
4 1100
dtype: int64
Notice, this returns a new Series
that represents the sum of those two columns. The first value in the Series
is the sum of the first values in the two that were added, the second is the sum of the second two, etc. It does not modify any of the columns of the dataset (you will need to do an assignment to change a value).
Problem 2#
In the cell below, find the maximum “emissions per capita” (emissions divided by population). Start by computing this value for each city and then find the maximum value of that Series
(using one of the Series
methods shown above). For testing purposes, save the result in a variable called ans2
.
Hint: You can save a Series
in a variable! It’s just like any other Python value!
# Write your answer here!
These element-wise computations also work if a one of the values is a single value rather than a Series
. For example, the following cell adds 4 to each of the populations. Notice this doesn’t modify the original DataFrame
, it just returns a new Series
with the old values plus 4.
df2['population'] + 4
0 1504
1 46
2 2004
3 64
4 1004
Name: population, dtype: int64
You can see here that the output of the Series
actually tells you a bit about the values to help you out! The dtype
property tells you the type of the data. In this case it uses a specialized integer type called int64
, but for all intents and purposes that’s really just like an int
. As a minor detail, it also stores the Name of the column the Series
came from for reference.
Another useful case for something like this is to compare the values of a column to a value. For example, the following cell computes which cities have an emissions value of 200 or more. Notice that the dtype
here is bool
since each value is a True/False
.
df2['emissions'] >= 200
0 True
1 False
2 True
3 False
4 False
Name: emissions, dtype: bool
Filtering Data#
You might have wondered why being able to compare a Series
to some value is something we deemed “useful” since it doesn’t seem like it does anything helpful. The power comes from using this bool
Series
to filter the DataFrame
to the rows you want.
For example, what if I wanted to print the names of the cities that have an emissions of 200 or more? I can use this bool
Series
to filter which rows I want! The syntax looks like the following cell.
df3 = df2[df2['emissions'] >= 200]
df3['city']
0 New York
2 Beijing
Name: city, dtype: object
That’s pretty cool how we can get this result without having to write any loops!
Notice the return value has type DataFrame
, so we can then use the syntax we learned at the beginning to grab a single column from that DataFrame
(thus returning a Series
).
The way this works is the indexing-notation for DataFrames
has special cases for which type of value you pass it.
If you pass it a
str
(e.g.,df2['emissions']
), it returns that column as aSeries
.If you pass it a
Series
withdtype=bool
(e.g.,df2[df2['emissions'] >= 200]
), it will return aDataFrame
of all the rows thatSeries
had aTrue
value for!
There is no magic with this, they just wrote an if-statement in their code to do different things based on the type provided!
We commonly call a Series
with dtype=bool
used for this context a mask. It usually makes your program more readable to save those masks in a variable. The following cell shows the exact same example, but adding a variable for readability for the mask.
high_emissions = df2['emissions'] >= 200
df3 = df2[high_emissions]
df3['city']
0 New York
2 Beijing
Name: city, dtype: object
Filtering on Multiple Conditions#
You can combine masks using logical operators to make complex queries. There are three logical operators for masks (like and
, or
, and not
but with different symbols).
&
does an element-wiseand
to combine two masks|
does an element-wiseor
to combine two masks~
does an element-wisenot
of a single mask
For example, if you want to find all cities that have high emissions or are in the US, you would probably try writing the following (but you’ll run into a bug).
df2[df2['emissions'] >= 200 | df2['country'] == 'USA']
The problem comes from precedence (order of operations). Just like how *
gets evaluated before +
, |
gets evaluated first because it has the highest precedence (so does &
). This makes Python interpret the first sub-expression as (200 | df['country']
), which causes an error since this operator is not defined for these types.
Whenever you run into ambiguities from precedence, one way you can always fix it is to put the sub-expressions in parentheses like in the following cell.
df2[(df2['emissions'] >= 200) | (df2['country'] == 'USA')]
city | country | emissions | population | |
---|---|---|---|---|
0 | New York | USA | 200 | 1500 |
2 | Beijing | China | 300 | 2000 |
4 | Seattle | USA | 100 | 1000 |
A much more readable solution involves saving each mask in a variable so you don’t have to worry about this precedence. This has an added benefit of giving each condition a human-readable name if you use good variable names!
high_emissions = df2['emissions'] >= 200
is_usa = df2['country'] == 'USA'
df2[high_emissions | is_usa]
city | country | emissions | population | |
---|---|---|---|---|
0 | New York | USA | 200 | 1500 |
2 | Beijing | China | 300 | 2000 |
4 | Seattle | USA | 100 | 1000 |
Problem 3#
In the cell below, write code to select all rows from the dataset that are in France and have a population greater than 50. For testing purposes, save the result in a variable called ans3
# Write your answer here!
Location#
We’ve shown you how to select specific columns or select specific rows based on a mask. In some sense, it’s a little confusing that df[val]
can be used to grab columns or rows depending on what is passed. This is because this syntax we have shown below, is really just special cases of a more generic syntax that lets you specify some location in the DataFrame
. pandas
provides this shorthand for convenience in some cases, but this more general syntax below works in many more!
In its most general form, the loc
property lets you specify a row indexer and a column indexer to specify which rows/columns you want. The syntax looks like the following (where things in <...>
are placeholders)
df.loc[<row indexer>, <column indexer>]
The row indexer refers to the index of the DataFrame
. Recall, when we display a DataFrame
, it shows values to the left of each row to identify each row in the DataFrame
.
It turns out the the column indexer is optional, so you can leave that out. For example, if I want to get the first row (row with index 0), I could write:
df2.loc[0]
city New York
country USA
emissions 200
population 1500
Name: 0, dtype: object
Interestingly, this actually returns a Series
! It looks different than the Series
returned from something like df['name']
since now it has an index that are the column names themselves! This means I could index into a specifc column by doing something like:
s = df2.loc[0]
s['city']
'New York'
Now this was a bit tedious to have to use double []
to access the column as well, which is exactly why loc
lets you specify a column as a “column indexer”. Instead, it’s more common to write:
df2.loc[0, 'city']
'New York'
You might be wondering: I’ve used the word “indexer” a few times but haven’t defined what that means! By indexer, I mean some value to indicate which rows/columns you want. So far, I have shown how to specify a single value as an indexer, but there are actually many options to choose from! You can always mix-and-match these and use different ones for the rows/cols.
List of indices and slices#
For example, you can use a list of values as an indexer to select many rows or many columns:
df2.loc[[1,2,3], ['city', 'country', 'emissions']]
city | country | emissions | |
---|---|---|---|
1 | Paris | France | 48 |
2 | Beijing | China | 300 |
3 | Nice | France | 40 |
Notice now it returns a DataFrame
instead of a single value.
You can also use slice syntax like you could for list
/str
to access a range of values. There are a couple oddities about this:
The start/stop points are both inclusive which is different than for
list
/str
where the stop point is exclusive.They do some fancy “magic” that let you use ranges with strings to get a range of column names.
For example
df2.loc[1:3, 'city':'emissions']
city | country | emissions | |
---|---|---|---|
1 | Paris | France | 48 |
2 | Beijing | China | 300 |
3 | Nice | France | 40 |
The way to read this loc
access is “all the rows starting at index 1 and to index 3 (both inclusive) and all the columns starting at city and going to emissions (both inclusive)”.
How does it define the “range of strings”? It uses the order of the columns in the DataFrame
.
Mask#
You can also use a bool
Series as an indexer to grab all the rows or columns that are marked True
. This is similar to masking we saw before, but you can now put the mask as a possible indexer.
high_emissions = df2['emissions'] >= 200
is_usa = df2['country'] == 'USA'
df2.loc[high_emissions | is_usa]
city | country | emissions | population | |
---|---|---|---|---|
0 | New York | USA | 200 | 1500 |
2 | Beijing | China | 300 | 2000 |
4 | Seattle | USA | 100 | 1000 |
Notice in the last cell, I left out the column indexer and it gave me all the columns (that is the default for the column indexer).
:
for everything#
Instead of relying on defaults, you can explicitly ask for “all of the columns” using the special range :
. This is a common syntax for many numerical processing libraries so pandas
adopts it too. It looks like the following
df2.loc[[0, 4, 2], :]
city | country | emissions | population | |
---|---|---|---|---|
0 | New York | USA | 200 | 1500 |
4 | Seattle | USA | 100 | 1000 |
2 | Beijing | China | 300 | 2000 |
You can also do this for the rows as well!
df2.loc[:, 'city']
0 New York
1 Paris
2 Beijing
3 Nice
4 Seattle
Name: city, dtype: object
A tip to help you read these in your head is to read :
by itself as “all”.
Recap Indexers#
So we saw the .loc
property here is kind of like a universal way of asking for your data. You can specify a row indexer and a column indexer to select your data. We saw the following things used as indexers:
A single value (row index for rows, column name for columns)
A list of values or a slice (row index for for rows, column names for columns)
A mask
:
to select all values
Return Values#
One thing that is also complex about .loc
is the type of the value returned depends on the types of the indexers. Recall that a pandas
DataFrame
is a 2-dimensional structure (rows and columns) while a Series
is a single row
or single column
.
To tell what the return type of a .loc
call is, you need to look for the “single value” type of indexer.
If both the row and column indexers are a single value, returns a single value. This will be whatever the value is at the location so its type will be the same as the
dtype
of the column it comes from.If only one of the row and column indexers is a single value (meaning the other is multiple values), returns a
Series
.If neither of the row and column indexers are single values (meaning both are multiple values), returns a
DataFrame
.