apply
Contents
apply#
Jupyter Info
Reminder, that on this site the Jupyter Notebooks are read-only and you can’t interact with them. Click the button above to launch an interactive version of this notebook.
With Binder, you get a temporary Jupyter Notebook website that opens with this notebook. Any code you write will be lost when you close the tab. Make sure to download the notebook so you can save it for later!
With Colab, it will open Google Colaboratory. You can save the notebook there to your Google Drive. If you don’t save to your Drive, any code you write will be lost when you close the tab. You can find the data files for this notebook below:
You will need to run all the cells of the notebook to see the output. You can do this with hitting Shift-Enter
on each cell or clicking the “Run All” button above.
import pandas as pd
df = pd.read_csv('earthquakes.csv')
df
id | year | month | day | latitude | longitude | name | magnitude | |
---|---|---|---|---|---|---|---|---|
0 | nc72666881 | 2016 | 7 | 27 | 37.672333 | -121.619000 | California | 1.43 |
1 | us20006i0y | 2016 | 7 | 27 | 21.514600 | 94.572100 | Burma | 4.90 |
2 | nc72666891 | 2016 | 7 | 27 | 37.576500 | -118.859167 | California | 0.06 |
3 | nc72666896 | 2016 | 7 | 27 | 37.595833 | -118.994833 | California | 0.40 |
4 | nn00553447 | 2016 | 7 | 27 | 39.377500 | -119.845000 | Nevada | 0.30 |
... | ... | ... | ... | ... | ... | ... | ... | ... |
8389 | nc72685246 | 2016 | 8 | 25 | 36.515499 | -121.099831 | California | 2.42 |
8390 | ak13879193 | 2016 | 8 | 25 | 61.498400 | -149.862700 | Alaska | 1.40 |
8391 | nc72685251 | 2016 | 8 | 25 | 38.805000 | -122.821503 | California | 1.06 |
8392 | ci37672328 | 2016 | 8 | 25 | 34.308000 | -118.635333 | California | 1.55 |
8393 | ci37672360 | 2016 | 8 | 25 | 34.119167 | -116.933667 | California | 0.89 |
8394 rows Ă— 8 columns
Last time, we learned that we can use regular arithmetic operators on pandas
DataFrames
or Series
to transform them. For example, if we were working with the earthquakes data, we could multiply each magnitude by 2 using the following syntax to do an element-wise computation.
df['magnitude'] * 2
0 2.86
1 9.80
2 0.12
3 0.80
4 0.60
...
8389 4.84
8390 2.80
8391 2.12
8392 3.10
8393 1.78
Name: magnitude, Length: 8394, dtype: float64
What if we wanted to find the length of each value in the name column? You might try something like the following and hope it does an element-wise computation as well.
len(df['name'])
8394
That doesn’t look right… Last time we saw you can use the len
function to find the number of elements in a structure, so this is actually returning the number of elements in the Series
df['name']
!
For the most part, you can only do element-wise operations with:
Arithmetic operators (e.g.,
+
,-
,*
, etc.)Comparison operators (e.g.,
==
,<
, etc.)Logical operators (
&
,|
,~
)
This means anything else will act on the Series
itself, just like this len
function did!
Built-in Functions in pandas
#
The syntax looks a bit weird at first, but if you want to call the len
function on each str
, you have to use this syntax below.
df['name'].str.len()
0 10
1 5
2 10
3 10
4 6
..
8389 10
8390 6
8391 10
8392 10
8393 10
Name: name, Length: 8394, dtype: int64
This reads “Take the name column, and apply the len
function defined for str
s to each element in the Series
”. It looks really odd at first, but it’s actually a nice syntax because it lets you be explicit what type you want to treat the data as and which function to call on it! We won’t look at other types now but there is a similar syntax for those as well.
Now you aren’t limited to just calling len
here, you can call pretty much any str
function using this syntax. For example, the following cell shows how to convert each name to its upper-case version.
df['name'].str.upper()
0 CALIFORNIA
1 BURMA
2 CALIFORNIA
3 CALIFORNIA
4 NEVADA
...
8389 CALIFORNIA
8390 ALASKA
8391 CALIFORNIA
8392 CALIFORNIA
8393 CALIFORNIA
Name: name, Length: 8394, dtype: object
Do note that this does not modify the original name column, but rather returns a new Series
with all the names upper-cased.
Apply#
What if you wanted to write your own function to transform a value and apply it to each element in a Series
? For example, what if I wanted to grab the first two characters from each name?
This is where we will need the more general apply
function defined for pandas
objects. apply
is more general than using the specific str
functions we saw above since it will let you use almost any function for your data transfomration.
Before we show how to do the specific example of grabbing the first two characters from the names, let’s use this new approach to find the len
of each name. We first show how to do this, and then explain what is happening.
df['name'].apply(len)
0 10
1 5
2 10
3 10
4 6
..
8389 10
8390 6
8391 10
8392 10
8393 10
Name: name, Length: 8394, dtype: int64
The first part, df['name'].apply(
, should probably make some sense to you. We are calling some function named apply
on the Series
df['name']
. What’s very strange about this is it seems to be passing len
as a parameter to this apply
function!!!
While this does look very strange, this is totally allowed in Python. A function is, in some sense, just like any other value in Python. In fact, the name of a function is treated the same as any variable name!
So the authors of pandas
who wrote the apply
function, wrote it to take a parameter that is ANOTHER function. They then call that function on each element in the Series
.
The cell below implements something sort of like this behavior but using list
s instead.
def list_apply(values, fun):
"""
Takes a list of values and a function, and applies that function
to each value in values. The given function must take one parameter
as input and the returned list will be the result of calling that
function once for each value in the list.
"""
# It's not necessary to use a list comprehension here,
# but it's the easiest way to write this method!
return [fun(v) for v in values]
list_apply(['I', 'love', 'dogs'], len)
[1, 4, 4]
There is no restriction to only passing in the len
function as a parameter here. You can pass any function that takes a single argument.
In the cell below, we will define a new function first_two
that takes a str
and returns the first two characters and then pass that to apply
.
def first_two(s):
"""
Returns the first two characters of the given str as a str.
Assumes there are at least two characters in s.
"""
return s[:2]
df['name'].apply(first_two)
0 Ca
1 Bu
2 Ca
3 Ca
4 Ne
..
8389 Ca
8390 Al
8391 Ca
8392 Ca
8393 Ca
Name: name, Length: 8394, dtype: object
Saving Results#
Remember this apply
function doesn’t modify any data in the DataFrame
or Series
, but rather returns a new one. It’s common that you want to save the result of an apply
to your dataset to use those values later. Just like how you can use the []
syntax to select columns from a DataFrame
, you can use it to set columns in a DataFrame
.
Below, we create a new column in the dataset by assigning to the new column name. Notice that df
now has this extra column.
df['first_two_letters'] = df['name'].apply(first_two)
df
id | year | month | day | latitude | longitude | name | magnitude | first_two_letters | |
---|---|---|---|---|---|---|---|---|---|
0 | nc72666881 | 2016 | 7 | 27 | 37.672333 | -121.619000 | California | 1.43 | Ca |
1 | us20006i0y | 2016 | 7 | 27 | 21.514600 | 94.572100 | Burma | 4.90 | Bu |
2 | nc72666891 | 2016 | 7 | 27 | 37.576500 | -118.859167 | California | 0.06 | Ca |
3 | nc72666896 | 2016 | 7 | 27 | 37.595833 | -118.994833 | California | 0.40 | Ca |
4 | nn00553447 | 2016 | 7 | 27 | 39.377500 | -119.845000 | Nevada | 0.30 | Ne |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
8389 | nc72685246 | 2016 | 8 | 25 | 36.515499 | -121.099831 | California | 2.42 | Ca |
8390 | ak13879193 | 2016 | 8 | 25 | 61.498400 | -149.862700 | Alaska | 1.40 | Al |
8391 | nc72685251 | 2016 | 8 | 25 | 38.805000 | -122.821503 | California | 1.06 | Ca |
8392 | ci37672328 | 2016 | 8 | 25 | 34.308000 | -118.635333 | California | 1.55 | Ca |
8393 | ci37672360 | 2016 | 8 | 25 | 34.119167 | -116.933667 | California | 0.89 | Ca |
8394 rows Ă— 9 columns