apply#

Jupyter Info

Reminder, that on this site the Jupyter Notebooks are read-only and you can’t interact with them. Click the button above to launch an interactive version of this notebook.

  • With Binder, you get a temporary Jupyter Notebook website that opens with this notebook. Any code you write will be lost when you close the tab. Make sure to download the notebook so you can save it for later!

  • With Colab, it will open Google Colaboratory. You can save the notebook there to your Google Drive. If you don’t save to your Drive, any code you write will be lost when you close the tab. You can find the data files for this notebook below:

You will need to run all the cells of the notebook to see the output. You can do this with hitting Shift-Enter on each cell or clicking the “Run All” button above.

import pandas as pd
df = pd.read_csv('earthquakes.csv')
df
id year month day latitude longitude name magnitude
0 nc72666881 2016 7 27 37.672333 -121.619000 California 1.43
1 us20006i0y 2016 7 27 21.514600 94.572100 Burma 4.90
2 nc72666891 2016 7 27 37.576500 -118.859167 California 0.06
3 nc72666896 2016 7 27 37.595833 -118.994833 California 0.40
4 nn00553447 2016 7 27 39.377500 -119.845000 Nevada 0.30
... ... ... ... ... ... ... ... ...
8389 nc72685246 2016 8 25 36.515499 -121.099831 California 2.42
8390 ak13879193 2016 8 25 61.498400 -149.862700 Alaska 1.40
8391 nc72685251 2016 8 25 38.805000 -122.821503 California 1.06
8392 ci37672328 2016 8 25 34.308000 -118.635333 California 1.55
8393 ci37672360 2016 8 25 34.119167 -116.933667 California 0.89

8394 rows Ă— 8 columns

Last time, we learned that we can use regular arithmetic operators on pandas DataFrames or Series to transform them. For example, if we were working with the earthquakes data, we could multiply each magnitude by 2 using the following syntax to do an element-wise computation.

df['magnitude'] * 2
0       2.86
1       9.80
2       0.12
3       0.80
4       0.60
        ... 
8389    4.84
8390    2.80
8391    2.12
8392    3.10
8393    1.78
Name: magnitude, Length: 8394, dtype: float64

What if we wanted to find the length of each value in the name column? You might try something like the following and hope it does an element-wise computation as well.

len(df['name'])
8394

That doesn’t look right… Last time we saw you can use the len function to find the number of elements in a structure, so this is actually returning the number of elements in the Series df['name']!

For the most part, you can only do element-wise operations with:

  • Arithmetic operators (e.g., +, -, *, etc.)

  • Comparison operators (e.g., ==, <, etc.)

  • Logical operators (&, |, ~)

This means anything else will act on the Series itself, just like this len function did!

Built-in Functions in pandas#

The syntax looks a bit weird at first, but if you want to call the len function on each str, you have to use this syntax below.

df['name'].str.len()
0       10
1        5
2       10
3       10
4        6
        ..
8389    10
8390     6
8391    10
8392    10
8393    10
Name: name, Length: 8394, dtype: int64

This reads “Take the name column, and apply the len function defined for strs to each element in the Series”. It looks really odd at first, but it’s actually a nice syntax because it lets you be explicit what type you want to treat the data as and which function to call on it! We won’t look at other types now but there is a similar syntax for those as well.

Now you aren’t limited to just calling len here, you can call pretty much any str function using this syntax. For example, the following cell shows how to convert each name to its upper-case version.

df['name'].str.upper()
0       CALIFORNIA
1            BURMA
2       CALIFORNIA
3       CALIFORNIA
4           NEVADA
           ...    
8389    CALIFORNIA
8390        ALASKA
8391    CALIFORNIA
8392    CALIFORNIA
8393    CALIFORNIA
Name: name, Length: 8394, dtype: object

Do note that this does not modify the original name column, but rather returns a new Series with all the names upper-cased.

Apply#

What if you wanted to write your own function to transform a value and apply it to each element in a Series? For example, what if I wanted to grab the first two characters from each name?

This is where we will need the more general apply function defined for pandas objects. apply is more general than using the specific str functions we saw above since it will let you use almost any function for your data transfomration.

Before we show how to do the specific example of grabbing the first two characters from the names, let’s use this new approach to find the len of each name. We first show how to do this, and then explain what is happening.

df['name'].apply(len)
0       10
1        5
2       10
3       10
4        6
        ..
8389    10
8390     6
8391    10
8392    10
8393    10
Name: name, Length: 8394, dtype: int64

The first part, df['name'].apply(, should probably make some sense to you. We are calling some function named apply on the Series df['name']. What’s very strange about this is it seems to be passing len as a parameter to this apply function!!!

While this does look very strange, this is totally allowed in Python. A function is, in some sense, just like any other value in Python. In fact, the name of a function is treated the same as any variable name!

So the authors of pandas who wrote the apply function, wrote it to take a parameter that is ANOTHER function. They then call that function on each element in the Series.

The cell below implements something sort of like this behavior but using lists instead.

def list_apply(values, fun):
    """
    Takes a list of values and a function, and applies that function
    to each value in values. The given function must take one parameter
    as input and the returned list will be the result of calling that
    function once for each value in the list.
    """
    # It's not necessary to use a list comprehension here, 
    # but it's the easiest way to write this method!
    return [fun(v) for v in values]

list_apply(['I', 'love', 'dogs'], len)
[1, 4, 4]

There is no restriction to only passing in the len function as a parameter here. You can pass any function that takes a single argument.

In the cell below, we will define a new function first_two that takes a str and returns the first two characters and then pass that to apply.

def first_two(s):
    """
    Returns the first two characters of the given str as a str.
    
    Assumes there are at least two characters in s.
    """
    return s[:2]

df['name'].apply(first_two)
0       Ca
1       Bu
2       Ca
3       Ca
4       Ne
        ..
8389    Ca
8390    Al
8391    Ca
8392    Ca
8393    Ca
Name: name, Length: 8394, dtype: object

Saving Results#

Remember this apply function doesn’t modify any data in the DataFrame or Series, but rather returns a new one. It’s common that you want to save the result of an apply to your dataset to use those values later. Just like how you can use the [] syntax to select columns from a DataFrame, you can use it to set columns in a DataFrame.

Below, we create a new column in the dataset by assigning to the new column name. Notice that df now has this extra column.

df['first_two_letters'] = df['name'].apply(first_two)
df
id year month day latitude longitude name magnitude first_two_letters
0 nc72666881 2016 7 27 37.672333 -121.619000 California 1.43 Ca
1 us20006i0y 2016 7 27 21.514600 94.572100 Burma 4.90 Bu
2 nc72666891 2016 7 27 37.576500 -118.859167 California 0.06 Ca
3 nc72666896 2016 7 27 37.595833 -118.994833 California 0.40 Ca
4 nn00553447 2016 7 27 39.377500 -119.845000 Nevada 0.30 Ne
... ... ... ... ... ... ... ... ... ...
8389 nc72685246 2016 8 25 36.515499 -121.099831 California 2.42 Ca
8390 ak13879193 2016 8 25 61.498400 -149.862700 Alaska 1.40 Al
8391 nc72685251 2016 8 25 38.805000 -122.821503 California 1.06 Ca
8392 ci37672328 2016 8 25 34.308000 -118.635333 California 1.55 Ca
8393 ci37672360 2016 8 25 34.119167 -116.933667 California 0.89 Ca

8394 rows Ă— 9 columns