apply#

Jupyter Info

Reminder, that on this site the Jupyter Notebooks are read-only and you can’t interact with them. Click the button above to launch an interactive version of this notebook.

With Binder, you get a temporary Jupyter Notebook website that opens with this notebook. Any code you write will be lost when you close the tab. Make sure to download the notebook so you can save it for later!
With Colab, it will open Google Colaboratory. You can save the notebook there to your Google Drive. If you don’t save to your Drive, any code you write will be lost when you close the tab. You can find the data files for this notebook below:
- earthquakes.csv

You will need to run all the cells of the notebook to see the output. You can do this with hitting Shift-Enter on each cell or clicking the “Run All” button above.

import pandas as pd

df = pd.read_csv('earthquakes.csv')
df

	id	year	month	day	latitude	longitude	name	magnitude
0	nc72666881	2016	7	27	37.672333	-121.619000	California	1.43
1	us20006i0y	2016	7	27	21.514600	94.572100	Burma	4.90
2	nc72666891	2016	7	27	37.576500	-118.859167	California	0.06
3	nc72666896	2016	7	27	37.595833	-118.994833	California	0.40
4	nn00553447	2016	7	27	39.377500	-119.845000	Nevada	0.30
...	...	...	...	...	...	...	...	...
8389	nc72685246	2016	8	25	36.515499	-121.099831	California	2.42
8390	ak13879193	2016	8	25	61.498400	-149.862700	Alaska	1.40
8391	nc72685251	2016	8	25	38.805000	-122.821503	California	1.06
8392	ci37672328	2016	8	25	34.308000	-118.635333	California	1.55
8393	ci37672360	2016	8	25	34.119167	-116.933667	California	0.89

8394 rows × 8 columns

Last time, we learned that we can use regular arithmetic operators on pandas DataFrames or Series to transform them. For example, if we were working with the earthquakes data, we could multiply each magnitude by 2 using the following syntax to do an element-wise computation.

df['magnitude'] * 2

     2.86
     9.80
     0.12
     0.80
     0.60
        ... 
  4.84
  2.80
  2.12
  3.10
  1.78
Name: magnitude, Length: 8394, dtype: float64

What if we wanted to find the length of each value in the name column? You might try something like the following and hope it does an element-wise computation as well.

len(df['name'])

That doesn’t look right… Last time we saw you can use the len function to find the number of elements in a structure, so this is actually returning the number of elements in the Series df['name']!

For the most part, you can only do element-wise operations with:

Arithmetic operators (e.g., +, -, *, etc.)
Comparison operators (e.g., ==, <, etc.)
Logical operators (&, |, ~)

This means anything else will act on the Series itself, just like this len function did!

Built-in Functions in `pandas`#

The syntax looks a bit weird at first, but if you want to call the len function on each str, you have to use this syntax below.

df['name'].str.len()

     10
      5
     10
     10
      6
        ..
  10
   6
  10
  10
  10
Name: name, Length: 8394, dtype: int64

This reads “Take the name column, and apply the len function defined for strs to each element in the Series”. It looks really odd at first, but it’s actually a nice syntax because it lets you be explicit what type you want to treat the data as and which function to call on it! We won’t look at other types now but there is a similar syntax for those as well.

Now you aren’t limited to just calling len here, you can call pretty much any str function using this syntax. For example, the following cell shows how to convert each name to its upper-case version.

df['name'].str.upper()

     CALIFORNIA
          BURMA
     CALIFORNIA
     CALIFORNIA
         NEVADA
           ...    
  CALIFORNIA
      ALASKA
  CALIFORNIA
  CALIFORNIA
  CALIFORNIA
Name: name, Length: 8394, dtype: object

Do note that this does not modify the original name column, but rather returns a new Series with all the names upper-cased.

Apply#

What if you wanted to write your own function to transform a value and apply it to each element in a Series? For example, what if I wanted to grab the first two characters from each name?

This is where we will need the more general apply function defined for pandas objects. apply is more general than using the specific str functions we saw above since it will let you use almost any function for your data transfomration.

Before we show how to do the specific example of grabbing the first two characters from the names, let’s use this new approach to find the len of each name. We first show how to do this, and then explain what is happening.

df['name'].apply(len)

     10
      5
     10
     10
      6
        ..
  10
   6
  10
  10
  10
Name: name, Length: 8394, dtype: int64

The first part, df['name'].apply(, should probably make some sense to you. We are calling some function named apply on the Series df['name']. What’s very strange about this is it seems to be passing len as a parameter to this apply function!!!

While this does look very strange, this is totally allowed in Python. A function is, in some sense, just like any other value in Python. In fact, the name of a function is treated the same as any variable name!

So the authors of pandas who wrote the apply function, wrote it to take a parameter that is ANOTHER function. They then call that function on each element in the Series.

The cell below implements something sort of like this behavior but using lists instead.

def list_apply(values, fun):
    """
    Takes a list of values and a function, and applies that function
    to each value in values. The given function must take one parameter
    as input and the returned list will be the result of calling that
    function once for each value in the list.
    """
    # It's not necessary to use a list comprehension here, 
    # but it's the easiest way to write this method!
    return [fun(v) for v in values]

list_apply(['I', 'love', 'dogs'], len)

[1, 4, 4]

There is no restriction to only passing in the len function as a parameter here. You can pass any function that takes a single argument.

In the cell below, we will define a new function first_two that takes a str and returns the first two characters and then pass that to apply.

def first_two(s):
    """
    Returns the first two characters of the given str as a str.
    
    Assumes there are at least two characters in s.
    """
    return s[:2]

df['name'].apply(first_two)

     Ca
     Bu
     Ca
     Ca
     Ne
        ..
  Ca
  Al
  Ca
  Ca
  Ca
Name: name, Length: 8394, dtype: object

Saving Results#

Remember this apply function doesn’t modify any data in the DataFrame or Series, but rather returns a new one. It’s common that you want to save the result of an apply to your dataset to use those values later. Just like how you can use the [] syntax to select columns from a DataFrame, you can use it to set columns in a DataFrame.

Below, we create a new column in the dataset by assigning to the new column name. Notice that df now has this extra column.

df['first_two_letters'] = df['name'].apply(first_two)
df

	id	year	month	day	latitude	longitude	name	magnitude	first_two_letters
0	nc72666881	2016	7	27	37.672333	-121.619000	California	1.43	Ca
1	us20006i0y	2016	7	27	21.514600	94.572100	Burma	4.90	Bu
2	nc72666891	2016	7	27	37.576500	-118.859167	California	0.06	Ca
3	nc72666896	2016	7	27	37.595833	-118.994833	California	0.40	Ca
4	nn00553447	2016	7	27	39.377500	-119.845000	Nevada	0.30	Ne
...	...	...	...	...	...	...	...	...	...
8389	nc72685246	2016	8	25	36.515499	-121.099831	California	2.42	Ca
8390	ak13879193	2016	8	25	61.498400	-149.862700	Alaska	1.40	Al
8391	nc72685251	2016	8	25	38.805000	-122.821503	California	1.06	Ca
8392	ci37672328	2016	8	25	34.308000	-118.635333	California	1.55	Ca
8393	ci37672360	2016	8	25	34.119167	-116.933667	California	0.89	Ca

8394 rows × 9 columns

Intermediate Data Programming

apply

Contents

apply#

Built-in Functions in `pandas`#

Apply#

Saving Results#

Intermediate Data Programming

apply

Contents

apply#

Built-in Functions in pandas#

Apply#

Saving Results#

Built-in Functions in `pandas`#