Joins#


We’re going to take a quick step away from geospatial data for the next flew slides, but this concept will be very relevant for the example we will be doing with geospatial data.

Let’s go back to the world of CSV data that we would process with pandas . Suppose I had two different DataFrames shown below. The left one tas stores information about the TAs and grading on the right stores information about students and which TA will be grading them ( ta_id and grader_id will match if that TA is grading that student). Our final goal will be trying to find the number of students each TA will grade.

tas

ta_name

ta_id

Ryan

1

James

2

Nicole

3

grading

grader_id

student_name

2

Flora

3

Paul

1

Wen

3

Andrew

You might wonder why I would represent data in this way such that they are in separate tables. This is a very common situation when you have data coming from many sources! You might have gotten two datasets from completely different sources, but you know there is an identifier column that relates information in one table to that of another.

The process of trying to combine two datasets in such a way to “align” them based on some column values is called a join. If we were to join these columns on ta_id and grader_id, you would imagine that the process looks something like this pseudo-code (this is not real Python!).

for t in tas:
  for g in grading:
    if t['ta_id'] == g['grader_id']:
        output(t, g)

To explain in English, the process of joining them on ta_id and grader_id finds all pairs of rows from each table, and keeps them if they match on ta_id and grader_id.

Now this pseudo-code isn’t going to actually look like what we will write for pandas. Joining is such a common operation with data that pandas provides a function to do so that looks like the following (there is a bit of setup to create the datasets):

import pandas as pd

# Make the tas DataFrame
tas = pd.DataFrame([
    {'ta_name': 'Ryan',   'ta_id': 1},
    {'ta_name': 'James',  'ta_id': 2},
    {'ta_name': 'Nicole', 'ta_id': 3},
])
print('tas')
print(tas)
print()

# Make the grading DataFrame
grading = pd.DataFrame([
    {'grader_id': 2, 'student_name': 'Flora'},
    {'grader_id': 3, 'student_name': 'Paul'},
    {'grader_id': 1, 'student_name': 'Wen'},
    {'grader_id': 3, 'student_name': 'Andrew'},

])
print('grading')
print(grading)
print()

# Join the datasets by ta_id and grader_id
merged = tas.merge(grading, left_on='ta_id', right_on='grader_id')
print('merged')
print(merged)

Notice that the output contains all pairs of rows that match up on ta_id and grader_id! There are now two rows for Nicole because there were two rows in the original grading table that had her ID. At this point merged is a new DataFrame with the four columns shown in the output above. To compute how many students each TA was grading, we could easily solve this with a groupby on this new DataFrame !

You might be wondering, what would happen to rows that don’t “line up”? What if there was a TA who didn’t have anyone they were grading or a student had a grader_id for a grader that doesn’t exist in the other table? Suppose we had the following dataset instead. We removed James from the left table and changed Wen to be graded by Nicole on the right.

tas

ta_name

ta_id

Ryan

1

Nicole

3

grading

grader_id

student_name

2

Flora

3

Paul

3

Wen

3

Andrew

Let’s see what happens when we run merge on these datasets.

import pandas as pd

# Make the tas DataFrame
tas = pd.DataFrame([
    {'ta_name': 'Ryan',   'ta_id': 1},
    {'ta_name': 'Nicole', 'ta_id': 3},
])
print('tas')
print(tas)
print()

# Make the grading DataFrame
grading = pd.DataFrame([
    {'grader_id': 2, 'student_name': 'Flora'},
    {'grader_id': 3, 'student_name': 'Paul'},
    {'grader_id': 3, 'student_name': 'Wen'},
    {'grader_id': 3, 'student_name': 'Andrew'},

])
print('grading')
print(grading)
print()

# Join the datasets by ta_id and grader_id
merged = tas.merge(grading, left_on='ta_id', right_on='grader_id')
print('merged')
print(merged)

Nicole is the only TA represented in this result! The rows that have IDs that did not “line up” with any other row got tossed out. This makes sense in the context of the pseudo-code we showed above since they didn’t have any matches. The topic of what to do with rows with missing values is very interesting and is the topic of the next slide.