Practice: Joins#

Suppose we were working with the following two datasets shown below. For each question, report the number of rows in the resulting DataFrame (remember, that the number of rows does not include the column header).

movies

movie_name

year

movie_id

directed_by

Lady Bird

2017

51

23

Grand Budapest Hotel

2014

47

16

Parasite

2019

103

14

Frozen

2013

34

18

Moonrise Kingdom

2012

37

16

directors

director_name

director_id

Bong Joon Ho

14

Greta Gerwig

23

Wes Anderson

16

Quentin Tarantino

21

Kathryn Bigelow

27

Question 0#

How many data rows are there in result ?

result = movies.merge(directors, left_on='directed_by',
                      right_on='director_id')

Your Task

Write your answer down in your own space.

Question 1#

How many data rows are there in result ?

result = directors.merge(movies, left_on='director_id',
                         right_on='directed_by', how='left')

Your Task

Write your answer down in your own space.

Question 2#

How many data rows are there in result ?

result = movies.merge(directors, left_on='directed_by',
                      right_on='director_id', how='outer')

Your Task

Write your answer down in your own space.

Question 3#

Suppose we wanted to compute the earliest movie directed by each director. Any director who has directed no movies should have NaN as a result.

Which of the following code blocks helps us answer that question? Select all that apply.

Your Task

Select one or more options. Write your answer down in your own space.

Option 0

merged = movies.merge(directors,
                      left_on='directed_by',
                      right_on='director_id',
                      how='right')
result = merged.groupby('director_id')['year'].min()

Option 1

merged = movies.merge(directors,
                      left_on='directed_by',
                      right_on='director_id',
                      how='left')
result = merged.groupby('director_id')['year'].min()

Option 2

merged = directors.merge(movies,
                         left_on='director_id',
                         right_on='directed_by',
                         how='right')
result = merged.groupby('director_id')['year'].min()

Option 3

merged = directors.merge(movies,
                         left_on='director_id',
                         right_on='directed_by',
                         how='left')
result = merged.groupby('director_id')['year'].min()

Option 4

merged = movies.merge(directors,
                      left_on='directed_by',
                      right_on='director_id')
result = merged.groupby('director_id')['year'].min()

Question 4#

Consider the following two datasets.

  • The first is our countries dataset. Each row has the name, continent, and geometry of a country (there are other columns, but we don’t need to think about them for this problem).

    NAME

    CONTINENT

    geometry

    United States

    North America

    MultiPolgyon(…)

    Ethiopia

    Africa

    Polygon(…)

    …

    …

    …

  • The second is a carbon emissions dataset. The dataset has two columns, the name of the country and the amount of carbon emissions produced by that country last year.

    Country

    Carbon

    United States

    16.50

    Ethiopia

    0.12

    …

    …

Suppose we wanted to make a plot of the average carbon emissions by continent. Before we do any aggregation to make this map to group countries together by continent, we need to join these datasets so that we can later answer for each continent, what is the average carbon emissions there.

For this problem, we will say any country in the country dataset that does not have an associated row in the carbon emissions dataset should have 0 carbon emissions.

Which of the following is the correct way to combine these datasets by their name? Assume the countries dataset is stored in a variable called countries and the carbon emissions data is in a variable called carbon and that we have run any relevant imports. There is only one correct answer in the list below. It is a useful exercise to make sure you can explain why the other answers are incorrect.

Your Task

Select one option. Write your answer down in your own space.

Option 0

countries.merge(carbon,
                left_on='NAME', right_on='Country')
                how='inner')

Option 1

countries.merge(carbon,
                left_on='NAME', right_on='Country')
                how='left')

Option 2

countries.merge(carbon,
                left_on='NAME', right_on='Country')
                how='right')

Option 3

gpd.sjoin(countries, carbon,
          op='intersects', how='inner')

Option 4

gpd.sjoin(countries, carbon,
          op='intersects', how='left')

Option 5

gpd.sjoin(countries, carbon,
          op='intersects', how='right')