Dissolve#

Data

You can download the dataset used in this reading here:


Consider our world dataset from last time. Recall we use geopandas to process and plot datasets that contain information about the location of an event. As a reminder, the following snippet shows a preview of the data and a plot of the world’s GDP.

import geopandas as gpd
import matplotlib.pyplot as plt

# Load data
df = gpd.read_file('geo_data/ne_110m_admin_0_countries.shp')

# Preview data
print(df.columns)
print(df.head())

# Plot data
df.plot(column='GDP_MD_EST', legend=True)
plt.savefig('world_gdp.png')

GDP by Continent#

Suppose we wanted to compute the total population for each continent. Since our dataset is tabular, you might suspect that since we are trying to compute a value “for each” group, that we would want to use a group-by! This is exactly the right idea! Remember, a group-by operation is one where we want to put each row into a group (e.g., the continent the country belongs to) and compute an aggregate for each group (e.g., the sum of the population). It turns out while GeoDataFrame s do have a groupby function, it is not going to behave as we want it to!

The big problem for groupby here is that it comes from pandas and is not geospatially aware. What this means is that if we were to use a groupby here, it would not know how to handle the geometry column of the GeoDataFrame ! This means there is not a well-defined answer for what the resulting geometry for the continent should be.

However, don’t worry! geopandas provides another function called dissolve that behaves exactly like groupby , but has added logic to combine all the geometries for the group into one. This means you can still compute aggregates like sum , min , max , mean for the columns of interest. Additionally, it will combine the geometry column in a special way to make one geometry for the group. The default (and most common) thing to do is to just take the overlap of all the geometries for that group.

Below, we run a full example that dissolves by the continent to show the total population in each continent, and then below explain the syntax. When you run the snippet, you should see an output that looks like we would expect: each continent is shown in its own color and its value is the sum of all the countries’ populations in that continent.

import geopandas as gpd
import matplotlib.pyplot as plt

df = gpd.read_file('geo_data/ne_110m_admin_0_countries.shp')

# Filter down to just the columns of interest
populations = df[['POP_EST', 'CONTINENT', 'geometry']]
# Run the dissolve (groupby) operation
populations = populations.dissolve(by='CONTINENT', aggfunc='sum')

# Then plot the result
populations.plot(column='POP_EST', legend=True)
plt.savefig('plot.png')

Notice this dissolve call has a lot of the same components as a groupby , but the syntax looks quite different. Instead of saying df.groupby('col1')['col2'].sum() , you say df.dissolve(by='col1', aggfunc='sum') . The dissolve operation is applied to ALL columns of the GeoDataFrame . As a result, it is common that you will need to filter down to just the columns of interest before doing a dissolve. In our example with col1 and col2 , you would need to filter df down to ['col1', 'col2', 'geometry'] since then the dissolve only happens on those columns.

Don’t believe us when we say that you can’t use groupby here? Try it out in the following snippet and see what the resulting plot looks like! The problem comes from the fact that our groupby call throws away the geometry column, making this a non-geospatial dataset; it would not be easy to modify this to account for the geometry in the way that dissolve is designed to do!

import geopandas as gpd
import matplotlib.pyplot as plt

df = gpd.read_file('geo_data/ne_110m_admin_0_countries.shp')
populations = df.groupby('CONTINENT')['POP_EST'].sum()

populations.plot()
plt.savefig('wrong_figure.png')