{"metadata":{"kernelspec":{"name":"python3","display_name":"Python 3.8.3 64-bit ('base': conda)"},"language_info":{"name":"python","version":"3.8.3","mimetype":"text/x-python","codemirror_mode":{"name":"ipython","version":3},"pygments_lexer":"ipython3","nbconvert_exporter":"python","file_extension":".py"},"interpreter":{"hash":"5acdcae90f63eb9c18802b5ea7e22742e6b0757228a832b417d73e29fa596335"}},"nbformat":4,"nbformat_minor":2,"cells":[{"cell_type":"markdown","source":["# <i class=\"fas fa-laptop\"></i> Practice/Reading: Pandas Tutorial"],"metadata":{}},{"cell_type":"markdown","source":["<div style=\"position: relative; padding-bottom: 62.5%; height: 0;\">\n","    <iframe src=\"https://www.loom.com/embed/1caa054531c24d39bb31bd44b646ba92\" frameborder=\"0\" webkitallowfullscreen mozallowfullscreen allowfullscreen style=\"position: absolute; top: 0; left: 0; width: 100%; height: 100%;\"></iframe>\n","</div>\n","\n","```{admonition} Jupyter Notebooks\n","Reminder, that on this site the Jupyter Notebooks are read-only and you can't interact with them. Click the <i class=\"fas fa-rocket\"></i> button above to launch\n","an interactive version of this notebook.\n","\n","* With Binder, you get a temporary Jupyter Notebook website that opens with this notebook. Any code you write will be lost when you close the tab. Make sure to download the notebook so you can save it for later!\n","* With Colab, it will open Google Colaboratory. You can save the notebook there to your Google Drive. If you don't save to your Drive, any code you write will be lost when you close the tab. You can find the data files for this notebook below:\n","  * {download}`tas.csv <./tas.csv>`\n","  * {download}`emissions.csv <./emissions.csv>`\n","\n","\n","You will need to run all the cells of the notebook to see the output. You can do this with hitting `Shift-Enter` on each cell or clickin the \"Run All\" button above.\n","```\n","\n","The first thing we will do is use the `import` command to load the `pandas` library. We will use this syntax shown below to \"rename\" `pandas` to `pd` so in your cells below, we only have to write out `pd` whenever we want to use a `pandas` feature."],"metadata":{},"attachments":{}},{"cell_type":"code","execution_count":1,"source":["import pandas as pd"],"outputs":[],"metadata":{}},{"cell_type":"markdown","source":["Next, we will load the data from the CSV file `tas.csv` that has the example data we were working with before. We will save it in a variable called `df` (stands for data frame which is a common `pandas` term). We do this with a provided function from `pandas` called `read_csv`."],"metadata":{},"attachments":{}},{"cell_type":"code","execution_count":2,"source":["df = pd.read_csv('tas.csv')\n","df"],"outputs":[{"output_type":"execute_result","data":{"text/plain":["      Name  Salary\n","0  Madrona       3\n","1      Ken       1\n","2     Ryan       3"],"text/html":["<div>\n","<style scoped>\n","    .dataframe tbody tr th:only-of-type {\n","        vertical-align: middle;\n","    }\n","\n","    .dataframe tbody tr th {\n","        vertical-align: top;\n","    }\n","\n","    .dataframe thead th {\n","        text-align: right;\n","    }\n","</style>\n","<table border=\"1\" class=\"dataframe\">\n","  <thead>\n","    <tr style=\"text-align: right;\">\n","      <th></th>\n","      <th>Name</th>\n","      <th>Salary</th>\n","    </tr>\n","  </thead>\n","  <tbody>\n","    <tr>\n","      <th>0</th>\n","      <td>Madrona</td>\n","      <td>3</td>\n","    </tr>\n","    <tr>\n","      <th>1</th>\n","      <td>Ken</td>\n","      <td>1</td>\n","    </tr>\n","    <tr>\n","      <th>2</th>\n","      <td>Ryan</td>\n","      <td>3</td>\n","    </tr>\n","  </tbody>\n","</table>\n","</div>"]},"metadata":{},"execution_count":2}],"metadata":{}},{"cell_type":"markdown","source":["Notice that this shows the CSV in a tabular format! What is `df`? It's a `pandas` object called a **`DataFrame`** which stores a table of values, much like an Excel table. \n","\n","Notice on the top row, it shows the name of the columns (`Name` and `Salary`) and on the left-most side, it shows an index for each row (`0`, `1`, and `2`). \n","\n","`DataFrame`s are powerful because they provide lots of ways to access and perform computations on your data without you having to write much code! \n","\n","## Accessing a Column\n","For example, you can get all of the TAs' names with the following call."],"metadata":{},"attachments":{}},{"cell_type":"code","execution_count":3,"source":["df['Name']"],"outputs":[{"output_type":"execute_result","data":{"text/plain":["0    Madrona\n","1        Ken\n","2       Ryan\n","Name: Name, dtype: object"]},"metadata":{},"execution_count":3}],"metadata":{}},{"cell_type":"markdown","source":["`df['Name']` returns another `pandas` object called a **`Series`** that represents a single column or row of a `DataFrame`. A `Series` is very similar to a `list` from Python, but has many extra features that we will explore later.\n","\n","Students sometimes get a little confused because this looks like `df` is a `dict` and it is trying to access a key named `Name`. This is not the case! One of the reasons Python is so powerful is it lets people who program libraries \"hook into\" the syntax of the language to make their own custom meaning of the `[]` syntax! `df` in this cell is really this special object defined by `pandas` called a `DataFrame`.\n","\n","\n","### Problem 0\n","In the cell below, write the code to access the `Salary` column of the data and store it in a variable named `ans0`! **For testing purposes, your variable name has to exactly be `ans0`.**"],"metadata":{},"attachments":{}},{"cell_type":"code","execution_count":4,"source":["# Write your answer here!"],"outputs":[],"metadata":{}},{"cell_type":"markdown","source":["Now, `pandas` is useful because it not only lets you access this data conveniently, but also perform computations on them. \n","\n","A `Series` object has many methods you can call on them to perform computation. Here is a list of some of the most useful ones:\n","* `mean`: Calculates the average value of the `Series`\n","* `min`: Calculates the minimum value of the `Series`\n","* `max`: Calculates the maximum value of the `Series`\n","* `idxmin`: Calculates the index of the minimum value of the `Series`\n","* `idxmax`: Calculates the index of the maximum value of the `Series`\n","* `count`: Calculates the number of values in the `Series`\n","* `unique`: Returns a new `Series` with all the unique values from the `Series`.\n","* And many more!\n","\n","For example, if I wanted to compute the average `Salary` of the TAs, I would write:"],"metadata":{},"attachments":{}},{"cell_type":"code","execution_count":5,"source":["average_salary = df['Salary'].mean()\n","average_salary"],"outputs":[{"output_type":"execute_result","data":{"text/plain":["2.3333333333333335"]},"metadata":{},"execution_count":5}],"metadata":{}},{"cell_type":"markdown","source":["\n","### Reminder: Types matter\n","When first learning `pandas`, it's easy to mix up `DataFrame` and `Series`. \n","* A `DataFrame` is a 2-dimensional structure (it has rows and columns like a grid)\n","* `Series` is 1-dimensional (it only has \"one direction\" like a single row or a single column).\n","\n","When you access a single column (or as we will see later, a single row) of a `DataFrame`, it returns a `Series`. \n","\n","### Problem 1\n","For this problem, you should compute the \"range\" of TA salaries (`the maximum value - the minimum value`). **For testing purposes, save the result in a variable called `ans1`.**\n","\n","*Hint: You might need to make two separate calls to `pandas` to compute this since you need both the min and the max.*"],"metadata":{},"attachments":{}},{"cell_type":"code","execution_count":6,"source":["# Write your answer here!\n"],"outputs":[],"metadata":{}},{"cell_type":"markdown","source":["## Element-wise Operations\n","For the rest of this slide, let's consider a slightly more complex dataset that has a few more columns. This dataset tracks the emissions for cities around the world (but only has a few rows)."],"metadata":{},"attachments":{}},{"cell_type":"code","execution_count":7,"source":["df2 = pd.read_csv('emissions.csv')\n","df2"],"outputs":[{"output_type":"execute_result","data":{"text/plain":["       city country  emissions  population\n","0  New York     USA        200        1500\n","1     Paris  France         48          42\n","2   Beijing   China        300        2000\n","3      Nice  France         40          60\n","4   Seattle     USA        100        1000"],"text/html":["<div>\n","<style scoped>\n","    .dataframe tbody tr th:only-of-type {\n","        vertical-align: middle;\n","    }\n","\n","    .dataframe tbody tr th {\n","        vertical-align: top;\n","    }\n","\n","    .dataframe thead th {\n","        text-align: right;\n","    }\n","</style>\n","<table border=\"1\" class=\"dataframe\">\n","  <thead>\n","    <tr style=\"text-align: right;\">\n","      <th></th>\n","      <th>city</th>\n","      <th>country</th>\n","      <th>emissions</th>\n","      <th>population</th>\n","    </tr>\n","  </thead>\n","  <tbody>\n","    <tr>\n","      <th>0</th>\n","      <td>New York</td>\n","      <td>USA</td>\n","      <td>200</td>\n","      <td>1500</td>\n","    </tr>\n","    <tr>\n","      <th>1</th>\n","      <td>Paris</td>\n","      <td>France</td>\n","      <td>48</td>\n","      <td>42</td>\n","    </tr>\n","    <tr>\n","      <th>2</th>\n","      <td>Beijing</td>\n","      <td>China</td>\n","      <td>300</td>\n","      <td>2000</td>\n","    </tr>\n","    <tr>\n","      <th>3</th>\n","      <td>Nice</td>\n","      <td>France</td>\n","      <td>40</td>\n","      <td>60</td>\n","    </tr>\n","    <tr>\n","      <th>4</th>\n","      <td>Seattle</td>\n","      <td>USA</td>\n","      <td>100</td>\n","      <td>1000</td>\n","    </tr>\n","  </tbody>\n","</table>\n","</div>"]},"metadata":{},"execution_count":7}],"metadata":{}},{"cell_type":"markdown","source":["If we wanted to access the emissions column, we could write:"],"metadata":{},"attachments":{}},{"cell_type":"code","execution_count":8,"source":["df2['emissions']"],"outputs":[{"output_type":"execute_result","data":{"text/plain":["0    200\n","1     48\n","2    300\n","3     40\n","4    100\n","Name: emissions, dtype: int64"]},"metadata":{},"execution_count":8}],"metadata":{}},{"cell_type":"markdown","source":["Or if we wanted to access the population columm, we could write:"],"metadata":{},"attachments":{}},{"cell_type":"code","execution_count":9,"source":["df2['population']"],"outputs":[{"output_type":"execute_result","data":{"text/plain":["0    1500\n","1      42\n","2    2000\n","3      60\n","4    1000\n","Name: population, dtype: int64"]},"metadata":{},"execution_count":9}],"metadata":{}},{"cell_type":"markdown","source":["One useful feature of `pandas` is it lets you combine values from different `Series`. For example, if we wanted to, we could add the values of the emissions column and the population column."],"metadata":{},"attachments":{}},{"cell_type":"code","execution_count":10,"source":["df2['emissions'] + df2['population']"],"outputs":[{"output_type":"execute_result","data":{"text/plain":["0    1700\n","1      90\n","2    2300\n","3     100\n","4    1100\n","dtype: int64"]},"metadata":{},"execution_count":10}],"metadata":{}},{"cell_type":"markdown","source":["Notice, this returns a new `Series` that represents the sum of those two columns. The first value in the `Series` is the sum of the first values in the two that were added, the second is the sum of the second two, etc. It does not modify any of the columns of the dataset (you will need to do an assignment to change a value).\n","\n","### Problem 2\n","In the cell below, find the maximum \"emissions per capita\" (emissions divided by population). Start by computing this value for each city and then find the maximum value of that `Series` (using one of the `Series` methods shown above). **For testing purposes, save the result in a variable called `ans2`.**\n","\n","*Hint: You can save a `Series` in a variable! It's just like any other Python value!*"],"metadata":{},"attachments":{}},{"cell_type":"code","execution_count":11,"source":["# Write your answer here!"],"outputs":[],"metadata":{}},{"cell_type":"markdown","source":["These element-wise computations also work if a one of the values is a single value rather than a `Series`. For example, the following cell adds 4 to each of the populations. Notice this doesn't modify the original `DataFrame`, it just returns a new `Series` with the old values plus 4."],"metadata":{},"attachments":{}},{"cell_type":"code","execution_count":12,"source":["df2['population'] + 4"],"outputs":[{"output_type":"execute_result","data":{"text/plain":["0    1504\n","1      46\n","2    2004\n","3      64\n","4    1004\n","Name: population, dtype: int64"]},"metadata":{},"execution_count":12}],"metadata":{}},{"cell_type":"markdown","source":["You can see here that the output of the `Series` actually tells you a bit about the values to help you out! The `dtype` property tells you the type of the data. In this case it uses a specialized integer type called `int64`, but for all intents and purposes that's really just like an `int`. As a minor detail, it also stores the Name of the column the `Series` came from for reference.\n","\n","Another useful case for something like this is to compare the values of a column to a value. For example, the following cell computes which cities have an emissions value of 200 or more. Notice that the `dtype` here is `bool` since each value is a `True/False`."],"metadata":{},"attachments":{}},{"cell_type":"code","execution_count":13,"source":["df2['emissions'] >= 200"],"outputs":[{"output_type":"execute_result","data":{"text/plain":["0     True\n","1    False\n","2     True\n","3    False\n","4    False\n","Name: emissions, dtype: bool"]},"metadata":{},"execution_count":13}],"metadata":{}},{"cell_type":"markdown","source":["## Filtering Data \n","You might have wondered why being able to compare a `Series` to some value is something we deemed \"useful\" since it doesn't seem like it does anything helpful. The power comes from using this `bool` `Series` to **filter** the `DataFrame` to the rows you want.\n","\n","For example, what if I wanted to print the names of the cities that have an emissions of 200 or more? I can use this `bool` `Series` to filter which rows I want! The syntax looks like the following cell."],"metadata":{},"attachments":{}},{"cell_type":"code","execution_count":14,"source":["df3 = df2[df2['emissions'] >= 200]\n","df3['city']"],"outputs":[{"output_type":"execute_result","data":{"text/plain":["0    New York\n","2     Beijing\n","Name: city, dtype: object"]},"metadata":{},"execution_count":14}],"metadata":{}},{"cell_type":"markdown","source":["That's pretty cool how we can get this result without having to write any loops!\n","\n","Notice the return value has type `DataFrame`, so we can then use the syntax we learned at the beginning to grab a single column from that `DataFrame` (thus returning a `Series`). \n","\n","\n","The way this works is the indexing-notation for `DataFrames` has special cases for which type of value you pass it.\n","* If you pass it a `str` (e.g., `df2['emissions']`), it returns that column as a `Series`.\n","* If you pass it a `Series` with `dtype=bool` (e.g., `df2[df2['emissions'] >= 200]`), it will return a `DataFrame` of all the rows that `Series` had a `True` value for!\n","\n","There is no magic with this, they just wrote an if-statement in their code to do different things based on the type provided!\n","\n","We commonly call a `Series` with `dtype=bool` used for this context a **mask**. It usually makes your program more readable to save those masks in a variable. The following cell shows the exact same example, but adding a variable for readability for the mask."],"metadata":{},"attachments":{}},{"cell_type":"code","execution_count":15,"source":["high_emissions = df2['emissions'] >= 200\n","df3 = df2[high_emissions]\n","df3['city']"],"outputs":[{"output_type":"execute_result","data":{"text/plain":["0    New York\n","2     Beijing\n","Name: city, dtype: object"]},"metadata":{},"execution_count":15}],"metadata":{}},{"cell_type":"markdown","source":["### Filtering on Multiple Conditions\n","You can combine masks using logical operators to make complex queries. There are three logical operators for masks (like `and`, `or`, and `not` but with different symbols).\n","* `&` does an element-wise `and` to combine two masks\n","* `|` does an element-wise `or` to combine two masks\n","* `~` does an element-wise `not` of a single mask\n","\n","For example, if you want to find all cities that have high emissions or are in the US, you would probably try writing the following (but you'll run into a bug).\n","\n","```{snippet}\n","df2[df2['emissions'] >= 200 | df2['country'] == 'USA']\n","```\n","\n","The problem comes from **precedence** (order of operations). Just like how `*` gets evaluated before `+`, `|` gets evaluated first because it has the highest precedence (so does `&`). This makes Python interpret the first sub-expression as (`200 | df['country']`), which causes an error since this operator is not defined for these types.\n","\n","Whenever you run into ambiguities from precedence, one way you can always fix it is to put the sub-expressions in parentheses like in the following cell."],"metadata":{},"attachments":{}},{"cell_type":"code","execution_count":16,"source":["df2[(df2['emissions'] >= 200) | (df2['country'] == 'USA')]"],"outputs":[{"output_type":"execute_result","data":{"text/plain":["       city country  emissions  population\n","0  New York     USA        200        1500\n","2   Beijing   China        300        2000\n","4   Seattle     USA        100        1000"],"text/html":["<div>\n","<style scoped>\n","    .dataframe tbody tr th:only-of-type {\n","        vertical-align: middle;\n","    }\n","\n","    .dataframe tbody tr th {\n","        vertical-align: top;\n","    }\n","\n","    .dataframe thead th {\n","        text-align: right;\n","    }\n","</style>\n","<table border=\"1\" class=\"dataframe\">\n","  <thead>\n","    <tr style=\"text-align: right;\">\n","      <th></th>\n","      <th>city</th>\n","      <th>country</th>\n","      <th>emissions</th>\n","      <th>population</th>\n","    </tr>\n","  </thead>\n","  <tbody>\n","    <tr>\n","      <th>0</th>\n","      <td>New York</td>\n","      <td>USA</td>\n","      <td>200</td>\n","      <td>1500</td>\n","    </tr>\n","    <tr>\n","      <th>2</th>\n","      <td>Beijing</td>\n","      <td>China</td>\n","      <td>300</td>\n","      <td>2000</td>\n","    </tr>\n","    <tr>\n","      <th>4</th>\n","      <td>Seattle</td>\n","      <td>USA</td>\n","      <td>100</td>\n","      <td>1000</td>\n","    </tr>\n","  </tbody>\n","</table>\n","</div>"]},"metadata":{},"execution_count":16}],"metadata":{}},{"cell_type":"markdown","source":["A much more readable solution involves saving each mask in a variable so you don't have to worry about this precedence. This has an added benefit of giving each condition a human-readable name if you use good variable names!"],"metadata":{},"attachments":{}},{"cell_type":"code","execution_count":17,"source":["high_emissions = df2['emissions'] >= 200\n","is_usa = df2['country'] == 'USA'\n","df2[high_emissions | is_usa]"],"outputs":[{"output_type":"execute_result","data":{"text/plain":["       city country  emissions  population\n","0  New York     USA        200        1500\n","2   Beijing   China        300        2000\n","4   Seattle     USA        100        1000"],"text/html":["<div>\n","<style scoped>\n","    .dataframe tbody tr th:only-of-type {\n","        vertical-align: middle;\n","    }\n","\n","    .dataframe tbody tr th {\n","        vertical-align: top;\n","    }\n","\n","    .dataframe thead th {\n","        text-align: right;\n","    }\n","</style>\n","<table border=\"1\" class=\"dataframe\">\n","  <thead>\n","    <tr style=\"text-align: right;\">\n","      <th></th>\n","      <th>city</th>\n","      <th>country</th>\n","      <th>emissions</th>\n","      <th>population</th>\n","    </tr>\n","  </thead>\n","  <tbody>\n","    <tr>\n","      <th>0</th>\n","      <td>New York</td>\n","      <td>USA</td>\n","      <td>200</td>\n","      <td>1500</td>\n","    </tr>\n","    <tr>\n","      <th>2</th>\n","      <td>Beijing</td>\n","      <td>China</td>\n","      <td>300</td>\n","      <td>2000</td>\n","    </tr>\n","    <tr>\n","      <th>4</th>\n","      <td>Seattle</td>\n","      <td>USA</td>\n","      <td>100</td>\n","      <td>1000</td>\n","    </tr>\n","  </tbody>\n","</table>\n","</div>"]},"metadata":{},"execution_count":17}],"metadata":{}},{"cell_type":"markdown","source":["### Problem 3\n","In the cell below, write code to select all rows from the dataset that are in France and have a population greater than 50. **For testing purposes, save the result in a variable called `ans3`**"],"metadata":{},"attachments":{}},{"cell_type":"code","execution_count":18,"source":["# Write your answer here!"],"outputs":[],"metadata":{}},{"cell_type":"markdown","source":["## Location\n","We've shown you how to select specific columns or select specific rows based on a mask. In some sense, it's a little confusing that `df[val]` can be used to grab columns or rows depending on what is passed. This is because this syntax we have shown below, is really just special cases of a more generic syntax that lets you specify some location in the `DataFrame`. `pandas` provides this shorthand for convenience in some cases, but this more general syntax below works in many more!\n","\n","In its most general form, the `loc` property lets you specify a **row indexer** and a **column indexer** to specify which rows/columns you want. The syntax looks like the following (where things in `<...>` are placeholders)\n","\n","```\n","df.loc[<row indexer>, <column indexer>]\n","```\n","\n","The row indexer refers to the index of the `DataFrame`. Recall, when we display a `DataFrame`, it shows values to the left of each row to identify each row in the `DataFrame`.\n","\n","It turns out the the column indexer is optional, so you can leave that out. For example, if I want to get the first row (row with index 0), I could write:"],"metadata":{},"attachments":{}},{"cell_type":"code","execution_count":19,"source":["df2.loc[0]"],"outputs":[{"output_type":"execute_result","data":{"text/plain":["city          New York\n","country            USA\n","emissions          200\n","population        1500\n","Name: 0, dtype: object"]},"metadata":{},"execution_count":19}],"metadata":{}},{"cell_type":"markdown","source":["Interestingly, this actually returns a `Series`! It looks different than the `Series` returned from something like `df['name']` since now it has an index that are the column names themselves! This means I could index into a specifc column by doing something like:"],"metadata":{},"attachments":{}},{"cell_type":"code","execution_count":20,"source":["s = df2.loc[0]\n","s['city']"],"outputs":[{"output_type":"execute_result","data":{"text/plain":["'New York'"]},"metadata":{},"execution_count":20}],"metadata":{}},{"cell_type":"markdown","source":["Now this was a bit tedious to have to use double `[]` to access the column as well, which is exactly why `loc` lets you specify a column as a \"column indexer\". Instead, it's more common to write:"],"metadata":{},"attachments":{}},{"cell_type":"code","execution_count":21,"source":["df2.loc[0, 'city']"],"outputs":[{"output_type":"execute_result","data":{"text/plain":["'New York'"]},"metadata":{},"execution_count":21}],"metadata":{}},{"cell_type":"markdown","source":["You might be wondering: I've used the word \"indexer\" a few times but haven't defined what that means! By indexer, I mean some value to indicate which rows/columns you want. So far, I have shown how to specify a single value as an indexer, but there are actually many options to choose from! You can always mix-and-match these and use different ones for the rows/cols.\n","\n","### List of indices and slices\n","For example, you can use a list of values as an indexer to select many rows or many columns:"],"metadata":{},"attachments":{}},{"cell_type":"code","execution_count":22,"source":["df2.loc[[1,2,3], ['city', 'country', 'emissions']]"],"outputs":[{"output_type":"execute_result","data":{"text/plain":["      city country  emissions\n","1    Paris  France         48\n","2  Beijing   China        300\n","3     Nice  France         40"],"text/html":["<div>\n","<style scoped>\n","    .dataframe tbody tr th:only-of-type {\n","        vertical-align: middle;\n","    }\n","\n","    .dataframe tbody tr th {\n","        vertical-align: top;\n","    }\n","\n","    .dataframe thead th {\n","        text-align: right;\n","    }\n","</style>\n","<table border=\"1\" class=\"dataframe\">\n","  <thead>\n","    <tr style=\"text-align: right;\">\n","      <th></th>\n","      <th>city</th>\n","      <th>country</th>\n","      <th>emissions</th>\n","    </tr>\n","  </thead>\n","  <tbody>\n","    <tr>\n","      <th>1</th>\n","      <td>Paris</td>\n","      <td>France</td>\n","      <td>48</td>\n","    </tr>\n","    <tr>\n","      <th>2</th>\n","      <td>Beijing</td>\n","      <td>China</td>\n","      <td>300</td>\n","    </tr>\n","    <tr>\n","      <th>3</th>\n","      <td>Nice</td>\n","      <td>France</td>\n","      <td>40</td>\n","    </tr>\n","  </tbody>\n","</table>\n","</div>"]},"metadata":{},"execution_count":22}],"metadata":{}},{"cell_type":"markdown","source":["Notice now it returns a `DataFrame` instead of a single value.\n","\n","You can also use slice syntax like you could for `list`/`str` to access a range of values. There are a couple oddities about this:\n","* The start/stop points are **both inclusive** which is different than for `list`/`str` where the stop point is exclusive.\n","* They do some fancy \"magic\" that let you use ranges with strings to get a range of column names.\n","\n","For example"],"metadata":{},"attachments":{}},{"cell_type":"code","execution_count":23,"source":["df2.loc[1:3, 'city':'emissions']"],"outputs":[{"output_type":"execute_result","data":{"text/plain":["      city country  emissions\n","1    Paris  France         48\n","2  Beijing   China        300\n","3     Nice  France         40"],"text/html":["<div>\n","<style scoped>\n","    .dataframe tbody tr th:only-of-type {\n","        vertical-align: middle;\n","    }\n","\n","    .dataframe tbody tr th {\n","        vertical-align: top;\n","    }\n","\n","    .dataframe thead th {\n","        text-align: right;\n","    }\n","</style>\n","<table border=\"1\" class=\"dataframe\">\n","  <thead>\n","    <tr style=\"text-align: right;\">\n","      <th></th>\n","      <th>city</th>\n","      <th>country</th>\n","      <th>emissions</th>\n","    </tr>\n","  </thead>\n","  <tbody>\n","    <tr>\n","      <th>1</th>\n","      <td>Paris</td>\n","      <td>France</td>\n","      <td>48</td>\n","    </tr>\n","    <tr>\n","      <th>2</th>\n","      <td>Beijing</td>\n","      <td>China</td>\n","      <td>300</td>\n","    </tr>\n","    <tr>\n","      <th>3</th>\n","      <td>Nice</td>\n","      <td>France</td>\n","      <td>40</td>\n","    </tr>\n","  </tbody>\n","</table>\n","</div>"]},"metadata":{},"execution_count":23}],"metadata":{}},{"cell_type":"markdown","source":["The way to read this `loc` access is \"all the rows starting at index 1 and to index 3 (both inclusive) and all the columns starting at city and going to emissions (both inclusive)\".\n","\n","How does it define the \"range of strings\"? It uses the order of the columns in the `DataFrame`.\n","\n","### Mask\n","\n","You can also use a `bool` Series as an indexer to grab all the rows or columns that are marked `True`. This is similar to masking we saw before, but you can now put the mask as a possible indexer."],"metadata":{},"attachments":{}},{"cell_type":"code","execution_count":24,"source":["high_emissions = df2['emissions'] >= 200\n","is_usa = df2['country'] == 'USA'\n","df2.loc[high_emissions | is_usa]"],"outputs":[{"output_type":"execute_result","data":{"text/plain":["       city country  emissions  population\n","0  New York     USA        200        1500\n","2   Beijing   China        300        2000\n","4   Seattle     USA        100        1000"],"text/html":["<div>\n","<style scoped>\n","    .dataframe tbody tr th:only-of-type {\n","        vertical-align: middle;\n","    }\n","\n","    .dataframe tbody tr th {\n","        vertical-align: top;\n","    }\n","\n","    .dataframe thead th {\n","        text-align: right;\n","    }\n","</style>\n","<table border=\"1\" class=\"dataframe\">\n","  <thead>\n","    <tr style=\"text-align: right;\">\n","      <th></th>\n","      <th>city</th>\n","      <th>country</th>\n","      <th>emissions</th>\n","      <th>population</th>\n","    </tr>\n","  </thead>\n","  <tbody>\n","    <tr>\n","      <th>0</th>\n","      <td>New York</td>\n","      <td>USA</td>\n","      <td>200</td>\n","      <td>1500</td>\n","    </tr>\n","    <tr>\n","      <th>2</th>\n","      <td>Beijing</td>\n","      <td>China</td>\n","      <td>300</td>\n","      <td>2000</td>\n","    </tr>\n","    <tr>\n","      <th>4</th>\n","      <td>Seattle</td>\n","      <td>USA</td>\n","      <td>100</td>\n","      <td>1000</td>\n","    </tr>\n","  </tbody>\n","</table>\n","</div>"]},"metadata":{},"execution_count":24}],"metadata":{}},{"cell_type":"markdown","source":["Notice in the last cell, I left out the column indexer and it gave me all the columns (that is the default for the column indexer).\n","\n","### `:` for everything\n","\n","Instead of relying on defaults, you can explicitly ask for \"all of the columns\" using the special range `:`. This is a common syntax for many numerical processing libraries so `pandas` adopts it too. It looks like the following"],"metadata":{},"attachments":{}},{"cell_type":"code","execution_count":25,"source":["df2.loc[[0, 4, 2], :]"],"outputs":[{"output_type":"execute_result","data":{"text/plain":["       city country  emissions  population\n","0  New York     USA        200        1500\n","4   Seattle     USA        100        1000\n","2   Beijing   China        300        2000"],"text/html":["<div>\n","<style scoped>\n","    .dataframe tbody tr th:only-of-type {\n","        vertical-align: middle;\n","    }\n","\n","    .dataframe tbody tr th {\n","        vertical-align: top;\n","    }\n","\n","    .dataframe thead th {\n","        text-align: right;\n","    }\n","</style>\n","<table border=\"1\" class=\"dataframe\">\n","  <thead>\n","    <tr style=\"text-align: right;\">\n","      <th></th>\n","      <th>city</th>\n","      <th>country</th>\n","      <th>emissions</th>\n","      <th>population</th>\n","    </tr>\n","  </thead>\n","  <tbody>\n","    <tr>\n","      <th>0</th>\n","      <td>New York</td>\n","      <td>USA</td>\n","      <td>200</td>\n","      <td>1500</td>\n","    </tr>\n","    <tr>\n","      <th>4</th>\n","      <td>Seattle</td>\n","      <td>USA</td>\n","      <td>100</td>\n","      <td>1000</td>\n","    </tr>\n","    <tr>\n","      <th>2</th>\n","      <td>Beijing</td>\n","      <td>China</td>\n","      <td>300</td>\n","      <td>2000</td>\n","    </tr>\n","  </tbody>\n","</table>\n","</div>"]},"metadata":{},"execution_count":25}],"metadata":{}},{"cell_type":"markdown","source":["You can also do this for the rows as well! "],"metadata":{},"attachments":{}},{"cell_type":"code","execution_count":26,"source":["df2.loc[:, 'city']"],"outputs":[{"output_type":"execute_result","data":{"text/plain":["0    New York\n","1       Paris\n","2     Beijing\n","3        Nice\n","4     Seattle\n","Name: city, dtype: object"]},"metadata":{},"execution_count":26}],"metadata":{}},{"cell_type":"markdown","source":["A tip to help you read these in your head is to read `:` by itself as \"all\".\n","\n","### Recap Indexers\n","So we saw the `.loc` property here is kind of like a universal way of asking for your data. You can specify a row indexer and a column indexer to select your data. We saw the following things used as indexers:\n","* A single value (row index for rows, column name for columns)\n","* A list of values or a slice (row index for for rows, column names for columns)\n","* A mask\n","* `:` to select all values\n","\n","\n","### Return Values\n","One thing that is also complex about `.loc` is the type of the value returned depends on the types of the indexers. Recall that a `pandas` `DataFrame` is a 2-dimensional structure (rows and columns) while a `Series` is a single `row` or single `column`.\n","\n","To tell what the return type of a `.loc` call is, you need to look for the \"single value\" type of indexer.\n","* If both the row and column indexers are a single value, returns a single value. This will be whatever the value is at the location so its type will be the same as the `dtype` of the column it comes from.\n","* If only one of the row and column indexers is a single value (meaning the other is multiple values), returns a `Series`.\n","* If neither of the row and column indexers are single values (meaning both are multiple values), returns a `DataFrame`."],"metadata":{},"attachments":{}}]}