# Web Scraping
<div style="position: relative; padding-bottom: 56.25%; height: 0;"><iframe src="https://www.loom.com/embed/821641d5fc364d9ebae67b30c556d2ea" frameborder="0" webkitallowfullscreen mozallowfullscreen allowfullscreen style="position: absolute; top: 0; left: 0; width: 100%; height: 100%;"></iframe></div>

---

<div style="position: relative; padding-bottom: 56.25%; height: 0;"><iframe src="https://www.loom.com/embed/30e65e12cdf4469d8b0ddf64776b3550" frameborder="0" webkitallowfullscreen mozallowfullscreen allowfullscreen style="position: absolute; top: 0; left: 0; width: 100%; height: 100%;"></iframe></div>

```{jupyter-info}
```

## APIs
The first web technology we talked about are the Application Programming Interface (API). Any API is generally a URL you can go to that returns data (as opposed to a web-page like facebook.com). We looked at an API that returns the position of the International Space Station. This API is located at http://api.open-notify.org/iss-now.json and returns data in the JSON format. JSON is basically just a python lists and dictionaries with keys and values.

`requests` is a very popular Python library that lets you fetch data from a URL.

In [None]:
import requests

In [None]:
response = requests.get('http://api.open-notify.org/iss-now.json')

This returns a "response" object that has information about the response like it's status code and data

In [None]:
response.status_code

200

You can view the data with the `content` attribute, but this returns a string which is not very helpful. Instead, we can use the `json` method to convert the string to a python dictionary.

In [None]:
response.content

b'{"message": "success", "timestamp": 1591032742, "iss_position": {"longitude": "-101.1647", "latitude": "-15.1609"}}'

In [None]:
d = response.json()
print(d)
print(d['timestamp'])

{'message': 'success', 'timestamp': 1591032742, 'iss_position': {'longitude': '-101.1647', 'latitude': '-15.1609'}}
1591032742


How did we know there was a key called `'timestamp'`? This is part of the documentation of the API that can be found [here](http://open-notify.org/Open-Notify-API/ISS-Location-Now/).  

To get more up-to-date data, we would have to make the request again. The code below makes 20 calls to the API and prints the latitude and longitude of the ISS.

In [None]:
for i in range(20):
  response = requests.get('http://api.open-notify.org/iss-now.json')
  if response.status_code == 200:
    print(response.json()['iss_position'])
  else:
    print('Error')

{'longitude': '-101.1452', 'latitude': '-15.1361'}
{'longitude': '-101.1452', 'latitude': '-15.1361'}
{'longitude': '-101.1257', 'latitude': '-15.1113'}
{'longitude': '-101.1063', 'latitude': '-15.0864'}
{'longitude': '-101.1063', 'latitude': '-15.0864'}
{'longitude': '-101.0868', 'latitude': '-15.0616'}
{'longitude': '-101.0868', 'latitude': '-15.0616'}
{'longitude': '-101.0674', 'latitude': '-15.0369'}
{'longitude': '-101.0674', 'latitude': '-15.0369'}
{'longitude': '-101.0480', 'latitude': '-15.0120'}
{'longitude': '-101.0286', 'latitude': '-14.9872'}
{'longitude': '-101.0286', 'latitude': '-14.9872'}
{'longitude': '-101.0092', 'latitude': '-14.9624'}
{'longitude': '-101.0092', 'latitude': '-14.9624'}
{'longitude': '-100.9898', 'latitude': '-14.9376'}
{'longitude': '-100.9898', 'latitude': '-14.9376'}
{'longitude': '-100.9704', 'latitude': '-14.9128'}
{'longitude': '-100.9510', 'latitude': '-14.8880'}
{'longitude': '-100.9510', 'latitude': '-14.8880'}
{'longitude': '-100.9315', 'lat

## Web Scraping
APIs are fantastic because the return useful data that is generally well formatted! One problem is the API needs to be written by someone and they don't always exist for the things you want. If there is no API to access the data nicely, another approach is to "scrape" the data of a webpage. 

In the rest of this example, we will try to scrape [this webpage](https://forecast.weather.gov/MapClick.php?lat=47.6036&lon=-122.3294) with the weather forecast so we can gather data about the weather for the rest of the week. This example is a bit lacking since we don't show **what** you would do with this data, just **how** to get it. To know what you can do with the data, refer to the rest of the course where we learned how to process data we had; web-scraping is just a tool to gather more data.

To understand how to scrape a page, you have to understand what a webpage looks like. Please refer to the slides to see what a webpage looks like. 

In [None]:
page = requests.get('https://forecast.weather.gov/MapClick.php?lat=47.6036&lon=-122.3294')
print(page.content)



This would be a pain to parse by hand, so instead we use a library that lets us look at the conents of the page. This library is called Beautiful Soup which can be used like below:

In [None]:
from bs4 import BeautifulSoup
soup = BeautifulSoup(page.content, 'html.parser')

To find the first paragraph in the page, we can write

In [None]:
soup.find('p')

<p>
<input checked="checked" id="nws" name="affiliate" type="radio" value="nws.noaa.gov"/>
<label class="search-scope" for="nws">NWS</label>
<input id="noaa" name="affiliate" type="radio" value="noaa.gov"/>
<label class="search-scope" for="noaa">All NOAA</label>
</p>

To find all the paragraphs, we would use `find_all`

In [None]:
soup.find_all('p')

[<p>
 <input checked="checked" id="nws" name="affiliate" type="radio" value="nws.noaa.gov"/>
 <label class="search-scope" for="nws">NWS</label>
 <input id="noaa" name="affiliate" type="radio" value="noaa.gov"/>
 <label class="search-scope" for="noaa">All NOAA</label>
 </p>, <p>Your local forecast office is</p>, <p>
             Dangerously hot temperatures will continue early this week over the Desert Southwest, with critical fire weather threats from the same region northward into the Great Basin. Further north, severe thunderstorms will be possible across the northern Rockies and Plains today, expanding into the Upper Midwest on Tuesday.
             <a href="http://www.wpc.ncep.noaa.gov/discussions/hpcdiscussions.php?disc=pmdspd" target="_blank">Read More &gt;</a>
 </p>, <p class="myforecast-current">NA</p>, <p class="myforecast-current-lrg">59°F</p>, <p class="myforecast-current-sm">15°C</p>, <p class="moreInfo"><b>More Information:</b></p>, <p><a href="https://www.weather.gov/sew"

We learned you can also specify an ID or a class to identify a tag in HTML. First, we select the element with the id "seven-day-forecast" and then try to find all of the items with the class "tombstone-container" **inside** that element.

In [None]:
seven_day = soup.find(id='seven-day-forecast')
seven_day

<div class="panel panel-default" id="seven-day-forecast">
<div class="panel-heading">
<b>Extended Forecast for</b>
<h2 class="panel-title">
	    	    Downtown Seattle WA	</h2>
</div>
<div class="panel-body" id="seven-day-forecast-body">
<div id="seven-day-forecast-container"><ul class="list-unstyled" id="seven-day-forecast-list"><li class="forecast-tombstone">
<div class="tombstone-container">
<p class="period-name">Today<br/><br/></p>
<p><img alt="Today: Sunny, with a high near 66. Calm wind becoming west northwest around 6 mph in the afternoon. " class="forecast-icon" src="newimages/medium/few.png" title="Today: Sunny, with a high near 66. Calm wind becoming west northwest around 6 mph in the afternoon. "/></p><p class="short-desc">Sunny</p><p class="temp temp-high">High: 66 °F</p></div></li><li class="forecast-tombstone">
<div class="tombstone-container">
<p class="period-name">Tonight<br/><br/></p>
<p><img alt="Tonight: Mostly clear, with a low around 50. North northwest wind aroun

In [None]:
forecast_items = seven_day.find_all(class_='tombstone-container')
tonight = forecast_items[0]
tonight

<div class="tombstone-container">
<p class="period-name">Today<br/><br/></p>
<p><img alt="Today: Sunny, with a high near 66. Calm wind becoming west northwest around 6 mph in the afternoon. " class="forecast-icon" src="newimages/medium/few.png" title="Today: Sunny, with a high near 66. Calm wind becoming west northwest around 6 mph in the afternoon. "/></p><p class="short-desc">Sunny</p><p class="temp temp-high">High: 66 °F</p></div>

To print it out a little nicer, we can use the `prettify` method

In [None]:
print(tonight.prettify())

<div class="tombstone-container">
 <p class="period-name">
  Today
  <br/>
  <br/>
 </p>
 <p>
  <img alt="Today: Sunny, with a high near 66. Calm wind becoming west northwest around 6 mph in the afternoon. " class="forecast-icon" src="newimages/medium/few.png" title="Today: Sunny, with a high near 66. Calm wind becoming west northwest around 6 mph in the afternoon. "/>
 </p>
 <p class="short-desc">
  Sunny
 </p>
 <p class="temp temp-high">
  High: 66 °F
 </p>
</div>


Elements returned by `find` sometimes have attributes that you can inspect. For example, the `img` tags in the the forecast have a "title" attribute with information about the forecast.

In [None]:
tonight.find('img')['title']

'Today: Sunny, with a high near 66. Calm wind becoming west northwest around 6 mph in the afternoon. '

Finding elements within an element can sometimes be tedious to do manually with find. BeautifulSoup also provides the `select` method that lets you find elements using a special syntax called "CSS Selectors". 

The follwoing line finds all the elements with the class "period-name" that are inside elements with the class "tombstone container".

In [None]:
seven_day.select('.tombstone-container .period-name')

[<p class="period-name">Today<br/><br/></p>,
 <p class="period-name">Tonight<br/><br/></p>,
 <p class="period-name">Tuesday<br/><br/></p>,
 <p class="period-name">Tuesday<br/>Night</p>,
 <p class="period-name">Wednesday<br/><br/></p>,
 <p class="period-name">Wednesday<br/>Night</p>,
 <p class="period-name">Thursday<br/><br/></p>,
 <p class="period-name">Thursday<br/>Night</p>,
 <p class="period-name">Friday<br/><br/></p>]

To get the text inside each tag returned by `select`, we can use a loop to all the `get_text` method on each tag. 

Below we get the following information for each time in the forecast
*  We get the name of the forecast time (called period)
*  We get the description of the forecast from the title property of the image inside the forecast
*  We get the forecast temperature 



In [None]:
periods = [pt.get_text() for pt in seven_day.select('.tombstone-container .period-name')]
print(periods)

['Today', 'Tonight', 'Tuesday', 'TuesdayNight', 'Wednesday', 'WednesdayNight', 'Thursday', 'ThursdayNight', 'Friday']


In [None]:
titles = [img['title'] for img in seven_day.select('.tombstone-container img')]
print(titles)

['Today: Sunny, with a high near 66. Calm wind becoming west northwest around 6 mph in the afternoon. ', 'Tonight: Mostly clear, with a low around 50. North northwest wind around 7 mph becoming east northeast in the evening. ', 'Tuesday: Partly sunny, with a high near 66. East northeast wind around 5 mph becoming calm  in the afternoon. ', 'Tuesday Night: Mostly cloudy, with a low around 53. North wind 5 to 7 mph becoming calm. ', 'Wednesday: Partly sunny, with a high near 66. Northeast wind 5 to 8 mph becoming northwest in the morning. ', 'Wednesday Night: Mostly cloudy, with a low around 51.', 'Thursday: Mostly sunny, with a high near 70.', 'Thursday Night: Mostly cloudy, with a low around 52.', 'Friday: A chance of showers after 11am.  Partly sunny, with a high near 70.']


In [None]:
temps = [tt.get_text() for tt in seven_day.select('.tombstone-container .temp')]
print(temps)

['High: 66 °F', 'Low: 50 °F', 'High: 66 °F', 'Low: 53 °F', 'High: 66 °F', 'Low: 51 °F', 'High: 70 °F', 'Low: 52 °F', 'High: 70 °F']


All of the data together is shown below

In [None]:
print(titles)
print(temps)
print(periods)

['Today: Sunny, with a high near 66. Calm wind becoming west northwest around 6 mph in the afternoon. ', 'Tonight: Mostly clear, with a low around 50. North northwest wind around 7 mph becoming east northeast in the evening. ', 'Tuesday: Partly sunny, with a high near 66. East northeast wind around 5 mph becoming calm  in the afternoon. ', 'Tuesday Night: Mostly cloudy, with a low around 53. North wind 5 to 7 mph becoming calm. ', 'Wednesday: Partly sunny, with a high near 66. Northeast wind 5 to 8 mph becoming northwest in the morning. ', 'Wednesday Night: Mostly cloudy, with a low around 51.', 'Thursday: Mostly sunny, with a high near 70.', 'Thursday Night: Mostly cloudy, with a low around 52.', 'Friday: A chance of showers after 11am.  Partly sunny, with a high near 70.']
['High: 66 °F', 'Low: 50 °F', 'High: 66 °F', 'Low: 53 °F', 'High: 66 °F', 'Low: 51 °F', 'High: 70 °F', 'Low: 52 °F', 'High: 70 °F']
['Today', 'Tonight', 'Tuesday', 'TuesdayNight', 'Wednesday', 'WednesdayNight', 'Th

The data is now stored as "parallel arrays", where the value at index 0 in each array corresponds to one forecast. This would be a bit annoying to work with, so we put it in a `pandas` `DataFrame`

In [None]:
import pandas as pd
weather = pd.DataFrame({
    'period': periods,
    'temp': temps,
    'desc': titles
})
weather

Unnamed: 0,period,temp,desc
0,Today,High: 66 °F,"Today: Sunny, with a high near 66. Calm wind b..."
1,Tonight,Low: 50 °F,"Tonight: Mostly clear, with a low around 50. N..."
2,Tuesday,High: 66 °F,"Tuesday: Partly sunny, with a high near 66. Ea..."
3,TuesdayNight,Low: 53 °F,"Tuesday Night: Mostly cloudy, with a low aroun..."
4,Wednesday,High: 66 °F,"Wednesday: Partly sunny, with a high near 66. ..."
5,WednesdayNight,Low: 51 °F,"Wednesday Night: Mostly cloudy, with a low aro..."
6,Thursday,High: 70 °F,"Thursday: Mostly sunny, with a high near 70."
7,ThursdayNight,Low: 52 °F,"Thursday Night: Mostly cloudy, with a low arou..."
8,Friday,High: 70 °F,Friday: A chance of showers after 11am. Partl...
