Practice: Coding Machine Learning#

Jupyter Info

Reminder, that on this site the Jupyter Notebooks are read-only and you can’t interact with them. Click the button above to launch an interactive version of this notebook.

  • With Binder, you get a temporary Jupyter Notebook website that opens with this notebook. Any code you write will be lost when you close the tab. Make sure to download the notebook so you can save it for later!

  • With Colab, it will open Google Colaboratory. You can save the notebook there to your Google Drive. If you don’t save to your Drive, any code you write will be lost when you close the tab. You can find the data files for this notebook below:

You will need to run all the cells of the notebook to see the output. You can do this with hitting Shift-Enter on each cell or clicking the “Run All” button above.

In this notebook, we will practice trying to predict the weather. We won’t try to predict it in the sense you are familiar with, where meteorologists try to predict what the weather will be a week out from now. Instead, we will do a simpler example where we look at various information about a day and try to predict the maximimum temperature that day.

The data is stored in weather.csv and has the following columns.

  • STA: A code representing what station the measurements were taken from

  • YR: Which year this measurement was taken

  • MO: Which month this measurement was taken

  • DA: Which day this measurement was taken

  • MAX (our target): The maximum temperature that was reached that day

  • MIN: The minimum temperature that was reached that day.

Since the target we want to predict is a number, this will be a regression task rather than a classification task. Almost all the code you will write will be the same as we saw in the lesson, except:

  • You will use a DecisionTreeRegressor from sklearn.tree instead of a DecisionTreeClassifier

  • You will use the mean_squared_error function from the sklearn.metrics module instead of accuracy_score. It behaves similarly in the sense it takes the true labels and the predicted labels, but is different in that it returns the error of the predictions instead. Formally, this is returning the mean-squared error between your predictions and the true values (find the difference for each example, square them, and average them). A higher MSE means the model did worse, while an MSE of 0 means there were no errors!

As a recommendation, you may use the following variable names for the parts of the problem:

  • data should store the DataFrame of all the data stored in weather.csv.

  • features should store the DataFrame of just the features.

  • labels should store the Series of labels.

  • model should store the DecisionTreeRegressor.

  • error should store the error of the trained model on the whole dataset.

We don’t specify each step so that you refer back to your notes and the code process you saw from the notebook earlier in the lesson. Refer back to that for the steps to train the model (accounting for the differences we highlighted above). Remember to import all the necessary libraries!

As a hint for correctness on this task, your model should get 0 error on this dataset. We will discuss in Lesson 12 why getting 0 error might be a sign of something is actually wrong with our model, but for this lesson, we will consider that correct!

For these problems, you should not use any loops!

# Write your code here!