Practice: Coding Machine Learning#

Jupyter Info

Reminder, that on this site the Jupyter Notebooks are read-only and you can’t interact with them. Click the button above to launch an interactive version of this notebook.

  • With Binder, you get a temporary Jupyter Notebook website that opens with this notebook. Any code you write will be lost when you close the tab. Make sure to download the notebook so you can save it for later!

  • With Colab, it will open Google Colaboratory. You can save the notebook there to your Google Drive. If you don’t save to your Drive, any code you write will be lost when you close the tab. You can find the data files for this notebook below:

You will need to run all the cells of the notebook to see the output. You can do this with hitting Shift-Enter on each cell or clicking the “Run All” button above.

In this notebook, we will practice trying to predict whether or not a mushroom is edible based on its features using data in mushrooms.csv. This dataset has many columns which we will not attempt to understand, but you can look at them here if you are interested. The columns of interest will be our target class which takes on values e for edible or p for poisonous. All the features we will use (explained below) are categorical. Some value are missing so we will need to handle that.

For this task, we will use only a subset of the columns for prediction. You should use the columns cap-shape, cap-surface, cap-color as features and class as the target. All other columns should be ignored for this analysis.

In this problem, you should follow the machine learning pipeline to do the following steps:

  • Drop all columns that are not relevant to the analysis.

  • Remove all rows that have missing values for the columns of interest. There is no need to throw out rows that have missing values outside of these 4 columns (since those missing values will not be included in the model).

  • Separate the data into usable features and labels.

  • Split the dataset into 70% training data and 30% test data.

  • Train a decision tree model on the data.

  • Evaluate the models training and test accuracy.

Remember to import any thing you need from pandas or sklearn!

Problem 0: Load the Data#

Load in the dataset into a DataFrame named data. Don’t preprocess the data in any way for this problem.

# Write your code here!

Problem 1: Process the Data#

Do the data processing parts of the ML pipeline. Namely, do the following steps:

  • Drop all columns that are not relevant to the analysis.

  • Remove all rows that have missing values for the columns of interest. There is no need to throw out rows that have missing values outside of these 4 columns (since those missing values will not be included in the model).

  • Separate the data into usable features and labels.

  • Split the dataset into 70% training data and 30% test data.

Save the variables for the train and test features into features_train, features_test, labels_train, labels_test as we did before.

# Write your code here!

Problem 2: Train the Model#

Write code to create a decision tree model and train it. Make sure you save it in a variable called model.

# Write your code here!

Problem 3: Assess the Model#

Write code to compute the training and test accuracy of the model in the cell below. Save the train accuracy in varaible called train_acc and the test accuracy in a variable called test_acc.

For reference, each of your train and test accuracies should be around 70%.

Check your Understanding: If both the train and test accuracy are near 70%, would we say that the model is overfit? Why or why not?

# Write your code here!