CS100 Homework #3, Part II
Overview
The topic of this homework assignment is supervised learning. The first half is concerned with linear regression, and the second half, classification.
Part II: Classiftcation
The second part of this assignment is concerned with classification. In class, you learned (or will be learning) about four different classification methods:
K-NN, decision trees, logistic regression, and naive Bayes. Likewise, in studio you have explored some of these in the context of binary classification, where data are classified in only one of two ways (e.g., Clinton supporters, or Trump supporters). In this assignment, you’re going to explore multiclass classification, where more than two classes may apply (e.g., in addition to the above two, Jill Stein supporters, or Gary Johnson supporters). All the binary classification algorithms you learned about (except logistic regression) generalize to multiple classes. Thus, in this assignment, you will be building classifiersusing k-NN, decision trees, and naive Bayes.
Libraries
Before you can begin working on this assignment, you must install, and then load, the necessary libraries:
library(class) # k-NN library(rpart) # Decision trees library(klaR) # Naive Bayes library(caret) # Cross-validation
This assignment is relatively free form compared to other CS 100 assignments, in the sense that there are no exact answers. We’ll tell you what accuracy ranges we expect your classifiers to achieve so that you can gauge your performance. If you can achieve those accuracies, you will have convinced us that you possess the necessary skills to perform basic classification tasks. At the end, you’ll also be asked a few questions about both the data and the classification methods.
Data
The data set for this assignment contains nutritional measurements of several (260, to be exact) items on the McDonald’s menu, such as an Egg McMuffin and a Big Mac. The variables include serving size, calories, sodium content, and various other features (24 in total) of McDonald’s menu items.
Features
Category: Breakfast, Beef & Pork, Chicken & Fish, Salads, Snacks & Sides, Desserts, Beverages, Coffee & Tea, and Smoothies &
Item: Name of the
Serving Size: Amount of food in one Solid foods are described by grams, while liquids are described by milliliters.
Calories
Calories from Fat
Total Fat
Total Fat (% Daily Value)
Saturated Fat
Saturated Fat (% Daily Value)
Trans Fat
Cholesterol
Cholesterol (% Daily Value)
Sodium
Sodium (% Daily Value)
Carbohydrates
Carbohydrates (% Daily Value)
Dietary Fiber
Dietary Fiber (% Daily Value)
Sugars
Protein
Vitamin A (% Daily Value)
Vitamin C (% Daily Value)
Calcium (% Daily Value)
Iron (% Daily Value)
The variable that yourclassifiers should aim to predictisthefirst one, Category, which ranges over eight possible values: Breakfast, Beef & Pork, Chicken & Fish, Salads, Snacks & Sides, Desserts, Beverages, and Coffee & Tea.
Cleaning
The data are located here.
Load them, view them, summarize them, and think about how to prepare them for analysis. Be sure to take into account the differing requirements of the three learning algorithms. E.g.,
k-NN predicts only factors, from only numeric predictors.
Aside: Some classifiers, like the implementation of
k-NN in the class library, cannot handle missing values. Thus, to use
k-NN, the datashould be stripped ofincomplete cases. The rpart and the klaR libraries include an option called na.action which takes values na.omit or na.pass. The former omits observations with missing data, the latter includes them. The way they are used in decision trees (when na.action = na.pass) is via something called surrogate splits. When naive Bayes encounters an observa- tion with amissing attribute value, and na.action = na.pass, itsimply ignores that observation when computing probabilities pertaining to that attribute.
Accuracy
A quick spoiler: a naive Bayes model that predicts Category using all the features in this data set can achieve 90% accuracy.
Regardless, we’re going to work on building models with fewer features, so that they are easier to interpret. We’ll begin with a single feature model, and work our way up to four. We also want you to build a model with the exact number of features specified (1, 2, 3, or 4). So if you happento achieve the next section’s accuracy with fewer features (accuracy for three-feature models with two-feature models), you should still add at least one new feature.
Grading
Youwill be graded ontwothings: the accuracy ofyour models and your writeup. For each problem, we will specify a milestone, you need to reach that for credit. You will not get any credit for accuracy below what we specify. However, only five of your models has to achieve this accuracy, not all of them. That said, the rest of your models will probably achieve this same accuracy
±5%.For each model, you should explain which features you chose and why. Note: To explain “why” will probably require that your produce descriptive statistics (including visualizations) of a feature’s relationship to Category.
Training and Test Data
Inorder to standardize grading, we have divided the data into training and test sets. You can find the training data here, and the test data here.
Remember to evaluate accuracy on the testing data, not the training data. Otherwise, you will think you are getting a much higher accuracy that you really are. Analogously, you will think you are getting a much higher grade than you really are.
Learning Algorithms
Many learning algorithms have what are called hyperparameters, which are parameters that are used to specify a learning model. Forexample, the
kinkNN is a hyperparameter. Likewise, the depth of a decision tree is a hyperparam- eter. (Naive Bayes does not have any hyperparameters to speak of.)
The rpart library allows you to control many hyperparameters ofadecision tree, including:
maxdepth: the maximum depthofatree, meaning the maximum number of levels it can have
minsplit: the minimum number ofobservations that anode must contain for another split to be attempted
minbucket: the minimum number of observations that a terminal bucket (a leaf) must contain (i.e., further splits are not considered on features that
would spawn children with too few observations)
To control these hyperparameters, you should call the rpart function with all the usual suspects (described below, for completeness), as well as a fourth argument called control:
# Decision tree
library(rpart.plot) # Install the rplot.plot package to visualize your decision tree models
tree <- rpart(y ~ x, data = my_data, method = 'class') rpart.plot(tree)
pruned_tree <- rpart(y ~ x, data = my_data, method = 'class', control = rpart.control(maxde rpart.plot(pruned_tree)
The “usual suspects” (i.e, the basic arguments to rpart) are:
formula is the formula to use for the tree in the form of label ~ feature_1 + feature_2 + ....
data specifies the data frame
method is either class for a classification tree or anova for
Model Building
For each section (i.e., number of features), you are tasked with building three models:
kNN, decision trees, and naive Bayes. To measure prediction accuracy for
kNN, you should more or less as you did in Studio.
# For kNN
my_predictions <- knn( ... )
mean(my_predictions == mcdonalds_test$Category) # Testing accuracy
Fordecision trees and naive Bayes, you should runthe predict function onyour model and your test data, as shown below. Then you can check whether the predictions output by your model match the Category values in the test data using:
# For Decision Trees my_dt <- rpart( ... )
predict <- predict(my_dt, mcdonalds_test, type = "class") mean(predict == mcdonalds_test$Category) # Testing accuracy
# For Naive Bayes
my_nb <- NaiveBayes( ... )
predict <- predict(my_nb, mcdonalds_test)$class mean(predict == mcdonalds_test$Category) # Testing accuracy
Feel free to make use of an accuracy functionto save ontyping, and at the same time limit opportunities for bugs!
acc <- function(x) mean(x == mcdonalds_test$Category)
1- feature model
Start off simple building one-feature models. Generate descriptive statistics to compare the relationships between various features of interest and Category. Use your intuition to guide your search. Percent of daily recommended intake of vitamin A may not significantly impact an item’s category. On the other hand, the serving size or the amount of saturated fat may have a significant impact.
For full credit achieve 67% or higher.
2- feature model
Building off of your one-feature model, add a new feature. For some models, it may actually be better to use a new combination of features, but you should still be able to add one new feature in total andincrease your accuracy.
For full credit achieve 69% or higher.
3- feature model
Next, building off your two-feature model or not (i.e., feel free to start over from scratch), build a three-feature model.
For full credit achieve 71% or higher.
4- feature model
Finish up by building a model using four features. For full credit achieve 73% or higher.
5- feature model
Finish up by building a model using four features. For full credit achieve 82% or higher.
6- feature model
Finish up by building a model using four features. For full credit achieve 85% or higher.
7- feature model
Finish up by building a model using four features. For full credit achieve 87% or higher.
Accuracy and explanatory power
At this point, we’ll turn you loose (temporarily).
Build the best model you can that trade-offs accuracy for explanatory power. That is, using a reasonable number of features (possibly more than four, but definitely less than 24), build a model that achieves relatively high accuracy, but at the same time iseasy to explain. Report the accuracy of your model, and explain how/why it works.
Cross-validation
One shortcoming of the above analysis of your various models is that you evaluated them on only one partition of the data into training and test sets. We asked you to do this only sothat wecould standardize grading. It is always good practice to cross-validate your models on multiple partitions of the data. The final step in this homework is to complete this cross-validation. You can do so
using the train function, just as you did in studio, with the method argument set equal to "knn", "rpart", or "nb". Do you expect your accuracies to go up or down, and why?
Note: To test the accuracy of predictions using a model built by the train function, the type argument to the predict function should be set equal to "raw" not "class".
Follow Up
Now that you’ve finished classifying the category of McDonald’s food items, let’s talk about the classifiers. For each algorithm—
k-NN, decision trees, and naive Bayes—list their pros and cons, and provide examples of where and when you’d use each of them. (Feel free to surf the web forhelpwith this discussion question. But please cite all yoursources.)
DescriptionIn this final assignment, the students will demonstrate their ability to apply two ma
Path finding involves finding a path from A to B. Typically we want the path to have certain properties,such as being the shortest or to avoid going t
Develop a program to emulate a purchase transaction at a retail store. Thisprogram will have two classes, a LineItem class and a Transaction class. Th
1 Project 1 Introduction - the SeaPort Project series For this set of projects for the course, we wish to simulate some of the aspects of a number of
1 Project 2 Introduction - the SeaPort Project series For this set of projects for the course, we wish to simulate some of the aspects of a number of