(5/5)

# this homework assignment is supervised learning. The first half is concerned with linear regression, and the second half, classification.

INSTRUCTIONS TO CANDIDATES

CS100 Homework #3, Part II

Overview

The topic of this homework assignment is supervised learning. The first half is concerned with linear regression, and the second half, classification.

Part II: Classiftcation

The second part of this assignment is concerned with classification. In class, you learned (or will be learning) about four different classification methods:

K-NN, decision trees, logistic regression, and naive Bayes. Likewise, in studio you have explored some of these in the context of binary classification, where data are classified in only one of two ways (e.g., Clinton supporters, or Trump supporters). In this assignment, you’re going to explore multiclass classification, where more than two classes may apply (e.g., in addition to the above two, Jill Stein supporters, or Gary Johnson supporters). All the binary classification algorithms you learned about (except logistic regression) generalize to multiple classes. Thus, in this assignment, you will be building classifiersusing k-NN, decision trees, and naive Bayes.

Libraries

Before you can begin working on this assignment, you must install, and then load, the necessary libraries:

library(class) # k-NN library(rpart) # Decision trees library(klaR) # Naive Bayes library(caret) # Cross-validation

This assignment is relatively free form compared to other CS 100 assignments, in the sense that there are no exact answers. We’ll tell you what accuracy ranges we expect your classifiers to achieve so that you can gauge your performance. If you can achieve those accuracies, you will have convinced us that you possess the necessary skills to perform basic classification tasks. At the end, you’ll also be asked a few questions about both the data and the classification methods.

Data

The data set for this assignment contains nutritional measurements of several (260, to be exact) items on the McDonald’s menu, such as an Egg McMuffin and a Big Mac. The variables include serving size, calories, sodium content, and various other features (24 in total) of McDonald’s menu items.

Features

1. Category: Breakfast, Beef & Pork, Chicken & Fish, Salads, Snacks & Sides, Desserts, Beverages, Coffee & Tea, and Smoothies &

2. Item: Name of the

3. Serving Size: Amount of food in one Solid foods are described by grams, while liquids are described by milliliters.

4. Calories

5. Calories from Fat

6. Total Fat

7. Total Fat (% Daily Value)

8. Saturated Fat

9. Saturated Fat (% Daily Value)

10. Trans Fat

11. Cholesterol

12. Cholesterol (% Daily Value)

13. Sodium

14. Sodium (% Daily Value)

15. Carbohydrates

16. Carbohydrates (% Daily Value)

17. Dietary Fiber

18. Dietary Fiber (% Daily Value)

19. Sugars

20. Protein

21. Vitamin A (% Daily Value)

22. Vitamin C (% Daily Value)

23. Calcium (% Daily Value)

24. Iron (% Daily Value)

The variable that yourclassifiers should aim to predictisthefirst one, Category, which ranges over eight possible values: Breakfast, Beef & Pork, Chicken & Fish, Salads, Snacks & Sides, Desserts, Beverages, and Coffee & Tea.

Cleaning

The data are located here.

Load them, view them, summarize them, and think about how to prepare them for analysis. Be sure to take into account the differing requirements of the three learning algorithms. E.g.,

k-NN predicts only factors, from only numeric predictors.

Aside: Some classifiers, like the implementation of

k-NN in the class library, cannot handle missing values. Thus, to use

k-NN, the datashould be stripped ofincomplete cases. The rpart and the klaR libraries include an option called na.action which takes values na.omit or na.pass. The former omits observations with missing data, the latter includes them. The way they are used in decision trees (when na.action = na.pass) is via something called surrogate splits. When naive Bayes encounters an observa- tion with amissing attribute value, and na.action = na.pass, itsimply ignores that observation when computing probabilities pertaining to that attribute.

Accuracy

A quick spoiler: a naive Bayes model that predicts Category using all the features in this data set can achieve 90% accuracy.

Regardless, we’re going to work on building models with fewer features, so that they are easier to interpret. We’ll begin with a single feature model, and work our way up to four. We also want you to build a model with the exact number of features specified (1, 2, 3, or 4). So if you happento achieve the next section’s accuracy with fewer features (accuracy for three-feature models with two-feature models), you should still add at least one new feature.

Youwill be graded ontwothings: the accuracy ofyour models and your writeup. For each problem, we will specify a milestone, you need to reach that for credit. You will not get any credit for accuracy below what we specify. However, only five of your models has to achieve this accuracy, not all of them. That said, the rest of your models will probably achieve this same accuracy

±5%.For each model, you should explain which features you chose and why. Note: To explain “why” will probably require that your produce descriptive statistics (including visualizations) of a feature’s relationship to Category.

Training and Test Data

Inorder to standardize grading, we have divided the data into training and test sets. You can find the training data here, and the test data here.

Remember to evaluate accuracy on the testing data, not the training data. Otherwise, you will think you are getting a much higher accuracy that you really are. Analogously, you will think you are getting a much higher grade than you really are.

Learning Algorithms

Many learning algorithms have what are called hyperparameters, which are parameters that are used to specify a learning model. Forexample, the

kinkNN is a hyperparameter. Likewise, the depth of a decision tree is a hyperparam- eter. (Naive Bayes does not have any hyperparameters to speak of.)

The rpart library allows you to control many hyperparameters ofadecision tree, including:

• maxdepth: the maximum depthofatree, meaning the maximum number of levels it can have

• minsplit: the minimum number ofobservations that anode must contain for another split to be attempted

• minbucket: the minimum number of observations that a terminal bucket (a leaf) must contain (i.e., further splits are not considered on features that

would spawn children with too few observations)

To control these hyperparameters, you should call the rpart function with all the usual suspects (described below, for completeness), as well as a fourth argument called control:

# Decision tree

library(rpart.plot) # Install the rplot.plot package to visualize your decision tree models

tree <- rpart(y ~ x, data = my_data, method = 'class') rpart.plot(tree)

pruned_tree  <-  rpart(y  ~  x,  data  =  my_data,  method  =  'class',  control  =  rpart.control(maxde rpart.plot(pruned_tree)

The “usual suspects” (i.e, the basic arguments to rpart) are:

• formula is the formula to use for the tree in the form of label ~ feature_1 +  feature_2  +  ....

• data specifies the data frame

• method is either class for a classification tree or anova for

Model Building

For each section (i.e., number of features), you are tasked with building three models:

kNN, decision trees, and naive Bayes. To measure prediction accuracy for

kNN, you should more or less as you did in Studio.

# For kNN

my_predictions <- knn( ... )

mean(my_predictions == mcdonalds_test\$Category) # Testing accuracy

Fordecision trees and naive Bayes, you should runthe predict function onyour model and your test data, as shown below. Then you can check whether the predictions output by your model match the Category values in the test data using:

# For Decision Trees my_dt <- rpart( ... )

predict <- predict(my_dt, mcdonalds_test, type = "class") mean(predict == mcdonalds_test\$Category) # Testing accuracy

# For Naive Bayes

my_nb <- NaiveBayes( ... )

predict  <-  predict(my_nb,  mcdonalds_test)\$class mean(predict == mcdonalds_test\$Category) # Testing accuracy

Feel free to make use of an accuracy functionto save ontyping, and at the same time limit opportunities for bugs!

acc <- function(x) mean(x == mcdonalds_test\$Category)

1- feature model

Start off simple building one-feature models. Generate descriptive statistics to compare the relationships between various features of interest and Category. Use your intuition to guide your search. Percent of daily recommended intake of vitamin A may not significantly impact an item’s category. On the other hand, the serving size or the amount of saturated fat may have a significant impact.

For full credit achieve 67% or higher.

2-       feature model

Building off of your one-feature model, add a new feature. For some models, it may actually be better to use a new combination of features, but you should still be able to add one new feature in total andincrease your accuracy.

For full credit achieve 69% or higher.

3-  feature model

Next, building off your two-feature model or not (i.e., feel free to start over from scratch), build a three-feature model.

For full credit achieve 71% or higher.

4-  feature model

Finish up by building a model using four features. For full credit achieve 73% or higher.

5-    feature model

Finish up by building a model using four features. For full credit achieve 82% or higher.

6-   feature model

Finish up by building a model using four features. For full credit achieve 85% or higher.

7-   feature model

Finish up by building a model using four features. For full credit achieve 87% or higher.

Accuracy and explanatory power

At this point, we’ll turn you loose (temporarily).

Build the best model you can that trade-offs accuracy for explanatory power. That is, using a reasonable number of features (possibly more than four, but definitely less than 24), build a model that achieves relatively high accuracy, but at the same time iseasy to explain. Report the accuracy of your model, and explain how/why it works.

Cross-validation

One shortcoming of the above analysis of your various models is that you evaluated them on only one partition of the data into training and test sets. We asked you to do this only sothat wecould standardize grading. It is always good practice to cross-validate your models on multiple partitions of the data. The final step in this homework is to complete this cross-validation. You can do so

using the train function, just as you did in studio, with the method argument set equal to "knn", "rpart", or "nb". Do you expect your accuracies to go up or down, and why?

Note: To test the accuracy of predictions using a model built by the train function, the type argument to the predict function should be set equal to "raw" not "class".

Now that you’ve finished classifying the category of McDonald’s food items, let’s talk about the classifiers. For each algorithm—

k-NN, decision trees, and naive Bayes—list their pros and cons, and provide examples of where and when you’d use each of them. (Feel free to surf the web forhelpwith this discussion question. But please cite all yoursources.)

(5/5)

## Related Questions

##### . Introgramming & Unix Fall 2018, CRN 44882, Oakland University Homework Assignment 6 - Using Arrays and Functions in C

DescriptionIn this final assignment, the students will demonstrate their ability to apply two ma

##### . The standard path finding involves finding the (shortest) path from an origin to a destination, typically on a map. This is an

Path finding involves finding a path from A to B. Typically we want the path to have certain properties,such as being the shortest or to avoid going t

##### . Develop a program to emulate a purchase transaction at a retail store. This program will have two classes, a LineItem class and a Transaction class. The LineItem class will represent an individual

Develop a program to emulate a purchase transaction at a retail store. Thisprogram will have two classes, a LineItem class and a Transaction class. Th

##### . SeaPort Project series For this set of projects for the course, we wish to simulate some of the aspects of a number of Sea Ports. Here are the classes and their instance variables we wish to define:

1 Project 1 Introduction - the SeaPort Project series For this set of projects for the course, we wish to simulate some of the aspects of a number of

##### . Project 2 Introduction - the SeaPort Project series For this set of projects for the course, we wish to simulate some of the aspects of a number of Sea Ports. Here are the classes and their instance variables we wish to define:

1 Project 2 Introduction - the SeaPort Project series For this set of projects for the course, we wish to simulate some of the aspects of a number of

Hire Me

Hire Me