logo Use CA10RAM to get 10%* Discount.
Order Nowlogo
(5/5)

this project is to gain more practice with using functions, lists and dictionaries and gain some intuition for Machine Learning, the field of computer science concerned with writing algorithms.

INSTRUCTIONS TO CANDIDATES
ANSWER ALL QUESTIONS

CSCI 141 - Spring 2020

Assignment 5: Cancer Classification using Machine Learning 

  • Overview 

The goal of this project is to gain more practice with using functions, lists and dictionaries and gain some intuition for Machine Learning, the field of computer science concerned with writing algorithms that allow computers to “learn” from data. One field these techniques are being used to make a difference is in medicine.

The problem we’ll be solving is as follows: Given a data file containing hundreds of patient records with values describing measurements of cancer tumors and whether or not each tumor is malignant or benign, develop a simple rule-based classifier that can be used to predict whether an as-yet-unseen tumor is malignant or benign.

The general idea is that malignant tumors are different than benign tumors. Malignant tumors tend to have larger radii, to be more smooth, to be more symmetric, etc. Measurements have been taken on many tumors whose class (malignant or benign) is known. The code you are going to write will get the average score across all the malignant tumors for an attribute (e.g. ‘area’) as well as the average score for that attribute for benign tumors. Let’s say that the average area for malignant tumors is 100, and for benign tumors is 50. We can then use that information to try to predict whether a given tumor is malignant or benign.

Imagine you are presented with a new tumor and told the area was 99. All else being equal, we would have reason to think this tumor is more likely to be malignant than had its area been 51. Based on this intuition, we are going to create a simple classification scheme. We will calculate the midpoint between the malignant average and the benign average (75 in our hypothetical example), and simply say that for each new tumor, if its value for that attribute is greater than or equal to the midpoint value for that attribute, that is one vote for the tumor being malignant. Each attribute that we are using produces a vote, and at the end of counting votes for each attribute, if the malignant votes are greater than or equal to the benign votes, we predict that the tumor is malignant.

2 Machine Learning Framework 

“Machine learning” is a popular buzzword that might evoke computer brain simulations, or robots walking among humans. In reality (for now, anyway), machine learning refers to some- thing less fanciful: algorithms that use previously observed data to make predictions about new data. It may sound less glamorous than fully sentient robots, but that’s exactly what was described above! Machine learning allows us to solve problems by considering hundreds or thousands of attributes (and their combinations) - far more than a human alone could do. You can get more sophisticated about the specifics of how you go about this, but that’s the core of what machine learning really means.

If using data to make predictions on new data is our goal, you might think it makes sense to use

all the data we have to learn from. But in fact, if we truly don’t know the labels (e.g., malignant or benign) of the data we’re testing our algorithm on, we won’t have any idea whether it’s doing a good job! For this reason, it makes sense to split the data we have labels for into a training set, which we’ll use to “learn” from, and a test set, which we’ll use to evaluate how well the algorithm does on new data (i.e., data it wasn’t trained on). We will take about 80% of the data as our training set, and use the remaining 20% as our test set.

 

2.1 Training Phase 

Here’s how our classifier will work: In the training phase, we will “learn” (read: compute) the average value each attribute (e.g. area, smoothness, etc.) among the malignant tumors. We will also “learn” (again: compute) the average value of each attribute among benign tumors. Then we’ll compute the midpoint for each attribute. This collection of midpoints, one for each attribute, is our classifier.

2.2 Testing Phase 

Having trained our classifier, we can now use it to make an educated guess about the label of a new tumor if we have the measurements of all of its attributes. Our educated guess will be pretty simple:

  • If the tumor’s value for an attribute is greater than or equal to the midpoint value for that attribute, cast one vote for the tumor being

  • If the tumor’s attribute value is less than the midpoint, cast one vote for the tumor being benign.

  • Tally up the votes cast according to these rules for each of the ten attributes. If the malignant votes are greater than or equal to the benign votes, we predict that the tumorIf we want to use this classifier to diagnose people, we have an important question to answer: how good are our guesses? To answer this question, we’ll run test our algorithm on the 20% of our data that we held out as the test set, which we didn’t use to train the classifier, but we do know the correct labels. Our rate of accuracy on these data should be indicative of how well our classifier will do on new, unlabeled tumors.

3 Dataset Description 

You have been provided with cancerTrainingData.txt, a text file containing the 80% of the data that we’ll use as our training set.

The file has many numbers per patient record, some of which refer to attributes of the tumor. The skeleton code includes the function make_training_set(), which reads in the important information from this file and produces a list of dictionaries. Each dictionary contains attributes for a single tumor as follows:

  1. ID

  2. radius

  3. texture

  4. perimeter

  5. area

  6. smoothness

  7. compactness

  8. concavity

  9. concave

  10. symmetry

  11. fractal

  12. class

The middle 10 attributes (numbered 1 through 10) are the numbers that describe the tumor. The first attribute is just the patient ID number, and the last attribute is the actual real life state of the tumor, namely, malignant (represented by “M”) or benign (represented by “B”).

We don’t need to know what these attributes mean: all we need to know is that they are measurements of the tumors, and that benign and malignant tumors tend to have different attribute values. For these 10 tumor attributes when comparing to the midpoint values, higher numbers indicate malignancy. Pictorially, the list of dictionaries looks like this (two are shown, but the list contains many more than that)

The dictionary stored in the 0th spot in the list gives the attributes for the 0th tumor: training_set[0]["class"] gives the true class label (in this case, ”B” for benign) of the 0th tumor.

4 Getting Started

Download the skeleton code (cancer_classifier.py), training  set  (cancerTrainingData.txt), and the test set (cancerTestingData.txt).  Make sure all three files are in the same directory,  or the main program will not be able to load the data from the files.

In some browsers, clicking the link to each data file simply opens the file in your browser, which isn’t helpful. To download the data files, I recommend right-clicking the link from Canvas or the course webpage and selecting “Save File As...”, or your browser’s equivalent. Choose the same location as you’ve saved the skeleton code and save the files without changing their names to be sure that the program will be able to read them correctly.

(5/5)
Attachments:

Related Questions

. Introgramming & Unix Fall 2018, CRN 44882, Oakland University Homework Assignment 6 - Using Arrays and Functions in C

DescriptionIn this final assignment, the students will demonstrate their ability to apply two ma

. The standard path finding involves finding the (shortest) path from an origin to a destination, typically on a map. This is an

Path finding involves finding a path from A to B. Typically we want the path to have certain properties,such as being the shortest or to avoid going t

. Develop a program to emulate a purchase transaction at a retail store. This program will have two classes, a LineItem class and a Transaction class. The LineItem class will represent an individual

Develop a program to emulate a purchase transaction at a retail store. Thisprogram will have two classes, a LineItem class and a Transaction class. Th

. SeaPort Project series For this set of projects for the course, we wish to simulate some of the aspects of a number of Sea Ports. Here are the classes and their instance variables we wish to define:

1 Project 1 Introduction - the SeaPort Project series For this set of projects for the course, we wish to simulate some of the aspects of a number of

. Project 2 Introduction - the SeaPort Project series For this set of projects for the course, we wish to simulate some of the aspects of a number of Sea Ports. Here are the classes and their instance variables we wish to define:

1 Project 2 Introduction - the SeaPort Project series For this set of projects for the course, we wish to simulate some of the aspects of a number of

Ask This Question To Be Solved By Our ExpertsGet A+ Grade Solution Guaranteed

expert
Um e HaniScience

625 Answers

Hire Me
expert
Muhammad Ali HaiderFinance

637 Answers

Hire Me
expert
Husnain SaeedComputer science

934 Answers

Hire Me
expert
Atharva PatilComputer science

930 Answers

Hire Me
June
January
February
March
April
May
June
July
August
September
October
November
December
2025
1950
1951
1952
1953
1954
1955
1956
1957
1958
1959
1960
1961
1962
1963
1964
1965
1966
1967
1968
1969
1970
1971
1972
1973
1974
1975
1976
1977
1978
1979
1980
1981
1982
1983
1984
1985
1986
1987
1988
1989
1990
1991
1992
1993
1994
1995
1996
1997
1998
1999
2000
2001
2002
2003
2004
2005
2006
2007
2008
2009
2010
2011
2012
2013
2014
2015
2016
2017
2018
2019
2020
2021
2022
2023
2024
2025
2026
2027
2028
2029
2030
2031
2032
2033
2034
2035
2036
2037
2038
2039
2040
2041
2042
2043
2044
2045
2046
2047
2048
2049
2050
SunMonTueWedThuFriSat
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
1
2
3
4
5
00:00
00:30
01:00
01:30
02:00
02:30
03:00
03:30
04:00
04:30
05:00
05:30
06:00
06:30
07:00
07:30
08:00
08:30
09:00
09:30
10:00
10:30
11:00
11:30
12:00
12:30
13:00
13:30
14:00
14:30
15:00
15:30
16:00
16:30
17:00
17:30
18:00
18:30
19:00
19:30
20:00
20:30
21:00
21:30
22:00
22:30
23:00
23:30