logo Use CA10RAM to get 10%* Discount.
Order Nowlogo
(5/5)

We have a Canvas quiz that is meant to test that you have read the library documentation for the packages we use for this class.

INSTRUCTIONS TO CANDIDATES
ANSWER ALL QUESTIONS

Introduction

You may be asking yourself “what is the importance of learning about Data Science and Machine Learning in a cybersecurity class?”, The short answer is that data science is a useful set of tools to handle the massive amount of data that flow through IT systems and it is used by many security teams either explicitly or within tools/programs they use so it is important to get a basic understanding of how it works. This Project will go through a simplified scenario where data science can be used, if this sparks your interest there are plenty of other ML focused classes at GaTech that you may be interested in taking, as well as a wealth of training materials on Youtube, Coursera, Udacity, Udemy, DataCamp etc that you could use to go deeper into the field.

Scenario:

You are an analyst on a security team for a midsized software company that runs a messaging app (a slack, gchat, microsoft teams competitor). It is Monday morning and you see an email from your manager setting up a meeting to discuss a new security feature that needs to be implemented in the product ASAP. You join the meeting and learn that recently there has been a big uptick in malicious executable files being sent over the chat app and it is starting to generate bad press for the company. A few analysts on the team already worked on analyzing a set of files sent over the app and classifying them as malicious or benign. They also used a python library (pefile) to get some attributes of each executable file and have created a CSV with those extracted attributes and a column with the name class with a 1 denoting a malicious file and a 0 denoting a benign files. They documented their preprocessing work in a readme in the git repo (urwithajit9/ClaMP: A Malware classifier dataset built with header fields’ values of Portable Executable files (github.com)) and shared the repo with software engineers so they can get to work writing code that will generate those features for every executable file sent over the messaging app. Your boss turns to you and says I would like you to help us to understand a bit more about how big of a problem this is on our app and write a model that takes in these features and produces a propensity score from 0 to 1 where scores closer to 0 mean a low likelihood of the file being malicious and closer to a 1 means a higher likelihood of a file being malicious. Also since the team may want to reuse this type of work in the future for different types of files or with different extracted attributes you should create functions that can be used in the future with minimal rework. Once you produce a model, you will share your code and the trained model file with the software engineers who will integrate the model into the messaging app and will score all files uploaded to the app.

General Advice

Develop locally then test in the autograder when you are confident your code runs without errors. You can run the python files locally, develop in a local vscode/jupyter notebook or on a hosted web notebook like google colab.

Do not use print statements in your gradescope submissions. While print statements are useful to debug issues locally in an autograder context they can leak sensitive answer information. We have detections in gradescope that will block you from viewing scores/outputs from your code if you use print statements in any of your submitted code. If you try to bruteforce/hack/game the autograder or extract information we will give you a 0 for the whole assignment. 

Read the python library documentation. You will be using pandas, scikit-learn, yellowbrick 

Do not hard code solutions (see screenshot below for what not to do):

 

 

Task 0 (20 points)

We have a Canvas quiz that is meant to test that you have read the library documentation for the packages we use for this class. It is not meant to be tricky and can be completed before you start the project or after you finish it. 

Useful Links:

scikit-learn Documentation

Deliverables:

Complete Canvas Quiz

 

Task 1 (5 points)

Lets first get familiar with some pandas basics. Pandas is a library that handles data frames which you can think of as a python class that handles tabular data. In this section you will make a very simple function that takes in a pandas dataframe of the file attributes and classifications and returns some simple attributes. See the function skeleton and implement a count of rows, count of columns, count of rows where the classification is 1 (positive), count of rows where the classification is 0 (negative) and a percentage of classification of 1 (percent positive) in the dataset. Generally in the real world you would also use plotting tools like PowerBi, Tableau, Data Studio, Matplotlib etc to create graphics and other visuals to better understand the dataset you are working with, this step is generally known as Exploratory Data Analysis. Since we are using an autograder for this class we will skip the plotting for this project.

 

Useful Links:

pandas documentation — pandas 1.5.3 documentation (pydata.org)

What is Exploratory Data Analysis? | IBM

Top Data Visualization Tools | KDnuggets 

 

Deliverables:

Complete find_dataset_statistics function in task1.py

Submit task1.py to gradescope

 

 

Task 2 (10 points)

Now that you have a basic understanding of pandas and the dataset it is time to dive into some more complex data processing tasks. The first subtask in this task is splitting your dataset into both features and targets (columns) and splitting your dataset into training and test sets (rows). These are basic concepts in model building but at a high level it is important to hold out a subset of your data when you train a model so you can see what the expected performance is on unseen samples and so you can determine if the resulting model is overfit (performs much better on training data vs test data). Preprocessing data is important since most models only take in numerical values so categorical features need to be “encoded” to numerical values so models can use them. Numerical scaling can be more or less useful depending on the type of model used but is especially important in linear models. These preprocessing techniques will provide you options to augment your dataset and improve model performance. 

For one hot encoding and scaling functions you should return a dataframe with the encoded/scaled columns concatenated to the ‘other’ columns you did not transform. For the PCA functions you should return just the PCA dataframe. For the feature engineering dataframe you should return the feature engineered column attached to the input dataframe.

Example Output for one hot encoding (where color and version are encoded):

 

Note: For these functions (and in data science/ML in general) you should be training/fitting to the train set and predicting/transforming on the train and test sets. 

Note: for onehot encoding please use the format columnName_columnValue for the new encoded column names. Ie in the sklearn example of a column named `gender` and values `male` and `female`, you would put `gender_male` and `gender_female` as your new column names after one hot encoding. If the test set has a third gender then you would denote that value you didn’t see at training time with 0’s for each encoded column. 

Note: When using sklearn.preprocessing.OneHotEncoder check the output from the transformation it may cause issues if it is not a format that pandas dataframes expect. To see this and to debug it, try running your code locally (or on a google colab notebook) using the training csv with a few categorical columns.

 

 

(5/5)
Attachments:

Related Questions

. Introgramming & Unix Fall 2018, CRN 44882, Oakland University Homework Assignment 6 - Using Arrays and Functions in C

DescriptionIn this final assignment, the students will demonstrate their ability to apply two ma

. The standard path finding involves finding the (shortest) path from an origin to a destination, typically on a map. This is an

Path finding involves finding a path from A to B. Typically we want the path to have certain properties,such as being the shortest or to avoid going t

. Develop a program to emulate a purchase transaction at a retail store. This program will have two classes, a LineItem class and a Transaction class. The LineItem class will represent an individual

Develop a program to emulate a purchase transaction at a retail store. Thisprogram will have two classes, a LineItem class and a Transaction class. Th

. SeaPort Project series For this set of projects for the course, we wish to simulate some of the aspects of a number of Sea Ports. Here are the classes and their instance variables we wish to define:

1 Project 1 Introduction - the SeaPort Project series For this set of projects for the course, we wish to simulate some of the aspects of a number of

. Project 2 Introduction - the SeaPort Project series For this set of projects for the course, we wish to simulate some of the aspects of a number of Sea Ports. Here are the classes and their instance variables we wish to define:

1 Project 2 Introduction - the SeaPort Project series For this set of projects for the course, we wish to simulate some of the aspects of a number of

Ask This Question To Be Solved By Our ExpertsGet A+ Grade Solution Guaranteed

expert
Um e HaniScience

802 Answers

Hire Me
expert
Muhammad Ali HaiderFinance

982 Answers

Hire Me
expert
Husnain SaeedComputer science

621 Answers

Hire Me
expert
Atharva PatilComputer science

648 Answers

Hire Me

Get Free Quote!

320 Experts Online