(5/5)

To solve this problem you look to sampling, which is a statistical methodology for extracting a relevant subset of a population.

INSTRUCTIONS TO CANDIDATES

ANSWER ALL QUESTIONS

Hadoop Map Reduce – Sampling a dataset. 50 points

Imagine you’re working with a terabyte-scale dataset and you have a MapReduce application you want to test with that dataset. Running your MapReduce application against the dataset may take hours, and constantly iterating with code refinements and rerunning against it isn’t an optimal workflow.

To solve this problem you look to sampling, which is a statistical methodology for extracting a relevant subset of a population. In the context of MapReduce, sampling provides an opportunity to work with large datasets without the overhead of having to wait for the entire dataset to be read and processed.

TO-DO – In Hadoop MapReduce code:

Write an input format that wraps the actual input format used to read data. Your code should work with TextInputFormat.
The input format that you’ll write should be configured, via arguments, with the number of samples that should be extracted from the wrapped input
The input test data for this problem is the same as homework 1: cookbook_text/*.txt.

(3 points each) Spark Scala, Spark Java or PySpark

Create a single dataframe from all CSV files in the zip, with header information
Show the dataframe columns
Show the first 20 rows, sorted by (capacity descending, model ascending)
Count the total number of rows
Count the total number of rows, grouped by capacity
Get the dataframe summary statistics
Select the following columns: date, model, capacity
Select the number of distinct models
Calculate the pairwise frequency of this two columns (e.g. crosstab): capacity, smart_1_normalized
Find the mean value of column capacity

3. Spark Anomaly detection – Hard Drive Failures 60 points

Anomaly detection is a technique used to identify unusual patterns that do not conform to expected behavior, called outliers.

Anomalies can be broadly categorized as:

Point anomalies: A single instance of data is anomalous if it’s too far off from the rest. Business use case: Detecting credit card fraud based on “amount ”
Contextual anomalies: The abnormality is context specific. This type of anomaly is common in time-series data. Business use case: Spending $100 on food every day during the holiday season is normal, but may be odd
Collective anomalies: A set of data instances collectively helps in detecting anomalies. Business use case: Someone is trying to copy data form a remote machine to a local host unexpectedly, an anomaly that would be flagged as a potential cyber attack.

TO-DO, in Spark Scala, Spark Java, or PySpark (** no Pandas **):

Given the hard drive logs for 2019 Q1, implement a K-Nearest Neighbors (KNN) point anomaly detector for:

a) Normalized Read Error Rate, SMART attribute 1.

Annualized Failure Rate (by model)

ETL: Computations/Transformations and data labeling (20 points): For an explanation of hard drive SMART attributes, refer to:

For an explanation of scripts required for these computations, refer to docs_Q1_2019.zip.

ANOMALY DETECTION : Machine Learning training, testing (40 points)

Implement KNN (supervised training) for:

a) Normalized Read Error Rate, SMART attribute 1.

Annualized Failure Rate (by model)

For generating labels, use for a) 100 , for b) 2%

Anomaly Detection Reference: https://www.datascience.com/blog/python-anomaly- detection

Spark Bloom Filters and Broadcast Joins 50 points

Suppose you are interested in records on one dataset, Table A, based on values of another dataset, Table B. Generally, an inner join is used as a form of filtering.

Consider, for example, if Table A has 100’s of millions of rows, while Table B has only a few thousands.

In cases like this, you might want to avoid the shuffle that the join operation introduces, especially if the dataset you want to use for filtering is significantly smaller than the main dataset on which you will perform your further computation.

(5/5)

Use CA10RAM to get 10%* Discount.

To solve this problem you look to sampling, which is a statistical methodology for extracting a relevant subset of a population.

ANSWER ALL QUESTIONS

Attachments:

Instructions Files

Related Questions

. Introgramming & Unix Fall 2018, CRN 44882, Oakland University Homework Assignment 6 - Using Arrays and Functions in C

. The standard path finding involves finding the (shortest) path from an origin to a destination, typically on a map. This is an

. Develop a program to emulate a purchase transaction at a retail store. This program will have two classes, a LineItem class and a Transaction class. The LineItem class will represent an individual

. SeaPort Project series For this set of projects for the course, we wish to simulate some of the aspects of a number of Sea Ports. Here are the classes and their instance variables we wish to define:

. Project 2 Introduction - the SeaPort Project series For this set of projects for the course, we wish to simulate some of the aspects of a number of Sea Ports. Here are the classes and their instance variables we wish to define:

Ask This Question To Be Solved By Our ExpertsGet A+ Grade Solution Guaranteed

Our Experts

Um e HaniScience

Muhammad Ali HaiderFinance

Husnain SaeedComputer science

Atharva PatilComputer science

Other Services

To solve this problem you look to sampling, which is a statistical methodology for extracting a relevant subset of a population.

ANSWER ALL QUESTIONS

Attachments:

Instructions Files

Related Questions

. Introgramming & Unix Fall 2018, CRN 44882, Oakland University Homework Assignment 6 - Using Arrays and Functions in C

. The standard path finding involves finding the (shortest) path from an origin to a destination, typically on a map. This is an

. Develop a program to emulate a purchase transaction at a retail store. This program will have two classes, a LineItem class and a Transaction class. The LineItem class will represent an individual

. SeaPort Project series For this set of projects for the course, we wish to simulate some of the aspects of a number of Sea Ports. Here are the classes and their instance variables we wish to define:

. Project 2 Introduction - the SeaPort Project series For this set of projects for the course, we wish to simulate some of the aspects of a number of Sea Ports. Here are the classes and their instance variables we wish to define:

Ask This Question To Be Solved By Our ExpertsGet A+ Grade Solution Guaranteed

Our Experts

Um e HaniScience

Muhammad Ali HaiderFinance

Husnain SaeedComputer science

Atharva PatilComputer science