logo Use CA10RAM to get 10%* Discount.
Order Nowlogo
(5/5)

To solve this problem you look to sampling, which is a statistical methodology for extracting a relevant subset of a population.

INSTRUCTIONS TO CANDIDATES
ANSWER ALL QUESTIONS
  1. Hadoop Map Reduce – Sampling a dataset. 50 points

Imagine you’re working with a terabyte-scale dataset and you have a MapReduce application you want to test with that dataset. Running your MapReduce application against the dataset may take hours, and constantly iterating with code refinements and rerunning against it isn’t an optimal workflow.

To solve this problem you look to sampling, which is a statistical methodology for extracting a relevant subset of a population. In the context of MapReduce, sampling provides an opportunity to work with large datasets without the overhead of having to wait for the entire dataset to be read and processed.

TO-DO – In Hadoop MapReduce code:

  • Write an input format that wraps the actual input format used to read data. Your code should work with TextInputFormat.

  • The input format that you’ll write should be configured, via arguments, with the number of samples that should be extracted from the wrapped input

  • The input test data for this problem is the same as homework 1: cookbook_text/*.txt.

(3 points each) Spark Scala, Spark Java or PySpark

 

  1. Create a single dataframe from all CSV files in the zip, with header information

  2. Show the dataframe columns

  3. Show the first 20 rows, sorted by (capacity descending, model ascending)

  4. Count the total number of rows

  5. Count the total number of rows, grouped by capacity

  6. Get the dataframe summary statistics

  7. Select the following columns: date, model, capacity

  8. Select the number of distinct models

  9. Calculate the pairwise frequency of this two columns (e.g. crosstab): capacity, smart_1_normalized

  10. Find the mean value of column capacity

 

3.    Spark Anomaly detection – Hard Drive Failures 60 points

Anomaly detection is a technique used to identify unusual patterns that do not conform to expected behavior, called outliers.

Anomalies can be broadly categorized as:

  1. Point anomalies: A single instance of data is anomalous if it’s too far off from the rest. Business use case: Detecting credit card fraud based on “amount ”

  2. Contextual anomalies: The abnormality is context specific. This type of anomaly is common in time-series data. Business use case: Spending $100 on food every day during the holiday season is normal, but may be odd

  3. Collective anomalies: A set of data instances collectively helps in detecting anomalies. Business use case: Someone is trying to copy data form a remote machine to a local host unexpectedly, an anomaly that would be flagged as a potential cyber attack.

TO-DO, in Spark Scala, Spark Java, or PySpark (** no Pandas **):

 Given the hard drive logs for 2019 Q1, implement a K-Nearest Neighbors (KNN) point anomaly detector for:

a)      Normalized Read Error Rate, SMART attribute 1.

  1. Annualized Failure Rate (by model)

ETL: Computations/Transformations and data labeling (20 points): For an explanation of hard drive SMART attributes, refer to:

For an explanation of scripts required for these computations, refer to docs_Q1_2019.zip.

ANOMALY DETECTION : Machine Learning training, testing (40 points)

Implement KNN (supervised training) for:

a)      Normalized Read Error Rate, SMART attribute 1.

  1. Annualized Failure Rate (by model)

For generating labels, use for a) 100 , for b) 2%

Anomaly Detection Reference: https://www.datascience.com/blog/python-anomaly- detection

  1. Spark Bloom Filters and Broadcast Joins 50 points

Suppose you are interested in records on one dataset, Table A, based on values of another dataset, Table B. Generally, an inner join is used as a form of filtering.

Consider, for example, if Table A has 100’s of millions of rows, while Table B has only a few thousands.

In cases like this, you might want to avoid the shuffle that the join operation introduces, especially if the dataset you want to use for filtering is significantly smaller than the main dataset on which you will perform your further computation.

(5/5)
Attachments:

Related Questions

. Introgramming & Unix Fall 2018, CRN 44882, Oakland University Homework Assignment 6 - Using Arrays and Functions in C

DescriptionIn this final assignment, the students will demonstrate their ability to apply two ma

. The standard path finding involves finding the (shortest) path from an origin to a destination, typically on a map. This is an

Path finding involves finding a path from A to B. Typically we want the path to have certain properties,such as being the shortest or to avoid going t

. Develop a program to emulate a purchase transaction at a retail store. This program will have two classes, a LineItem class and a Transaction class. The LineItem class will represent an individual

Develop a program to emulate a purchase transaction at a retail store. Thisprogram will have two classes, a LineItem class and a Transaction class. Th

. SeaPort Project series For this set of projects for the course, we wish to simulate some of the aspects of a number of Sea Ports. Here are the classes and their instance variables we wish to define:

1 Project 1 Introduction - the SeaPort Project series For this set of projects for the course, we wish to simulate some of the aspects of a number of

. Project 2 Introduction - the SeaPort Project series For this set of projects for the course, we wish to simulate some of the aspects of a number of Sea Ports. Here are the classes and their instance variables we wish to define:

1 Project 2 Introduction - the SeaPort Project series For this set of projects for the course, we wish to simulate some of the aspects of a number of

Ask This Question To Be Solved By Our ExpertsGet A+ Grade Solution Guaranteed

expert
Um e HaniScience

885 Answers

Hire Me
expert
Muhammad Ali HaiderFinance

621 Answers

Hire Me
expert
Husnain SaeedComputer science

660 Answers

Hire Me
expert
Atharva PatilComputer science

849 Answers

Hire Me

Get Free Quote!

268 Experts Online