Hadoop Map Reduce – Sampling a dataset. 50 points
Imagine you’re working with a terabyte-scale dataset and you have a MapReduce application you want to test with that dataset. Running your MapReduce application against the dataset may take hours, and constantly iterating with code refinements and rerunning against it isn’t an optimal workflow.
To solve this problem you look to sampling, which is a statistical methodology for extracting a relevant subset of a population. In the context of MapReduce, sampling provides an opportunity to work with large datasets without the overhead of having to wait for the entire dataset to be read and processed.
TO-DO – In Hadoop MapReduce code:
Write an input format that wraps the actual input format used to read data. Your code should work with TextInputFormat.
The input format that you’ll write should be configured, via arguments, with the number of samples that should be extracted from the wrapped input
The input test data for this problem is the same as homework 1: cookbook_text/*.txt.
(3 points each) Spark Scala, Spark Java or PySpark
Create a single dataframe from all CSV files in the zip, with header information
Show the dataframe columns
Show the first 20 rows, sorted by (capacity descending, model ascending)
Count the total number of rows
Count the total number of rows, grouped by capacity
Get the dataframe summary statistics
Select the following columns: date, model, capacity
Select the number of distinct models
Calculate the pairwise frequency of this two columns (e.g. crosstab): capacity, smart_1_normalized
Find the mean value of column capacity
3. Spark Anomaly detection – Hard Drive Failures 60 points
Anomaly detection is a technique used to identify unusual patterns that do not conform to expected behavior, called outliers.
Anomalies can be broadly categorized as:
Point anomalies: A single instance of data is anomalous if it’s too far off from the rest. Business use case: Detecting credit card fraud based on “amount ”
Contextual anomalies: The abnormality is context specific. This type of anomaly is common in time-series data. Business use case: Spending $100 on food every day during the holiday season is normal, but may be odd
Collective anomalies: A set of data instances collectively helps in detecting anomalies. Business use case: Someone is trying to copy data form a remote machine to a local host unexpectedly, an anomaly that would be flagged as a potential cyber attack.
TO-DO, in Spark Scala, Spark Java, or PySpark (** no Pandas **):
Given the hard drive logs for 2019 Q1, implement a K-Nearest Neighbors (KNN) point anomaly detector for:
a) Normalized Read Error Rate, SMART attribute 1.
Annualized Failure Rate (by model)
ETL: Computations/Transformations and data labeling (20 points): For an explanation of hard drive SMART attributes, refer to:
For an explanation of scripts required for these computations, refer to docs_Q1_2019.zip.
ANOMALY DETECTION : Machine Learning training, testing (40 points)
Implement KNN (supervised training) for:
a) Normalized Read Error Rate, SMART attribute 1.
Annualized Failure Rate (by model)
For generating labels, use for a) 100 , for b) 2%
Anomaly Detection Reference: https://www.datascience.com/blog/python-anomaly- detection
Spark Bloom Filters and Broadcast Joins 50 points
Suppose you are interested in records on one dataset, Table A, based on values of another dataset, Table B. Generally, an inner join is used as a form of filtering.
Consider, for example, if Table A has 100’s of millions of rows, while Table B has only a few thousands.
In cases like this, you might want to avoid the shuffle that the join operation introduces, especially if the dataset you want to use for filtering is significantly smaller than the main dataset on which you will perform your further computation.
DescriptionIn this final assignment, the students will demonstrate their ability to apply two ma
Path finding involves finding a path from A to B. Typically we want the path to have certain properties,such as being the shortest or to avoid going t
Develop a program to emulate a purchase transaction at a retail store. Thisprogram will have two classes, a LineItem class and a Transaction class. Th
1 Project 1 Introduction - the SeaPort Project series For this set of projects for the course, we wish to simulate some of the aspects of a number of
1 Project 2 Introduction - the SeaPort Project series For this set of projects for the course, we wish to simulate some of the aspects of a number of