Assignment 1-MapReduce Warm-up (20 points)
This assignment has three short coding problems. Please read the instructions carefully and submit all the required files on the blackboard.
Problem 1- Processing Yelp Review Dataset (10 pts)
For this lab, you take a sample of the yelp review dataset, explained below, and for each business id find the number of reviews and the average stars given to that business id.
The data is made available to the public by Yelp and is in JSON format. Please go to https://www.yelp.com/dataset and click on “Download Dataset” then enter your name, email address, and your initials, check agree to dataset license and click on Download. You will be redirected to another page. Click on “Download JSON” to download the dataset. You can read the documentation of the JSON dataset here: https://www.yelp.com/dataset/documentation/json. The file that you will be using for this assignment is called review.json and it is about 3.6 GB compressed. This file contains a sample of reviews given by users to each business and includes the business and user ids
A JSON object is very similar to the XML tag in that it consists of a set of attributes and their values. Below is an example of a JSON object in review.json file. Each line of review.json contains a single JSON object (I pasted each attribute in a separate line for readability here but in the dataset these are all in one line)
// string, 22 character unique review id
// string, 22 character unique user id, maps to the user in user.json
// string, 22 character business id, maps to business in business.json
// integer, star rating
// string, date formatted YYYY-MM-DD
To process json in java, you can add the following library in your pom.xml:
Then in your map function, you can create a json object and extract the attributes you want by calling the method “get” on the json object. For example, you can extract the attribute “stars” in your map function as follows (suppose that “value” is the value passed as an argument to your map function):
JSONObject jsn= new JSONObject(value.toString()); int stars= (Integer)jsn.get(“stars”);
You can extract the other attributes you need in a similar fashion.
What you need to do:
You need to write a MapReduce program which takes review.json as input and for each business_id it outputs the number of reviews that are given to that business id together with its average stars.
business_id average_stars review_count
You can first create a smaller sample of the review.json file and test your program on this sample. For example, you can use the following unix shell command to copy the first 100K lines of the review.json in another file called review_small.json and run your program locally on this sample:
head -100000 review.json >> review_sample.json
Once you are confident that your program works correctly on a smaller sample, right click on your project folder on eclipse, click on export, and click on “runnable jar” to export your program as a runnable jar. The
reason we are exporting it as a runnable jar is because we would like to package the json library we used with the jar file, that way all nodes running the map function will have access to that library.
Attention: You do not need to specify the name of your driver class in Hadoop jar command if your mapreduce program is exported as a runnable jar. In that case you only need to specify the paths to your input and output files. That is,
Hadoop jar <path to your runnable jar on local> <input path on hdfs> <output path on hdfs>
You can emit multiple attributes as a key from your map or reduce function by appending them together and send them as a Text object. For example, if you want to send both A1 and A2 as key from your reducer, you can emit new Text( A1+”,”+A2) as key.
What you need to submit:
The source code for your mapper, reducer, driver, and combiner. Please name your driver class as YelpAverageStar.java
Problem 2—Increase the performance of your program for problem1 with a custom combiner (10 pts)
Modify your solution to problem 1 and use a custom combiner to increase the performance of your
MapReduce program (please refer to the lectures “more on MapReduce” slides 29-44). Run and debug your program on a smaller data. Once you are sure that your program works correctly, copy the yelp review data to hdfs, create a jar file and run the program on your three node yarn cluster. Once your job is completed, record the job elapsed time and reduce shuffle size (the reduce shuffle size is printed on the terminal once the job is completed. You can find the job elapsed in Yarn GUI, the application history). Then go back to your program and comment the line for using the combiner and run your program again without combiner on the cluster. Record the job elapsed time and the reduce shuffle size again. Does your program run faster when using combiners? What is the reduce shuffle size with and without using a combiner?
What you need to submit:
The source code for your mapper, reducer, driver, and combiner. Please name your driver class as YelpAverageReviewWithCombiner.java
A document that compares the shuffle size and elapsed time of the job with and without the combiner.
Problem 3—Finding pair of Flights with the maximum number of cancellation per carrier(Optional +5 pts)
If you want to get more practice with MapReduce and some extra points then this problem is for you. You are given a large dataset of flight arrival and departure details for all commercial flights within the USA between the years 1987-2000. The original dataset is about 5.5 GB and is extracted from here: stat- computing.org/dataexpo/2009/the-data.html click on the link and download files 1987 to 2000 and copy them in a folder on your virtual machine. You can name the folder anything you want, for example, flightdata. The goal is to write a simple MapReduce program to find which origin/destination pair had the most number of cancelled flights for each unique carrier.
What is the format of the dataset and how can you access it?
Each file contains flight information for a particular year. For example 2000.csv contains all flight information for year 2000.
The ready solutions purchased from Library are already used solutions. Please do not submit them directly as it may lead to plagiarism. Once paid, the solution file download link will be sent to your provided email. Please either use them for learning purpose or re-write them in your own language. In case if you haven't get the email, do let us know via chat support.
Get Free Quote!
262 Experts Online