logo Use CA10RAM to get 10%* Discount.
Order Nowlogo
(5/5)

This assignment is designed to give students hands-on experience with designing an application using big data and some of the components of Hadoop ecosystem

INSTRUCTIONS TO CANDIDATES
ANSWER ALL QUESTIONS

This assignment is designed to give students hands-on experience with designing an application using big data and some of the components of Hadoop ecosystem.

Objectives

This assignment supports the following objectives:

 

  • Design toy applications using Sqoop, Flume, or Storm
  • Create a small Hadoop application

Details

ZoomZoomCom, a marketing company, has hired you to design a data analysis project based on Hadoop. They want to determine for each of thousands of product categories if there is a correlation between the sentiment of the context in which a link to a page for a product of a certain category appears and the likelihood of its being clicked on. For example, is an upbeat context of a link to a cosmetic product more likely to be linked on than a downbeat one? Is the opposite true for dieting products?

Additional Details

You are provided a database containing the URL for a large number of product pages and the category of each product. (You may assume there is only one product in each product page.) Let us call it the ‘URL DB’.

Assume you have access to a web crawler that crawls web pages. You also have at your disposal a web scraping program that can find out whether a web page has a link to one of the product pages and for each such link it can extract the paragraph (or a roughly 50 words window around the occurrence of the link—don’t worry about this kind of detail).

You have access to a log file which shows for each such page how many times it was served in a given week and how many times any link that points to one of our product pages was clicked on during that week. Based on the data in the log file you can compute the hit ratio of each link anchor (the number of times the link anchor was clicked on divided by the number of times the page was viewed).

You also have at your disposal a sentiment analysis program called Sentimenter for determining whether a paragraph is Upbeat, Downbeat, or Neutral.

 

So, based on the hit ratio of a link anchor and the sentiment of the context in which each link anchor appears you may come up with something like this: for link anchor ‘XYZ’ the hit ratio when the context was upbeat was 30%, the hit ratio when it was downbeat was 2% and when it was neutral was 7%.

You will have to group the link anchors by the product categories they point to. For

example, different occurrences of “truly wonderful device” may point to different products.

Then you will have to aggregate these hit ratios by product categories. So let us say, for cosmetics we found hundred link anchors and we have the hit ratios for each of our 3 sentiment categories. We can then just average them out. Similarly, for other product categories. This will give us the correlations that we seek.

 

Question 1

 

First, the URL DB has to be imported into Hadoop.

 

  • Which component of Hadoop should your system use to ingest the URL DB into Hadoop? (3 points)
  • Which component should your system use to interact with the URL DB? (3 points)
  • Why have you chosen those components? (2 points) Question 2

Second, the log file has to be ingested into Hadoop.

 

  • Which component of Hadoop should your system use to ingest the log file into Hadoop? (3 points)
  • Why that component? What might be some alternatives. (2 points) Question 3

Third, the crawler has to retrieve the content of web pages. For simplicity, let us assume that there is only one instance of the crawler operating and it downloads and operates only one page at a time. We can also assume that the web pages to be downloaded belong only to certain media and blog sites (so, not the entire web).

 

  • Which Hadoop component should be used to ingest this content? (3 points)
  • Why that component? (2 points) Question 4

 

Fourth, for each page we have to find all the links in each page which point to a URL in our URL DB. Let us say we are provided a program to do that called LinkFinderAnalyzer.

Fortunately, LinkFinderAnalyzer does a number of additional tasks. It extracts the para in which the link appears. Then it invokes Sentimenter to find the sentiment of the para. Then, for each link_anchor it computes the hit score of that link_anchor.

The output of running LinkAnalyzerFinder are tuples of the form <Product_type,

<Sentiment:hit_score>>. We want to reduce these tuples to tuples of the form

<Product_type, <Sentiment:hit_score, Sentiment:hit_score,…, Sentiment:hit_score>>. We can use MapReduce to do these tasks, where the mapping function is LinkFinderAnalyzer. This is similar to the MapReduce exercise for your Week 11 assignment.

Keep in mind that for each product_type there can be many such tuples because there could be several products in a product type and there could be several URLs pointing to a product in that product type.

By ‘product type’ is meant the type of product it is (e.g., lipstick, of which there can be several brands). This is to be distinguished from product category which may contain many product types. For example, both lipstick and mascara would fall under the product category of cosmetics.

Fifth, for each product category we want to combine the tuples produced by step 4. So,

suppose we had two tuples <Cosmetics, <Upbeat: 0.3,…Downbeat:0.01>> and <Cosmetics,

<Neutral: 0.011,…, Upbeat:0.04>>, we want to combine them into one tuple: <Cosmetics ,

<Upbeat: 0.3,…Downbeat:0.01, Neutral: 0.011,…, Upbeat:0.04>>.

This may also be a MapReduce job. The Mapper does nothing, but then there is shuffling and sorting in terms of product category, and reducing which combines the lists for each product category.

Sixth, for each tuple of step 5 we want to compute the average hit score for each sentiment type. The output of this step should be <Product_Category, <Upbeat:Average_hit_score, Neutral:Average_hit_score, Downbeat:Average_hit_score>>.

This is the final output of the system.

 

  • Which components of the Hadoop ecosystem will you use to do each of the jobs of steps 4-6? Would you use MapReduce or some interface to MapReduce (Hive? Pig? Spark?) (4 points)
  • Would you use the same component for each of steps 4-6, or would you use different components? You must justify your answer. (4 points)
  • Keep in mind that we are chaining together the operations of Steps 4-6. How should the output of the previous step in the sequence be passed to the next step?

 

What kind of Hadoop architecture is best suited for that? Will your system store the intermediate results? If so, in what sort of structure? You must justify your

answer. (4 points)

(5/5)
Attachments:

Related Questions

. Introgramming & Unix Fall 2018, CRN 44882, Oakland University Homework Assignment 6 - Using Arrays and Functions in C

DescriptionIn this final assignment, the students will demonstrate their ability to apply two ma

. The standard path finding involves finding the (shortest) path from an origin to a destination, typically on a map. This is an

Path finding involves finding a path from A to B. Typically we want the path to have certain properties,such as being the shortest or to avoid going t

. Develop a program to emulate a purchase transaction at a retail store. This program will have two classes, a LineItem class and a Transaction class. The LineItem class will represent an individual

Develop a program to emulate a purchase transaction at a retail store. Thisprogram will have two classes, a LineItem class and a Transaction class. Th

. SeaPort Project series For this set of projects for the course, we wish to simulate some of the aspects of a number of Sea Ports. Here are the classes and their instance variables we wish to define:

1 Project 1 Introduction - the SeaPort Project series For this set of projects for the course, we wish to simulate some of the aspects of a number of

. Project 2 Introduction - the SeaPort Project series For this set of projects for the course, we wish to simulate some of the aspects of a number of Sea Ports. Here are the classes and their instance variables we wish to define:

1 Project 2 Introduction - the SeaPort Project series For this set of projects for the course, we wish to simulate some of the aspects of a number of

Ask This Question To Be Solved By Our ExpertsGet A+ Grade Solution Guaranteed

expert
Um e HaniScience

570 Answers

Hire Me
expert
Muhammad Ali HaiderFinance

888 Answers

Hire Me
expert
Husnain SaeedComputer science

709 Answers

Hire Me
expert
Atharva PatilComputer science

708 Answers

Hire Me
June
January
February
March
April
May
June
July
August
September
October
November
December
2025
1950
1951
1952
1953
1954
1955
1956
1957
1958
1959
1960
1961
1962
1963
1964
1965
1966
1967
1968
1969
1970
1971
1972
1973
1974
1975
1976
1977
1978
1979
1980
1981
1982
1983
1984
1985
1986
1987
1988
1989
1990
1991
1992
1993
1994
1995
1996
1997
1998
1999
2000
2001
2002
2003
2004
2005
2006
2007
2008
2009
2010
2011
2012
2013
2014
2015
2016
2017
2018
2019
2020
2021
2022
2023
2024
2025
2026
2027
2028
2029
2030
2031
2032
2033
2034
2035
2036
2037
2038
2039
2040
2041
2042
2043
2044
2045
2046
2047
2048
2049
2050
SunMonTueWedThuFriSat
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
1
2
3
4
5
00:00
00:30
01:00
01:30
02:00
02:30
03:00
03:30
04:00
04:30
05:00
05:30
06:00
06:30
07:00
07:30
08:00
08:30
09:00
09:30
10:00
10:30
11:00
11:30
12:00
12:30
13:00
13:30
14:00
14:30
15:00
15:30
16:00
16:30
17:00
17:30
18:00
18:30
19:00
19:30
20:00
20:30
21:00
21:30
22:00
22:30
23:00
23:30