logo Use CA10RAM to get 10%* Discount.
Order Nowlogo
(5/5)

Reading Invoice Data in SparkTo get started, create a spark cluster in the Databricks console.Once your cluster is up and running

INSTRUCTIONS TO CANDIDATES
ANSWER ALL QUESTIONS

Part 1/3: Reading Invoice Data in Spark

1)   To get started, create a spark cluster in the Databricks console.  Once your cluster is up and running, take a screenshot and post it below. 

Step1: Creating a new spark cluster using Scala 2.11, Spark 2.4.3 and Python 3.0

 

Step2: Take a Screenshot of the running cluster

 

2)   Next, create a new notebook and execute the following code to print a sample of the

invoice dataset (provided by Databricks).  Again, provide a screenshot.

/dbfs/databricks-datasets/online_retail/data-001/data.csv

 

    with open("/dbfs/databricks-datasets/online_retail/data-001/data.csv") as f:

        x = ''.join(f.readlines())

    print(x)

Step3: Creating a new notebook

 

Step4: Displaying the built-in dataset inside of the python notebook

 

3)   Read the invoice CSV into a resilient distributed dataset (RDD).  Collect the first five rows and print them.  Take a screenshot of both the code and printed output and include it here.

 

Step5: Creating an RDD and displaying the first five rows of the dataset

Part 2/3: Answer the following questions regarding invoice data

 

For each question below, please:

  • Use map and reduce functions to answer the question.

  • Provide the snippet of Spark code that you used to answer the question.

  • Include a screenshot of your notebook that includes both the code and the printed

answer.

 

1)   Which customer in the dataset has spent the most on products?  The quantity

multiplied by the unit price will give you the total dollar amount spent per invoice line.

 

2)   What is the product description for the best-selling product in the dataset?  We will

define "Best Selling" as the product with the highest quantity sold.

 

3)   How much has each country spent on products?  The output should have two columns,

one being the country and the other being the gross dollar amount spent across all

products.  Sort the output by the dollar amount, descending.  Print the entire output,

showing a gross dollar amount for each country.

 

4)   What is the highest-grossing day in the dataset?  Again, use quantity multiplied by unit

price to get the revenue per line.

 

5)   Finally, try out one of Databrick's visualizations.  Note that you will need to convert

back to a DataFrame in order to visualize the data (hint: look at rdd.toDF()).  Create an

appropriate DataFrame for visualization and call display on it. 

Take a screenshot of your code and the resulting visualization.  You can find available

visualizations by expanding this icon at the bottom of a cell:

 

Part 3/3: Kafka Questions

1)   In one sentence or less, what is the purpose of Kafka.     

Kafka is an open source software which provides a framework for storing, reading and analyzing streaming data

 

2)   Describe two ways in which Kafka differentiates itself from other messaging

systems.

  1. Kafka is an active-active system. With replication configured it supports active-passive pattern for high availability. Most of the traditonal message brokers do not support active-active.

  2. Kafka scales well and scales horizontally, you can add more nodes to handle increasing load. I have seen systems scale upto billion msgs and 10s of TBs of data/day on few nodes of commodity hardware. JMS brokers in question can only scale vertically and will quickly hit gc limits 

 

3)   Describe one architectural decision that has contributed to Kafka’s scalability and

performance at scale.  Think specifically about how it interacts with the underlying

Operating System and/or Java Virtual Machine (JVM).

 

Kafkas Performance

Kafka relies heavily on the OS kernel to move data around quickly. It relies on the principals of Zero Copy. Kafka enables you to batch data records into chunks. These batches of data can be seen end to end from Producer to file system (Kafka Topic Log) to the Consumer. Batching allows for more efficient data compression and reduces I/O latency. Kafka writes to the immutable commit log to the disk sequential; thus, avoids random disk access, slow disk seeking. Kafka provides horizontal Scale through sharding. It shards a Topic Log into hundreds potentially thousands of partitions to thousands of servers. This sharding allows Kafka to handle massive load.

 

Kafkas Scalability

Kafka is a good storage system for records/messages. Kafka acts like high-speed file system for commit log storage and replication. These characteristics make Kafka useful for all manners of applications. Records written to Kafka topics are persisted to disk and replicated to other servers for fault-tolerance. Since modern drives are fast and quite large, this fits well and is very useful. Kafka Producers can wait on acknowledgment, so messages are durable as the producer write not complete until the message replicates. The Kafka disk structure scales well. Modern disk drives have very high throughput when writing in large streaming batches.

(5/5)
Attachments:

Related Questions

. Introgramming & Unix Fall 2018, CRN 44882, Oakland University Homework Assignment 6 - Using Arrays and Functions in C

DescriptionIn this final assignment, the students will demonstrate their ability to apply two ma

. The standard path finding involves finding the (shortest) path from an origin to a destination, typically on a map. This is an

Path finding involves finding a path from A to B. Typically we want the path to have certain properties,such as being the shortest or to avoid going t

. Develop a program to emulate a purchase transaction at a retail store. This program will have two classes, a LineItem class and a Transaction class. The LineItem class will represent an individual

Develop a program to emulate a purchase transaction at a retail store. Thisprogram will have two classes, a LineItem class and a Transaction class. Th

. SeaPort Project series For this set of projects for the course, we wish to simulate some of the aspects of a number of Sea Ports. Here are the classes and their instance variables we wish to define:

1 Project 1 Introduction - the SeaPort Project series For this set of projects for the course, we wish to simulate some of the aspects of a number of

. Project 2 Introduction - the SeaPort Project series For this set of projects for the course, we wish to simulate some of the aspects of a number of Sea Ports. Here are the classes and their instance variables we wish to define:

1 Project 2 Introduction - the SeaPort Project series For this set of projects for the course, we wish to simulate some of the aspects of a number of

Ask This Question To Be Solved By Our ExpertsGet A+ Grade Solution Guaranteed

expert
Um e HaniScience

656 Answers

Hire Me
expert
Muhammad Ali HaiderFinance

740 Answers

Hire Me
expert
Husnain SaeedComputer science

511 Answers

Hire Me
expert
Atharva PatilComputer science

726 Answers

Hire Me
June
January
February
March
April
May
June
July
August
September
October
November
December
2025
1950
1951
1952
1953
1954
1955
1956
1957
1958
1959
1960
1961
1962
1963
1964
1965
1966
1967
1968
1969
1970
1971
1972
1973
1974
1975
1976
1977
1978
1979
1980
1981
1982
1983
1984
1985
1986
1987
1988
1989
1990
1991
1992
1993
1994
1995
1996
1997
1998
1999
2000
2001
2002
2003
2004
2005
2006
2007
2008
2009
2010
2011
2012
2013
2014
2015
2016
2017
2018
2019
2020
2021
2022
2023
2024
2025
2026
2027
2028
2029
2030
2031
2032
2033
2034
2035
2036
2037
2038
2039
2040
2041
2042
2043
2044
2045
2046
2047
2048
2049
2050
SunMonTueWedThuFriSat
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
1
2
3
4
5
00:00
00:30
01:00
01:30
02:00
02:30
03:00
03:30
04:00
04:30
05:00
05:30
06:00
06:30
07:00
07:30
08:00
08:30
09:00
09:30
10:00
10:30
11:00
11:30
12:00
12:30
13:00
13:30
14:00
14:30
15:00
15:30
16:00
16:30
17:00
17:30
18:00
18:30
19:00
19:30
20:00
20:30
21:00
21:30
22:00
22:30
23:00
23:30