Earn Higher Grades With Instant Assignment Help.Ask Question!

Software Engineering
(5/5)

Reading Invoice Data in SparkTo get started, create a spark cluster in the Databricks console.Once your cluster is up and running

INSTRUCTIONS TO CANDIDATES
ANSWER ALL QUESTIONS

Part 1/3: Reading Invoice Data in Spark

1)   To get started, create a spark cluster in the Databricks console.  Once your cluster is up and running, take a screenshot and post it below. 

Step1: Creating a new spark cluster using Scala 2.11, Spark 2.4.3 and Python 3.0

 

Step2: Take a Screenshot of the running cluster

 

2)   Next, create a new notebook and execute the following code to print a sample of the

invoice dataset (provided by Databricks).  Again, provide a screenshot.

/dbfs/databricks-datasets/online_retail/data-001/data.csv

 

    with open("/dbfs/databricks-datasets/online_retail/data-001/data.csv") as f:

        x = ''.join(f.readlines())

    print(x)

Step3: Creating a new notebook

 

Step4: Displaying the built-in dataset inside of the python notebook

 

3)   Read the invoice CSV into a resilient distributed dataset (RDD).  Collect the first five rows and print them.  Take a screenshot of both the code and printed output and include it here.

 

Step5: Creating an RDD and displaying the first five rows of the dataset

Part 2/3: Answer the following questions regarding invoice data

 

For each question below, please:

  • Use map and reduce functions to answer the question.

  • Provide the snippet of Spark code that you used to answer the question.

  • Include a screenshot of your notebook that includes both the code and the printed

answer.

 

1)   Which customer in the dataset has spent the most on products?  The quantity

multiplied by the unit price will give you the total dollar amount spent per invoice line.

 

2)   What is the product description for the best-selling product in the dataset?  We will

define "Best Selling" as the product with the highest quantity sold.

 

3)   How much has each country spent on products?  The output should have two columns,

one being the country and the other being the gross dollar amount spent across all

products.  Sort the output by the dollar amount, descending.  Print the entire output,

showing a gross dollar amount for each country.

 

4)   What is the highest-grossing day in the dataset?  Again, use quantity multiplied by unit

price to get the revenue per line.

 

5)   Finally, try out one of Databrick's visualizations.  Note that you will need to convert

back to a DataFrame in order to visualize the data (hint: look at rdd.toDF()).  Create an

appropriate DataFrame for visualization and call display on it. 

Take a screenshot of your code and the resulting visualization.  You can find available

visualizations by expanding this icon at the bottom of a cell:

 

Part 3/3: Kafka Questions

1)   In one sentence or less, what is the purpose of Kafka.     

Kafka is an open source software which provides a framework for storing, reading and analyzing streaming data

 

2)   Describe two ways in which Kafka differentiates itself from other messaging

systems.

  1. Kafka is an active-active system. With replication configured it supports active-passive pattern for high availability. Most of the traditonal message brokers do not support active-active.

  2. Kafka scales well and scales horizontally, you can add more nodes to handle increasing load. I have seen systems scale upto billion msgs and 10s of TBs of data/day on few nodes of commodity hardware. JMS brokers in question can only scale vertically and will quickly hit gc limits 

 

3)   Describe one architectural decision that has contributed to Kafka’s scalability and

performance at scale.  Think specifically about how it interacts with the underlying

Operating System and/or Java Virtual Machine (JVM).

 

Kafkas Performance

Kafka relies heavily on the OS kernel to move data around quickly. It relies on the principals of Zero Copy. Kafka enables you to batch data records into chunks. These batches of data can be seen end to end from Producer to file system (Kafka Topic Log) to the Consumer. Batching allows for more efficient data compression and reduces I/O latency. Kafka writes to the immutable commit log to the disk sequential; thus, avoids random disk access, slow disk seeking. Kafka provides horizontal Scale through sharding. It shards a Topic Log into hundreds potentially thousands of partitions to thousands of servers. This sharding allows Kafka to handle massive load.

 

Kafkas Scalability

Kafka is a good storage system for records/messages. Kafka acts like high-speed file system for commit log storage and replication. These characteristics make Kafka useful for all manners of applications. Records written to Kafka topics are persisted to disk and replicated to other servers for fault-tolerance. Since modern drives are fast and quite large, this fits well and is very useful. Kafka Producers can wait on acknowledgment, so messages are durable as the producer write not complete until the message replicates. The Kafka disk structure scales well. Modern disk drives have very high throughput when writing in large streaming batches.

Attachments:
(5/5)

Related Questions

CSI 1420 Introduction to C Programming & Unix Fall 2018, CRN 44882, Oakland University Homework Assignment 6 - Using Arrays and Functions in C

DescriptionIn this final assignment, the students will demonstrate their ability to apply two majorconstructs of the C programming language – Fu

The standard path finding involves finding the (shortest) path from an origin to a destination, typically on a map. This is an

Path finding involves finding a path from A to B. Typically we want the path to have certain properties,such as being the shortest or to avoid going t

Develop a program to emulate a purchase transaction at a retail store. This program will have two classes, a LineItem class and a Transaction class. The LineItem class will represent an individual

Develop a program to emulate a purchase transaction at a retail store. Thisprogram will have two classes, a LineItem class and a Transaction class. Th

SeaPort Project series For this set of projects for the course, we wish to simulate some of the aspects of a number of Sea Ports. Here are the classes and their instance variables we wish to define:

1 Project 1 Introduction - the SeaPort Project series For this set of projects for the course, we wish to simulate some of the aspects of a number of

Project 2 Introduction - the SeaPort Project series For this set of projects for the course, we wish to simulate some of the aspects of a number of Sea Ports. Here are the classes and their instance variables we wish to define:

1 Project 2 Introduction - the SeaPort Project series For this set of projects for the course, we wish to simulate some of the aspects of a number of

Ask This Assignment To Be Done By Our ExpertsGet A+ Grade Solution Guaranteed

expert
joyComputer science
(4/5)
12 Answers Hire Me
expert
Robert DLaw
(4.8/5)
626 Answers Hire Me
expert
Dr Samuel BarberaStatistics
(5/5)
543 Answers Hire Me
expert
Tutor For YouEconomics
(5/5)
528 Answers Hire Me