Part 1/3: Reading Invoice Data in Spark
1) To get started, create a spark cluster in the Databricks console. Once your cluster is up and running, take a screenshot and post it below.
Step1: Creating a new spark cluster using Scala 2.11, Spark 2.4.3 and Python 3.0
Step2: Take a Screenshot of the running cluster
2) Next, create a new notebook and execute the following code to print a sample of the
invoice dataset (provided by Databricks). Again, provide a screenshot.
/dbfs/databricks-datasets/online_retail/data-001/data.csv
with open("/dbfs/databricks-datasets/online_retail/data-001/data.csv") as f:
x = ''.join(f.readlines())
print(x)
Step3: Creating a new notebook
Step4: Displaying the built-in dataset inside of the python notebook
3) Read the invoice CSV into a resilient distributed dataset (RDD). Collect the first five rows and print them. Take a screenshot of both the code and printed output and include it here.
Step5: Creating an RDD and displaying the first five rows of the dataset
Part 2/3: Answer the following questions regarding invoice data
For each question below, please:
Use map and reduce functions to answer the question.
Provide the snippet of Spark code that you used to answer the question.
Include a screenshot of your notebook that includes both the code and the printed
answer.
1) Which customer in the dataset has spent the most on products? The quantity
multiplied by the unit price will give you the total dollar amount spent per invoice line.
2) What is the product description for the best-selling product in the dataset? We will
define "Best Selling" as the product with the highest quantity sold.
3) How much has each country spent on products? The output should have two columns,
one being the country and the other being the gross dollar amount spent across all
products. Sort the output by the dollar amount, descending. Print the entire output,
showing a gross dollar amount for each country.
4) What is the highest-grossing day in the dataset? Again, use quantity multiplied by unit
price to get the revenue per line.
5) Finally, try out one of Databrick's visualizations. Note that you will need to convert
back to a DataFrame in order to visualize the data (hint: look at rdd.toDF()). Create an
appropriate DataFrame for visualization and call display on it.
Take a screenshot of your code and the resulting visualization. You can find available
visualizations by expanding this icon at the bottom of a cell:
Part 3/3: Kafka Questions
1) In one sentence or less, what is the purpose of Kafka.
Kafka is an open source software which provides a framework for storing, reading and analyzing streaming data
2) Describe two ways in which Kafka differentiates itself from other messaging
systems.
Kafka is an active-active system. With replication configured it supports active-passive pattern for high availability. Most of the traditonal message brokers do not support active-active.
Kafka scales well and scales horizontally, you can add more nodes to handle increasing load. I have seen systems scale upto billion msgs and 10s of TBs of data/day on few nodes of commodity hardware. JMS brokers in question can only scale vertically and will quickly hit gc limits
3) Describe one architectural decision that has contributed to Kafka’s scalability and
performance at scale. Think specifically about how it interacts with the underlying
Operating System and/or Java Virtual Machine (JVM).
Kafkas Performance
Kafka relies heavily on the OS kernel to move data around quickly. It relies on the principals of Zero Copy. Kafka enables you to batch data records into chunks. These batches of data can be seen end to end from Producer to file system (Kafka Topic Log) to the Consumer. Batching allows for more efficient data compression and reduces I/O latency. Kafka writes to the immutable commit log to the disk sequential; thus, avoids random disk access, slow disk seeking. Kafka provides horizontal Scale through sharding. It shards a Topic Log into hundreds potentially thousands of partitions to thousands of servers. This sharding allows Kafka to handle massive load.
Kafkas Scalability
Kafka is a good storage system for records/messages. Kafka acts like high-speed file system for commit log storage and replication. These characteristics make Kafka useful for all manners of applications. Records written to Kafka topics are persisted to disk and replicated to other servers for fault-tolerance. Since modern drives are fast and quite large, this fits well and is very useful. Kafka Producers can wait on acknowledgment, so messages are durable as the producer write not complete until the message replicates. The Kafka disk structure scales well. Modern disk drives have very high throughput when writing in large streaming batches.
DescriptionIn this final assignment, the students will demonstrate their ability to apply two ma
Path finding involves finding a path from A to B. Typically we want the path to have certain properties,such as being the shortest or to avoid going t
Develop a program to emulate a purchase transaction at a retail store. Thisprogram will have two classes, a LineItem class and a Transaction class. Th
1 Project 1 Introduction - the SeaPort Project series For this set of projects for the course, we wish to simulate some of the aspects of a number of
1 Project 2 Introduction - the SeaPort Project series For this set of projects for the course, we wish to simulate some of the aspects of a number of