Learning outcomes of this assessment
INSTRUCTIONS TO CANDIDATES
ANSWER ALL QUESTIONS
Learning outcomes of this assessment
The learning outcomes covered by this assignment are: • Provide a broad overview of the general ﬁeld of ‘big data systems’ • Developing specialised knowledge in areas that demonstrate the interaction and synergy between ongoing research and practical deployment of this ﬁeld of study.
Key skills to be assessed
This assignments aims at assessing your skills in: • The usage of common big data tools and techniques • Your ability to implement a standard data analysis process – Loading the data – Cleansing the data – Analysis – Visualisation / Reporting • Use of Python, SQL and Linux terminal commands
The module notes complimented by tools and techniques covered in other modules are suﬃcient literature for completing this assignment successfully.
For reference documentation: • Spark documentation (https://spark.apache.org/documentation.html) • Hive documentation (https://cwiki.apache.org/confluence/display/Hive/Home) • Impala documentation (https://www.cloudera.com/documentation/enterprise/latest/topics/ impala.html) • Sqoop documentation (https://sqoop.apache.org/docs/1.4.4/SqoopUserGuide.html) • MySQL documentation (https://dev.mysql.com/doc/refman/5.5/en/) • Python documentation (https://developers.google.com/edu/python/introduction and https: //matplotlib.org/users/intro.html)
Equipment and Facilities to be Used
For this assignment the Cloudera Virtual Machine provided for this module must be used. All processing must be done via scripts and code, and these must be stored and included in submission. Terminal commands must be stored in shell scripts, language speciﬁc code has to be stored in separate ﬁles (for example, SQL code must be stored in SQL scripts and python code must be stored in Python scripts). The solution has to be implemented using both SQL and Python.
For the successful completion of this assignment, a total of 80 hours should be budgeted.
You will be given a dataset and a set of problem statements. Where possible (you will need to carefully explain any reasons for not supplying both solutions), you are required implement the solution in both SQL (using either Hive or Impala), and Spark (using pyspark or spark-shell).
You will follow a typical data analysis process:
- Load / ingest the data to be analysed
- Prepare / clean the data
- Analyse the data
- Visualise results / generate report
For steps 1, 2 and 3 you will use the virtual machine (and the software installed on it) that has been provided as part of this module. The data necessary for this assignment will be provided in a MySQL dump format which you will need to copy onto the virtual machine and start working with it from there.
The virtual machine has a MySQL server running and you will need to load the data into the MySQL server. From there you will be required to use Sqoop to get the data into Hadoop.
For the cleansing, preparation and analysis you will implement the solution twice (where possible). First in SQL using either Hive or Impala and then in Spark using either pyspark or spark-shell.
For the visualisation of the results you are free to use any tool that fulﬁls the requirements, which can be tools you have learned about such as Python’s matplotlib, SAS or Qlik, or any other free open source tool you may ﬁnd suitable.
Extra features to be implemented
To get more than a “Satisfactory” mark, a number of extra features should be implemented. Features include, but are not limited to: • Creation of a single script that executes the entire process of loading the supplied data to exporting the result data required for visualisation. • The Spark implementation is done in Scala as opposed to Python. • Usage of parametrised scripts which allows you to pass parameters to the queries to dynamically set data selection criteria. For instance, passing datetime parameters to select tweets in that time period. • Plotting of extra graphs visualising the discovery of useful information based on your own exploration which is not covered by the other problem statements. • Extraction of statistical information from the data. • The usage of ﬁle formats other than plain text.
You will be given a dataset containing simpliﬁed Twitter data pertaining to a number of football games. The dataset will be supplied in compressed format and will be made available online for download or can be supplied by USB memory stick. Further information regarding each game, including the teams playing and their oﬃcial hashtags, start and end times, as well as the times of any goals, will also be provided.
You are a data analyst / data scientist working for an event security company who monitor real time events to analyse the level of potential disturbance. In order to asses commotion at an event, they monitor the Twitter feeds pertaining to the event. They would like answers to the following questions (in all the following, you should consider the half time and overtime as ‘during-game’)..
Questions / problem statements:
- Extract and present the average number of tweets per ‘during-game’ minute for the top 10 (i.e. most tweeted about during the event) games.
- Rank the games according to number of distinct users tweeting ‘during-game’ and present the information for the top 10 games, including the number of distinct users for each.
- Find the top 3 teams that played in the most games. Rank their games in order of highest number of ‘during-game’ tweets (include the frequency in your output).
- Find the top 10 (ordered by number of tweets) games which have the highest ‘during-game’ tweeting spike in the last 10 minutes of the game.
- As well as the oﬃcial hashtags, each tweet may be labelled with other hashtags. Restricting the data to ‘during-game’ tweets, list the top 10 most common non oﬃcial hashtags over the whole dataset with their frequencies.
- Draw the graph of the progress of one of the games (the game you choose should have a complete set of tweets for the entire duration of the game). It may be useful to summarize the tweet frequencies in 1 minute intervals.
A 4000-5000 word report that documents your solution.
Additional advice to the client will award marks above the “Satisfactory” grade. This could include but is not limited to: • Other ﬁndings based on your analysis of the data • Outline of algorithms which would extract further information from the data • Discussion of alternative visualizations that could prove useful Along with the report, you are expected to also ﬁll in a self-assessment form.
Requirements / Marking Scheme
Requirement Assessment Weight Method (%) Data load and preparation Report & Demonstration 20% Data analysis Report & Demonstration 30% Report Report 30% Demonstration of the work Demonstration 10% Satisfactory response to questions Demonstration 10%
Notes • The assignment must be completed on your own. • The assignment must be completed on time – if you submit work late, it will be marked according to the University’s late submission policy
The University has strict policies on unfair means. It is your responsibility to ensure that you both understand these and adhere to them in the production of your assignment. Any submitted works with such content identiﬁable will be penalised in accordance with the University regulations
You submission should be a single ZIP ﬁle upload. The ﬁle should be named as:
<<Your surname>> <<Your name>>.zip – for example: Smith John.zip
All items in the zip ﬁle should also be prepended by your surname. (Ensure you replace “Smith” by your surname in the names below).
The following items must be included in your submission:
- The ﬁle(s) in CSV format containing the data that resulted from your analysis and is being visualised in the report.
- A folder named scripts containing the script ﬁles for Spark, python, SQL (for either Hive or Impala), etc as well as terminal commands. All scripts must contain comments where appropriate.
- A PDF document named Smith report.pdf containing your report.
It is assumed that you will also address any social / legal and ethical issues surrounding the implementation of the project such as copyright, references, licenses, and web law.
You will need to demonstrate your working scripts and be prepared to discuss functionality and implementation. Demonstrations will be held in a room and at a time to be arranged after the submission deadline, most likely in week 12.
The following assessment criteria are provided as a guide to the criteria that you need to satisfy in order to get a grade within each of the following ranges.
Extremely poor (0-9) • Totally inadequate demonstration of required knowledge. • Not able to apply the practical and analytical skills from their programmes. • No appropriate design methodology. • No demonstration of analysis evaluation or synthesis. • No evidence of the ability to self-manage a signiﬁcant piece of work and critical self-evaluation of the process. • Little academic value; presentation is extremely poor; work has no structure or clarity; extremely poor use of language; no references; no attempt to provide evidence of sources used.
Very Poor (10-19) • Virtually no relevant knowledge demonstrated. • Fails to adequately apply the practical and analytical skills from their programme. • Very poor use of design methodology. • No meaningful analysis or evaluation or synthesis. • Unable to self-manage a signiﬁcant piece of work and to identify appropriate issues for critical self-evaluation of the process for reﬂection. • Academic arguments presented are inappropriate or very poorly linked; presentation is very poor; work has little discernible structure or clarity; very poor use of language; lack of ability to source adequate material; very poor referencing.
Poor (20-29) • Inconsistent or inaccurate knowledge. • Limited and inappropriate and inaccurate application of the practical and analytical skills from their programme. • Poor use of methodology. • Descriptive, occasional attempts to analysis or evaluate material but lacks critical approach to evaluation or synthesis. • Identiﬁes issues for reﬂection but lacks evidence of reﬂective processes. • Some but inconsistent ability to self-manage a signiﬁcant piece of work or critical self-evaluation of the process. • Confusion or weakness in academic argument; presentation is poor; work is disorganised and lacks clarity; poor use of language; poor use of reference material; inappropriate or out dated sources with numerous referencing errors.
Inadequate (30-39) • Limited evidence of knowledge. • Inappropriate application of the practical and analytical skills from their programme. • Unsatisfactory design methodology. • Mainly descriptive evidence of analysis, inconsistent critical approach, little evaluation or synthesis. • Follows processes of reﬂection but fails to demonstrate insight; lacks coherence in the self-management of a signiﬁcant piece of work. • Presentation is unsatisfactory; work is limited in terms of structure, coherence or clarity; limitations in academic style; unsatisfactory referencing with errors; limited ability to support content with relevant sources.
Unsatisfactory (40-49) • Basic knowledge with occasional inaccuracies. • Appropriate yet basic application of the practical and analytical skills from their programme. • Superﬁcial depth or limited breadth, but an overall adequate identiﬁcation of design methodology. • Critical analysis evident, with some evaluation and synthesis, although limited evidence of reﬂection. • Some evidence of an ability to self-manage a signiﬁcant piece of work and critical self-evaluation of the process. • Some appropriate academic argument although not well applied and lacking in clarity; presentation of work is adequate in terms of structure, coherence, clarity and academic style; some inconsistencies; some grammar and syntax errors which detract from the content; narrow range of sources; referencing in presented work is adequate with some inconsistencies or inaccuracies; over utilises secondary sources; references used are inappropriate in terms of currency.
Satisfactory (50-59) • Mostly accurate knowledge with satisfactory depth and breadth of knowledge. • Solid application of the practical and analytical skills from their programme • Fair use of design methodology. • Sound critical analysis and evaluation or synthesis. • Demonstrates basic ability of synthesise information in order to formulate appropriate questions and conclusions; reﬂective process is utilised, with insight demonstrating planning for future practice; shows the ability to self-manage a signiﬁcant piece of work and critical self-evaluation of the process. • Relevant academic argument; presentation of work is fair in terms of structure coherence, clarity and academic style; some inconsistencies in grammar and syntax; fair range of sources identiﬁed with appropriate referencing and few inaccuracies; appropriate use of primary and secondary sources.
Good (60-69) • Consistently relevant accurate knowledge with good depth and breadth. • Clear and relevant application of the practical and analytical skills from their programme. • Good use of design methodology. • Clear, in depth critical analysis, evaluation and academic argument with synthesis of diﬀerent ideas and perspectives. • Utilises reﬂection to develop self and practice; aware of the inﬂuence of varied perspectives and time frames; demonstrates an ability to self-manage a signiﬁcant piece of work and critical self-evaluation of the process.
- Presentation of work is well organised with good use of language to express ideas or argument; very few inconsistencies in grammar and syntax good; good range of sources; well referenced with very few inaccuracies; good use of primary and secondary sources.
Very Good (70-79) • Comprehensive knowledge demonstrating very good depth and breadth. • Clear insight into links between the practical and analytical skills from their programme. • Strong use of design methodology. • Very good analysis and synthesis of material with evidence of critical and independent thought. • Demonstrates ability to transfer knowledge between diﬀerent contexts appropriately; balanced and mature approach to reﬂection used to enhance practice and performance; clear ability to self-manage a signiﬁcant piece of work and critical self-evaluation of the process. • Presentation is of a very good standard, demonstrating a scholarly style. Very good grammar and syntax. Clear evidence of referencing to a wide range of primary and secondary sources which are used eﬀectively in supporting the work.
Excellent (80-89) • Excellent depth of knowledge in a variety of contexts. • Coherent and systematic application of the practical and analytical skills from their programme. • Excellent use of design methodology. • Excellent critical analysis and synthesis. • Integrates the complexity of a range of knowledge and excellent understanding of its relevance; conﬁdent in their ability to self-manage a signiﬁcant piece of work and critical self-evaluation of the process • Arguments handled skilfully with imaginative interpretation of material; presentation is excellent, well-structured and logical; demonstrates a scholarly style; excellent grammar and syntax.
Outstanding (90-100) • Outstanding knowledge. • Exceptional application of the practical and analytical skills from their programme. • Excellent professional execution of design methodology. • Outstanding critical analysis and synthesis. • Excels in self-managing a signiﬁcant piece of work and critical self-evaluation of the process show an aptitude to formulate new questions, ideas or challenges.
- Incorporates evidence of original thinking; presentation is outstanding demonstrating a ﬂuent academic style.
. Introgramming & Unix Fall 2018, CRN 44882, Oakland University Homework Assignment 6 - Using Arrays and Functions in C
DescriptionIn this final assignment, the students will demonstrate their ability to apply two ma
. The standard path finding involves finding the (shortest) path from an origin to a destination, typically on a map. This is an
Path finding involves finding a path from A to B. Typically we want the path to have certain properties,such as being the shortest or to avoid going t
. Develop a program to emulate a purchase transaction at a retail store. This program will have two classes, a LineItem class and a Transaction class. The LineItem class will represent an individual
Develop a program to emulate a purchase transaction at a retail store. Thisprogram will have two classes, a LineItem class and a Transaction class. Th
. SeaPort Project series For this set of projects for the course, we wish to simulate some of the aspects of a number of Sea Ports. Here are the classes and their instance variables we wish to define:
Introduction - the SeaPort Project series
For this set of projects for the course, we wish to simulate some of the aspects of a number of
. Project 2 Introduction - the SeaPort Project series For this set of projects for the course, we wish to simulate some of the aspects of a number of Sea Ports. Here are the classes and their instance variables we wish to define:
Introduction - the SeaPort Project series
For this set of projects for the course, we wish to simulate some of the aspects of a number of