(5/5)

This coursework assignment requires you to design, analyse, implement and test algorithms for the task of comparing text documents

INSTRUCTIONS TO CANDIDATES

ANSWER ALL QUESTIONS

Overview

This coursework assignment requires you to design, analyse, implement and test algorithms for the task of comparing text documents. This will be done by comparing all words in a document to a given dicfionary of words to create a feature vector, and a distance measure will be implemented to compare documents by their resulfing feature vectors.

The task is described in detail with informal descripfions of the funcfionality that is required; it is your task to first design and formally describe algorithms to solve this problem. You will then analyse your algorithms and implement them in Java. Following this, you will run a se- ries of fiming experiments to visualise the run-fime of your algorithms, presenfing and briefly discussing your results.

Please note: do not use advanced data structures in this assignment (such as hash maps, trees, tries, etc.). You will cover these later in the module. The purpose of this assignment is to reinforce your knowledge of describing, analysing, implemenfing and fiming algorithms. You may use arrays when you implement your algorithms to represent lists but please do not use any Java data structure classes, such as those that implement the Collection interface (for example, do not use ArrayList).

Descripfion

Documents

A plain text document A containing w words can be thought of as a list data structure, where:

A =< a1, a2, ..., aw >,

and ai is the ith word in the document. Note that A may include duplicate words, and for simplicity, all punctuafion has been removed and assume that words can only be made up of lower-case leters. An example would be:

A =,

where in this example w = 20 and a4 =example.

Dicfionaries

A dicfionary Q is also a list of s words:

Q =< q1, q2, ..., qs >,

However, unlike documents, a dicfionary may not include duplicate words.

Feature Vectors

A feature vector F can be computed for a given document A and dicfionary Q by counfing the number of occurrences of each dicfionary word within the document. For example, given a document:

and a dicfionary:

A =< hello to you and hello to the world >

Q =< cat hello world the apple >,

the resulfing feature vector would be:

F =< 0, 2, 1, 1, 0 >,

where the length of F will be the same length as the dicfionary Q. For example, q2 = hello appears twice within A, hence f2 = 2, whereas f5 = 0 as the word apple does not appear at all in A.

Document Similarity Distance (DSD)

The Document Similarity Distance (DSD) can be computed to compare two documents using feature vectors derived from a common dicfionary. More formally, given two documents, A and B, and a dicfionary Q, feature vectors FA and FB can be calculated for documents A and B using the steps discussed above. These feature vectors can then be used to calculate the Document Similarity Distance between A and B as:

Exercises

Your task is to complete the following exercises to formally describe, analyse, implement (and test) the funcfionality that has been discussed in the problem definifion and then carry out fiming experiments.

Reminder: all implementafion must be done in Java and please do not use in-built Java data structure classes such as those that extend the Collection interface. Do not use more ad- vanced data structures such as hash tables and do not use classes such as ArrayList class. You may use arrays wherever necessary, however.

Part A: Calculafing Feature Vectors

Design and write a formal descripfion of an algorithm to calculate a feature vector when given a document and a dicfionary as inputs. (ıo marks)
Analyse the run-fime complexity of your algorithm from quesfion 1 to calculate document feature vectors. (ıo marks)
Implement your algorithm from quesfion 1 in Java and call it calculateFeatureVector. The input to your algorithm should be two String arrays (one for a document and one for a dicfionary) and the output should be a single array of integers. (ıo marks)

¢. Design and conduct fiming experiments using your implementafion from quesfion 3. How you design your experiments is your decision but you should perform mulfiple re- samples using various document and dicfionary lengths. Present your results appropri- ately, and briefly comment on your results and how they compare to your expectafions following your run-fime complexity analysis. (ıo marks)

For the purpose of tesfing your algorithm some example code has been provided in CourseworkUtilities.java on Blackboard. The generateDictionary method will randomly generate a dicfionary and generateDocument will generate a document when passed a dicfionary. Please note that, for simplicity, the words that are generated will be random characters, but this is suficient for tesfing and fiming experiments.

Note: all code to generate your experimental results should be included in your sub- mission (with informafive comments) to demonstrate how you designed and performed your experiments. You do not need to submit CourseworkUtilities.java as this is given to you.

Part B: Comparing Documents

Design and write a formal descripfion of an algorithm to compute the Document Simi- larity Distance (DSD) of two feature vectors (you can assume the input feature vectors were generated using the same dicfionary and do not need to check for this). Specifically, the input of this algorithm should be two integer arrays (for the two feature vectors) and the output should be the DSD between them. (ı2 marks)
Design and write a formal descripfion of an algorithm that finds the closest match for all documents within an input

Specifically, the inputs of your algorithm should be:

a list of documents D of length n, and
a dicfionary, Q.

Your algorithm should go through each document in D and compute the DSD to each other document (you may reference your algorithm from quesfion 5 here - there is no need to describe this logic again). The algorithm should record the index of document that was the closest match to each document (excluding itself) within D and return a list of these indices.

For example, given D =< d1, d2, d3 >, suppose the closest match to d1 was d2, the clos- est match to d2 was d3 and the closest match to d3 was d2. The output would therefore be < 2, 3, 2 > (hint: remember that we index from 1 in formal descripfions but index from 0 in code). (ı2 marks)

Analyse the run-fime complexity of your algorithm in quesfion 6 to find the closest doc- uments in a list of documents. (ı2 marks)
Implement your algorithm from quesfion 6 in Java to find the indices of the best matches to each document in an input list of documents. Your method should be called findNearestDocuments and the inputs to your method should should be:
- a String[][] to store a list of documents, and
- a String[] to store a dicfionary.

Your method should return a single integer array. (ı2 marks)

Design and conduct fiming experiments using your implementafion from quesfion 8. Present your results on a graph and briefly comment on your results and how they com- pare to your expectafions following your run-fime complexity analysis. (ı2 marks)

Note: again, all code to generate your experimental results should be included in your submission (with informafive comments) to demonstrate how you designed and per- formed your experiments. You are welcome to reuse the ufility methods provided for quesfion ¢ to assist with your experiments.

Relafionship to formafive work

Please see the following lectures and lab exercises for background on each of the specified tasks:

Describing algorithms: the lectures in week 2 introduced informal and formal descrip- fions of The lab in week 2 included a tutorial and exercises on wrifing formal descripfions of algorithms, and this has been reinforced with further examples through- out the lectures and labs.
Analysing algorithms: the lectures in week 3 and ¢ introduced the concept and structure of analysing algorithms. Further examples were included in the lab in week ¢, and an addifional opfional seminar sheet (with solufions) has been uploaded to blackboard for further

Timing experiments: some informal fiming experiments have been shown in the lec- tures, but see the lab sheets in weeks 3 and ¢ for specific exercises to pracfice creafing fiming experiments for running code in
Algorithm design: this has been taught throughout the prerequisite modules for this module. It was further reinforced in the second part of the lecture in week ¢ with a case study of designing an algorithm for a specific problem, starfing with an ineficient solufion and iterafing through more eficient

Deliverables

You must submit a single .pdf electronically on e:vision that contains all of your work. This .pdf should be prepared using the PASS applicafion that is installed on all UEA CMP lab machines. Please ensure that you include answers to all quesfions in the assignment that you have at- tempted and also include all source code used to implement and run your experiments. Your code should be clear and include comments to aid the marker where necessary (a full javadoc is not necessary, however).

It is accestable to do your analyses on paper by hand if that is easier for you, but make sure to scan them, save as a .pdf, and include them in your final .pdf. Please do not use spaces in any file names that you include in your PASS submission. aFileName.pdf would be fine, but

a File Name.pdf will not work correctly with PASS.

An example of using PASS to format this coursework with instrucfions will be provided on Black- board nearer the submission date to help anyone who is unfamiliar with the system.

Finally, please make sure that everything in your submission is clear and legible. If anything is not clear and visible in your final .pdf then we cannot mark it. It is your responsibility to make sure that the .pdf that you submit includes all of your work and that it is a fair representafion of the effort that you have put into this assignment.

Resources

Previous exercises: If you get stuck when complefing the coursework please revisit the lab exercises that are listed in the Relafionship to formafive work secfion during your allo- cated weekly lab The teaching assistants in the labs will not be able to help you with your coursework directly, but they will be more than happy to help you understand how to answer the (very) related exercises in the lab sheets. You will then be able to apply this knowledge and understanding to the new problems in this coursework assignment.
Discussion board: if you have clarificafion quesfions about what is required then please ask these quesfions on the Blackboard discussion This will enable other students to also benefit from the quesfion/answer. Please check that your quesfion has not been asked previously before starfing a new thread.
Course text: Goodrich, T., Tamassia, R. (2oo5) Data Structures and Algorithms in Java,

¢th edifion. As menfioned in lectures, this text book is very useful and this older edifion is freely available online if you do a quick search for it. Chapter ¢ of this edifion in parficular will be helpful for analysis of algorithms, but any edifion of the book will include helpful informafion.

Marking Scheme

Itemised marks are provided throughout the assignment descripfion. To summarise:

1. Part ı: Calculafing Feature Vectors (¢o marks)

Design and descripfion (1o marks)
Analysis (1o marks)
Java implementafion and tesfing (1o marks) 1.¢. Timing experiments and discussion (1o marks)

2. Part 2: Comparing Documents (6o marks)

Design and descripfion of DSD (12 marks)
Design and descripfion of finding closest documents (12 marks)
Analysis of finding closest documents (12 marks) 2.¢. Java implementafion and tesfing (12 marks)

2.5. Timing experiments and discussion (12 marks)

Total: ıoo Marks

Further informafion:

Formal descripfions: marks will be awarded for the correctness of your algorithms and your ability to describe them accurately in pseudocode. Make sure that you clearly de- scribe all elements of your algorithm and use comments where
Algorithm analyses: marks will be awarded for correctness of your evaluafion and for following the correct procedures when analysing algorithms. Using LaTeX/Overleaf for wrifing up your analysis is encouraged, but it is acceptable to do them by hand/other means instead if you prefer (such as scanning analyses done by hand and including them in your final submission). If you do this however then it is your responsibility to make sure your wrifing/presentafion is clear and legible - we cannot give you marks if we cannot understand it!
Implementafion: you should submit your Java code for the required implementafion quesfions and marks will be awarded for correctness and comprehensibility of your This includes using standard programming convenfions (such as correct indentafion, sen- sible variable names, etc.) as well as informafive and reasonable comments to explain complex code. There is no need to define a full javadoc but there should be helpful com- ments where necessary to make it easy for a reader to understand your code.
Timing experiments: marks will be awarded for experimental design, presentafion of results, and discussion of results. Make sure that you perform many resamples using various input sizes when conducfing your experiments, and make sure to present re- sults using clear, well-formated graphs. As menfioned in the quesfion, you should also briefly discuss your results in terms of whether they agree with your runfime complex- ity analysis. This does not have to be a long discussion but it should demonstrate your understanding of how your pracfical results and algorithm analyses are

(5/5)

Use CA10RAM to get 10%* Discount.

This coursework assignment requires you to design, analyse, implement and test algorithms for the task of comparing text documents

ANSWER ALL QUESTIONS

Attachments:

Instructions Files

Related Questions

. Introgramming & Unix Fall 2018, CRN 44882, Oakland University Homework Assignment 6 - Using Arrays and Functions in C

. The standard path finding involves finding the (shortest) path from an origin to a destination, typically on a map. This is an

. Develop a program to emulate a purchase transaction at a retail store. This program will have two classes, a LineItem class and a Transaction class. The LineItem class will represent an individual

. SeaPort Project series For this set of projects for the course, we wish to simulate some of the aspects of a number of Sea Ports. Here are the classes and their instance variables we wish to define:

. Project 2 Introduction - the SeaPort Project series For this set of projects for the course, we wish to simulate some of the aspects of a number of Sea Ports. Here are the classes and their instance variables we wish to define:

Ask This Question To Be Solved By Our ExpertsGet A+ Grade Solution Guaranteed

Our Experts

Um e HaniScience

Muhammad Ali HaiderFinance

Husnain SaeedComputer science

Atharva PatilComputer science

Other Services

This coursework assignment requires you to design, analyse, implement and test algorithms for the task of comparing text documents

ANSWER ALL QUESTIONS

Attachments:

Instructions Files

Related Questions

. Introgramming & Unix Fall 2018, CRN 44882, Oakland University Homework Assignment 6 - Using Arrays and Functions in C

. The standard path finding involves finding the (shortest) path from an origin to a destination, typically on a map. This is an

. Develop a program to emulate a purchase transaction at a retail store. This program will have two classes, a LineItem class and a Transaction class. The LineItem class will represent an individual

. SeaPort Project series For this set of projects for the course, we wish to simulate some of the aspects of a number of Sea Ports. Here are the classes and their instance variables we wish to define:

. Project 2 Introduction - the SeaPort Project series For this set of projects for the course, we wish to simulate some of the aspects of a number of Sea Ports. Here are the classes and their instance variables we wish to define:

Ask This Question To Be Solved By Our ExpertsGet A+ Grade Solution Guaranteed

Our Experts

Um e HaniScience

Muhammad Ali HaiderFinance

Husnain SaeedComputer science

Atharva PatilComputer science