# Your input will be strings, in fact, all the strings that are defined in the systemβs dictionary.

INSTRUCTIONS TO CANDIDATES
###### ANSWER ALL QUESTIONS

In this assignment you are going to test various hash functions to see how good they are, in terms of how many collisions they have.

Your input will be strings, in fact, all the strings that are defined in the system’s dictionary.

This is located in /usr/share/dict/words; you can download a copy to your local computer for testing here. (If you try to download it your browser may say that its a binary file, but it is in fact a text file, with one word per line.) Note that if you try to open the system copy of the dictionary in your code on the server, you will have to open it for reading only, otherwise opening the file will fail.

This file contains just under 100,000 English words, which we’re going to use to test the uniformity of various hash functions. Your hash functions will hash strings into 16-bit (not 32-bit) ints. This is important, because we’re going to keep a table of the number of collisions for each hash value.

With 16-bit ints there are only 65,536 possible hash values, so this table will easily fit in memory. If we used 32-bit ints then there would be 4,294,967,296 possble hashes, a more troublesome amount. For C++, you can get a 16-bit unsigned int type by doing #include uint16_t x; // x has exactly 16 bits, and is unsigned For each of the possible hash functions your program should:

Create a vector hashes of size 65,536 Process the list of words, and for each word, compute its hash h Increment the entry in the table for that hash: hashes.at(h)++ When finished, use Pearson’s χ² test to determine the probability that the resulting distribution is uniformly distributed. (See below for the deets.) Hash functions to test The hash (non)functions you should test are: String length (modulo 216 2 16 ) First character Additive checksum (add all characters together), modulo 216 2 16 Remainder (use a modulo of 65413, this is the first prime that is smaller than the table size).

Remember that you cannot just add up all the characters and then take the mod of the result; you have to thread the modulo through the loop that computes the sum. Multiplicative (using the scheme described in class/in the lecture notes). Again, remember that you can’t just use the final sum; you have to incorporate the multiplicative calculation into hashing loop. Also remember that in C++ char may be signed or unsigned; on your computers and on the server it is signed, so character values actually range from -128 to +127.

Normally this doesn’t matter, as all of the standard characters are in the range 0-127. However, a few of the words in the dictionary use extended foreign characters in the range 128-255, which will map to negative values if char is signed, and negative values will throw off the hash functions. A hacky way to compensate for this is to just assume that char is signed and pass the character value c through the transformation c < 0 ? int(255 + c) : int(c) before using it. (This will leave standard character codes alone, but will map extended characters to positive values above 127.)

The correct (cross-platform compatible) way is to get the unsigned version of char (which will be either unsigned char or just char itself) and then convert the value to that, using something like this: #include unsigned char char_to_unsigned(char c) { return static_cast<std::make_unsigned::type>(c); } (I’d also suggest making your “hash testing” function take a function pointer, so that you can just write all the above hash functions and then test them one by one.) Pearson’s χ² test The χ² test is a statistical method for determining the probability that a given set of observed results (in this case, the number of hashes per “bin”) could have been generated from a given distribution. In our case, we want our distribution to be uniform, meaning that every bin should have expected=Number of words65,536 expected = Number of words 65 , 536 items in it (this will be a fractional value).

To perform the test, you will take the hashes vector and from it compute the value X2 X 2 : ππΈ=∑i(expected−ππππππ[i])2expected c 2 = ∑ i ( expected − h a s h e s [ i ] ) 2 expected We also need to know the “degrees of freedom”, which is just the number of bins, minus 1: 65,535. We’ll use the Boost.Math library for the statistical functions: #include <boost/math/distributions/chi_squared.hpp> (If you’re working on your own computer,

Boost.Math is a header-only library, so you can just download the full Boost distribution and then extract boost_1_63_0/boost/math directory somewhere in the include path for your compiler or project.) Create a χ² distribution, with the desired degrees of freedom: boost::math::chi_squared c2d(65535.0); and then use the cumulative distribution function to get the probability: float p = boost::math::cdf(c2d, c2); p tells us the probability that c2 is not derived from a uniform distribution, so the smaller this is, the better! (You can also negate it and convert to a percentage, for a kind of “good distribution” measure: (1 - p) * 100.0.)

Implementation This assignment doesn’t have a test runner. Just run each of the above hash functions through the full dictionary and then report on the p p value you get for each. You should get p=1 p = 1 for the really bad hash functions (indicating that they are very far from uniform) but p<1 p < 1 for at least the remainder method. (I got p = 0.251 for remainder hashing; multiplicative should give you a p<1 p < 1 but it’s tricky to get right, so don’t worry about it too much if you get p=1 p = 1 .)

Submission

Save your work in a directory on the server named cs133/assign5/. Extra Credit If you make your code print out nice-looking histograms of each hash function’s distribution, then I’ll count this assignment as two assignments.

## Expert's Answer

804 Times Downloaded

## Related Questions

##### . Introgramming & Unix Fall 2018, CRN 44882, Oakland University Homework Assignment 6 - Using Arrays and Functions in C

DescriptionIn this final assignment, the students will demonstrate their ability to apply two ma

##### . The standard path finding involves finding the (shortest) path from an origin to a destination, typically on a map. This is an

Path finding involves finding a path from A to B. Typically we want the path to have certain properties,such as being the shortest or to avoid going t

##### . Develop a program to emulate a purchase transaction at a retail store. This program will have two classes, a LineItem class and a Transaction class. The LineItem class will represent an individual

Develop a program to emulate a purchase transaction at a retail store. Thisprogram will have two classes, a LineItem class and a Transaction class. Th

##### . SeaPort Project series For this set of projects for the course, we wish to simulate some of the aspects of a number of Sea Ports. Here are the classes and their instance variables we wish to define:

1 Project 1 Introduction - the SeaPort Project series For this set of projects for the course, we wish to simulate some of the aspects of a number of

##### . Project 2 Introduction - the SeaPort Project series For this set of projects for the course, we wish to simulate some of the aspects of a number of Sea Ports. Here are the classes and their instance variables we wish to define:

1 Project 2 Introduction - the SeaPort Project series For this set of projects for the course, we wish to simulate some of the aspects of a number of

804 Times Downloaded

### Ask This Question To Be Solved By Our ExpertsGet A+ Grade Solution Guaranteed

807 Answers

Hire Me

976 Answers

Hire Me

924 Answers

Hire Me
##### Atharva PatilComputer science

933 Answers

Hire Me

Get Free Quote!

337 Experts Online