CMSC701 Homework 3 : Evaluating two different AMQs

This homework is due by 11:59PM ET on Thurs. May 11. NOTE: This project is substantially less implementation heavy than the prior 2 homeworks, and I expect it will take folks less time. However, this means you should focus even more on your writeup, as the implementation should be less burdensome. Also, unlike homework 1 and 2, you may implement this homework using Python if you choose.

This homework consists of 3 programming tasks with associated writeup. Programming task (3) builds upon task (2) and so they should be done in order. Each programming tasks consists of (1) an implementation and (2) a writeup. Each implementation should be accompanied by a README (or section in a global README) containing instructions for building and using the implementation (e.g. a description of the command-line interface). In addition to the code you submit directly, you should also make a GitHub repository to which you post the code, and you should link to this in the README (it’s OK if you keep it private, but remember to invite Peyman and me).

The writeups for all three tasks should be put into a single document (either a PDF or a Word Document). Each writeup should include the following 3 components:

  1. A short (no more than a paragraph) prose description of your implementation.

  2. A brief (again, no more than a paragraph) summary of what you found to be the most difficult part of implementing each task.

  3. The plots as described in each task.

Programming Tasks

Note : For the programming tasks of this assignment, there is not a specific language requirement you have to follow. Moreover, unlike project 1 and 2, you may choose to implement this in Python if you wish. If you do end up deriving some piece of your code from a source you encounter (either code, or a paper), make sure to cite this source in both your code and in the homework you turn in.

Task 1 — Empirical evaluation of the bloom filter

In this task, you will perform an empirical evaluation of the Bloom filter. For this task you are encouraged to use an existing Bloom filter implementation, rather than to implement your own! I can provide recommendations for specific languages, but I encourage you to share implementations you find with each other (ideally on Piazza). Given the Bloom filter implementation, you should write a wrapper program that (1) builds the bloom filter on a set of keys K, and (2) queries this bloom filter on a set of keys K’ such that K’ is not equal to K. That is, K’ should have a mix of keys from K and keys not present in the initial set K.

Writeup: For this programming task, test your implementation by generating several (3 or 4) sets K and K’ of various sizes and with various mixtures of “positive” and “negative” keys. For each set of keys, consider building your bloom filter at an expected false-positive rate of (1/(2^7)), (1/(2^8)) and (1/(2^10)) — the specific choices for these numbers will be clear in Task 3. For each input dataset, and for each target false positive rate, measure (a) what is the observed false positive rate (also, as a sanity check, ensure you have no false negatives) (b) what is the total time spent querying K’ and (c) what is the total size of the bloom filter? Note for the purposes of this project, you don’t have to worry about serializing and deserializing your filter if you don’t want, you can simply time, from within your program, just the time taken for the query. Comment on how your observations match with the theory in terms of bloom filter false positive rate, and size. Also comment on how the false positive rate affects the lookup time.

Task 2 — Empirical evaluation of a minimal perfect hash

In this task, you will perform the same test as you did above with the bloom filter, except, rather than a bloom filter, you will build a minimal perfect hash function over each test set K. The main purpose of this task is to compare the size, lookup speed and “false positive” rate of the MPHF to that of the bloom filter. Note that there is no way to tweak the false positive rate of the MPHF, so you do not have to sweep over that parameter in testing. Just as with the bloom filter above, you do not need to implement the MPHF construction and lookup yourself. I recommend using the BBHash MPHF we discussed in class, though you are free to use others if you want. The BBHash MPHF has implementations in C++, Rust, Go, Python and maybe other languages too.

Writeup: There are no false positive rates to set here, but otherwise, test the same things you tested above for the bloom filter (i.e. use the same collection of K and K’ sets). Check the observed false positive rate (i.e. what fraction of keys not in K return an index less than the number of elements of K when queried), the total time spent querying each K’, and the size of the MPHF. How do these compare to the bloom filter? How does the false positive rate compare to the expectations you had before you performed this experiment?

Task 3 — Augment the MPHF with a “fingerprint array”

There is a simple way to turn the MPHF into a static AMQ with a controllable false positive rate. This is by adding a fingerprint array. That is, let MPHF(k) be the index associated with element k in K by our MPHF. We compute, using some (good) hash function, h(k), and in an auxiliary array at index MPHF(k) we store the first b bits of h(k). This is essentially like the quasi-dictionary data structure if you wish to read more about this idea. Assuming that h() is a good hash function, storing the last b bits of h(k) should provide us a false positive rate close to (1/(2^b)). That is, if we stored only 1 bit (either 0 or 1), then, on average, half of the false positive keys mapping to MPHF(k) would have the last bit of their h() end with 0 and half with 1; thus, we could reject half of the false positive keys. If we stored the last 2 bits, then we could reject 3/4 of these false positives, etc.

So, in this programming task, you should pair your MPHF from Task 2 with this auxiliary fingerprint array. Then, using the same K and K’ sets as above, you will compare the false positive rate, query speed, and size of the MPHF + fingerprint array against the bloom filter. Note: for this part of the assignment, it is ideal if you use an integer vector to pack precisely the right number of bits into the fingerprint array for each element (e.g. the sdsl int vector or the compact vector you may have used in previous projects); however, if you cannot find an integer vector library for the language you are using, you may use a standard array type. Just note in your writeup what your data structure’s actual size is, as well as what it would be if you used the minimum number of bits for each entry in your fingerprint array.

Writeup: For this programming task, test your implementation with your sets sets K and K’ as used in parts 1 and 2. For each set of keys, consider building your MPHF and fingerprint array with fingerprints consisting of 7, 8 and 10 bits. For each input dataset, and for each target false positive rate, measure (a) what is the observed false positive rate (also, as a sanity check, ensure you have no false negatives) (b) what is the total time spent querying each K’ and (c) what is the total size of the MPHF + fingerprint array? Note for the purposes of this project, you don’t have to worry about serializing and deserializing your filter if you don’t want, you can simply time, from within your program, just the time taken for the query. Comment on how your observations match with the theory in terms of MPHF + fingerprint array false positive rate, and size. Also comment on how the false positive rate affects the lookup time.