CMSC701 Homework 2 : Implementing bitvector rank and select, and applying them to a sparse array

This homework is due by 11:59PM ET on Tues. March 28. It consists of 3 programming tasks with associated writeup. The programming tasks build upon each other, and so should be implemented in order. Each programming tasks consists of (1) an implementation and (2) a writeup. Each implementation should be accopanied by a README (or section in a global README) containing instructions for building and using the implementation (e.g. a description of the library interface, and the command-line interface where applicable). In addition to the code you submit directly, you should also make a GitHub repository to which you post the code, and you should link to this in the README (it’s OK if you keep it private, but remember to invite Peyman and me).

The writeups for all three tasks should be put into a single document (either a PDF or a Word Document). Each writeup should include the following 3 components:

A short (no more than a paragraph) prose description of your implementation.
A brief (again, no more than a paragraph) summary of what you found to be the most difficult part of implementing each task.
The plots as described in each task.

Programming Tasks

Note : For the programming tasks of this assignment, there is not a specific language requirement you have to follow. However, as with Project 1 tasks should be completed in a compiled language that gives you direct access to memory management and bit-manipulation primitives. Good candidates for implementing this project would be C, C++ (ideally modern variants such as C++11/14/17), Rust or Go (though e.g. Java and C# would also work). If you are working in a managed memory language, please mind the memory usage, as the point of this project is to measure the overhead of these succinct data structures. You may not use a succinct data structure library to implement these tasks (the point is for you to implement them yourself). However, you may choose to use a library to provide the basic bit-vector representation and direct access operations if you wish (i.e. you may use the library to allow creation, population and direct access to a bit-vector, but not for rank, select, or any other “higher-level” operations). Also, note that you will find implementations of rank and select all over the place (some even from the previous iteration of this course). The point of the project is for you to implement these yourself, so please avoid the desire to comb through that existing code.

If you do end up deriving some piece of your code from a source you encounter (either code, or a paper), make sure to cite this source in both your code and in the homework you turn in.

Task 1 — bit-vector rank

Implement a succinct, constant-time, bit-vector rank operation. The exact design of the implementation is up to you, but the following requirements are put in place. In addition to turning your code in on ELMS, you should upload your code to a GitHub repository and include the link in your submission.

Your rank support should be a class / structure that “wraps” the underlying bit-vector. In C++ lingo, something like:

 bit_vector b;
 // do some stuff with b
 rank_support r(&b); // let r access b via a pointer
 auto x = r(i); // x now holds the value of rank1(b, i);
                // I'm using r(i) (operator()) as a shorthand
                // for rank1(.) as defined below.

You should implement the following methods for your rank_support class (type signatures are given for C++, use your judgement if implementing in another language):
- uint64_t rank1(uint64_t i) : Returns the number of 1s in the underlying bit-vector up to position i (exclusive).
- uint64_t overhead() : Returns the size of the rank data structure (in bits) that is required to support constant-time rank on the current bitvector.
- save(string& fname) : Saves the rank data structure for this bit vector to the file fname (your bit vector should also have a save() function).
- load(string& fname) : Loads the rank data structure for this bit vector from the file fname (your bit vector should also have a load() function).

Writeup: For this programming task, test your implementation by invoking it for bit vectors of various sizes, and plotting the bit-vector size (say N) versus the time requried to do some fixed number of rank operations. Also, plot the bit-vector size (say N) versus the result of calling the overhead() function. Does your implementation match the expected theoretical bounds?

Task 2 — bit-vector select

Implement a succinct, (at most) log-time, bit-vector select operation. The exact design of the implementation is up to you, but the following requirements are put in place. In addition to turning your code in on ELMS, you should upload your code to a GitHub repository and include the link in your submission.

Your select support should be a class / structure that “wraps” the underlying bit-vector. In C++ lingo, something like:

 bit_vector b;
 // do some stuff with b
 rank_support r(&b); // let r access b via a pointer
 select_support s(&r); // let s access r via a pointer

 auto x = s(i); // x now holds the value of select1(b, i);
                // I'm using r(i) (operator()) as a shorthand
                // for select1(.) as defined below.

You should implement the following methods for your rank_support class (type signatures are given for C++, use your judgement if implementing in another language):
- uint64_t select1(uint64_t i) : Returns the position, in the underlying bit-vector, of the FIRST index, j for which rank1(j) = 1.
- uint64_t overhead() : Returns the size of the select data structure (in bits) that is required to support log-time select on the current bitvector (how does this relate to the size of the rank data structure built above).
- save(string& fname) : Saves the select data structure for this bit vector to the file fname (your bit vector should also have a save() function).
- load(string& fname) : Loads the select data structure for this bit vector from the file fname (your bit vector should also have a load() function).

Writeup: For this programming task, test your implementation by invoking it for bit vectors of various sizes, and plotting the bit-vector size (say N) versus the time requried to do some fixed number of select operations. Also, plot the bit-vector size (say N) versus the result of calling the overhead() function. Does your implementation match the expected theoretical bounds? If you feel ambitious, you can additionally implement a constant-time bit-vector select, though this is not required.

Task 3 — Implementing a sparse array using your bitvector rank and select

Using the bitvector rank and select implementation you have built above, you will build a “sparse array” that stores a collection of objects and associates with each one an index. The idea here is that the values themselves will be tightly packed, but the index associated with each will be derived from a (potentially sparse) bitvector. Consider the following example:

bitvector:

| 0 | 1 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 1 |

values:

["foo", "bar", "baz"]

Note that we have a bitvector of length 10, but only 3 non-zero values. This representation will be much more efficient than storing e.g. a list of 10 strings, only 3 of which are non-empty, and even more efficient at this sparsity than storing a list of string, index tuples. As we will see later in the course, this type of representation is very useful for representing e.g. a sampled suffix array, which we will use for full-text search with the FM-index.

Critically, the sparse array should be able to efficiently return the i-th present element of the array, along with its index of occurrence, it should be able to return the element present at a given index (or an empty indicator if no such element is present).

The exact design of your sparse array class is up to you, but it should implement at least the following functions:

void create(uint64_t size) : Creates an empty sparse array of length size (the size of the underlying bitvector you will create).
void append(string elem, uint64_t pos) : Appends the element elem at index pos of the sparse array. You may assume this is the only way to insert elements, and so you will always insert element in-order and never will insert two element with the same position, further you may assume that you will always have pos < size (but you should probably guard against this anyway).
void finalize() : This will “finalize” the sparse array. At this point, no more elements will be added. This is when you should take the opportunity to create any rank or select data structures.
bool get_at_rank(uint64_t r, std::string& elem) : This function places a reference to the r-th present item in the array in the reference elem. It returns true if there was >= r items in the sparse array and false otherwise.
bool get_at_index(uint64_t r, std::string& elem): This function looks at the r-th index in the sparse bitvector; if that bit is 1, it fetches the corresponding value and binds it to the reference elem and returns true, if that bit is a 0, it simply returns false.
uint64_t get_index_of(uin64_t r) : This function takes as its argument a rank r and returns the index in the sparse array where the r-th present element appears. Note : this is basically just select on the bitvector, except that it returns the position where this element occurs, not one index past that position. If r > the weight of the bitvector (i.e. greater than the number of present elements) then this function should return a sentinel value (e.g. in C++ this might be something like std::numeric_limits<uint64_t>()).
uint64_t num_elem_at(uint64_t r): This function returns the count of present elements (1s in the bit vector) up to and including index r (Note: This is just rank on the bitvector, but it is inclusive rather than exclusive of index r).
uint64_t size() : Returns the size of the sparse array.
uint64_t num_elem() : Returns the number of present elements in the sparse array (i.e. the number of 1s in the bitvector).
save(string& fname) : Saves the sparse array to the file fname.
load(string& fname) : Loads the sparse array from the file fname.

So, in our example above, sparse_array.get_at_rank(1, e) would return true, and after the function call the string e would have the value "bar". Likewise sparse_array.get_at_index(3, e) would return false and sparse_array.get_at_index(5, e) would return true and also set e to have value "bar". Finally, sparse_array.get_index_of(2) would return 5.

Implementation note: To keep this simple, I’m only requiring that you implement the sparse array for values of type string. However, note that if you are at all familiar with generic programming, it is very easy to make this work for values of any type — so if you are so inclined I encourage you to make this a “generic” sparse array.

Writeup: For this programming task, test your implementation by generating sparse arrays of a few different lengths (e.g. 1000, 10000, 100000, 1000000) and having various sparsity (e.g. 1%, 5%, 10%). How does the speed of the different functions vary as a factor of the overall size? How about as a function of the overall sparsity? Finally, try and estimate how the size of your sparse array in memory compares to what the size would be if all of the 0 elements were instead explicitly stored as “empty” elements (e.g. as empty strings). How much space do you save? How do your savings depend on sparsity?