Homework 2
Download
CMSC701 Homework 2 : Implementing bitvector rank and select, and applying them to a sparse array
This homework is due by 11:59PM ET on Tues. March 28. It consists of 3 programming tasks with associated writeup. The programming tasks build upon each other, and so should be implemented in order. Each programming tasks consists of (1) an implementation and (2) a writeup. Each implementation should be accopanied by a README (or section in a global README) containing instructions for building and using the implementation (e.g. a description of the library interface, and the command-line interface where applicable). In addition to the code you submit directly, you should also make a GitHub repository to which you post the code, and you should link to this in the README (it’s OK if you keep it private, but remember to invite Peyman and me).
The writeups for all three tasks should be put into a single document (either a PDF or a Word Document). Each writeup should include the following 3 components:
-
A short (no more than a paragraph) prose description of your implementation.
-
A brief (again, no more than a paragraph) summary of what you found to be the most difficult part of implementing each task.
-
The plots as described in each task.
Programming Tasks
Note : For the programming tasks of this assignment, there is not a specific language requirement you have to follow. However, as with Project 1 tasks should be completed in a compiled language that gives you direct access to memory management and bit-manipulation primitives. Good candidates for implementing this project would be C, C++ (ideally modern variants such as C++11/14/17), Rust or Go (though e.g. Java and C# would also work). If you are working in a managed memory language, please mind the memory usage, as the point of this project is to measure the overhead of these succinct data structures. You may not use a succinct data structure library to implement these tasks (the point is for you to implement them yourself). However, you may choose to use a library to provide the basic bit-vector representation and direct access operations if you wish (i.e. you may use the library to allow creation, population and direct access to a bit-vector, but not for rank, select, or any other “higher-level” operations). Also, note that you will find implementations of rank and select all over the place (some even from the previous iteration of this course). The point of the project is for you to implement these yourself, so please avoid the desire to comb through that existing code.
If you do end up deriving some piece of your code from a source you encounter (either code, or a paper), make sure to cite this source in both your code and in the homework you turn in.
Task 1 — bit-vector rank
Implement a succinct, constant-time, bit-vector rank operation. The exact design of the implementation is up to you, but the following requirements are put in place. In addition to turning your code in on ELMS, you should upload your code to a GitHub repository and include the link in your submission.
-
Your rank support should be a class / structure that “wraps” the underlying bit-vector. In C++ lingo, something like:
bit_vector b; // do some stuff with b rank_support r(&b); // let r access b via a pointer auto x = r(i); // x now holds the value of rank1(b, i); // I'm using r(i) (operator()) as a shorthand // for rank1(.) as defined below.
-
You should implement the following methods for your rank_support class (type signatures are given for C++, use your judgement if implementing in another language):
uint64_t rank1(uint64_t i)
: Returns the number of 1s in the underlying bit-vector up to position i (exclusive).uint64_t overhead()
: Returns the size of the rank data structure (in bits) that is required to support constant-time rank on the current bitvector.save(string& fname)
: Saves the rank data structure for this bit vector to the filefname
(your bit vector should also have asave()
function).load(string& fname)
: Loads the rank data structure for this bit vector from the filefname
(your bit vector should also have aload()
function).
Writeup: For this programming task, test your implementation by invoking it for bit vectors of various sizes, and plotting the bit-vector size (say N) versus the time requried to do some fixed number of rank operations. Also, plot the bit-vector size (say N) versus the result of calling the overhead()
function. Does your implementation match the expected theoretical bounds?
Task 2 — bit-vector select
Implement a succinct, (at most) log-time, bit-vector select operation. The exact design of the implementation is up to you, but the following requirements are put in place. In addition to turning your code in on ELMS, you should upload your code to a GitHub repository and include the link in your submission.
-
Your select support should be a class / structure that “wraps” the underlying bit-vector. In C++ lingo, something like:
bit_vector b; // do some stuff with b rank_support r(&b); // let r access b via a pointer select_support s(&r); // let s access r via a pointer auto x = s(i); // x now holds the value of select1(b, i); // I'm using r(i) (operator()) as a shorthand // for select1(.) as defined below.
-
You should implement the following methods for your rank_support class (type signatures are given for C++, use your judgement if implementing in another language):
uint64_t select1(uint64_t i)
: Returns the position, in the underlying bit-vector, of the FIRST index, j for which rank1(j) = 1.uint64_t overhead()
: Returns the size of the select data structure (in bits) that is required to support log-time select on the current bitvector (how does this relate to the size of therank
data structure built above).save(string& fname)
: Saves the select data structure for this bit vector to the filefname
(your bit vector should also have asave()
function).load(string& fname)
: Loads the select data structure for this bit vector from the filefname
(your bit vector should also have aload()
function).
Writeup: For this programming task, test your implementation by invoking it for bit vectors of various sizes, and plotting the bit-vector size (say N) versus the time requried to do some fixed number of select operations. Also, plot the bit-vector size (say N) versus the result of calling the overhead()
function. Does your implementation match the expected theoretical bounds? If you feel ambitious, you can additionally implement a constant-time bit-vector select, though this is not required.
Task 3 — Implementing a sparse array using your bitvector rank and select
Using the bitvector rank and select implementation you have built above, you will build a “sparse array” that stores a collection of objects and associates with each one an index. The idea here is that the values themselves will be tightly packed, but the index associated with each will be derived from a (potentially sparse) bitvector. Consider the following example:
bitvector:
| 0 | 1 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 1 |
values:
["foo", "bar", "baz"]
Note that we have a bitvector of length 10, but only 3 non-zero values. This representation will be much more efficient than storing e.g. a list of 10 strings, only 3 of which are non-empty, and even more efficient at this sparsity than storing a list of string, index tuples. As we will see later in the course, this type of representation is very useful for representing e.g. a sampled suffix array, which we will use for full-text search with the FM-index.
Critically, the sparse array should be able to efficiently return the i-th present element of the array, along with its index of occurrence, it should be able to return the element present at a given index (or an empty indicator if no such element is present).
The exact design of your sparse array class is up to you, but it should implement at least the following functions:
-
void create(uint64_t size)
: Creates an empty sparse array of lengthsize
(the size of the underlying bitvector you will create). -
void append(string elem, uint64_t pos)
: Appends the elementelem
at indexpos
of the sparse array. You may assume this is the only way to insert elements, and so you will always insert element in-order and never will insert two element with the same position, further you may assume that you will always havepos
<size
(but you should probably guard against this anyway). -
void finalize()
: This will “finalize” the sparse array. At this point, no more elements will be added. This is when you should take the opportunity to create any rank or select data structures. -
bool get_at_rank(uint64_t r, std::string& elem)
: This function places a reference to the r-th present item in the array in the referenceelem
. It returnstrue
if there was >= r items in the sparse array and false otherwise. -
bool get_at_index(uint64_t r, std::string& elem)
: This function looks at the r-th index in the sparse bitvector; if that bit is 1, it fetches the corresponding value and binds it to the referenceelem
and returns true, if that bit is a 0, it simply returns false. -
uint64_t get_index_of(uin64_t r)
: This function takes as its argument a rankr
and returns the index in the sparse array where ther
-th present element appears. Note : this is basically just select on the bitvector, except that it returns the position where this element occurs, not one index past that position. Ifr
> the weight of the bitvector (i.e. greater than the number of present elements) then this function should return a sentinel value (e.g. in C++ this might be something likestd::numeric_limits<uint64_t>()
). -
uint64_t num_elem_at(uint64_t r)
: This function returns the count of present elements (1s in the bit vector) up to and including indexr
(Note: This is just rank on the bitvector, but it is inclusive rather than exclusive of index r). -
uint64_t size()
: Returns the size of the sparse array. -
uint64_t num_elem()
: Returns the number of present elements in the sparse array (i.e. the number of 1s in the bitvector). -
save(string& fname)
: Saves the sparse array to the filefname
. -
load(string& fname)
: Loads the sparse array from the filefname
.
So, in our example above, sparse_array.get_at_rank(1, e)
would return true, and after the function call the string e
would have the value "bar"
. Likewise sparse_array.get_at_index(3, e)
would return false and sparse_array.get_at_index(5, e)
would return true and also set e to have value "bar"
. Finally, sparse_array.get_index_of(2)
would return 5
.
Implementation note: To keep this simple, I’m only requiring that you implement the sparse array for values of type string
. However, note that if you are at all familiar with generic programming, it is very easy to make this work for values of any type — so if you are so inclined I encourage you to make this a “generic” sparse array.
Writeup: For this programming task, test your implementation by generating sparse arrays of a few different lengths (e.g. 1000, 10000, 100000, 1000000) and having various sparsity (e.g. 1%, 5%, 10%). How does the speed of the different functions vary as a factor of the overall size? How about as a function of the overall sparsity? Finally, try and estimate how the size of your sparse array in memory compares to what the size would be if all of the 0 elements were instead explicitly stored as “empty” elements (e.g. as empty strings). How much space do you save? How do your savings depend on sparsity?