CMSC 701 Homework 1 : Overview

This assignment deals with the construction and querying of the suffix array. In particular, however, we will be much more concerned with the details of querying the suffix array then with construction. Thus, you are free to use a 3rd party library of your choice to implement the actual construction of the suffix array.

In order to make the construction of the suffix array generally useful, it will be necessary to not only build this index, but also to be able to save it to disk and load it from disk. To this end, this project will also require you to be able to serialize and deserialize your index. Your project will consist of 2 executables buildsa and querysa which are described in more detail below. You can think of the project as being broken into two parts.

In the first part of the project, you will implement a program to read a reference sequence from a FASTA file, to construct the suffix array on this sequence, and then to write the string and suffix array (and possibly an extra lookup table structure) to file in a binary format. You will also implement a program to read the saved file from disk, compute some statistics about the suffix array, and write those to a text format output file.

In the second part of the project, you will implement a program to read your serialized suffix array from file, as well as to read an input FASTA file containing many queries. Your program will then produce an output file with the query results in a well-specified output format.

In addition to the code, you will return in a brief report, with a subsection on both parts of the project.

Overall structure

Your project should contain (at least) two top-level executables. One called buildsa and the other called querysa that will work as described below.

There should be a README.md file in the top level directory. This README file should contain the following information.
- What language have you written your solution in?
- What resources did you consult in working on this assignment (view this as a form of citation; you shouldn’t copy code directly from anywhere in your assignment, but if you consulted other sources please list them here).
Finally, your submission should also include a report with the information outlined in each of the parts below.

Part (a), constructing the suffix array and an auxiliary index

In this part of the assignment, you will write the buildsa program. This program will read in a “genome” (in FASTA) format, build the suffix array on this reference, and write the string and suffix array to a binary file. Additionally (when an extra option is provided) it will build a secondary index to allow for the improved search heuristic we discussed in class. When invoked with this extra option, it will also write this data structure to the serialized file.

`buildsa`

`buildsa`: Input

The input consists of one optional parameter, as well as 2 required arguments, given in this order:

--preftab <k> - if the option --preftab is passed to the buildsa executable (with the parameter k), then a prefix table will be built atop the suffix array, capable of jumping to the suffix array interval corresponding to any prefix of length k.
reference - the path to a FASTA format file containing the reference of which you will build the suffix array. Note: The FASTA format allows a record to be split over multiple input lines, make sure your program is parsing the reference input correctly.
output - the program will write a single binary output file to a file with this name, that contains a serialized version of the input string and the suffix array.

`buildsa`: Output

Your program will output a file with the name given by the output argument above. This should be a binary file holding everything necessary to perform query using your suffix array. Specifically, it should include the input string itself (probably with the sentinel $ appended) and it should also include an encoding of the entries of the suffix array. Aditionally, if your program created a prefix table (was given the --preftab option), then this file will also contain the serialzation of the prefix table.

Note: The specific binary encoding is up to you — it can be as simple as an integer representing the string length followed by the bytes of the string and an integer representing the number of bytes in the suffix array followed by the entries, or something more complex. You are allowed to use an external serialization library for this component, but the serialization must be to a binary (not text) format. In C++ you could use something like cereal or bitsery; in Rust you could use soemthing like rkyv or serde with bincode.

NOTE : You may make use of an existing 3rd party library for the actual suffix array construction algorithm so that it scales to reasonably large reference sequences. Such libraries exist and are readily available in C, C++ and Rust, and I am happy to provide specific recommendations if folks would like.

Report Items :

What did you find to be the most challenging part of implementing the buildsa program?
For references of various size:
- How long does building the suffix array take?
- How large is the resulting serialized file?
- For the times and sizes above, how do they change if you use the --preftab option with some different values of k?
- Given the scaling above, how large of a genome do you think you could construct the suffix array for on a machine with 32GB of RAM, why?

Part (b), querying the suffix array

In the second part of the assignment, you will implement a program that reads in your serialized data structures and then performs a series of queries against the suffix array.

Your program for this part should be called querysa. This program will take as input 4 arguments, your serialized suffix array, a FASTA file containing queries, a query mode parameter, and an output file name. It will perform query in the SA and report the results in the output file specified in the format specified below.

`querysa`: Input

index - the path to the binary file containing your serialized suffix array (as written by buildsa above).
queries - the path to an input file in FASTA format containing a set of records. You will need to care about both the name and sequence of these fasta records, as you will report the output using the name that appears for a record. Note, query sequences can span more than one line (headers will always be on one line).
query mode - this argument should be one of two strings; either naive or simpaccel. If the string is naive you should perform your queries using the naive binary search algorithm. If the string is simpaccel you should perform your queries using the “simple accelerant” algorithm we covered in class. Note: If you are reading a serialized input file with no prefix lookup table, then these algorithms are to be run on each query on the full suffix array, and if you are reading in a prefix lookup table as well, then this is the algorithm that should be used on the relevant interval for each query.
output - the name to use for the resulting output.

`querysa`: Output

output - the output file of your program. This file should contain the results of your queries in the following format. Each line should contain a tab-separated list containing the following information:
- query_name, k, hit_1, hit_2, hit_k

Here, the query_name is simply the header of the corresponding FASTA entry (the string after the > — not including the > on the header line). The value k is the number of occurrences of the query string in the underlying text on which the suffix array is built. Finally hit_1 through hit_k are the positions in the original text (0-indexed) where the query string occurs. If a query string does not occur in the text, then you should report k = 0, and there will be no hit_1, … etc. entries for that query.

Notes : For the purposes of measuring timing for the report (as mentioned below), you should be careful to not let the reading of input or the writing of output affect your timing measurements. There are a few ways to accomplish this. One simple way is to simply read all of input queries into memory before you start your internal timer — time only the actual queries themselves and the creation of the query output in memory — and then end your internal timer before you write the output to file. To enable even more accurate timing of the query, you could implement a --quite or --no-output flag for your executable that performs the queries but doesn’t actually write output to file (or collect it in memory).

Also, please be aware that the simple accelerant that you are implementing here is not the complex variant that requires us to pre-compute the LCPs between suffix array entries! It is simply the one that computes and propagates the LCP of the pattern and the current suffix array interval bounds. You do not need to implement the “super accelerant” (the variant that guarantees O(n+log m) search), only the “simple accelerant” (the variant that usually achieves O(n + log m) search in practice).

Report Items :

What did you find to be the most challenging part of implementing the query program?
For references of various size:
- How long does query take on references of different size, and on queries of different length?
- How does the speed of the naive lookup algorithm compare to the speed of the simpleaccel lookup algorithm?
- How does the speed further compare when not using a prefix lookup table versus using a prefix lookup table with different values of k?
- Given the scaling above, and the memory requirements of each type of index, what kind of tradeoff do you personally think makes sense in terms of using more memory in exchange for faster search.

CMSC 701 Homework 1 : Overview

Overall structure

Part (a), constructing the suffix array and an auxiliary index

buildsa

buildsa: Input

buildsa: Output

Report Items :

Part (b), querying the suffix array

querysa: Input

querysa: Output

Report Items :

`buildsa`

`buildsa`: Input

`buildsa`: Output

`querysa`: Input

`querysa`: Output