Approximate String Matching using FM-index

Master thesis data

Specialization: Algorithmics and Programming
Thesis advisor: Paolo Ribeca
Orientation: Research
Student: Santiago Marco Sola

Thesis Description

New sequencing platforms (Illumina, Solexa, ...) are producing more and more genomic data each
day. This data requires complex analysis which strongly relies on short sequence alignment.
Nowadays, the most successful aligners rely on the use of the Burrows Wheeler Transform (BWT)
and FM-index to map sequences. Such techniques allow to build compact indexes which are able
to perform exact searches in linear time. However, the reads obtained from the sequencing
process contain errors. In order to obtain accurate results, approximate searches must be
performed instead. Nonetheless, the cost of approximate searches is very sensitive to the number
of mismatches. In particular, the cost becomes exponential as the number of mismatches
increases. To mitigate this problem, several algorithmic techniques have been developed.
This master thesis is intended to analyze and develop mixed strategies to cope with the problem of
approximate string matching using FM-indexes. First, it will focus on analyzing the data structures
involved in the FM-index and how their design affects performance and space cost. Also many
techniques, used to speedup the alignment process avoiding the exponential cost explosion, will be
analyzed. In addition, we want to determine how they behave and how scalable they are with the
increase of the number of mismatches.
Finally, the techniques and ideas developed during this thesis will be implemented as a part of the
GEM library (a software toolbox for the high-throughput analysis of short-read sequences). In this
way, the work of this thesis will be made easily available for applied use.