Diploma / Masters Thesis: Distributed data structures for
efficient molecular sequence analysis
| Archive | Size | Version | Description |
| ptpan-thesis |
[PS.gz] [PDF] |
451 KB 1092 KB |
15-Nov-03 |
Abstract:
With the increasing demands on fast query algorithms and data structures on huge genome
databases, many of the old indexing techniques become unsuitable. In this thesis, we will
propose and implement a novel indexing structure based on suffix trees called PTPan.
Its main purpose is the fast lookup of DNA/RNA substrings in very large databases.
Creation and use of this structure can be easily distributed and parallelised across multiple
machines on a cluster. The index is supposed to fit in main memory. References to the
source database are not necessary for generating a candidate set. Compression is used to
keep the requirements on disk as small as possible, without heavily affecting the O(|Q|)
lookup time (Q being the query string). Space requirements after compression are better
or close to those of suffix arrays (using suitable parameters for our application). PTPan
has been implemented into the ARB software package. |
|