Academia.eduAcademia.edu
ICT Innovations 2010 Web Proceedings ISSN 1857-7288 271 Next-generation DNA sequencing technology, challenges and bioinformatics approaches for sequence alignment Aleksandra Bogojeska, Slobodan Kalajdziski, Ljupco Kocarev Department of Computer Science and Informatics, Faculty of Electrical Engineering and Information Technologies, Karpos II bb, 1000 Skopje {aleksandra.bogojeska,skalaj,lkocarev}@feit.ukim.edu.mk Abstract. The advent of high-throughput sequencing platforms brought bioinformatics to a new level. This, so called ‘next-generation’ sequencing technology opened the researching doors of every laboratory allowing accomplishment of previously unimaginable scale and expensive experiments. As a result, novel research areas have emerged providing huge amounts of new data ready to be analyzed. Parallel to this progress, a variety of sequencing tools designed for data analysis has been published. Sequence alignment takes the central challenge in data analysis, providing primary representative results for the experiments. Few alignment methods and diversity of tools have been published and developed in the last years. The main goal of all these alignment tools is to fit between performance and accuracy. In this review will be presented the new NGS technologies and platforms, the current alignment approaches applied in data analysis and described some commonly used implementations of the methods. Keywords: next-generation sequencing, alignment algorithm, short reads 1 Introduction The process of determining the order of the nucleotides in a molecule of DNA, is called DNA sequencing. Novel approaches in genome sequencing alternate the genomics research rapidly supplanting the former Sanger method of sequencing. This next-generation sequencing (NGS) technology offers low cost and accurate DNA sequencing, which led to appearance of new fields in bioinformatics and molecular biology and advance in the existing ones. Research areas which have emerged and benefit from the use of NGS are metagenomis, Single Nucleotide Polymorphisms (SNPs) detection, gene expression, chromatin immunoprecipitation (ChIP) sequencing, non-coding RNAs discovering, de novo and ancient DNA sequencing [1], [2]. Each of these fields is expected to provide new and significant information for future development and research in genomics. There are two fundamental computational analyses performed over the data after the process of sequencing: assembly and alignment. The assembly is essential for organisms without a sequenced reference genome and represents the process of joining the sequenced reads in whole genome sequence. The alignment on the other hand, remains fundamental analysis that confirms the success of the experiment. This M. Gusev (Editor): ICT Innovations 2010, Web Proceedings, ISSN 1857-7288 © ICT ACT – http://ictinnovations.org/2010, 2010 272 Bogojeska et al.: Next-generation DNA sequencing technology process will be analyzed later in details. In the end the both types of analysis end with graphical tools for data viewing and representation. The paper is organized as follow. In Section II the NGS technologies and the sequencing platforms using this technology are going to be presented. Section III gives a closer look to the sequence alignment methods and tools developed for the NGS. The paper is concluded in section IV and future directions for the NSG technology and development of bioinformatics tools are given. 2 Next Generation Sequencing Technologies Since 2004 when the first NGS platform was presented by Live Sciences, Roche 454, many new approaches and producers of NGS platforms have appeared. For the time being, there exist 8 producers of NGS platforms: Applied Biosystems (www3.appliedbiosystems.com), Complete genomics (www.completegenomics.com), Helicos (www.helicosbio.com/), Illumina (www.illumina.com), Polonator (www.polonator.org/), Roche (/www.roche.com/index.htm), Pacific Bioscience (www.pacificbiosciences.com/) and Ion Torrent (www.iontorrent.com/). Each of the platforms embodies a complex symbiosis of enzymology, chemistry, high-resolution optics, hardware and software engineering. They are characterized with short read lengths, massive volumes of data generated and low cost sequencing. The NGS platforms produce reads with length between 30-400base pairs(bp) and generate 500(Mega base pairs)Mb-200(Giga base pair)Gb per run, compared to the Sanger method where maximum 70(kilo base pairs)kb per run are being generated with reads length of 900bp. The mentioned characteristics for the listed platforms are presented in Table 1. Table 1 NGS platforms and their features. Feature Technol. Read Len.(bp) Roche454 Sequencing-bysynthesis Revirsible dye termination 400 Illumina ABiSOLID Sequencing by ligation Sequencing by Complete hybridiyaton and genomics ligation Sequencing-byPacific 1 2 Data. Gen per run 500Mb Time per run 10h 8days 1,5days 50 150200Gb 1835Gb 100Gb 70 8Gb 446 10bases 10- 2x100 1x35 67days / Accuracy M:99.5% C:99.99% / M:99.94% 1 C:99.999% 2 99.95% 99.9999% M – measured accuracy, produced from the sequencing platform C – consensus accuracy, gained by the alignment tool M. Gusev (Editor): ICT Innovations 2010, Web Proceedings, ISSN 1857-7288 © ICT ACT – http://ictinnovations.org/2010, 2010 ICT Innovations 2010 Web Proceedings ISSN 1857-7288 Biosciences synthesis (single mol.) Sequencing-byHelicos synthesis (single mol.) Polonator Polony sequencing by ligation Capillary Sanger sequenciung 273 per sec 15min 25-55 2128Gb 8days C:99.995% 26 4-5Gb 4days 98% 800 70kb 3h 99.9% All platforms mainly differ in the approach of DNA manipulation; they use amplification of DNA molecule or single DNA molecule. The amplification of the molecules in the NGS is done in vitro using polymerase chain reaction (PCR). Parallelism is achieved with executing PCR on multiple individual molecules of DNA. This method is used in Illumina, Roche 454, Polonator, IonTorrent and ABISOLiD platforms. For the time being, only Helicos and Pacific Biosystems instruments are regarded as ‘single molecule’ sequencers. It is expected that the single molecule sequencing will produce read lengths of thousands of bases that will provide simplified and improved data analysis. The methodology used in the NGS platforms differs in the way of detection and reporting nucleotides. Roche platforms use pyrosequencing method combined with emulsion, Illumina platforms use reversible dye terminators for bases distinguishing, and ABI platforms use sequencing by ligation method. The new Ion Torrent sequencing platform uses pyrosequencing where later the released hydrogen ion is detected by a special semiconductor chip. Detailed review of the chemistry behind these platforms appears elsewhere [1],[2],[3], as on the companies` websites supplied with up-to-date information for each of the platforms. 3 Alignment, methods and tools used for NGS data analysis Alignment represents the process of determining the source of the sequenced DNA read. The read can be mapped against a given reference genome or multiple genomes from the species the sequence has come from. The alignment process can also be applied to other genomes, assuming that the evolutionary distance between the species of reads and the genome is appropriate. The foregoing NGS technologies used for DNA sequencing and their specific characteristics lead to development of new alignment tools. The gold standard tool BLAST used for analysis of sequencing data produced by the Sanger technology and the former generation of programs requires alignments of protein sequences used for performing search through large databases to find the best matched sequences [5]. On the other hand the new alignment tools perform alignment against the genome references of the species of interest. Also, these tools will have to deal with the use of many various technologies with unique error model which has to be implemented in the algorithm design; the specific species rate of polymorphisms that has to be calculated in the design of expected number of mismatches during the alignment. M. Gusev (Editor): ICT Innovations 2010, Web Proceedings, ISSN 1857-7288 © ICT ACT – http://ictinnovations.org/2010, 2010 274 Bogojeska et al.: Next-generation DNA sequencing technology These design assumptions will result in faster algorithms that will be capable of accurate processing of the massive data volumes produced by the NGS technologies. In the recent year there have been a growing number of implementations for tools that perform short-read alignment, but the number of significant methodologies implied is much smaller. The alignment tools can be grouped by the methodology used into three categories: hash table-based algorithms, algorithms based on suffix trees together with their modifications and merge sorting based algorithms. There is only one implementation for the third category, the Slider tool [24]. Accordingly this review will focus on the first two techniques. The first discussed method is hash tablebased implementation, where one has two possible ways of index creation, using the reference genome, or using the set sequence reads. Also the Burrows Wheeler transform (BWT)-based algorithms will be presented where first an efficient index of the reference genome is created which later results in fast search that has low-memory footprint. In order to give accurate mapping of the sequence, alignment programs are following a multistep algorithm. In the first step they use heuristic techniques to find the most likely places in the reference where the read can be mapped. After, on this smaller subset of possible mapping locations more accurate and sensitive algorithms are run, like the Smith-Waterman local alignment algorithm [6] and its modifications, giving the top n places where the read is mapped against the reference. 3.1 Hash-based alignment methods The first alignment tools used for the NGS short reads developed and presented, used the same methodology of creating the searching index as the BLAST [5] generation of algorithms - a hash table. This hash table is created form the input query data and is used for structuring the index and scanning trough the database sequences. This method is appropriate for DNA sequencing data where most of the time one has duplicate sequence and all the possible combinations of nucleotides are unlikely to be present. Those two features of the data match the feature of hash tables to index complex and non-sequential data. The hash table index can be created from the reference genome or the input reads. The difference in the approach is in the gain and loss of the memory and time. The hash tables created from the reference genome have constant memory requirement regardless of the size of the input reads. This memory is usually large depending on the size of the reference genome. Hash tables based on the sequenced reads usually require smaller and variable memory requirements but have slower processing time to scan the entire reference genome against every input read. Again, the memory depends on the number and diversity of the reads. M. Gusev (Editor): ICT Innovations 2010, Web Proceedings, ISSN 1857-7288 © ICT ACT – http://ictinnovations.org/2010, 2010 ICT Innovations 20110 Web Proceed dings ISSN 18557-7288 275 Fig. 1 Hassh table-based method. m The reggions used for seeding s are marrked with capitaal letters and the maatched reads forr the seeds 100 and 110 are sh hown Algoriithms that usse the hash table-based method m are: MAQ [7], S SOAP [8], SHRiMP [9], SSAHA A2 [10], RMA AP [11], RazerS [12], SeqM Map [13], ZO OOM [14], BFAST [15], MOSAIK (http:://bioinformaticcs.bc.edu/marthhlab/Mosaik) and the commerccial ELAND Illlumina's algoorithm. These hash table-based algorithms are impleementing the spaced seed paradigm ment mismatcches and gap ps. A templaate in form, example, [16] to allow alignm m at thhe ‘1’ positioon and 2 ‘101101’, represents a seed that requires 4 matches p In Fig. F 1, there is one seed 1000 and other 110, where mismatchhes at the ‘0’ positions. the capitaal letters in thee input reads represent r the seed s region annd in the hashh index the matched seeds. An im mprovement of the spaced seed policy is i the q-gram approach p read are ussed for gaps detection. d I am m going to [17], wheere multiple sppaced seeds per give shoort overview of MAQ and a SOAP alignment a tools, their haash index implemenntations, mism matches handliing, the algoriithm advantagges and disadvvantages. MAQ Alignment Quaality) presentss an alignmentt and variant caller tool MAQ [7]] (Mapping/A that uses hash table foor data indexiing. Additionaal characteristic feature forr MAQ is the consideration of thhe quality scorres (Phred vallues) of the reeads and later according to this innformation annd number off mismatches it assigns maapping qualityy value to each mappped read. This T feature helps h in the process of distinguishing d g between platform error and truue SNP. The algorithm alw ways reports one o alignmennt, and the w reporting mapping m quallity of zero. repeated read is alignedd randomly with mplement the search, s MAQ uses multiplee hash tables created from m the input To im reads whhose number depends on the number of mismatchhes allowed trrough the alignmennt. By default six hash tablles are created allowing upp to 2 mismaatches in a read usinng the spacedd seed techniqque. The num mber of mism matches can bbe greater, increasingg the numberr of hash tablles and processing time. MAQ M guarantiies to find alignmennts up to two mismatches in i the first 28 8bp of the reaads. It requirees smaller memory, less then 1G Gb per processsor, but long ger time to acchieve accuraate results. ment MAQ gives g many otther possibilitties for short--read data Along with the alignm a different foormat conversiion, SNPs detection and alignment vieweer. analysis as M. Gusev (Editor): ICT Innovations I 20110, Web Proceeedings, ISSN 18857-7288 © ICT ACT – http://icctinnovations.orrg/2010, 2010 276 Bogojeska et al.: Next-generation DNA sequencing technology MAQ is capable of processing Illumina and ABISOLiD reads, single or mate pair, no longer of 63bp, which is the greatest disadvantage of the algorithm, as each of the technologies aims to produce longer reads. SOAP SOAP [8] (Short Oligonucleotide Analysis Package) is another alignment tool designed for short-reads mapping. It is also hash table-based but does not use the quality information for alignment. To achieve the alignment SOAP uses seed and hash table searching algorithm. Not like in MAQ here the hash table index is created from the reference sequences. The input reads and the reference genome are being encoded to numeric data using 2 bits per base. This value is used as suffix, when searching in the look-up table is performed, in order to find the number of different bases. It reports the alignments with minimal number of mismatches or smaller gaps reported. By default 2 mismatches and gaps between 1 and 3bp are allowed. The number of 2 mismatches allowed is achieved with splitting the read in four fragments, and all 6 combinations of the two fragments where mismatches can exists are taken as a seed. To remove the contaminated end regions of Illumina reads, SOAP can remove few base pairs from the end of the reads. To gain lower memory SOAP loads the reference sequences as unsigned 3-bytes array. SOAP is also capable of mate pair alignment, but not of color space alignment. It is intended for alignment of short reads up to 60bp. Additional features of the algorithm are specific alignment for mRNA, and small RNA reads. 3.2 Suffix array, FM-index and Burrows-Wheeler transform alignment methods This category of alignment algorithms has been developed in the recent years and connects the suffix array methods and special FM-index. The connection is done using a data structure that origins form compression theory, Burrows-Wheeler transform. The combination of these methods gives algorithms that perform reads matching 10 times faster than the hash table-based aligners [18], [19]. A suffix array represents a data structure that contains all possible suffixes of a string. It’s designed for efficient searching of a large text. Burrows-Wheeler transform [4] is a technique for string compression which implements transformation algorithm and having the resulting compressed string one can easy decode the input string. It uses so called zero string to identify the string end. A table of all possible suffixes of the string is created, and sorted. The last column of the sorted matrix represents the BWT string, Fig. 2. Afterwards using the LF (last-to-first) mapping the initial text can be obtained. These features make BWT suitable for genetic data manipulation where the strings are consisted of four-letters alphabet. The FM(Full-text, Minute space)index [22] is a data structure based on the BWT and presents an efficient algorithm for data indexing. The creators of FM-index, Ferragina and Manzini, propose special index structure consisted of three parts: superbuckets section (SB), bucket directory (BD), and body of the FM-index. The superbuckets section stores the number of M. Gusev (Editor): ICT Innovations 2010, Web Proceedings, ISSN 1857-7288 © ICT ACT – http://ictinnovations.org/2010, 2010 ICT Innovations 2010 Web Proceedings ISSN 1857-7288 277 occurrences of every character in the previous SB, the bucket directory stores the starting position of each compressed bucket in the body of the FM-index, and the body of the index stores compressed image of each bucket. The data structure creation includes two steps. The first step is performing BWT on the reference genome. This process is reversible and within this step sequences that exist multiple times will appear together in the data structure. The second step, memory intensive, represents the final index creation given by the FM-index structure. This data structure reports faster searching time having the same or smaller size index than the input genome size. For the human genome index approximately requires 2-5GB depending on the other algorithm techniques implemented [18-20]. This small index size allows storing the index on disk and loading into memory on any standard computing cluster. Fig. 2 BWT construction for the string GOOGOL [30] The tradeoff between speed and algorithm sensitivity in the sense of BWT based algorithms is the number of mismatches allowed. Whereas in the hash table-based methods the seeds were present, here there is no efficient way of dealing mismatches. Each algorithm employs different method for mismatches and gaps detection and management. As the sequencing technology develops and becomes more accurate this limitation will become less important for species with low rate of polymorphisms. Next I am going to give short description of three BWT-based methods, founders of this generation algorithms. BWA and BWA-SW These two algorithms are published by the same developer of the MAQ aligner, using the speed up that BWT gives. The BWA [18] version is intended for short reads up to 200bp length, and the enhanced version BWT-SW [21] is for long reads up to 100kbp. They both use suffix array combined with the BWT and FM-index. As in MAQ the mapping quality is reported for each aligned read. The BWA algorithm uses few other concepts for Illumina platform reads alignment, considering their characteristics: ambiguous bases, paired-end mapping, determining the maximum allowed number of mismatches and generation mapping quality scores. When comes to exact matching the FM-index backward search is used. For the inexact matching a breadth-first search algorithm (BFS) with a heap-like structure for calculating differences between strings is used. This algorithm will find all alignments within a given n number of differences. Acceleration is achieved by using iterative strategy for the top repetitive intervals and unique top intervals with n M. Gusev (Editor): ICT Innovations 2010, Web Proceedings, ISSN 1857-7288 © ICT ACT – http://ictinnovations.org/2010, 2010 278 Bogojeska et al.: Next-generation DNA sequencing technology differences. It is 6-18 times faster than MAQ but reports slightly bigger error rate than MAQ, where one can see the tradeoff between speed and accuracy. NGS platforms development results in producing reads with 500-1000bp lengths in order to increase the read length. The limitations of BWA to deal with longer data, and data with high error rates, leaded to the development of BWA-SW algorithm. This algorithm is the only one that can process reads with length longer than 1kb, using the BWT. Compared to the short-reads alignment where full-length alignment of the read is required, the long-read alignment prefer local matches because these reads are more likely to contain big number of mismatches and gaps. BWA-SW builds FM-indices for the both reference genome and read. The reference genome is represented as a prefix tree, and the read as a prefix directed acyclic word graph (DAWG). Using dynamic programming the best local match is found. Mismatches and gaps are handled using seeds templates between the two FM-indices. This algorithm reports 5-20 times faster processing time compared to SSAHA2 and BLAST, and same accuracy as SSAHA2 alignment algorithm. BOWTIE Bowtie [19] is the first published short-read alignment algorithm that uses the BWT technique. It uses index structure proposed by Kärkkäinen [23], combined with the FM-index structure, which can be configured to tradeoff between speed and memory. For exact alignment pure LF mapping is used. In order to handle sequencing errors and polymorphisms Bowtie introduce two extensions: quality-aware backtracking algorithm and double indexing strategy. The first extension allows mismatches and favors the high-quality alignments using greedy, depth-first search (DFS) to find the best alignment with n mismatches. The second extension avoids unreasonable backtracking for reads with low quality values. Using the MAQ strategy mapping quality is reported, too. Disadvantage is that Bowtie is not able to report gapped alignments, i.e. detection of insertions and deletions. Recent Bowtie versions are able to perform pair-end and color space alignment and process reads up to 1024bp length. SOAP2 SOAP2 [20] is the improved version of the SOAP alignment algorithm with reduced computer memory usage and increased alignment speed. These improvements are achieved using the BWT compression index instead using the hash table-based seed algorithm. It uses the proposed FM data structure, for example a 13mer on the hash will result in 226 blocks for the reference genome, and in few interactions the exact location of the search string can be found inside the block. For mismatches dealing a ‘split-read strategy’ is used, same as with the SOAP algorithm. The read is spitted in fragments in order to detect mismatches. As the previous version of this algorithm, the best alignment is reported by the minimal number of mismatches and gaps. Improvement is made in the maximum length of the input reads, where now the algorithm can handle reads up to 1024bp. Using the BWT this algorithm is 12 times faster in the step of index creation for the human genome compared to the first M. Gusev (Editor): ICT Innovations 2010, Web Proceedings, ISSN 1857-7288 © ICT ACT – http://ictinnovations.org/2010, 2010 ICT Innovations 2010 Web Proceedings ISSN 1857-7288 279 version, and 20 times faster for the same amount of reads aligned. The SOAP group of algorithms is evolving and there are companion algorithms for de novo assembly and SNP’s detection. 4 Conclusion and NGS future Bioinformatics and genomics science has shifted as result of NGS technologies advancement. New projects inspired by the NGS, 1000 Genome Project (http://1000genomes.org), HapMap (http://hapmap.org), will provide significant information for genome structure and functionalities, and open new perspectives in specific diseases and cancer treatment. The bioinformatics community is being intrigued and gives fast response to the challenges coming from the NGS projects. As a consequence, the existing alignment methods have been improved and adjusted to deal with the massive volumes of short-reads data. The competition between alignment tools is still running and there is no answer provided for the question which tool is the most accurate and suitable to use, offering most effective use of computational resources. Table 2 gives overview of the presented alignment tools. The current NGS improvement, production of longer reads, requires modification of the many developed short-read analysis tools or development of new ones. Long reads will have the primary use when de novo assembly is performed, or genomes with high repetitive structure are sequenced. On the contrary, the short-read sequencing will play the main role when sequencing on specific small regions is required, as ChIP sequencing, RNA sequencing and SNPs detection. Table 2 NGS alignment tools and their features. Tool Method Platform Indels Read Len. MAQ Hash Illumina, ABiSOLD y 63bp SOAP Hash Illumina y 60bp BWA BWT Illumina ABiSOLID y 200bp SOAP2 BWT Illumina BWA-SW BWT All y 1024bp y 100kbp BOWTIE BWT Illumina Roche n 1024bp NGS technology is in an early stage of development and the following years will bring improvements and novelties in this researching area and continuing stimulus for researches in bioinformatics. Developers will be continuously challenged as new data types will be presented from new and enhanced NGS platforms. The future development of analysis tools and management systems will have to incorporate information about sequence errors, biases, genome polymorphisms rate. The methods and their algorithm representations described above provide the first approach to the existing and upcoming challenges in the sequencing field. Acknowledgments. We are thankful to Zlatko Trajanoski and Gernot Stocker from the Institute of Genomics and Bioinformatics, Graz, Austria, for their unselfish sharing of information, resources and experience in the field of bioinformatics. M. Gusev (Editor): ICT Innovations 2010, Web Proceedings, ISSN 1857-7288 © ICT ACT – http://ictinnovations.org/2010, 2010 280 Bogojeska et al.: Next-generation DNA sequencing technology References 1. Mardis, E.R.: The impact of next-generation sequencing technology on genetics. Trends Gnet. 24,113--141 (2008) 2. Mardis, E.R.: Next-generation DNA sequencing methods. Annu. Rev. Genomics. Hum. Genet. 9, 387—401 (2008) 3. Voelkerding, K., Dames, S.A., Durtschi, J.D.: Next-Generation Sequencing: From Basic Research to Diagnostics. Clin. Chem. 55, 641--658 (2009) 4. Burrows, M., Wheeler, D.J.: A block-sorting lossless data compression algorithm. Tech. Report 124, Digital Equipment Corporation. CA: Palo Alto (1994) 5. Altschul, S.F., Gish,W., Miller, W. et al.: Basic local alignment search tool. J. Mol. Biol. 215, 403--410 (1990) 6. Smith,T.F., Waterman, M.S.: Identification of common molecular subsequences. Mol. Biol. 147, 195--197 (1981) 7. Li, H., Ruan, J., Durbin, R.: Mapping short DNA sequencing reads and calling variants using mapping quality scores. Genome Res. 18, 1851-1858, 2008 8. Li, R., Li, Y., Kristiansen, K., Wang, J.: SOAP: short oligonucleotide alignment program. Bioinformatics. 24, 713--714 (2008) 9. Rumble, S.M. et al.: SHRiMP: accurate mapping of short color-space reads. PLOS Comput. Biol. 5, e1000386 (2009) 10. Ning, Z., Cox, A.J., Mullikin, J.C.: SSAHA: a fast search method for large DNA databases. Genome Res. 11, 1725--1729 (2001) 11.Smith, A.D., Xuan, Z., Zhang ,M.Q.: Using quality scores and longer reads improves accuracy of Solexa read mapping, BMC Bioinformatics, 9, 128 (2008) 12.Weese, D., Emde, A.K., Rausch, T., Doring, A., Reinert, K.: RazerS--fast read mapping with sensitivity control. Genome Res. 19, 1646--1654, (2009) 13.Jiang, H., Wong, W.H.: SeqMap: mapping massive amount of oligonucleotides to the genome. Bioinformatics. 24, 2395--2396 (2008) 14.Lin, H., Zhang, Z., Zhang, M.Q., Ma, B., Li, M.: ZOOM! Zillions of oligos mapped. Bioinformatics, 24, 2431--2437 (2008) 15.Homer, N., Merriman, B., Nelson, S.F.: BFAST: an alignment tool for large scale genome resequencing. PLoS One, 4, e7767--e7767 (2009) 16.Ma, B., Tromp, J., Li, M.: PatternHunter: faster and more sensitive homology search. Bioinformatics, 18, 440--445 (2002) 17.Rasmussen, K.R., Stoye, J., Myers, E. W.: Efficient q-gram filters for finding all epsilonmatches over a given length. J.Comput Biol, 13, 296--308 (2006) 18.Li, H., Durbin, R.: Fast and Accurate Short Read Alignment with BurrowsWheeler Transform. Bioinformatics, 25, 1754--1760 (2009) 19.Langmead, B., Trapnell, C., Pop, M., Salzberg, S. L.: Ultrafast and memory-efficient alignment of short DNA sequences to the human genome. Genome Biol. 10, R25 20.Li, R., Yu, C., Li, Y., Lam, T. W., Yiu, S. M., Kristiansen, K., Wang, J.: SOAP2: an improved ultrafast tool for short read alignment. Bioinformatics. 25, 1966--1967 (2009) 21.Li, H., Durbin, R.: Fast and accurate long read alignment with Burrows-Wheeler transform. Bioinformatics. 26, 589--1595 (2010) 22.Feragina, P., Manzini, G.: Opportunistic data structures with applications. Proceedings of the 41st Symposium on Foundation of Computer Science (FOCS 2000),Redondo Beach, CA, USA. 390--398 (2000) 23.Kärkkäinen, J.: Fast BWT in small space by blockwise duffix sorting. Theor. comput. Sci. 387, 249--257 (2007) 24.Malhis, N., Butterfield, Y., Easter, M., Jones, JMS.: Slider-maximum use of probability information for alignment of short sequence reads and SNP detection. Bioinformatics. 25,6-13 (2009) M. Gusev (Editor): ICT Innovations 2010, Web Proceedings, ISSN 1857-7288 © ICT ACT – http://ictinnovations.org/2010, 2010