ICT Innovations 2010 Web Proceedings ISSN 1857-7288
271
Next-generation DNA sequencing technology,
challenges and bioinformatics approaches
for sequence alignment
Aleksandra Bogojeska, Slobodan Kalajdziski, Ljupco Kocarev
Department of Computer Science and Informatics,
Faculty of Electrical Engineering and Information Technologies,
Karpos II bb, 1000 Skopje
{aleksandra.bogojeska,skalaj,lkocarev}@feit.ukim.edu.mk
Abstract. The advent of high-throughput sequencing platforms brought
bioinformatics to a new level. This, so called ‘next-generation’ sequencing
technology opened the researching doors of every laboratory allowing
accomplishment of previously unimaginable scale and expensive experiments.
As a result, novel research areas have emerged providing huge amounts of new
data ready to be analyzed. Parallel to this progress, a variety of sequencing tools
designed for data analysis has been published. Sequence alignment takes the
central challenge in data analysis, providing primary representative results for
the experiments. Few alignment methods and diversity of tools have been
published and developed in the last years. The main goal of all these alignment
tools is to fit between performance and accuracy. In this review will be
presented the new NGS technologies and platforms, the current alignment
approaches applied in data analysis and described some commonly used
implementations of the methods.
Keywords: next-generation sequencing, alignment algorithm, short reads
1 Introduction
The process of determining the order of the nucleotides in a molecule of DNA, is
called DNA sequencing. Novel approaches in genome sequencing alternate the
genomics research rapidly supplanting the former Sanger method of sequencing. This
next-generation sequencing (NGS) technology offers low cost and accurate DNA
sequencing, which led to appearance of new fields in bioinformatics and molecular
biology and advance in the existing ones. Research areas which have emerged and
benefit from the use of NGS are metagenomis, Single Nucleotide Polymorphisms
(SNPs) detection, gene expression, chromatin immunoprecipitation (ChIP)
sequencing, non-coding RNAs discovering, de novo and ancient DNA sequencing [1],
[2]. Each of these fields is expected to provide new and significant information for
future development and research in genomics.
There are two fundamental computational analyses performed over the data after
the process of sequencing: assembly and alignment. The assembly is essential for
organisms without a sequenced reference genome and represents the process of
joining the sequenced reads in whole genome sequence. The alignment on the other
hand, remains fundamental analysis that confirms the success of the experiment. This
M. Gusev (Editor): ICT Innovations 2010, Web Proceedings, ISSN 1857-7288
© ICT ACT – http://ictinnovations.org/2010, 2010
272
Bogojeska et al.: Next-generation DNA sequencing technology
process will be analyzed later in details. In the end the both types of analysis end with
graphical tools for data viewing and representation.
The paper is organized as follow. In Section II the NGS technologies and the
sequencing platforms using this technology are going to be presented. Section III
gives a closer look to the sequence alignment methods and tools developed for the
NGS. The paper is concluded in section IV and future directions for the NSG
technology and development of bioinformatics tools are given.
2 Next Generation Sequencing Technologies
Since 2004 when the first NGS platform was presented by Live Sciences, Roche 454,
many new approaches and producers of NGS platforms have appeared. For the time
being, there exist 8 producers of NGS platforms: Applied Biosystems
(www3.appliedbiosystems.com), Complete genomics (www.completegenomics.com),
Helicos (www.helicosbio.com/), Illumina (www.illumina.com), Polonator
(www.polonator.org/), Roche (/www.roche.com/index.htm), Pacific Bioscience
(www.pacificbiosciences.com/) and Ion Torrent (www.iontorrent.com/).
Each of the platforms embodies a complex symbiosis of enzymology, chemistry,
high-resolution optics, hardware and software engineering. They are characterized
with short read lengths, massive volumes of data generated and low cost sequencing.
The NGS platforms produce reads with length between 30-400base pairs(bp) and
generate 500(Mega base pairs)Mb-200(Giga base pair)Gb per run, compared to the
Sanger method where maximum 70(kilo base pairs)kb per run are being generated
with reads length of 900bp. The mentioned characteristics for the listed platforms are
presented in Table 1.
Table 1 NGS platforms and their features.
Feature
Technol.
Read
Len.(bp)
Roche454
Sequencing-bysynthesis
Revirsible dye
termination
400
Illumina
ABiSOLID Sequencing by
ligation
Sequencing by
Complete
hybridiyaton and
genomics
ligation
Sequencing-byPacific
1
2
Data.
Gen
per run
500Mb
Time
per
run
10h
8days
1,5days
50
150200Gb
1835Gb
100Gb
70
8Gb
446
10bases 10-
2x100
1x35
67days
/
Accuracy
M:99.5%
C:99.99%
/
M:99.94% 1
C:99.999% 2
99.95%
99.9999%
M – measured accuracy, produced from the sequencing platform
C – consensus accuracy, gained by the alignment tool
M. Gusev (Editor): ICT Innovations 2010, Web Proceedings, ISSN 1857-7288
© ICT ACT – http://ictinnovations.org/2010, 2010
ICT Innovations 2010 Web Proceedings ISSN 1857-7288
Biosciences synthesis
(single mol.)
Sequencing-byHelicos
synthesis
(single mol.)
Polonator Polony sequencing
by ligation
Capillary
Sanger
sequenciung
273
per sec
15min
25-55
2128Gb
8days
C:99.995%
26
4-5Gb
4days
98%
800
70kb
3h
99.9%
All platforms mainly differ in the approach of DNA manipulation; they use
amplification of DNA molecule or single DNA molecule. The amplification of the
molecules in the NGS is done in vitro using polymerase chain reaction (PCR).
Parallelism is achieved with executing PCR on multiple individual molecules of
DNA. This method is used in Illumina, Roche 454, Polonator, IonTorrent and
ABISOLiD platforms. For the time being, only Helicos and Pacific Biosystems
instruments are regarded as ‘single molecule’ sequencers. It is expected that the single
molecule sequencing will produce read lengths of thousands of bases that will provide
simplified and improved data analysis.
The methodology used in the NGS platforms differs in the way of detection and
reporting nucleotides. Roche platforms use pyrosequencing method combined with
emulsion, Illumina platforms use reversible dye terminators for bases distinguishing,
and ABI platforms use sequencing by ligation method. The new Ion Torrent
sequencing platform uses pyrosequencing where later the released hydrogen ion is
detected by a special semiconductor chip.
Detailed review of the chemistry behind these platforms appears elsewhere [1],[2],[3],
as on the companies` websites supplied with up-to-date information for each of the
platforms.
3 Alignment, methods and tools used for NGS data analysis
Alignment represents the process of determining the source of the sequenced DNA
read. The read can be mapped against a given reference genome or multiple genomes
from the species the sequence has come from. The alignment process can also be
applied to other genomes, assuming that the evolutionary distance between the
species of reads and the genome is appropriate.
The foregoing NGS technologies used for DNA sequencing and their specific
characteristics lead to development of new alignment tools. The gold standard tool
BLAST used for analysis of sequencing data produced by the Sanger technology and
the former generation of programs requires alignments of protein sequences used for
performing search through large databases to find the best matched sequences [5]. On
the other hand the new alignment tools perform alignment against the genome
references of the species of interest. Also, these tools will have to deal with the use of
many various technologies with unique error model which has to be implemented in
the algorithm design; the specific species rate of polymorphisms that has to be
calculated in the design of expected number of mismatches during the alignment.
M. Gusev (Editor): ICT Innovations 2010, Web Proceedings, ISSN 1857-7288
© ICT ACT – http://ictinnovations.org/2010, 2010
274
Bogojeska et al.: Next-generation DNA sequencing technology
These design assumptions will result in faster algorithms that will be capable of
accurate processing of the massive data volumes produced by the NGS technologies.
In the recent year there have been a growing number of implementations for tools
that perform short-read alignment, but the number of significant methodologies
implied is much smaller. The alignment tools can be grouped by the methodology
used into three categories: hash table-based algorithms, algorithms based on suffix
trees together with their modifications and merge sorting based algorithms. There is
only one implementation for the third category, the Slider tool [24]. Accordingly this
review will focus on the first two techniques. The first discussed method is hash tablebased implementation, where one has two possible ways of index creation, using the
reference genome, or using the set sequence reads. Also the Burrows Wheeler
transform (BWT)-based algorithms will be presented where first an efficient index of
the reference genome is created which later results in fast search that has low-memory
footprint.
In order to give accurate mapping of the sequence, alignment programs are
following a multistep algorithm. In the first step they use heuristic techniques to find
the most likely places in the reference where the read can be mapped. After, on this
smaller subset of possible mapping locations more accurate and sensitive algorithms
are run, like the Smith-Waterman local alignment algorithm [6] and its modifications,
giving the top n places where the read is mapped against the reference.
3.1
Hash-based alignment methods
The first alignment tools used for the NGS short reads developed and presented, used
the same methodology of creating the searching index as the BLAST [5] generation of
algorithms - a hash table. This hash table is created form the input query data and is
used for structuring the index and scanning trough the database sequences. This
method is appropriate for DNA sequencing data where most of the time one has
duplicate sequence and all the possible combinations of nucleotides are unlikely to be
present. Those two features of the data match the feature of hash tables to index
complex and non-sequential data.
The hash table index can be created from the reference genome or the input reads.
The difference in the approach is in the gain and loss of the memory and time. The
hash tables created from the reference genome have constant memory requirement
regardless of the size of the input reads. This memory is usually large depending on
the size of the reference genome. Hash tables based on the sequenced reads usually
require smaller and variable memory requirements but have slower processing time to
scan the entire reference genome against every input read. Again, the memory
depends on the number and diversity of the reads.
M. Gusev (Editor): ICT Innovations 2010, Web Proceedings, ISSN 1857-7288
© ICT ACT – http://ictinnovations.org/2010, 2010
ICT Innovations 20110 Web Proceed
dings ISSN 18557-7288
275
Fig. 1 Hassh table-based method.
m
The reggions used for seeding
s
are marrked with capitaal letters
and the maatched reads forr the seeds 100 and 110 are sh
hown
Algoriithms that usse the hash table-based method
m
are: MAQ [7], S
SOAP [8],
SHRiMP [9], SSAHA
A2 [10], RMA
AP [11], RazerS [12], SeqM
Map [13], ZO
OOM [14],
BFAST [15], MOSAIK (http:://bioinformaticcs.bc.edu/marthhlab/Mosaik) and the
commerccial ELAND Illlumina's algoorithm.
These hash table-based algorithms are impleementing the spaced seed paradigm
ment mismatcches and gap
ps. A templaate in form, example,
[16] to allow alignm
m
at thhe ‘1’ positioon and 2
‘101101’, represents a seed that requires 4 matches
p
In Fig.
F 1, there is one seed 1000 and other 110, where
mismatchhes at the ‘0’ positions.
the capitaal letters in thee input reads represent
r
the seed
s
region annd in the hashh index the
matched seeds. An im
mprovement of the spaced seed policy is
i the q-gram approach
p read are ussed for gaps detection.
d
I am
m going to
[17], wheere multiple sppaced seeds per
give shoort overview of MAQ and
a
SOAP alignment
a
tools, their haash index
implemenntations, mism
matches handliing, the algoriithm advantagges and disadvvantages.
MAQ
Alignment Quaality) presentss an alignmentt and variant caller tool
MAQ [7]] (Mapping/A
that uses hash table foor data indexiing. Additionaal characteristic feature forr MAQ is
the consideration of thhe quality scorres (Phred vallues) of the reeads and later according
to this innformation annd number off mismatches it assigns maapping qualityy value to
each mappped read. This
T
feature helps
h
in the process of distinguishing
d
g between
platform error and truue SNP. The algorithm alw
ways reports one
o alignmennt, and the
w reporting mapping
m
quallity of zero.
repeated read is alignedd randomly with
mplement the search,
s
MAQ uses multiplee hash tables created from
m the input
To im
reads whhose number depends on the number of mismatchhes allowed trrough the
alignmennt. By default six hash tablles are created allowing upp to 2 mismaatches in a
read usinng the spacedd seed techniqque. The num
mber of mism
matches can bbe greater,
increasingg the numberr of hash tablles and processing time. MAQ
M
guarantiies to find
alignmennts up to two mismatches in
i the first 28
8bp of the reaads. It requirees smaller
memory, less then 1G
Gb per processsor, but long
ger time to acchieve accuraate results.
ment MAQ gives
g
many otther possibilitties for short--read data
Along with the alignm
a different foormat conversiion, SNPs detection and alignment vieweer.
analysis as
M. Gusev (Editor): ICT Innovations
I
20110, Web Proceeedings, ISSN 18857-7288
© ICT ACT – http://icctinnovations.orrg/2010, 2010
276
Bogojeska et al.: Next-generation DNA sequencing technology
MAQ is capable of processing Illumina and ABISOLiD reads, single or mate pair,
no longer of 63bp, which is the greatest disadvantage of the algorithm, as each of the
technologies aims to produce longer reads.
SOAP
SOAP [8] (Short Oligonucleotide Analysis Package) is another alignment tool
designed for short-reads mapping. It is also hash table-based but does not use the
quality information for alignment. To achieve the alignment SOAP uses seed and
hash table searching algorithm. Not like in MAQ here the hash table index is created
from the reference sequences. The input reads and the reference genome are being
encoded to numeric data using 2 bits per base. This value is used as suffix, when
searching in the look-up table is performed, in order to find the number of different
bases. It reports the alignments with minimal number of mismatches or smaller gaps
reported. By default 2 mismatches and gaps between 1 and 3bp are allowed. The
number of 2 mismatches allowed is achieved with splitting the read in four fragments,
and all 6 combinations of the two fragments where mismatches can exists are taken as
a seed. To remove the contaminated end regions of Illumina reads, SOAP can remove
few base pairs from the end of the reads. To gain lower memory SOAP loads the
reference sequences as unsigned 3-bytes array.
SOAP is also capable of mate pair alignment, but not of color space alignment. It is
intended for alignment of short reads up to 60bp. Additional features of the algorithm
are specific alignment for mRNA, and small RNA reads.
3.2 Suffix array, FM-index and Burrows-Wheeler transform alignment methods
This category of alignment algorithms has been developed in the recent years and
connects the suffix array methods and special FM-index. The connection is done
using a data structure that origins form compression theory, Burrows-Wheeler
transform. The combination of these methods gives algorithms that perform reads
matching 10 times faster than the hash table-based aligners [18], [19].
A suffix array represents a data structure that contains all possible suffixes of a
string. It’s designed for efficient searching of a large text. Burrows-Wheeler transform
[4] is a technique for string compression which implements transformation algorithm
and having the resulting compressed string one can easy decode the input string. It
uses so called zero string to identify the string end. A table of all possible suffixes of
the string is created, and sorted. The last column of the sorted matrix represents the
BWT string, Fig. 2. Afterwards using the LF (last-to-first) mapping the initial text can
be obtained. These features make BWT suitable for genetic data manipulation where
the strings are consisted of four-letters alphabet. The FM(Full-text, Minute space)index [22] is a data structure based on the BWT and presents an efficient algorithm
for data indexing. The creators of FM-index, Ferragina and Manzini, propose special
index structure consisted of three parts: superbuckets section (SB), bucket directory
(BD), and body of the FM-index. The superbuckets section stores the number of
M. Gusev (Editor): ICT Innovations 2010, Web Proceedings, ISSN 1857-7288
© ICT ACT – http://ictinnovations.org/2010, 2010
ICT Innovations 2010 Web Proceedings ISSN 1857-7288
277
occurrences of every character in the previous SB, the bucket directory stores the
starting position of each compressed bucket in the body of the FM-index, and the
body of the index stores compressed image of each bucket. The data structure creation
includes two steps. The first step is performing BWT on the reference genome. This
process is reversible and within this step sequences that exist multiple times will
appear together in the data structure. The second step, memory intensive, represents
the final index creation given by the FM-index structure. This data structure reports
faster searching time having the same or smaller size index than the input genome
size. For the human genome index approximately requires 2-5GB depending on the
other algorithm techniques implemented [18-20]. This small index size allows storing
the index on disk and loading into memory on any standard computing cluster.
Fig. 2 BWT construction for the string GOOGOL [30]
The tradeoff between speed and algorithm sensitivity in the sense of BWT based
algorithms is the number of mismatches allowed. Whereas in the hash table-based
methods the seeds were present, here there is no efficient way of dealing mismatches.
Each algorithm employs different method for mismatches and gaps detection and
management. As the sequencing technology develops and becomes more accurate this
limitation will become less important for species with low rate of polymorphisms.
Next I am going to give short description of three BWT-based methods, founders
of this generation algorithms.
BWA and BWA-SW
These two algorithms are published by the same developer of the MAQ aligner, using
the speed up that BWT gives. The BWA [18] version is intended for short reads up to
200bp length, and the enhanced version BWT-SW [21] is for long reads up to
100kbp. They both use suffix array combined with the BWT and FM-index. As in
MAQ the mapping quality is reported for each aligned read.
The BWA algorithm uses few other concepts for Illumina platform reads
alignment, considering their characteristics: ambiguous bases, paired-end mapping,
determining the maximum allowed number of mismatches and generation mapping
quality scores. When comes to exact matching the FM-index backward search is used.
For the inexact matching a breadth-first search algorithm (BFS) with a heap-like
structure for calculating differences between strings is used. This algorithm will find
all alignments within a given n number of differences. Acceleration is achieved by
using iterative strategy for the top repetitive intervals and unique top intervals with n
M. Gusev (Editor): ICT Innovations 2010, Web Proceedings, ISSN 1857-7288
© ICT ACT – http://ictinnovations.org/2010, 2010
278
Bogojeska et al.: Next-generation DNA sequencing technology
differences. It is 6-18 times faster than MAQ but reports slightly bigger error rate than
MAQ, where one can see the tradeoff between speed and accuracy.
NGS platforms development results in producing reads with 500-1000bp lengths
in order to increase the read length. The limitations of BWA to deal with longer data,
and data with high error rates, leaded to the development of BWA-SW algorithm.
This algorithm is the only one that can process reads with length longer than 1kb,
using the BWT. Compared to the short-reads alignment where full-length alignment
of the read is required, the long-read alignment prefer local matches because these
reads are more likely to contain big number of mismatches and gaps. BWA-SW
builds FM-indices for the both reference genome and read. The reference genome is
represented as a prefix tree, and the read as a prefix directed acyclic word graph
(DAWG). Using dynamic programming the best local match is found. Mismatches
and gaps are handled using seeds templates between the two FM-indices. This
algorithm reports 5-20 times faster processing time compared to SSAHA2 and
BLAST, and same accuracy as SSAHA2 alignment algorithm.
BOWTIE
Bowtie [19] is the first published short-read alignment algorithm that uses the BWT
technique. It uses index structure proposed by Kärkkäinen [23], combined with the
FM-index structure, which can be configured to tradeoff between speed and memory.
For exact alignment pure LF mapping is used. In order to handle sequencing errors
and polymorphisms Bowtie introduce two extensions: quality-aware backtracking
algorithm and double indexing strategy. The first extension allows mismatches and
favors the high-quality alignments using greedy, depth-first search (DFS) to find the
best alignment with n mismatches. The second extension avoids unreasonable
backtracking for reads with low quality values. Using the MAQ strategy mapping
quality is reported, too. Disadvantage is that Bowtie is not able to report gapped
alignments, i.e. detection of insertions and deletions. Recent Bowtie versions are able
to perform pair-end and color space alignment and process reads up to 1024bp length.
SOAP2
SOAP2 [20] is the improved version of the SOAP alignment algorithm with reduced
computer memory usage and increased alignment speed. These improvements are
achieved using the BWT compression index instead using the hash table-based seed
algorithm. It uses the proposed FM data structure, for example a 13mer on the hash
will result in 226 blocks for the reference genome, and in few interactions the exact
location of the search string can be found inside the block. For mismatches dealing a
‘split-read strategy’ is used, same as with the SOAP algorithm. The read is spitted in
fragments in order to detect mismatches. As the previous version of this algorithm,
the best alignment is reported by the minimal number of mismatches and gaps.
Improvement is made in the maximum length of the input reads, where now the
algorithm can handle reads up to 1024bp. Using the BWT this algorithm is 12 times
faster in the step of index creation for the human genome compared to the first
M. Gusev (Editor): ICT Innovations 2010, Web Proceedings, ISSN 1857-7288
© ICT ACT – http://ictinnovations.org/2010, 2010
ICT Innovations 2010 Web Proceedings ISSN 1857-7288
279
version, and 20 times faster for the same amount of reads aligned. The SOAP group
of algorithms is evolving and there are companion algorithms for de novo assembly
and SNP’s detection.
4 Conclusion and NGS future
Bioinformatics and genomics science has shifted as result of NGS technologies
advancement. New projects inspired by the NGS, 1000 Genome Project
(http://1000genomes.org), HapMap (http://hapmap.org), will provide significant
information for genome structure and functionalities, and open new perspectives in
specific diseases and cancer treatment. The bioinformatics community is being
intrigued and gives fast response to the challenges coming from the NGS projects. As
a consequence, the existing alignment methods have been improved and adjusted to
deal with the massive volumes of short-reads data. The competition between
alignment tools is still running and there is no answer provided for the question which
tool is the most accurate and suitable to use, offering most effective use of
computational resources. Table 2 gives overview of the presented alignment tools.
The current NGS improvement, production of longer reads, requires modification
of the many developed short-read analysis tools or development of new ones. Long
reads will have the primary use when de novo assembly is performed, or genomes
with high repetitive structure are sequenced. On the contrary, the short-read
sequencing will play the main role when sequencing on specific small regions is
required, as ChIP sequencing, RNA sequencing and SNPs detection.
Table 2 NGS alignment tools and their features.
Tool
Method
Platform
Indels
Read Len.
MAQ
Hash
Illumina,
ABiSOLD
y
63bp
SOAP
Hash
Illumina
y
60bp
BWA
BWT
Illumina
ABiSOLID
y
200bp
SOAP2
BWT
Illumina
BWA-SW
BWT
All
y
1024bp
y
100kbp
BOWTIE
BWT
Illumina
Roche
n
1024bp
NGS technology is in an early stage of development and the following years will
bring improvements and novelties in this researching area and continuing stimulus for
researches in bioinformatics. Developers will be continuously challenged as new data
types will be presented from new and enhanced NGS platforms. The future
development of analysis tools and management systems will have to incorporate
information about sequence errors, biases, genome polymorphisms rate. The methods
and their algorithm representations described above provide the first approach to the
existing and upcoming challenges in the sequencing field.
Acknowledgments. We are thankful to Zlatko Trajanoski and Gernot Stocker from
the Institute of Genomics and Bioinformatics, Graz, Austria, for their unselfish
sharing of information, resources and experience in the field of bioinformatics.
M. Gusev (Editor): ICT Innovations 2010, Web Proceedings, ISSN 1857-7288
© ICT ACT – http://ictinnovations.org/2010, 2010
280
Bogojeska et al.: Next-generation DNA sequencing technology
References
1. Mardis, E.R.: The impact of next-generation sequencing technology on genetics. Trends
Gnet. 24,113--141 (2008)
2. Mardis, E.R.: Next-generation DNA sequencing methods. Annu. Rev. Genomics. Hum.
Genet. 9, 387—401 (2008)
3. Voelkerding, K., Dames, S.A., Durtschi, J.D.: Next-Generation Sequencing: From Basic
Research to Diagnostics. Clin. Chem. 55, 641--658 (2009)
4. Burrows, M., Wheeler, D.J.: A block-sorting lossless data compression algorithm. Tech.
Report 124, Digital Equipment Corporation. CA: Palo Alto (1994)
5. Altschul, S.F., Gish,W., Miller, W. et al.: Basic local alignment search tool. J. Mol. Biol.
215, 403--410 (1990)
6. Smith,T.F., Waterman, M.S.: Identification of common molecular subsequences. Mol. Biol.
147, 195--197 (1981)
7. Li, H., Ruan, J., Durbin, R.: Mapping short DNA sequencing reads and calling variants
using mapping quality scores. Genome Res. 18, 1851-1858, 2008
8. Li, R., Li, Y., Kristiansen, K., Wang, J.: SOAP: short oligonucleotide alignment program.
Bioinformatics. 24, 713--714 (2008)
9. Rumble, S.M. et al.: SHRiMP: accurate mapping of short color-space reads. PLOS Comput.
Biol. 5, e1000386 (2009)
10. Ning, Z., Cox, A.J., Mullikin, J.C.: SSAHA: a fast search method for large DNA databases.
Genome Res. 11, 1725--1729 (2001)
11.Smith, A.D., Xuan, Z., Zhang ,M.Q.: Using quality scores and longer reads improves
accuracy of Solexa read mapping, BMC Bioinformatics, 9, 128 (2008)
12.Weese, D., Emde, A.K., Rausch, T., Doring, A., Reinert, K.: RazerS--fast read mapping
with sensitivity control. Genome Res. 19, 1646--1654, (2009)
13.Jiang, H., Wong, W.H.: SeqMap: mapping massive amount of oligonucleotides to the
genome. Bioinformatics. 24, 2395--2396 (2008)
14.Lin, H., Zhang, Z., Zhang, M.Q., Ma, B., Li, M.: ZOOM! Zillions of oligos mapped.
Bioinformatics, 24, 2431--2437 (2008)
15.Homer, N., Merriman, B., Nelson, S.F.: BFAST: an alignment tool for large scale genome
resequencing. PLoS One, 4, e7767--e7767 (2009)
16.Ma, B., Tromp, J., Li, M.: PatternHunter: faster and more sensitive homology search.
Bioinformatics, 18, 440--445 (2002)
17.Rasmussen, K.R., Stoye, J., Myers, E. W.: Efficient q-gram filters for finding all epsilonmatches over a given length. J.Comput Biol, 13, 296--308 (2006)
18.Li, H., Durbin, R.: Fast and Accurate Short Read Alignment with BurrowsWheeler Transform. Bioinformatics, 25, 1754--1760 (2009)
19.Langmead, B., Trapnell, C., Pop, M., Salzberg, S. L.: Ultrafast and memory-efficient
alignment of short DNA sequences to the human genome. Genome Biol. 10, R25
20.Li, R., Yu, C., Li, Y., Lam, T. W., Yiu, S. M., Kristiansen, K., Wang, J.: SOAP2: an
improved ultrafast tool for short read alignment. Bioinformatics. 25, 1966--1967 (2009)
21.Li, H., Durbin, R.: Fast and accurate long read alignment with Burrows-Wheeler transform.
Bioinformatics. 26, 589--1595 (2010)
22.Feragina, P., Manzini, G.: Opportunistic data structures with applications. Proceedings of
the 41st Symposium on Foundation of Computer Science (FOCS 2000),Redondo Beach, CA,
USA. 390--398 (2000)
23.Kärkkäinen, J.: Fast BWT in small space by blockwise duffix sorting. Theor. comput. Sci.
387, 249--257 (2007)
24.Malhis, N., Butterfield, Y., Easter, M., Jones, JMS.: Slider-maximum use of probability
information for alignment of short sequence reads and SNP detection. Bioinformatics. 25,6-13 (2009)
M. Gusev (Editor): ICT Innovations 2010, Web Proceedings, ISSN 1857-7288
© ICT ACT – http://ictinnovations.org/2010, 2010