Algorithms for Next-Generation Sequencing
eBook - ePub

Algorithms for Next-Generation Sequencing

  1. 350 pages
  2. English
  3. ePUB (mobile friendly)
  4. Available on iOS & Android
eBook - ePub

Algorithms for Next-Generation Sequencing

Book details
Book preview
Table of contents
Citations

About This Book

Advances in sequencing technology have allowed scientists to study the human genome in greater depth and on a larger scale than ever before – as many as hundreds of millions of short reads in the course of a few days. But what are the best ways to deal with this flood of data?

Algorithms for Next-Generation Sequencing is an invaluable tool for students and researchers in bioinformatics and computational biology, biologists seeking to process and manage the data generated by next-generation sequencing, and as a textbook or a self-study resource. In addition to offering an in-depth description of the algorithms for processing sequencing data, it also presents useful case studies describing the applications of this technology.

Frequently asked questions

Simply head over to the account section in settings and click on “Cancel Subscription” - it’s as simple as that. After you cancel, your membership will stay active for the remainder of the time you’ve paid for. Learn more here.
At the moment all of our mobile-responsive ePub books are available to download via the app. Most of our PDFs are also available to download and we're working on making the final remaining ones downloadable now. Learn more here.
Both plans give you full access to the library and all of Perlego’s features. The only differences are the price and subscription period: With the annual plan you’ll save around 30% compared to 12 months on the monthly plan.
We are an online textbook subscription service, where you can get access to an entire online library for less than the price of a single book per month. With over 1 million books across 1000+ topics, we’ve got you covered! Learn more here.
Look out for the read-aloud symbol on your next book to see if you can listen to it. The read-aloud tool reads text aloud for you, highlighting the text as it is being read. You can pause it, speed it up and slow it down. Learn more here.
Yes, you can access Algorithms for Next-Generation Sequencing by Wing-Kin Sung in PDF and/or ePUB format, as well as other popular books in Computer Science & Programming Games. We have over one million books available in our catalogue for you to explore.

Information

Year
2017
ISBN
9781498752985
Edition
1

Chapter 1

Introduction

DNA stands for deoxyribonucleic acid. It was first discovered in 1869 by Friedrich Miescher [58]. However, it was not until 1944 that Avery, MacLeod and McCarty [12] demonstrated that DNA is the major carrier of genetic information, not protein. In 1953, James Watson and Francis Crick discovered the basic structure of DNA, which is a double helix [310]. After that, people started to work on DNA intensively.
DNA sequencing sprang to life in 1972, when Frederick Sanger (at the University of Cambridge, England) began to work on the genome sequence using a variation of the recombinant DNA method. The full DNA sequence of a viral genome (bacteriophage ϕX174) was completed by Sanger in 1977 [259, 260]. Based on the power of sequencing, Sanger established genomics,1 which is the study of the entirety of an organism’s hereditary information, encoded in DNA (or RNA for certain viruses). Note that it is different from molecular biology or genetics, whose primary focus is to investigate the roles and functions of single genes.
During the last decades, DNA sequencing has improved rapidly. We can sequence the whole human genome within a day and compare multiple individual human genomes. This book is devoted to understanding the bioinformatics issues related to DNA sequencing. In this introduction, we briefly review DNA, RNA and protein. Then, we describe various sequencing technologies. Lastly, we describe the applications of sequencing technologies.

1.1 DNA, RNA, protein and cells

Deoxyribonucleic acid (DNA) is used as the genetic material (with the exception that certain viruses use RNA as the genetic material). The basic building block of DNA is the DNA nucleotide. There are 4 types of DNA nucleotides: adenine (A), guanine (G), cytosine (C) and thymine (T). The DNA nucleotides can be chained together to form a strand of DNA. Each strand of DNA is asymmetric. It begins from 5′ end and ends at 3′ end.
Image
FIGURE 1.1: The double-stranded DNA. The two strands show a complementary base pairing.
When two opposing DNA strands satisfy the Watson-Crick rule, they can be interwoven together by hydrogen bonds and form a double-stranded DNA. The Watson-Crick rule (or complementary base pairing rule) requires that the two nucleotides in opposing strands be a complementary base pair, that is, they must be an (A, T) pair or a (C, G) pair. (Note that A = T and C ≡ G are bound with the help of two and three hydrogen bonds, respectively.) Figure 1.1 gives an example double-stranded DNA. One strand is ACGTAGCT while the other strand is its reverse complement, i.e., AGCTACGT.
The double-stranded DNAs are located in the nucleus (and mitochondria) of every cell. A cell can contain multiple pieces of double-stranded DNAs, each is called a chromosome. As a whole, the collection of chromosomes is called a genome; the human genome consists of 23 pairs of chromosomes, and its total length is roughly 3 billion base pairs.
The genome provides the instructions for the cell to perform daily life functions. Through the process of transcription, the machine RNA polymerase transcribes genes (the basic functional units) in our genome into transcripts (or RNA molecules). This process is known as gene expression. The complete set of transcripts in a cell is denoted as its transcriptome.
Each transcript is a chain of 4 different ribonucleic acid (RNA) nucleotides: adenine (A), guanine (G), cytosine (C) and uracil (U). The main difference between the DNA nucleotide and the RNA nucleotide is that the RNA nucleotide has an extra OH group. This extra OH group enables the RNA nucleotide to form more hydrogen bonds. Transcripts are usually single stranded instead of double stranded.
There are two types of transcripts: non-coding RNA (ncRNA) and message RNA (mRNA). ncRNAs are transcripts that do not translate into proteins. They can be classified into transfer RNAs (tRNAs), ribosomal RNAs (rRNAs), short ncRNAs (of length < 30 bp, includes miRNA, siRNA and piRNA) and long ncRNAs (of length > 200 bp, example includes Xist, and HOTAIR).
mRNA is the intermediate between DNA and protein. Each mRNA consists of three parts: a 5’ untranslated region (a 5’ UTR), a coding region and a 3’ untranslated region (3’ UTR). The length of the coding region is of a multiple of 3. It is a sequence of triplets of nucleotides called codons. Each codon corresponds to an amino acid.
Through translation, the machine ribosome translates each mRNA into a protein, which is the sequence of amino acids corresponding to the sequence of codons in the mRNA. Protein forms complex 3D structures. Each protein is a biological nanomachine that performs a specialized function. For example, enzymes are proteins that work as catalysts to promote chemical reactions for generating energy or digesting food. Other proteins, called transcription factors, interact with the genome to turn on or off the transcriptions. Through the interaction among DNA, RNA and protein, our genome dictates which cells should grow, when cells should die, how cells should be structured, and creates various body parts.
All cells in our body are developed from a single cell through cell division. When a cell divides, the double helix genome is separated into single-stranded DNA molecules. An enzyme called DNA polymerase uses each single-stranded DNA molecule as the template to replicate the genome into two identical double helixes. By this replication process, all cells within the same individual will have the same genome. However, due to errors in copying, some variations (called mutations) might happen in some cells. Those variations or mutations may cause diseases such as cancer.
Different individuals have similar genomes, but they also have genome variations that contribute to different phenotypes. For example, the color of our hairs and our eyes are controlled by the differences in our genomes. By studying and comparing genomes of different individuals, researchers develop an understanding of the factors that cause different phenotypes and diseases. Such knowledge ultimately helps to gain insights into the mystery of life and contributes to improving human health.

1.2 Sequencing technologies

DNA sequencing is a process that determines the order of the nucleotide bases. It translates the DNA of a specific organism into a format that is decipherable by researchers and scientists. DNA sequencing has allowed scientists to better understand genes and their roles within our body. Such knowledge has become indispensable for understanding biological processes, as well as in application fields such as diagnostic or forensic research. The advent of DNA sequencing has significantly accelerated biological research and discovery.
To facilitate the genomics study, we need to sequence the genomes of different species or different individuals. A number of sequencing technologies have been developed during the last decades. Roughly speaking, the development of the sequencing technologies consists of three phases:
• First-generation sequencing: Sequencing based on chemical degradation and gel electrophoresis.
Second-generation sequencing: Sequencing many DNA fragments in parallel. It has higher yield, lower cost, but shorter reads.
• Third-generation sequencing: Sequencing a single DNA molecule without the need to halt between read steps.
In this section, we will discuss the three phases in detail.

1.3 First-generation sequencing

Sanger and Coulson proposed the first-generation sequencing in 1975 [259, 260]. It enables us to sequence a DNA template of length 500 – 1000 within a few hours. The detailed steps are as follows (see Figure 1.3).
1. Amplify the DNA template by cloning.
2. Generate all possible prefixes of the DNA template.
3. Separation by electrophoresis.
4. Readout with fluorescent tags.
Step 1 amplifies the DNA template. The DNA template is inserted into the plasmid vector; then the plasmid vector is inserted into the host cells for cloning. By growing the host cells, we obtain many copies of the same DNA template.
Step 2 generates all possible prefixes of the DNA template. Two techniques have been proposed for this step: (1) the Maxam-Gilbert technique [194] and (2) the chain termination methodology (Sanger method) [259, 260]. The Maxam-Gilbert technique relies on the cleaving of nucleotides by chemical. Four different chemicals are used and generate all sequences ending with A, C, G a...

Table of contents

  1. Cover
  2. Half Title
  3. Title Page
  4. Copyright Page
  5. Table of Contents
  6. Preface
  7. 1 Introduction
  8. 2 NGS file formats
  9. 3 Related algorithms and data structures
  10. 4 NGS read mapping
  11. 5 Genome assembly
  12. 6 Single nucleotide variation (SNV) calling
  13. 7 Structural variation calling
  14. 8 RNA-seq
  15. 9 Peak calling methods
  16. 10 Data compression techniques used in NGS files
  17. References
  18. Index