eBook - ePub

Algorithms for Next-Generation Sequencing

Name: Algorithms for Next-Generation Sequencing
ISBN: 9781498752985

Wing-Kin Sung,

350 pages
English
ePUB (mobile friendly)
Available on iOS & Android

eBook - ePub

Algorithms for Next-Generation Sequencing

Wing-Kin Sung,

About this book

Advances in sequencing technology have allowed scientists to study the human genome in greater depth and on a larger scale than ever before – as many as hundreds of millions of short reads in the course of a few days. But what are the best ways to deal with this flood of data?

Algorithms for Next-Generation Sequencing is an invaluable tool for students and researchers in bioinformatics and computational biology, biologists seeking to process and manage the data generated by next-generation sequencing, and as a textbook or a self-study resource. In addition to offering an in-depth description of the algorithms for processing sequencing data, it also presents useful case studies describing the applications of this technology.

Tools to learn more effectively

Saving Books

Keyword Search

Annotating Text

Listen to it instead

Information

Publisher

Year

eBook ISBN

Topic

Subtopic

Index

Chapter 1 Introduction

DNA stands for deoxyribonucleic acid. It was first discovered in 1869 by Friedrich Miescher [58]. However, it was not until 1944 that Avery, MacLeod and McCarty [12] demonstrated that DNA is the major carrier of genetic information, not protein. In 1953, James Watson and Francis Crick discovered the basic structure of DNA, which is a double helix [310]. After that, people started to work on DNA intensively.

DNA sequencing sprang to life in 1972, when Frederick Sanger (at the University of Cambridge, England) began to work on the genome sequence using a variation of the recombinant DNA method. The full DNA sequence of a viral genome (bacteriophage ϕX174) was completed by Sanger in 1977 [259, 260]. Based on the power of sequencing, Sanger established genomics,¹ which is the study of the entirety of an organism’s hereditary information, encoded in DNA (or RNA for certain viruses). Note that it is different from molecular biology or genetics, whose primary focus is to investigate the roles and functions of single genes.

During the last decades, DNA sequencing has improved rapidly. We can sequence the whole human genome within a day and compare multiple individual human genomes. This book is devoted to understanding the bioinformatics issues related to DNA sequencing. In this introduction, we briefly review DNA, RNA and protein. Then, we describe various sequencing technologies. Lastly, we describe the applications of sequencing technologies.

1.1 DNA, RNA, protein and cells

Deoxyribonucleic acid (DNA) is used as the genetic material (with the exception that certain viruses use RNA as the genetic material). The basic building block of DNA is the DNA nucleotide. There are 4 types of DNA nucleotides: adenine (A), guanine (G), cytosine (C) and thymine (T). The DNA nucleotides can be chained together to form a strand of DNA. Each strand of DNA is asymmetric. It begins from 5′ end and ends at 3′ end.

**FIGURE 1.1:** The double-stranded DNA. The two strands show a complementary base pairing.

When two opposing DNA strands satisfy the Watson-Crick rule, they can be interwoven together by hydrogen bonds and form a double-stranded DNA. The Watson-Crick rule (or complementary base pairing rule) requires that the two nucleotides in opposing strands be a complementary base pair, that is, they must be an (A, T) pair or a (C, G) pair. (Note that A = T and C ≡ G are bound with the help of two and three hydrogen bonds, respectively.) Figure 1.1 gives an example double-stranded DNA. One strand is ACGTAGCT while the other strand is its reverse complement, i.e., AGCTACGT.

The double-stranded DNAs are located in the nucleus (and mitochondria) of every cell. A cell can contain multiple pieces of double-stranded DNAs, each is called a chromosome. As a whole, the collection of chromosomes is called a genome; the human genome consists of 23 pairs of chromosomes, and its total length is roughly 3 billion base pairs.

The genome provides the instructions for the cell to perform daily life functions. Through the process of transcription, the machine RNA polymerase transcribes genes (the basic functional units) in our genome into transcripts (or RNA molecules). This process is known as gene expression. The complete set of transcripts in a cell is denoted as its transcriptome.

Each transcript is a chain of 4 different ribonucleic acid (RNA) nucleotides: adenine (A), guanine (G), cytosine (C) and uracil (U). The main difference between the DNA nucleotide and the RNA nucleotide is that the RNA nucleotide has an extra OH group. This extra OH group enables the RNA nucleotide to form more hydrogen bonds. Transcripts are usually single stranded instead of double stranded.

There are two types of transcripts: non-coding RNA (ncRNA) and message RNA (mRNA). ncRNAs are transcripts that do not translate into proteins. They can be classified into transfer RNAs (tRNAs), ribosomal RNAs (rRNAs), short ncRNAs (of length < 30 bp, includes miRNA, siRNA and piRNA) and long ncRNAs (of length > 200 bp, example includes Xist, and HOTAIR).

mRNA is the intermediate between DNA and protein. Each mRNA consists of three parts: a 5’ untranslated region (a 5’ UTR), a coding region and a 3’ untranslated region (3’ UTR). The length of the coding region is of a multiple of 3. It is a sequence of triplets of nucleotides called codons. Each codon corresponds to an amino acid.

Through translation, the machine ribosome translates each mRNA into a protein, which is the sequence of amino acids corresponding to the sequence of codons in the mRNA. Protein forms complex 3D structures. Each protein is a biological nanomachine that performs a specialized function. For example, enzymes are proteins that work as catalysts to promote chemical reactions for generating energy or digesting food. Other proteins, called transcription factors, interact with the genome to turn on or off the transcriptions. Through the interaction among DNA, RNA and protein, our genome dictates which cells should grow, when cells should die, how cells should be structured, and creates various body parts.

All cells in our body are developed from a single cell through cell division. When a cell divides, the double helix genome is separated into single-stranded DNA molecules. An enzyme called DNA polymerase uses each single-stranded DNA molecule as the template to replicate the genome into two identical double helixes. By this replication process, all cells within the same individual will have the same genome. However, due to errors in copying, some variations (called mutations) might happen in some cells. Those variations or mutations may cause diseases such as cancer.

Different individuals have similar genomes, but they also have genome variations that contribute to different phenotypes. For example, the color of our hairs and our eyes are controlled by the differences in our genomes. By studying and comparing genomes of different individuals, researchers develop an understanding of the factors that cause different phenotypes and diseases. Such knowledge ultimately helps to gain insights into the mystery of life and contributes to improving human health.

1.2 Sequencing technologies

DNA sequencing is a process that determines the order of the nucleotide bases. It translates the DNA of a specific organism into a format that is decipherable by researchers and scientists. DNA sequencing has allowed scientists to better understand genes and their roles within our body. Such knowledge has become indispensable for understanding biological processes, as well as in application fields such as diagnostic or forensic research. The advent of DNA sequencing has significantly accelerated biological research and discovery.

To facilitate the genomics study, we need to sequence the genomes of different species or different individuals. A number of sequencing technologies have been developed during the last decades. Roughly speaking, the development of the sequencing technologies consists of three phases:

• First-generation sequencing: Sequencing based on chemical degradation and gel electrophoresis.

• Second-generation sequencing: Sequencing many DNA fragments in parallel. It has higher yield, lower cost, but shorter reads.

• Third-generation sequencing: Sequencing a single DNA molecule without the need to halt between read steps.

In this section, we will discuss the three phases in detail.

1.3 First-generation sequencing

Sanger and Coulson proposed the first-generation sequencing in 1975 [259, 260]. It enables us to sequence a DNA template of length 500 – 1000 within a few hours. The detailed steps are as follows (see Figure 1.3).

1. Amplify the DNA template by cloning.

2. Generate all possible prefixes of the DNA template.

3. Separation by electrophoresis.

4. Readout with fluorescent tags.

Step 1 amplifies the DNA template. The DNA template is inserted into the plasmid vector; then the plasmid vector is inserted into the host cells for cloning. By growing the host cells, we obtain many copies of the same DNA template.

Step 2 generates all possible prefixes of the DNA template. Two techniques have been proposed for this step: (1) the Maxam-Gilbert technique [194] and (2) the chain termination methodology (Sanger method) [259, 260]. The Maxam-Gilbert technique relies on the cleaving of nucleotides by chemical. Four different chemicals are used and generate all sequences ending with A, C, G a...

Cover
Half Title
Title Page
Copyright Page
Table of Contents
Preface
1 Introduction
2 NGS file formats
3 Related algorithms and data structures
4 NGS read mapping
5 Genome assembly
6 Single nucleotide variation (SNV) calling
7 Structural variation calling
8 RNA-seq
9 Peak calling methods
10 Data compression techniques used in NGS files
References
Index

Frequently asked questions

Yes, you can cancel anytime from the Subscription tab in your account settings on the Perlego website. Your subscription will stay active until the end of your current billing period. Learn how to cancel your subscription

No, books cannot be downloaded as external files, such as PDFs, for use outside of Perlego. However, you can download books within the Perlego app for offline reading on mobile or tablet. Learn how to download books offline

Perlego offers two plans: Essential and Complete

Essential is ideal for learners and professionals who enjoy exploring a wide range of subjects. Access the Essential Library with 800,000+ trusted titles and best-sellers across business, personal growth, and the humanities. Includes unlimited reading time and Standard Read Aloud voice.
Complete: Perfect for advanced learners and researchers needing full, unrestricted access. Unlock 1.4M+ books across hundreds of subjects, including academic and specialized titles. The Complete Plan also includes advanced features like Premium Read Aloud and Research Assistant.

Both plans are available with monthly, semester, or annual billing cycles.

We are an online textbook subscription service, where you can get access to an entire online library for less than the price of a single book per month. With over 1 million books across 990+ topics, we’ve got you covered! Learn about our mission

Look out for the read-aloud symbol on your next book to see if you can listen to it. The read-aloud tool reads text aloud for you, highlighting the text as it is being read. You can pause it, speed it up and slow it down. Learn more about Read Aloud

Yes! You can use the Perlego app on both iOS and Android devices to read anytime, anywhere — even offline. Perfect for commutes or when you’re on the go.
Please note we cannot support devices running on iOS 13 and Android 7 or earlier. Learn more about using the app

Yes, you can access Algorithms for Next-Generation Sequencing by Wing-Kin Sung in PDF and/or ePUB format, as well as other popular books in Computer Science & Programming Games. We have over one million books available in our catalogue for you to explore.

About this book

Tools to learn more effectively

Information

Table of contents

Frequently asked questions