Bioinformatics Algorithms
eBook - ePub

Bioinformatics Algorithms

Design and Implementation in Python

Miguel Rocha,Pedro G. Ferreira

Share book
  1. 400 pages
  2. English
  3. ePUB (mobile friendly)
  4. Available on iOS & Android
eBook - ePub

Bioinformatics Algorithms

Design and Implementation in Python

Miguel Rocha,Pedro G. Ferreira

Book details
Book preview
Table of contents
Citations

About This Book

Bioinformatics Algorithms: Design and Implementation in Python provides a comprehensive book on many of the most important bioinformatics problems, putting forward the best algorithms and showing how to implement them. The book focuses on the use of the Python programming language and its algorithms, which is quickly becoming the most popular language in the bioinformatics field. Readers will find the tools they need to improve their knowledge and skills with regard to algorithm development and implementation, and will also uncover prototypes of bioinformatics applications that demonstrate the main principles underlying real world applications.

  • Presents an ideal text for bioinformatics students with little to no knowledge of computer programming
  • Based on over 12 years of pedagogical materials used by the authors in their own classrooms
  • Features a companion website with downloadable codes and runnable examples (such as using Jupyter Notebooks) and exercises relating to the book

Frequently asked questions

How do I cancel my subscription?
Simply head over to the account section in settings and click on “Cancel Subscription” - it’s as simple as that. After you cancel, your membership will stay active for the remainder of the time you’ve paid for. Learn more here.
Can/how do I download books?
At the moment all of our mobile-responsive ePub books are available to download via the app. Most of our PDFs are also available to download and we're working on making the final remaining ones downloadable now. Learn more here.
What is the difference between the pricing plans?
Both plans give you full access to the library and all of Perlego’s features. The only differences are the price and subscription period: With the annual plan you’ll save around 30% compared to 12 months on the monthly plan.
What is Perlego?
We are an online textbook subscription service, where you can get access to an entire online library for less than the price of a single book per month. With over 1 million books across 1000+ topics, we’ve got you covered! Learn more here.
Do you support text-to-speech?
Look out for the read-aloud symbol on your next book to see if you can listen to it. The read-aloud tool reads text aloud for you, highlighting the text as it is being read. You can pause it, speed it up and slow it down. Learn more here.
Is Bioinformatics Algorithms an online PDF/ePUB?
Yes, you can access Bioinformatics Algorithms by Miguel Rocha,Pedro G. Ferreira in PDF and/or ePUB format, as well as other popular books in Technology & Engineering & Chemical & Biochemical Engineering. We have over one million books available in our catalogue for you to explore.
Chapter 1

Introduction

Abstract

The field of Bioinformatics has been rapidly developing over the last few years, driven by the exponential growth of data generated in the biological/biomedical fields and the need to use more sophisticated tools for their processing and to extract from them useful knowledge. The proposed book aims to provide readers with programming skills which will allow them to design bioinformatics solutions, while understanding the inherent biological questions. This chapter provides a prelude to this book, by briefly analyzing Bioinformatics evolution and some of the current challenges, defining the term and providing an overview of the book's contents and their logical organization.

Keywords

Bioinformatics; Computational Biology; Next-generation sequencing; Precision medicine; Metabolic engineering

1.1 Prelude

In the last decades, important advances have been achieved in the biological and biomedical fields, which have been boosted by important advances in experimental technologies. The most known, and arguably most relevant, example comes from the impressive evolution of sequencing technologies in the last 40 years, boosted by the large investment in the Human Genome Project mainly in the 1990's [92,150].
Additionally, other high-throughput technologies for measuring gene expression, protein or compound concentrations in cells, have led to a real revolution in biological and medical research. All these techniques are currently able to generate massive amounts of the so called omics data, that can be used to foster scientific research in the life sciences and promote the development of novel technologies in health care, biotechnology and related areas.
Merely as two examples of the impact of these novel technologies and produced data, we can pinpoint the impressive development in areas such as personalized (or precision) medicine and metabolic engineering efforts within industrial biotechnology.
Precision medicine addresses the growing trend of tailoring treatments to the characteristics of individual (or groups of) patients. This has been made increasingly possible by the availability of genomic, epigenomic, gene expression, and other types of data about specific patients, allowing to determine distinct risk profiles for certain diseases, or to study differentiated effects of treatments correlated to patterns in genomic, epigenomic or gene expression data. These data allow to design specific courses of action based on the patient's profiles, allowing more accurate diagnosis and specific treatment plans. This field is expected to grow significantly in the coming years, as it is confirmed by projects such as the 100,000 Genomes Project launched by the UK Prime Minister David Cameron in 2012 (https://www.genomicsengland.co.uk/the-100000-genomes-project/) or the launch of the Precision Medicine Initiative, announced in January 2015 by President Barack Obama, and which has started in February 2016.
Cancer research is an area that largely benefited from the recent advances in molecular assays. Projects such as the Genomic Data Commons (https://gdc.cancer.gov) or the International Cancer Genome Consortium (ICGC, http://icgc.org/) are generating comprehensive and multi-dimensional maps of the genomic alterations in cancer cells from hundreds of individuals in dozens of tumor types with a visible scientific, clinical, and societal impact.
Other current large-scale efforts boosted by the use of high-throughput technologies and led by international consortia are generating data at an unprecedented scale and changing our view of human molecular biology. Of notice are projects such as the 1000 Genomes Project (www.internationalgenome.org/) that provides a catalog of human genetic variation across worldwide populations; the Encyclopedia of DNA Elements (ENCODE, https://www.encodeproject.org/) has built a map of functional elements in the human genome; the Epigenomics Roadmap (http://www.roadmapepigenomics.org/) is characterizing the epigenomic landscapes of primary human tissues and cells or the Genotype-Tissue Expression project (GTEx, https://www.gtexportal.org/) which is providing gene expression and quantitative trait loci from more than 50 human tissues.
On the other hand, metabolic engineering is related to the improvement of specific microbes used in industrial biotechnological processes to produce important compounds as bio-fuels, plastics, pharmaceuticals, foods, food ingredients and other added-value compounds. Strategies used to improve host microbes include blocking competing pathways through gene deletion or inactivation, overexpressing relevant genes, introducing heterologous genes or enzyme engineering.
In both cases, the impact of data availability has been tremendous, opening new avenues for scientific advance and technological development. However, this has also raised significant challenges in the management and analysis of such complex and large volumes of data. Biological research has become in many aspects very data-oriented and this has been intricately connected to the ability to handle these huge amounts of data generating novel knowledge, or as Florian Markowetz recently puts it “All biology is computational biology” [108]. Therefore, the value of the sophisticated computational tools that have been developed to address these data processing and analysis has been undeniable.
This book is about Bioinformatics, the field that aims to handle these biological data, using computers, and seeking to unravel novel knowledge from raw data. In the next section, we will discuss further what Bioinformatics is, and the different tasks and scientific disciplines that are involved in the field. To close the chapter, we will overview the content of the remaining of the book to help the reader in the task of better navigating it.

1.2 What is Bioinformatics

Bioinformatics is a multi-disciplinary field at the intersection of Biology, Computer Science, and Statistics. Naturally, its development has followed the technological advances and research trends in Biology and Information Technologies. Thus, although it is still a young field, it is evolving fast and its scope has been successively redefined. For instance, the National Institute of Health (NIH) defines Bioinformatics in a broad way, as the “research, development, or application of computational tools and approaches for expanding the use of biological, medical, biological, behavioral, or health data” [79]. According to this definition, the tasks involved include data acquisition, storage, archival, analysis, and visualization.
Some authors have a more focused definition, which relates Bioinformatics mainly to the study of macromolecules at the cellular level, and emphasize its capability of handling large-scale data [105]. Indeed, since its appearance, the main tasks of Bioinformatics have been related to handling data at a cellular level, and this will also be the focus of this book.
Still in the previous seminal document from the NIH, the related field of Computational Biology is defined as the “development and application of data-analytical and theoretical methods, mathematical modeling, and computational simulation techniques to the study of biological, behavioral, and social systems”. Thus, although deeply related, and sometimes used interchangeably by some authors, the first (Bioinformatics) relates to a more technologically oriented view, while the second is more related to the study of natural systems and their modeling. This does not prevent a large overlap of the two fields.
Bioinformatics tackles a large number of research problems. For instance, the Bioinformatics (https://academic.oup.com/bioinformatics) journal publishes research on application areas that include genome analysis, phylogenetics, genetic, and population analysis, gene expression, structural biology, text mining, image analysis, and ontologies and databases.
The National Center for Biotechnology Information (NCBI, https://www.ncbi.nlm.nih.gov/Class/MLACourse/Modules/MolBioReview/bioinformatics.html) unfolds Bioinformatics into three main areas:
  • • developing new algorithms and statistics to assess relationships within large data sets;
  • • analyzing and interpreting different types of data (e.g. nucleotide and amino acid sequences, protein domains, and protein structures);
  • • developing and implementing tools that enable efficient access and management of different types of information.
This book will focus mainly on the first of these areas, covering the main algorithms that have been proposed to address Bioinformatics tasks. The emphasis will be put on algorithms for sequence processing and analysis, considering both nucleotide and amino acid sequences.

1.3 Book's Organization

This book is organized into four logical parts encompassing the major themes addressed in this text, each containing chapters dealing with specific topics.
In the first part, where this chapter is included, we introduce the field of Bioinformatics, providing relevant concepts and definitions. Since this is an interdisciplinary field, we will need to address some fundamental aspects regarding algorithms and the Python programming language (Chapter 2), cover some biological background needed to understand the algorithms put forward in the following parts of the book (Chapter 3).
The second part of this book addresses a number of problems related to sequence analysis, introducing algorithms and proposing illustrative Python functions and programs to solve them. The Bioinformatics tasks addressed will cover topics related with basic sequence processing and analysis tasks, such as the ones involved in transcription and translation (Chapter 4), algorithms for finding patterns in sequences (Chapter 5), pairwise and multiple sequence alignment algorithms (Chapters 6 and 8), searching homologous sequences in databases (Chapter 7), algorithms for phylogenetic analysis from sequences (Chapter 9), biological motif discovery with deterministic and stochastic algorithms (Chapters 10, 11), and finally Hidden Markov Models and their applications in Bioinformatics (Chapter 12).
The third part of the book will focus on more advanced algorithms, based in graphs as data structures, which will allow to handle large-scale sequence analysis tasks, such as the ones typically involved in processing and analyzing next-generation sequencing (NGS) data. This part starts with an introduction to graph data structures and algorithms (Chapter 13), addresses the construction and exploration of biological networks using graphs (Chapter 14)...

Table of contents