Introduction
The diversity of life, from a simple organism like bacteria to the largest animals, and the diversity of individuals within a species, are guided by biomolecules inside the living cells called deoxyribonucleic acid (DNA). The DNA molecule is formed of only four basic monomeric units known as DNA nucleotides composing of a phosphate group, a sugar, and four different types of nucleobases or simply bases (adenine, cytosine, guanine, and thymine). In bioinformatics, those four units are given the letters: A, C, G, and T respectively. The DNA molecules in a living cell are represented as sequences of those four nucleotides forming the genome. Viruses usually have small genomes; Bacteriophage spp has a median total length of 8689 bases (8.689 kb). The smallest non-viral genome is that of a bacterium known as Carsonella ruddii, which has a genome of 164,376 bases (164.376 kb). The total length of the human genome is 3,272,090,000 bases (3,272.09 Mb). Segments of DNA known as genes control the different aspects of life of a living organism by instructing the cells to synthesize the proteins, which do most of the work in cells and are required for the structure, function, and regulation of the body tissues and organs. The instructions are transcribed into ribonucleic acid (RNA), which is translated into a specific protein. The two-step process (transcription and translation) by which the information in gene flows into proteins is known as the central dogma of molecular biology. The information in the DNA is also transmitted from one generation to another. The new generation of a living organism inherits characteristics due to DNA transmission from parents. The diversity in life is attributed to the ability of the DNA to change slowly in search of better traits to adapt with changes in nature. Such changes or mutations contribute to the diversity in life. Advancement in molecular biology and biotechnology made possible the capturing of the information carried by DNA, RNA, and proteins. Sequences and other biological information from diverse species and individuals within the species of organisms are now increasingly deposited by researchers and institutions onto bioinformatics databases to be available for retrieval and analysis for research purposes. The genomic information has revolutionized biology and made modern biologists dependent on bioinformatics, which uses computer science to store, organize, search, manipulate, and retrieve the genomic information. Institutions like the National Institute of Health (NIH), the European Molecular Biology Laboratory (EMBL), and the Japanese Institute of Genetics contributed largely to the progress made in bioinformatics. Together, those three institutes formed the International Nucleotide Sequence Database Collaboration (INSDC) [1], which is a joint effort to collect and disseminate databases containing DNA and RNA, and protein sequences. The INSDC includes GenBank (USA), the European Nucleotide Archive (UK), and DNA Data Bank of Japan (Japan). Those three partners capture, preserve, share, and exchange a comprehensive collection of nucleotide sequences and associated information on a daily basis. The INSDC policy allows public access to the global archives of nucleotide data generated in publicly funded experiments. The submission of this genomic data is instrumented by the fact that it is a pre-requisite for publication in scholarly journals. The database records are publicly available for scientists from all over the world to access, analyze, draw conclusion, and publish their findings.
Before digging deep, it is important to discuss some basics in genomics that will help readers to understand bioinformatics. The foundation of bioinformatics is built on the data that represents the flow of genomic information from the DNA, onto RNA, and proteins. Therefore, understanding the composition of these three kinds of biomolecules, gene structure, gene transcription and expression, mutation, and techniques used to obtain such genomic data is fundamental for understanding the biological databases and other bioinformatics applications.
Genetic Information and Its Transmission
In the traditional Linnaean system of classification, living organisms are classified on the basis of cellular organization and methods of nutrition into five kingdoms: Monera (bacteria), Protista (protozoans and algae), Fungi (funguses), Plantae (plants), and Animalia (animals). A modern taxonomic classification has been made to extend the Linnaean system to consider genomic characteristics. Nowadays, biologists recognize only two vastly different cell types, prokaryote and eukaryote, based on the absence or presence of a membrane-bound nucleus containing the genetic material of the cell. Therefore, a living organism is either prokaryotic or eukaryotic [2, 3]. The prokaryote includes unicellular organisms that do not have a true nucleolus or membrane-bound organelles (Figure 1.1a). Prokaryote includes bacteria, which is the most abundant organism, and archaea, which are inhabitants of the most ext...