1.1Why informatics is important for biologists
This is really all about data. In particular, it’s about working with so much data that learning to program computers to perform calculations for us will save a lot of time, and probably make possible analysis that would otherwise be impossible. In biological research, the amount of data available to researchers has increased so much over recent years this has been described as a ‘data explosion’[1].
Much of this biological data is freely available for any researcher to access and use in their own work. Therefore, any biological scientist who learns skills to enable obtaining, preprocessing and analyzing publically-available datasets, is giving themselves an advantage when it comes to making the most out of their own opportunities.
One consequence of this increase in biological data is that many of the recent paradigms of molecular biology come from computational analysis of large collections of data. In terms of developing an intuition for what is shown when results from computational analysis is presented in a paper, there is no substitute for first-hand experience of using a method for data analysis in your own research (of course, a theoretical understanding of the method in question is also important!). In reality, it is becoming increasingly difficult to understand the current state of the art in biological research without some experience and understanding of computational biology.
In 2014, the UK’s MRC and BBSRC (Medical Research Council and Biotechnology & Biological Sciences Research Council) produced a report of ‘skills vulnerabilities’, which reflected important research capabilities lacking in the UK. Both in 2014 and in a 2017 update1, computational methods for biological research were identified as key weaknesses. In fact, the following specific points were highlighted:
•Data analytics, especially bioinformatics, appear to be particularly vulnerable.
•Informatics skills are applicable to many areas of both the biosciences and the medical sciences.
•Maths, statistics and computational biology skills are lacking particularly at the postgraduate and postdoctoral levels, with many respondents reporting difficulties in recruiting adequately skilled researchers at these levels; shortages are not just restricted to the UK.
So there is a recognized international shortage of bioinformatics skills, and these skills are increasingly fundamental across all areas of biological research. You were probably already aware of this given you’re reading this, but it hopefully serves as a motivating reminder that learning the bioinformatics skills taught in this book will be worth the effort you put in!
1.2How to use this book
This book was developed over a decade of my experience training biologists to empower their own research through making better use of computers. I think there are three key aspects of this training, which are in essence the intended learning outcomes of this book:
1.theoretical understanding of how a set of computational analysis steps produce a result that yields biological insight
2.ability to plan a set of analysis steps that, when carried out on a given dataset, will yield biological insight
3.practical experience of enacting those plans on real datasets to produce novel, valuable research results
For the first of these, reading the chapters of this book should help. Reading this book should also help with the second. But the only way to gain the skills to carry out data analysis to give research results is to do it. There is simply no substitute for practical experience. Furthermore, the more experience you get carrying out data analysis, the more instinctively you will be able to plan analyses for your own research and to think of the best datasets to work with. Because there is no substitute for practice, this book is designed to give all the practical guidance someone needs to be able to carry out a set of analysis procedures. We will cover the procedures that are particularly useful for harnessing different types of biological data.
Because a lot of data analysis tools are not implemented in tools with convenient graphical user interfaces (GUIs), there is no avoiding a bit of coding. While at first this will almost certainly be frustrating to those new to a command line interface, with time and practice you will find that the automation you can implement empowers you to achieve all sorts of things that would otherwise be impossible (or at least impractical). To help in this process, (all) required computer code is provided, which are effectively individual commands given to the computer. Each line2 of code is followed with detailed descriptions of every part of every command.
The first chapters of this book introduce R and the Unix command shell, which will be indispensible tools for data analysis. This will involve learning some of the building blocks for programming computers to perform many tasks in one go, without requiring continued instruction from a human. Many of the methods we use are theoretically simple enough to calculate by hand with a small set of observations, but the beauty of using command-line tools is that you can program them to perform huge numbers of repetitive tasks very quickly and automatically. One should also not underestimate the importance and power of ‘data wrangling’, which acknoweldges that the format in which you obtain data is rarely exactly the format that you need it in to perform the analyses you want.
The fourth chapter explains the mathematical theory behind the analysis methods that are employed throughout this book. To understand the theory, we’ll make use of the R environment to look at a few practical examples. Generally, I take the philosophy that a solid understanding of a few very versatile methods is the best strategy to enable a great variety of applications with as little effort as possible. A recurring theme of my research supervision is that the simpler your approach to demonstrate a finding, the better (as long as it’s appropriate): it will be understandable to more people, and therefore have greater impact, and will be less likely to be misinterpreted.
Chapters 5 to 7 use real research examples to build up your practical experience of obtaining and analyzing biological datasets, utilizing the statistical analysis methods described in Chapter 4. The examples use already-processed datasets, so that the focus is on the analysis rather than worrying about formats. The complexity of the tasks and the datasets involved builds through these chapters, so that by the end of Chapter 7 we are systematically evaluating patterns of variation of hundreds of features from multiple platforms used to characterize different aspects of the same samples.
And finally, the bulk of this book by volume guides you through the specifics of working with different types of biological datasets. I have included those I think are the most frequently-encountered across molecular biology research, but this is certainly influenced by my own background in cancer research. The choice of data types to cover also balances the accessibility of obtaining, pre-processing and analyzing the data, so that we get the most out of the least effort.
A word of warning: it is easy to feel isolated in research, and that can be problematic when you find yourself, still new to bioinformatics, as the expert for your research group or team. There is an excellent blog post from Mick Watson3 on problems facing ‘lonely bioinformaticians’. Most importantly, don’t be afraid of looking to others for help.
You can do this! Stick with it, and you should find that you’re able to make more use of the data you generate and the vast accumulation of molecular biology data that is already in the public domain.