PART I
Preliminaries
CHAPTER 1
Introduction and Book Overview
The first volume of the present two-volume book introduced the essential principles of support vector machines (SVMs) and machine learning in general. SVMs are a powerful modern machine learning methodology that has found great success in a variety of applications, including biomedicine. The emphasis of the first volume was to make SVM principles, which are often inaccessible to biomedical researchers due to being technically quite demanding, easy to grasp even for an audience that normally lacks substantial computational and mathematical training.
The first volume presented essential SVM principles, algorithms and protocols, but did not elaborate on all the necessary details of how the formal methods can be applied in practical settings. The first volume also did not present empirical comparisons of SVMs with other stateof-the-art methods which could reasonably be considered as alternatives in modern biomedical research. These two areas are the focus of the present, second, volume.
It is our intent that together the two volumes will provide sufficient theoretical and practical depth and guidance to data analysts and modelers so that they can bring the power of SVMs to bear successfully in their data analysis and modeling needs.
The present volume will also be useful to researchers that are quantitatively sophisticated and do not need the âgentleâ introduction of Volume 1, but can still benefit from guidance about effective ways to translate theoretical SVM methods to real-life and demanding data analytic practice.
Organization of the Second Volume
The second volume provides a summary of the main methods used in this book (Chapter 2), including essential SVM theory that was covered in depth in the first volume. This is to provide a refresher of the core concepts needed to understand the material in the present volume and also to make the second volume sufficiently self-contained, so that it can be read or taught independently of the first volume to appropriate audiences.
The remaining material comprises case studies and benchmarks.
- Case studies aim to give the reader a wide enough range of application areas and a deep enough account of practical details on how to translate the theoretical methods of the first volume into successful applied modeling of academic and industry relevance.
- Benchmarks are systematic comparisons of SVM-based methods to other state-of-the-art methods that can be reasonably considered as alternatives to the same types of analyses that SVMs are designed for.
The case studies and benchmarks are organized into four parts corresponding to genomic data, text data, clinical data, and broad (data-independent) categories.
We remain firm in our commitment, stated in the first volume, that we do not wish to impart to the readers a false sense that SVMs are a âone solution fits all problemsâ data analysis and modeling paradigm, so in these benchmarks we examine both strengths and limitations of SVMs. Our benchmarks uncover for the benefit of the practitioner analyst several important strengths and weaknesses of SVMs.
The SVM applications and benchmarking literature is vast and fairly rich. Instead of drawing from this general literature, we elected to present here case studies and benchmarks in which we were directly involved. This is to ensure that we have a degree of familiarity with these applications and benchmarks that go well beyond what can be obtained from reading published reports of work done by others. The latter, by necessity provides a limited view of what it takes to create successful applications, including choice of options/parameters, design of analysis approach, and numerous other details that often do not make it to the peer review literature but are essential for the success of an applied project. We hope that our extensive collective experience with SVM applications and comparative testing provides to the reader a sufficiently wide view of what SVMs can accomplish in practice without constraining the readerâs imagination about all possible opportunities which are, literally, endless.
The format of the second volume is no longer that of âprogrammed textâ employed in the first volume. This format was needed in the first volume to cover technically complex material in manageable chunks for the benefit of technically unsophisticated readers. Since the programmed text format is no longer needed here, it is replaced by the more appropriate traditional exposition format.
It is necessary to state that as is the case with most modern, high-performance machine learning and pattern recognition technologies, the methods and processes presented in Volumes 1 and 2 are to the best of our knowledge unconstrained for academic use, but many of the presented technologies as well as applications are protected by copyrights, patents and pending patents which entail the need for necessary licenses to be obtained for commercial applications from the owners of the Intellectual Property. The number of patents covering SVMs is very large and constantly expanding, and it is outside the scope of our work to identify which SVM methods presented in this book (and outside it) are owned by whom and for which application domain. We leave it to the readers interested in commercial applications to work with qualified technical and legal consultants to make sure that commercial applications of SVM methods are properly licensed. For the applications specifically presented in Volume 2, licensing inquiries can be made to: Discovery Holdings LLC, http://www.discoveryholdings.net.
Individual chapters in Volume 2 discuss essential information about the goals of the projects; what were the options/parameters and how the data analysis and modeling plan was formulated; the broader context of the projects; practical decisions that were useful; and main lessons learned.
In the remainder, we provide a synopsis of all case studies and benchmarks presented in the present volume.
Chapter 3 (âApplication and Comparison of SVMs and Other Methods for Multicategory Microarray-Based Cancer Classificationâ) shows how SVMs can be used to build diagnostic classifiers for 41 cancer types and 12 normal tissue types using microarray gene expression data. This is a very important application area that is already creating the foundations for the next generation diagnostics and personalized medicine of the future. Simultaneously with the application details, the chapter shows that SVMs outperform major classification methods; that relatively simple gene selection methods can improve classification accuracy; that some particular SVM multiclass methods are preferable over others; and that ensembling does not improve classification accuracy of the best non-ensemble models. The findings and ideas of the study have been used to create a robust automodeller, GEMS (http://www.gems-system.org), which has been tested very successfully with rigorous standards of validation (independent, prospective data validation) and was shown to match or exceed the predictive performance of human experts in all published models associated with the employed datasets.
Chapter 4 (âComparison of SVMs and Random Forests for Microarray-Based Cancer Classificationâ) performs a similar but more extended comparison of SVMs with the very popular method of Random Forests (RFs). RFs are used extensively in bioinformatics and have been popular primarily because they combine the intuitive and proven ideas of decision tree induction and bagging. The comparison involves 22 datasets with diagnostic and prognostic response variables. SVMs outperform RFs both in the settings when no gene selection is performed and when several popular gene selection methods are used.
Chapter 5 (âComparison of SVMs and Kernel Ridge Regr...