Informatics, Implementation, and Genomic Cohorts 12 | Example of a Scalable and Adaptable Approach for NGS Analyses Leveraging High-Performance Computing |
Cody Ashby,∗ Michael A. Bauer,∗ Erich A. Peterson,∗ and Donald J. Johann, Jr.∗,†
∗Winthrop P. Rockefeller Cancer Institute, University of
Arkansas for Medical Sciences, Little Rock, AR
†[email protected] Contents
I. | Introduction |
II. | Background |
III. | New System — Technical Approach a.System design b.Molecular profiling database (MPDB) c.MPDB admin d.Docker e.WES pipeline f.Installation and setup procedure g.Cloud computing h.Regression test sets |
IV. | Significance a.Deployment on google compute engine b.Internal testing and capabilities |
V. | Future Work |
VI. | Conclusions |
Competing Interests |
Authors’ Contributions |
References |
I.Introduction
Next-generation sequencing (NGS) is situated to contribute toward advanced understanding of health and disease in an unprecedented manner. In oncology, these advances are spanning the gamut from basic science to guideline-based clinical management of many cancers. National guidelines (e.g., NCCN version 2.2018 for non-small lung cancer) strongly advise broader molecular profiling (i.e., precision medicine) to identify driver mutations to ensure patients receive the most appropriate treatment. Under specific conditions, the Centers for Medicare and Medicaid Services (CMS) has proposed that the evidence is sufficient to cover NGS as a diagnostic laboratory test when ordered by a physician and performed in a CLIA-certified lab. NGS-based tumor profiling is soon to become the standard of care for advanced cancer patients, and liquid biopsy assays, which profile cell-free DNA, are an application of deep sequencing. Along with enabling a revolution in the fields of molecular diagnostics and drug discovery and development, the finding of novel classes of biomolecules (e.g., new types of non-coding RNAs) is being facilitated by NGS.
However, there exists a meta-problem in life sciences research. Big data from NGS is outpacing the capabilities of information technology. Maintaining state-of-the-art informatics needs can lead to an ever-shortening cycle of purchasing new high-performance computing (HPC) systems, which may lead to fiscal strife, especially for organizations new to the field. To this end, an information architecture and suite of tools for the deployment and retrieval of study data were developed. The NGS pipeline software was engineered from open-source best practices using an adaptable methodology along with a Docker container approach, so that the same code is executed regardless of the HPC facility to which a study is deployed.
Described in the present chapter is a comprehensive methodology for NGS analysis, which leverages in-house HPC, the scalability of a commercial cloud provider, and the computational resources present at an affiliated research university. NGS analyses have been performed at three HPC facilities using the same pipeline code, and software configuration management tasks have been greatly reduced. Also gained are improvements in reliability with an effective approach toward scalability. Finally, the promotion of fiscal responsibility is achieved, through the use of greater university-wide and commercial resources.
II.Background
NGS is rapidly changing the manner in which biomedical research is performed and clinical medicine is practiced. This is especially true for medical oncology. Former President Obama’s Precision Medicine Initiative (PMI) calls for a near-term focus on cancers and a longer-term aim to generate knowledge applicable to a whole range of health and disease.1 The recently announced Cancer Moonshot furthers the PMI by galvanizing research efforts against cancer, managed by a prominent national task force, and led by the former Vice President Biden.2
Why now? These goals are within reach due to: (i) advances in genomics over the past 10 years, (ii) increasing use of electronic health records, (iii) technical advances in health devices residing in smart phones and the fact a majority of US adults own one, (iv) advances in data science especially concerning big data, and (v) the changing role of patient partnerships especially crowd sourcing and citizen science, where people want to participate with feedback. The National Academy of Sciences is recommending a new taxonomic classification of disease, especially cancer, which requires definition by the many levels of molecular information (genome, epigenome, microbiome, etc.) and integrated with clinical data, into a knowledge network that learns.3 Since the cost of a whole-exome sequencing (WES) study is ~$,1000 (the price of a CT scan), NGS assays are expected to proliferate and be reimbursed, so there is an urgent need for innovative and cost effective computational approaches.
NGS informatics related to complex pipeline deployment and management of data has produced many advances. For instance, Galaxy 4 is an open-sourced web-based platform that facilitates the analysis of NGS experiments. It is a powerful and flexible approach designed for use by non-programmers. Galaxy allows users to run pipelines without in-depth knowledge of any single tool, which is a significant innovation. However, it’s sometimes hard to see the relationship between the GUI and command line parameter options used inside a pipeline. In addition, some users may need access to more programmatic or custom options from within the pipeline, which would require programming abilities. In addition, GUI options do not gracefully handle batch NGS analyses with multiple stages and inputs. Furthermore, some Galaxy pipeline processing steps can be hard to integrate with non-Galaxy-based tools and an integrated data cleaning (e.g., ETL) along with database archiving of pipeline-based study results is absent.
Omics Pipe5 is an open-source python package that can be installed on a local machine, a compute cluster, or a cloud computing platform such as amazon web services (AWS). Its primary purpose is collecting community-supported “best practices” pipelines in an attempt to increase the overall reproducibility of results, and it claims to be an easily extensible approach to pipeline deployment and management. However, there are some configuration requirements. If not using the AWS distribution, all third-party tools must be installed before using a given pipeline. Other important features not addressed concern the management of related experiments tied to pipeline deployment as well as annotating, archiving, and presenting the data results returned by the pipeline in a useful manner.
openBIS is a powerful tool that has proven itself useful in addressing problems of big data management.6 It provides generic approaches toward the management of high-throughput biological data. However, upon careful study of the requirements associated with this particular project, it was decided that a more specialized approach was needed. In-house developers with significant expertise in database design created a custom schema, which has allowed more in-depth data handling of multiple molecular profiling modalities and the establishment of interfaces for the interaction with third-party tools, such as IGV. The development of third-party tool interfaces and linkages may be more challenging with externally developed software.
Other innovative approaches include Taverna,7 which is a workflow management system used in a variety of scientific fields including bioinformatics. It is a desktop authoring system which allows a user to integrate a wide variety of software components into a computable solution. These features make it very useful for non-programmers. Other rapid prototyping approaches that may be applied to NGS pipeline construction include Bpipe8 and Snakemake.9 However, the developer is required to specify the pipeline components and any specific interfacing steps that may be required.
III.New System — Technical Approach
There are a wide variety of analysis tools and pipelines with different deployment modalities and parameters,3,10–12 and a custom solution addressing all requirements and constraints is illustrated here. The main challenges were to satisfy the following two requirements: first, to have the ability to utilize: (i) in-house, (ii) local affiliated university, or (iii) commercially provided HPC resources. Second, was to minimize software configuration management, and yet have robust pipelines which are platform independent. Illustrated in this chapter is a novel methodology for the processing of NGS experimental data, which utilizes an adaptable approach for the computational pipeline architecture. This methodology, which utilizes an information architecture, through the Molecular Profiling Database (MPDB, discussed later), is independent of any particular HPC system. Open source tools and the best practices from the Broad Institute and The McDonnell Genome Institute at Washington University were used in the pipeline construction. Pipeline design goals were: (i) to be fast but flexible, (ii) flexible but correct, and (iii) able to generate reports that are easily understood by clinicians/scientists.
This overall approach greatly reduces configuration management since there is only one version of the pipeline software that runs on different HPC systems. It also improves reliability, aids scalability, and promotes fiscal r...