MAY 13, 2015 10:30 AM PDT

Population Scale Human Genome Analysis on the Cloud

  • Peter White, PhD, James Hirmas

    Peter White, Co-Founder, Chief Scientific Advisor, GenomeNext LLC, Assistant Professor of Pediatrics, Nationwide Children's Hospital, James Hirmas, Co-Founder, CEO, GenomeNext LLC


Advanced sequencing technologies have made population scale whole genome sequencing a possibility. However, current strategies for analysis of this data rely upon parallelization approaches that have limited scalability, lack reproducibility and are complex to implement, requiring substantial investment in specialized IT solutions. To overcome these challenges our goal was to develop a platform that fully automates all the necessary components to perform both single sample and large-scale genomic data analysis. We developed a highly accurate and deterministic analysis solution, named Churchill, which fully automates the analytical process required to perform the complex and computationally intensive process of alignment, post-alignment processing and genotyping. Our parallelization strategy enables division of each analysis step across multiple compute instances, enabling whole genome analysis to be completed in under 90 minutes. In addition to rapid single sample analysis, Churchill optimizes utilization of available compute resources and scales in a near linear fashion. Utilizing Amazon Web Services (AWS) cloud computing resources we developed a platform that enables population scale genome analysis to be performed. To demonstrate this, we analyzed the 1000 Genomes Project dataset of 2,504 whole genome and exome sequenced individuals. Starting from FASTQ raw input data, we were able to fully automate the analysis process, ultimately performing multi-sample variant calling and generating population allele frequencies in seven days. Our approach demonstrates the feasibility of generating population allele frequencies specific to a given unified analysis approach, critical for accurately filtering datasets for discovery of rare pathogenic variants. Moreover, through use of on demand cloud computing resources, our method represents a solution for the genomics computational bottleneck and will keep pace with the magnitude of data generated by population scale sequencing. Learning Objectives: 1 Understanding the steps required to analyze human genome sequencing data, for both single sample analysis and large scale genomic studies 2 Optimizing compute resources and leveraging cloud computing to resolve the bioinformatics bottleneck

You May Also Like
MAY 13, 2015 10:30 AM PDT

Population Scale Human Genome Analysis on the Cloud

Loading Comments...
  • See More