SEP 11, 2019 1:30 PM PDT

Development of Machine Learning Algorithms to Data Mine Meta-Omic Characterizations of Complex Microbiota

Sponsored by: PacBio
C.E. Credits: P.A.C.E. CE Florida CE
Speaker
  • Professor of Microbiology & Immunology, Professor of Otolarynology-Head and Neck Surgery, Drexel University College of Medicine
    Biography
      Dr. Ehrlich is Professor of Microbiology and Immunology, and Otolaryngology-Head and Neck Surgery at Drexel University College of Medicine (DUCoM) in Philadelphia, PA, USA. He also directs: the Center for Genomic Sciences (CGS); the Center for Advanced Microbial Processing (CAMP); and the Center for Surgical Infections and Biofilms within the Institute for Molecular Medicine and Infectious Disease at DUCoM. In addition, he directs the Core Genomics Facility for the Drexel University as a whole. CGS scientists utilize a broad array of comparative genomic techniques and bioinformatic tools, many developed in-house, to identify and characterize both virulence genes within pathogens, and susceptibility genes to pathogens within their hosts. Dr Ehrlich is also one of the founders of the field of Clinical Molecular Diagnostics (MDx), having been involved in the original application of PCR for the detection of human retroviruses in 19851. He founded the MDx Division at UPMC and used these experiences to author the first textbook/lab manual for infectious disease (ID) MDx2. Together with a team of like-minded pioneers he was one of the founders of the Association for Molecular Pathology and served as the first co-chair of the ID section. Dr Ehrlich counts among his major contributions to science the mapping and cloning of several major human disease genes3,4, and the re-writing of much of our understanding of chronic bacterial pathogenesis5,6. The latter began with his promulgation of the biofilm paradigm to explain many facets of chronic mucosal microbial infections7-9. Working with Chris Post, he started his explorations into chronic middle-ear disease in children in the early 90's which he has since repeatedly generalized such that it is now widely accepted that the vast majority of all chronic microbial infections are biofilm-associated10,11. He also advanced the Distributed Genome Hypothesis (DGH12,13) to explain the enormous clinical variability among strains of a bacterial species, which together with the biofilm paradigm form the bases for his rubric of Bacterial Plurality6. His work in human genetics combined with the laboratory resources necessary to test the DGH have resulted in his having played a role in the development of several waves of genomic technology over the last quarter century including microsatellite mapping, microarrays, and next-generation sequencing. More recently he has developed the concept of bacterial population-level virulence factors and has for the first time within the field of bacterial genomics used statistical genetics and machine learning algorithms approaches to identify unannotated distributed genes that are associated with virulence. These computational methodologies provides a non-biased, top-down approach to prioritize the annotation of hypothetical genes14. Coincident with the recent relocation of his research enterprise to DUCOM he founded CAMP which functions as a collaborative multi-discipline facility for exploitation of a suite of technological advances, many developed within the CGS, which permit the identification, cloning, heterologous expression, and biochemical verification of commercially important biosynthetic and biodegradative pathways from what he refers to as the "Genomic Dark Matter". This approach came out his successful collaborative studies with Dr. David Sherman at the University of Michigan wherein they used multiple omics technologies (and developed the term meta-omics) to isolate and characterize all of the genes for a novel biosynthetic pathway for an important anti-cancer drug from an unculturable endosymbiotic bacterium of a tunicate15. Over the past several years Dr Ehrlich has overseen the development of a novel ultra-high-fidelity microbiome assay that provides quantitative, species-specific analyses of microbial consortia using whole-gene 16S amplification and sequencing on the Pacific Biosciences third generation long-read sequencing platform16. When combined with a state-of-the art bioinformatics pipeline that takes advantage of novel pathway- algorithms and a custom database, developed in-house, this system provides unprecedented accuracy. In collaboration with Dr Curtis Harris at the NCI, Dr Ehrlich and his team applied this high-fidelity microbiome assay to identify bacterial species-specific changes to the lung microbiome associated with a specific TP-53 mutation - providing the first microbial biomarker for cancer17. Dr Ehrlich's lifelong interest in emergent MDx and "omic" technologies led to his recent appointment as Director of the Meta-Omics Core Facility at the Sidney Kimmel Cancer Center, a consortium NCI-designated Cancer Center involving Thomas Jefferson University and Drexel University. Dr Ehrlich's latest paradigm-changing hypothesis is that Alzheimer's disease results from a combination of chronic bacterial infections of the brain (primarily originating from the periodontium) and the brain's anti-microbial and inflammatory responses to these infections. Dr. Ehrlich was elected as fellow of the American Association for the Advancement of Science in 2014 and has won numerous awards for his research and teaching.

    Abstract

    Background:  The vast majority of all genes are contained within the genomes of the prokaryotes, including the eubacteria and the archaea.  These largely single-cellular domains of life thus contain most of the metabolic machinery housed within the earth’s biosphere.  The gene systems that encode this machinery include entire pathways for the biosynthesis and catabolism of literally millions of natural products.  A small subset of species within the eubacteria are also associated with disease as pathogens, and these organisms produce a specialized set of secondary metabolic products termed virulence factors.  The vast majority of all prokaryotic genes are either unannotated or under-annotated with respect to the functions of their encoded proteins.  

    Rationale:  To develop computational means to identify the specific genes, and the metabolic pathways that they encode, that underlie traits of interest for the manipulation of prokaryotic physiology to improve human life and health.

    Specific Aim of the current research: To develop generic unbiased computational means to identify unannotated bacterial genes associated with pathogenesis, virulence and tissue tropism.

    Results:  Our initial tools for the identification of novel bacterial virulence genes were adopted from the statistical genetic approaches used in eukaryotic gene mapping.  Following the statistical identification of our first set of candidate unannotated virulence genes from the human obligate pathogen Haemophilus influenzae, we demonstrated using a combination of in vitro and in vivo animal model experiments that the identified genes’ cognate proteins were actual virulence factors.  Follow-on studies of one of these novel proteins, Msf1, provided mechanistic details regarding its mode of action.   Subsequently, we developed random forest and neural network-based machine-learning approaches for a more thorough search of H. influenzae’s virulence/tropism genes.  Through multiple rounds of parameter tuning we developed a highly reliable random forest program that provided greater than 85% specificity with regard to determining the actual disease (out of five diseases) from which a given bacterial strain was isolated.  Examination of the random forest’s classifier gene selection provides a rich source of novel unannotated genes from within the microbial genomic dark matter that will provide for a focused approach for the characterization of much new biology relating to pathogenesis.  It is interesting to note that four of the top five genes used by the classifier have no annotation whatsoever.  Using a second neural net approach, in this case for protein annotation we have been able to assign, with a high degree of confidence, at least one GO (gene ontology) term for 14% of the 13,692 hypothetical proteins encoded by the Moraxella catarrhalis pan (supra) genome.

    Discussion:  Through the combination of multiple machine learning algorithms we have developed the beginnings of a pipeline for the in silico identification and characterization of novel unannotated genes.

    Conclusions:  The development of high-throughput whole genome sequencing together with the creation of a suite of unbiased methods for the identification and characterization of unannotated prokaryotic genes that are associated with specific measurable traits will provide a universal method for targeted gene characterization leading to the discovery of novel biology underlying any metabolic process of interest.

    Learning Objectives:

    1. Understand that even in this day of massively high throughput whole genome sequencing that the vast majority of prokaryotic genes and gene systems are completely unannotated meaning that we have no idea what the genes that we sequence encode.
    2. Through the application of machine-learning (artificial intelligence) approaches we are beginning the development of a computational pipeline that can be used to: (a) identify the genes involved in a particular process; and b) then annotate the identified genes as to likely molecular function.

     


    Show Resources
    You May Also Like
    APR 07, 2020 8:00 AM PDT
    C.E. CREDITS
    APR 07, 2020 8:00 AM PDT
    DATE: April 7, 2020 TIME: 8:00am PT, 11:00am ET This webinar sets out to establish why quality control is key to robust, reliable, reproducible science. We will look at best practice criteri...
    OCT 08, 2020 7:00 AM PDT
    Add to Calendar Select one of the following: iCal Google Calendar Outlook Calendar Yahoo Calendar
    C.E. CREDITS
    OCT 08, 2020 7:00 AM PDT
    Add to Calendar Select one of the following: iCal Google Calendar Outlook Calendar Yahoo Calendar
    DATE: October 8, 2020 TIME: 7:00am PDT, 10:00am EDT, 4:00pm CEST How often do you pipette in your cell culture lab every day? Usually, we do it so often that we tend stop thinking about ho...
    MAY 08, 2020 10:00 AM PDT
    C.E. CREDITS
    MAY 08, 2020 10:00 AM PDT
    DATE: May 8, 2020 TIME: 10:00am PT, 11:00am MT, 1:00pm ET The application of next generation sequencing to interrogate immune repertoires and methods in which these highly complex dataset...
    FEB 26, 2020 9:00 AM PST
    C.E. CREDITS
    FEB 26, 2020 9:00 AM PST
    DATE: February 26, 2020 TIME: 9:00am PST 3D cell culture and analysis and the study of organoids and spheroids are becoming more prevalent as a research method in publications as traditional...
    JUN 09, 2020 10:00 AM PDT
    C.E. CREDITS
    JUN 09, 2020 10:00 AM PDT
    DATE: June 9, 2020 TIME: 10:00am PT, 1:00pm ET The presentation will first discuss sepsis as a disease and then explain the importance of performing diagnostic tests in the clinical labora...
    AUG 18, 2020 10:00 AM PDT
    Add to Calendar Select one of the following: iCal Google Calendar Outlook Calendar Yahoo Calendar
    C.E. CREDITS
    AUG 18, 2020 10:00 AM PDT
    Add to Calendar Select one of the following: iCal Google Calendar Outlook Calendar Yahoo Calendar
    DATE: August 18, 2020 TIME: 10:00am PT Get deeper understanding of gene expression patterns by using assays that retain spatial organization at single cell resolution! Come learn about the n...
    Loading Comments...
    Show Resources