SEP 11, 2019 1:30 PM PDT

Development of Machine Learning Algorithms to Data Mine Meta-Omic Characterizations of Complex Microbiota

Sponsored by: PacBio
Speaker

Abstract

Background:  The vast majority of all genes are contained within the genomes of the prokaryotes, including the eubacteria and the archaea.  These largely single-cellular domains of life thus contain most of the metabolic machinery housed within the earth’s biosphere.  The gene systems that encode this machinery include entire pathways for the biosynthesis and catabolism of literally millions of natural products.  A small subset of species within the eubacteria are also associated with disease as pathogens, and these organisms produce a specialized set of secondary metabolic products termed virulence factors.  The vast majority of all prokaryotic genes are either unannotated or under-annotated with respect to the functions of their encoded proteins.  

Rationale:  To develop computational means to identify the specific genes, and the metabolic pathways that they encode, that underlie traits of interest for the manipulation of prokaryotic physiology to improve human life and health.

Specific Aim of the current research: To develop generic unbiased computational means to identify unannotated bacterial genes associated with pathogenesis, virulence and tissue tropism.

Results:  Our initial tools for the identification of novel bacterial virulence genes were adopted from the statistical genetic approaches used in eukaryotic gene mapping.  Following the statistical identification of our first set of candidate unannotated virulence genes from the human obligate pathogen Haemophilus influenzae, we demonstrated using a combination of in vitro and in vivo animal model experiments that the identified genes’ cognate proteins were actual virulence factors.  Follow-on studies of one of these novel proteins, Msf1, provided mechanistic details regarding its mode of action.   Subsequently, we developed random forest and neural network-based machine-learning approaches for a more thorough search of H. influenzae’s virulence/tropism genes.  Through multiple rounds of parameter tuning we developed a highly reliable random forest program that provided greater than 85% specificity with regard to determining the actual disease (out of five diseases) from which a given bacterial strain was isolated.  Examination of the random forest’s classifier gene selection provides a rich source of novel unannotated genes from within the microbial genomic dark matter that will provide for a focused approach for the characterization of much new biology relating to pathogenesis.  It is interesting to note that four of the top five genes used by the classifier have no annotation whatsoever.  Using a second neural net approach, in this case for protein annotation we have been able to assign, with a high degree of confidence, at least one GO (gene ontology) term for 14% of the 13,692 hypothetical proteins encoded by the Moraxella catarrhalis pan (supra) genome.

Discussion:  Through the combination of multiple machine learning algorithms we have developed the beginnings of a pipeline for the in silico identification and characterization of novel unannotated genes.

Conclusions:  The development of high-throughput whole genome sequencing together with the creation of a suite of unbiased methods for the identification and characterization of unannotated prokaryotic genes that are associated with specific measurable traits will provide a universal method for targeted gene characterization leading to the discovery of novel biology underlying any metabolic process of interest.

Learning Objectives:

1. Understand that even in this day of massively high throughput whole genome sequencing that the vast majority of prokaryotic genes and gene systems are completely unannotated meaning that we have no idea what the genes that we sequence encode.
2. Through the application of machine-learning (artificial intelligence) approaches we are beginning the development of a computational pipeline that can be used to: (a) identify the genes involved in a particular process; and b) then annotate the identified genes as to likely molecular function.

 


Show Resources
You May Also Like
MAR 16, 2022 8:00 AM PDT
C.E. CREDITS
MAR 16, 2022 8:00 AM PDT
Date: March 16, 2022 Time: 8:00am (PDT), 11:00am (EDT), 5:00pm (CET) Handling of potent and/or hazardous substances is commonplace in sev.....
DEC 09, 2021 11:00 AM PST
C.E. CREDITS
DEC 09, 2021 11:00 AM PST
Date: December 09, 2021 Time: 11:00am (PDT), 2:00pm (EDT) The burden of antimicrobial resistance (AMR) has been acknowledged worldwide by leading health institutes. Besides the need for new...
APR 26, 2022 7:00 AM PDT
C.E. CREDITS
APR 26, 2022 7:00 AM PDT
Date: April 19, 2022 Time: 7:00am (PDT), 10:00am (EDT), 4:00pm (CEST) High-content (HC) phenotypic profiling approaches are a powerful tool to study the effect of biological, genetic, and ch...
MAR 30, 2022 6:00 AM PDT
MAR 30, 2022 6:00 AM PDT
Targeted NGS has been instrumental in helping the healthcare community deliver on the promise of precision medicine. The Ion Torrent Genexus Integrated Sequencer has enabled targeted genomic...
NOV 30, 2021 10:00 AM PST
C.E. CREDITS
NOV 30, 2021 10:00 AM PST
Date: November 30, 2021 Time: 10:00am (PDT), 1:00pm (EDT) The prevalence of thyroid disease worldwide has served as a catalyst for healthcare providers to study various tools and methods to...
MAR 23, 2022 11:00 AM PDT
MAR 23, 2022 11:00 AM PDT
Date: March 23, 2021 Time: 11:00am (PDT), 2:00pm (EDT), 8:00pm (CEDT) In this presentation, Dr. Middleton will review the development and deployment of large-scale saliva-based COVID-19 test...
SEP 11, 2019 1:30 PM PDT

Development of Machine Learning Algorithms to Data Mine Meta-Omic Characterizations of Complex Microbiota

Sponsored by: PacBio


Show Resources
Loading Comments...
Show Resources