Background: The vast majority of all genes are contained within the genomes of the prokaryotes, including the eubacteria and the archaea. These largely single-cellular domains of life thus contain most of the metabolic machinery housed within the earth’s biosphere. The gene systems that encode this machinery include entire pathways for the biosynthesis and catabolism of literally millions of natural products. A small subset of species within the eubacteria are also associated with disease as pathogens, and these organisms produce a specialized set of secondary metabolic products termed virulence factors. The vast majority of all prokaryotic genes are either unannotated or under-annotated with respect to the functions of their encoded proteins.
Rationale: To develop computational means to identify the specific genes, and the metabolic pathways that they encode, that underlie traits of interest for the manipulation of prokaryotic physiology to improve human life and health.
Specific Aim of the current research: To develop generic unbiased computational means to identify unannotated bacterial genes associated with pathogenesis, virulence and tissue tropism.
Results: Our initial tools for the identification of novel bacterial virulence genes were adopted from the statistical genetic approaches used in eukaryotic gene mapping. Following the statistical identification of our first set of candidate unannotated virulence genes from the human obligate pathogen Haemophilus influenzae, we demonstrated using a combination of in vitro and in vivo animal model experiments that the identified genes’ cognate proteins were actual virulence factors. Follow-on studies of one of these novel proteins, Msf1, provided mechanistic details regarding its mode of action. Subsequently, we developed random forest and neural network-based machine-learning approaches for a more thorough search of H. influenzae’s virulence/tropism genes. Through multiple rounds of parameter tuning we developed a highly reliable random forest program that provided greater than 85% specificity with regard to determining the actual disease (out of five diseases) from which a given bacterial strain was isolated. Examination of the random forest’s classifier gene selection provides a rich source of novel unannotated genes from within the microbial genomic dark matter that will provide for a focused approach for the characterization of much new biology relating to pathogenesis. It is interesting to note that four of the top five genes used by the classifier have no annotation whatsoever. Using a second neural net approach, in this case for protein annotation we have been able to assign, with a high degree of confidence, at least one GO (gene ontology) term for 14% of the 13,692 hypothetical proteins encoded by the Moraxella catarrhalis pan (supra) genome.
Discussion: Through the combination of multiple machine learning algorithms we have developed the beginnings of a pipeline for the in silico identification and characterization of novel unannotated genes.
Conclusions: The development of high-throughput whole genome sequencing together with the creation of a suite of unbiased methods for the identification and characterization of unannotated prokaryotic genes that are associated with specific measurable traits will provide a universal method for targeted gene characterization leading to the discovery of novel biology underlying any metabolic process of interest.
1. Understand that even in this day of massively high throughput whole genome sequencing that the vast majority of prokaryotic genes and gene systems are completely unannotated meaning that we have no idea what the genes that we sequence encode.
2. Through the application of machine-learning (artificial intelligence) approaches we are beginning the development of a computational pipeline that can be used to: (a) identify the genes involved in a particular process; and b) then annotate the identified genes as to likely molecular function.