Background: Haemophilus influenzae is the causative agent of multiple human disease conditions among multiple sites in the human body. Underlying genetic mechanisms are elusive, particularly in species with diverse ecological niches in the human body. Our lab and others have sequenced the whole genomes of over 1,600 genomes of Non-typeable Haemophilus influenzae (NTHi). These strains were isolated from human patients with various disease states (as well as colonizing without disease symptoms) from multiple locations within and on subject’s bodies.
Methods: 1,618 genomes were assembled using sequencer-appropriate assembly software. Automatic gene annotation was performed using Prokka, and pan-genome gene cluster analysis was performed with Roary. Gene presence/absence matrix of 4,207 gene clusters were used as gene ‘features’ to predict isolates from A) Body Site of isolation, and B) Disease State of patient. Additionally, ‘core’ genes (genes present in all NTHi strains) were converted to numeric vectors and used as a separate feature set. Three algorithms used for class prediction were explored, among the two feature sets.
Results: Imbalance in the number of classes within the dataset proved challenging for the machine learning (ML) algorithm predictions. All algorithms performed significantly better than ‘No Information Criteria’ in predicting either body site and disease state, though in all cases predicting fewer, and more balanced classes was correlated with higher accuracy.
Conclusion: Both gene presence/absence, and core gene genetic composition information in NTHi strains can successfully be used to predict both ecological niche and disease state of origin. Future work is warranted, specifically increasing the number of genomes in classes with low representation and exploring additional methods and feature selection techniques.
Learning Objectives:
1. Define horizontal gene transfer in naturally competent bacterial species
2. Identify importance of gene possession to phenotype
3. Explain current approaches to use gene sets to predict clinical provenance