Applications of Machine Learning to Predict Clinical Provenance of Haemophilus influenzae

C.E. Credits: P.A.C.E. CE Florida CE
Speaker
  • Head of Bioinformatics, Drexel University College of Medicine
    Biography

      I started my undergraduate degree in computer science before transitioning into biology at St. Lawrence University where I focused on environmental ecology using satellite imagery analysis to estimate biodiversity field data to satellite images and obtained my BS degree. After a few years working as a microbiologist, I was formally trained in computational biology and bioinformatics while at Carnegie Mellon University for a MS degree, and then PhD at Drexel University. For the last 10 years I have been working as the Head of Bioinformatics in Dr. Garth Ehrlich’s laboratory. I am responsible for the development of new data processing pipelines, and subsequent data analyses for cutting-edge DNA sequencing platforms including the Pacbio RSII/Sequel. These implementations consist of both currently published techniques, and novel analyses designed to answer questions specific to our research (mainly on chronic bacterial pathogens).  Our work consists of analyzing high throughput genomics, novel microbiome techniques, and machine learning to tease out biological meaning from these enormous datasets.


    Abstract

    Background: Haemophilus influenzae is the causative agent of multiple human disease conditions among multiple sites in the human body. Underlying genetic mechanisms are elusive, particularly in species with diverse ecological niches in the human body.  Our lab and others have sequenced the whole genomes of over 1,600 genomes of Non-typeable Haemophilus influenzae (NTHi).  These strains were isolated from human patients with various disease states (as well as colonizing without disease symptoms) from multiple locations within and on subject’s bodies.

    Methods: 1,618 genomes were assembled using sequencer-appropriate assembly software. Automatic gene annotation was performed using Prokka, and pan-genome gene cluster analysis was performed with Roary. Gene presence/absence matrix of 4,207 gene clusters were used as gene ‘features’ to predict isolates from A) Body Site of isolation, and B) Disease State of patient. Additionally, ‘core’ genes (genes present in all NTHi strains) were converted to numeric vectors and used as a separate feature set. Three algorithms used for class prediction were explored, among the two feature sets.

    Results: Imbalance in the number of classes within the dataset proved challenging for the machine learning (ML) algorithm predictions. All algorithms performed significantly better than ‘No Information Criteria’ in predicting either body site and disease state, though in all cases predicting fewer, and more balanced classes was correlated with higher accuracy.

    Conclusion: Both gene presence/absence, and core gene genetic composition information in NTHi strains can successfully be used to predict both ecological niche and disease state of origin. Future work is warranted, specifically increasing the number of genomes in classes with low representation and exploring additional methods and feature selection techniques.

    Learning Objectives:

    1. Define horizontal gene transfer in naturally competent bacterial species

    2. Identify importance of gene possession to phenotype

    3. Explain current approaches to use gene sets to predict clinical provenance


    Show Resources
    You May Also Like
    SEP 10, 2020 9:00 AM PDT
    C.E. CREDITS
    SEP 10, 2020 9:00 AM PDT
    Date: September 10, 2020 Time: 9:00am (PDT), 12:00pm (EDT) Osmolality testing is relevant throughout the entire bioprocessing workflow. As customers look to refine mAb and gene therapy workf...
    DEC 02, 2020 8:00 AM PST
    Add to Calendar Select one of the following: iCal Google Calendar Outlook Calendar Yahoo Calendar
    C.E. CREDITS
    DEC 02, 2020 8:00 AM PST
    Add to Calendar Select one of the following: iCal Google Calendar Outlook Calendar Yahoo Calendar
    DATE: December 2nd, 2020 TIME: 08:00am PDT, 11:00pm EDT Bioreactors and shakers are used to cultivate microorganisms, plant, insect, and mammalian cells in different volumes. Upscaling of pr...
    OCT 08, 2020 7:00 AM PDT
    C.E. CREDITS
    OCT 08, 2020 7:00 AM PDT
    DATE: October 8, 2020 TIME: 7:00am PDT, 10:00am EDT, 4:00pm CEST How often do you pipette in your cell culture lab every day? Usually, we do it so often that we tend stop thinking about ho...
    JUL 22, 2020 10:00 AM PDT
    C.E. CREDITS
    JUL 22, 2020 10:00 AM PDT
    DATE: July 23, 2020 TIME: 10:00 am PDT The SARS-CoV-2 pandemic has taken a toll on many sectors of the medical community. As the pandemic took a grip on the laboratory, the need for diagnost...
    JUN 09, 2020 10:00 AM PDT
    C.E. CREDITS
    JUN 09, 2020 10:00 AM PDT
    DATE: June 9, 2020 TIME: 10:00am PT, 1:00pm ET The presentation will first discuss sepsis as a disease and then explain the importance of performing diagnostic tests in the clinical labora...
    AUG 25, 2020 8:00 AM PDT
    C.E. CREDITS
    AUG 25, 2020 8:00 AM PDT
    DATE: August 25, 2020 TIME: 8:00am PDT, 10:00am CDT, 11:00am EDT Recombinant lentivirus (LV) and adeno-associated virus (AAV) are critical components of cell and gene therapies, which show g...
    Loading Comments...
    Show Resources