Transcriptome Assembly: Computational Challenges of Next-Generation Sequence Data

Presented at: Genetics and Genomics Virtual Event Series 2015

Speaker

Steven L Salzberg, PhD

Director, Center for Computational Biology McKusick-Nathans Institute of Genetic Medicine, Professor, Departments of Biomedical Engineering, Computer Science, and Biostatistics
BIOGRAPHY

Abstract

Next-generation sequencing technology allows us to peer inside the cell in exquisite detail, revealing new insights into biology, evolution, and disease that would have been impossible to discover just a few years ago. The enormous volumes of data produced by NGS experiments present many computational challenges that we are working to address. One of the most widely used sequencing methods is RNA-seq, which captures the genes being transcribed in a cell and uses sequencing to measure their levels of expression. In recent years, my lab has developed multiple systems for RNA-seq analysis, including the widely-used Bowtie, TopHat and Cufflinks programs for alignment and assembly of transcripts from RNA-seq data. In this presentation, I will describe two new systems, each of which represents a major step forward: (1) the HISAT system for spliced aligment of NGS reads, a successor to TopHat; and (2) the StringTie program for assembly and quantitation of RNA-seq data, a successor to Cufflinks. This talk describes joint work with Daehwan Kim and Mihaela Pertea. Learning Objectives: 1. Explain the overall process used to turn RNA sequence data into a summary of genes and their expression levels 2. Describe why it is difficult to align a short RNA or DNA sequence to the human genome 3. Appreciate the computational challenge of assembling a complete, correct set of transcripts from a large next-generation sequencing experiment.

Transcriptome Assembly: Computational Challenges of Next-Generation Sequence Data

Transcriptome Assembly: Computational Challenges of Next-Generation Sequence Data

Steven L Salzberg, PhD

Transcriptome Assembly: Computational Challenges of Next-Generation Sequence Data