MAY 02, 2019 11:51 AM PDT

A Explosion of Genetic Data Brings Errors That Grow

WRITTEN BY: Carmen Leitch

A team of researchers at Washington State University (WSU) wanted to know the minimum proteins that were required for gram-negative microbes called Proteobacteria to survive. The team compiled a dataset of 2,300 bacterial genomes, containing sequences for nearly nine million proteins, which were grouped together when similar. As they began to assess those sequences, they started finding errors in genomic data that is publicly available and used by scientists, including them. Their study, which will probably have major implications for future research, has been reported in Frontiers in Microbiology.

A digitally-colorized scanning electron microscopic (SEM) image of red-colored, Salmonella sp. bacteria as they were in the process of invading a mustard-colored, ruffled, immune cell. Salmonella are a type of Proteobacteria / Credit: National Institute of Allergy and Infectious Diseases (NIAID)

"Just in the last two years, researchers have sequenced more than twice the number of bacterial genomes as they did in the twenty years before that," said Shira Broschat, a professor in the School of Electrical Engineering and Computer Science at WSU.

A tool for genetic sequencing was pioneered in 1977 by Frederick Sanger and colleagues, commercialized, and brought into the research lab, where it had huge impacts. It took over ten years and $2.7 billion to sequence the human genome that way. Recent years have seen the advent of a new tool: next-gen sequencing, which is described in the video, and is able to deliver a human genome sequence in under an hour for $1500. One can imagine that this new technology immediately created a staggering amount of new genetic data. 

Unfortunately, the WSU researchers have found that this massive amount of data introduces a type of error that the researchers found to be a serious problem, noted lead author Svetlana Lockwood, a graduate student in computer science. "A single annotation error can propagate rapidly because scientists build on previous annotation when they sequence new genomes," she explained. Gene annotations describe the coding regions of genes, which create a protein, and where they are located; they can help scientists ascertain gene function, for example.

"We found that for each of the proteins, there were mistakes in annotation of their genes, which resulted in truncated or missing sequences," Broschat added. 

Other studies have identified annotation errors, but in this work, the WSU team listed and explained the different types of errors they found in currently available genetic data.

"With the scale of misannotation we found, researchers have to reevaluate the reliability of publicly available genome data for use in big data applications," Broschat noted.

These errors have both human and technological causes, said Kelly Brayton, a professor in the Department of Veterinary Microbiology and Pathology. DNA sequencing tools are good but they aren’t perfect, so they sometimes make the wrong determination when identifying nucleotide bases. A lack of understanding of proteins can also create problems. 

This team still had to use computational tools to perform their analysis; it’s simply too much data for people to comb through by hand. They used information contained in the databases of the National Center for Biotechnology Information. Their efforts continue, as Brayton and Broschat seek to develop a tool that can find annotation errors in datasets.

Sources: AAAS/Eurekalert! via WSU, Frontiers in Microbiology 

About the Author
  • Experienced research scientist and technical expert with authorships on over 30 peer-reviewed publications, traveler to over 70 countries, published photographer and internationally-exhibited painter, volunteer trained in disaster-response, CPR and DV counseling.
You May Also Like
OCT 19, 2020
Clinical & Molecular DX
Making Capillary Electrophoresis Accessible for Any Lab
OCT 19, 2020
Making Capillary Electrophoresis Accessible for Any Lab
Analyzing nucleic acids through gel electrophoresis has been a staple of genetic research for decades. But using traditi ...
NOV 12, 2020
Cardiology
Creating a Mouse Model to Test RBM20 Dependent Dilated Cardiomyopathy
NOV 12, 2020
Creating a Mouse Model to Test RBM20 Dependent Dilated Cardiomyopathy
Cardiovascular disease is something that, in most cases, is within our ability to control. A healthy diet and active lif ...
NOV 30, 2020
Cell & Molecular Biology
Can a Scent Motivate Us to Exercise?
NOV 30, 2020
Can a Scent Motivate Us to Exercise?
People are always looking for new ways to get inspired to exercise. Now odor is being proposed as a motivational tool fo ...
DEC 14, 2020
Genetics & Genomics
tRNA Plays a Role in the Immune Response to Stroke
DEC 14, 2020
tRNA Plays a Role in the Immune Response to Stroke
At one time, researchers knew that various forms of RNA served a few different critical roles in the creation of protein ...
DEC 24, 2020
Microbiology
A New, Infectious Strain of SARS-CoV-2 Emerges
DEC 24, 2020
A New, Infectious Strain of SARS-CoV-2 Emerges
The UK recently reported that it had detected a new variant of the SARS-CoV-2 pandemic virus, which causes COVID-19. Thi ...
JAN 05, 2021
Drug Discovery & Development
Promising Epigenetic Treatment for Depression Works After One Dose
JAN 05, 2021
Promising Epigenetic Treatment for Depression Works After One Dose
Researchers affiliated with the University of Sao Paulo in Brazil have used epigenetic modulators to reduce stress-induc ...
Loading Comments...