SEP 22, 2018 9:15 PM PDT

Tackling Speech and Object Recognition

WRITTEN BY: Nouran Amin

Image via Tech Hive

Computer scientists at MIT have opened new doors in speech and image recognition systems by creating a model system that can identify objects within an image based solely on the spoken descriptions of the image; an audio caption. Despite current speech-recognition technology, the new model will not need manual transcriptions and annotations of the examples it is trained on. The system instead adapts to words directly from recorded speech clips and objects placed raw images, and then associates them with one another. Even though the new systems currently recognizes only several different hundred words, researchers are hopeful in the future that their combined speech-object recognition technique can be useful instead of hours of manual labor

"We wanted to do speech recognition in a way that's more natural, leveraging additional signals and information that humans have the benefit of using, but that machine learning algorithms don't typically have access to. We got the idea of training a model in a manner similar to walking a child through the world and narrating what you're seeing," says David Harwath, a researcher in the Computer Science and Artificial Intelligence Laboratory (CSAIL) and the Spoken Language Systems Group.

Additionally, on useful application of the new system is replacing a bilingual annotator by learning the translations between languages, without the need of a bilingual annotator. "There's potential there for a Babel Fish-type of mechanism," explains Harwath by referring to the fictitious living earpiece in the "Hitchhiker's Guide to the Galaxy".

Image via Electronic Design

In their research paper, the scientists altered the model system to combine specific words with patches of pixels. They trained the model on a database system giving correct and incorrect images with captions. However, there exists a challenge during the training where the model doesn’t have access to alignment information between the speech and the image. "The biggest contribution of the paper," Harwath explains, "is demonstrating that these cross-modal [audio and visual] alignments can be inferred automatically by simply teaching the network which images and captions belong together and which pairs don't."

Source: MIT news

About the Author
  • Nouran is a scientist, educator, and life-long learner with a passion for making science more communicable. When not busy in the lab isolating blood macrophages, she enjoys writing on various STEM topics.
You May Also Like
NOV 16, 2021
Immunology
How a BBQ Lighter Inspired a New Vaccine Technology
NOV 16, 2021
How a BBQ Lighter Inspired a New Vaccine Technology
It’s the size of a pen, weighs as much as two AA batteries, and works without a power source. A new microneedle de ...
DEC 08, 2021
Genetics & Genomics
Expanding the Gene-Editing Toolbox
DEC 08, 2021
Expanding the Gene-Editing Toolbox
The CRISPR-Cas9 gene-editing technology sparked a veritable revolution in the biomedical sciences, taking genetic engine ...
DEC 22, 2021
Technology
E-waste Recycling Processes Emit Synthetic Antioxidants With Possible Health Risks
DEC 22, 2021
E-waste Recycling Processes Emit Synthetic Antioxidants With Possible Health Risks
Electronic waste (e-waste) refers to electronic material that has reached the end of its life and is ready to be recycle ...
DEC 26, 2021
Microbiology
To Understand the Gut Microbiome, Researchers Create New Method to ID Strains
DEC 26, 2021
To Understand the Gut Microbiome, Researchers Create New Method to ID Strains
We've long known that there are bacteria in the human gut; some can be dangerous if they are ingested. But genetic and c ...
DEC 30, 2021
Clinical & Molecular DX
Treating Rheumatoid Arthritis: No More Trial and Error?
DEC 30, 2021
Treating Rheumatoid Arthritis: No More Trial and Error?
  Treating rheumatoid arthritis can be like throwing darts—doctors don’t always hit the bullseye. Trial ...
JAN 14, 2022
Technology
Do People Trust Artificial Intelligence?
JAN 14, 2022
Do People Trust Artificial Intelligence?
What exactly is artificial intelligence (AI)?  It may sound like science fiction, and often conjures up images of m ...
Loading Comments...