SEP 22, 2018 09:15 PM PDT

Tackling Speech and Object Recognition

WRITTEN BY: Nouran Amin

Image via Tech Hive

Computer scientists at MIT have opened new doors in speech and image recognition systems by creating a model system that can identify objects within an image based solely on the spoken descriptions of the image; an audio caption. Despite current speech-recognition technology, the new model will not need manual transcriptions and annotations of the examples it is trained on. The system instead adapts to words directly from recorded speech clips and objects placed raw images, and then associates them with one another. Even though the new systems currently recognizes only several different hundred words, researchers are hopeful in the future that their combined speech-object recognition technique can be useful instead of hours of manual labor

"We wanted to do speech recognition in a way that's more natural, leveraging additional signals and information that humans have the benefit of using, but that machine learning algorithms don't typically have access to. We got the idea of training a model in a manner similar to walking a child through the world and narrating what you're seeing," says David Harwath, a researcher in the Computer Science and Artificial Intelligence Laboratory (CSAIL) and the Spoken Language Systems Group.

Additionally, on useful application of the new system is replacing a bilingual annotator by learning the translations between languages, without the need of a bilingual annotator. "There's potential there for a Babel Fish-type of mechanism," explains Harwath by referring to the fictitious living earpiece in the "Hitchhiker's Guide to the Galaxy".

Image via Electronic Design

In their research paper, the scientists altered the model system to combine specific words with patches of pixels. They trained the model on a database system giving correct and incorrect images with captions. However, there exists a challenge during the training where the model doesn’t have access to alignment information between the speech and the image. "The biggest contribution of the paper," Harwath explains, "is demonstrating that these cross-modal [audio and visual] alignments can be inferred automatically by simply teaching the network which images and captions belong together and which pairs don't."

Source: MIT news

About the Author
  • Nouran enjoys writing on various topics including science & medicine, global health, and conservation biology. She hopes through her writing she can make science more engaging and communicable to the general public.
You May Also Like
NOV 06, 2018
Space & Astronomy
NOV 06, 2018
Here's Where NASA Will Land its Martian InSight Mission Later This Month
In May, an Atlas V rocket blasted off from California’s Vandenberg Air Force Base and lofted NASA’s InSight mission into space so that it could...
NOV 11, 2018
Space & Astronomy
NOV 11, 2018
Rocket Lab Successfully Sends Electron Rocket on its First Commercial Flight
SpaceX’s Falcon 9 rocket quickly became of the most prominent means of commercial and private satellite launches, but sending such a massive rocket t...
NOV 12, 2018
Plants & Animals
NOV 12, 2018
Steaks Aren't the Only Things We Get From Cows
It’s no secret that cows are routinely slaughtered for beef, but what if we told you that only about 60% of the cow gets harvested for food? Fret not...
DEC 03, 2018
Neuroscience
DEC 03, 2018
Stentrode: No Surgery Focused Brain stimulation
Brain stentrode, a brain interfacing electrode that is implanted within a blood vessel...
DEC 10, 2018
Space & Astronomy
DEC 10, 2018
Learn More About How NASA Built the Parker Solar Probe
When NASA built the Solar Parker Probe to embark on its mission to study the Sun, they knew it’d need to be built with bleeding-edge technology to ma...
DEC 18, 2018
Technology
DEC 18, 2018
Battery-free Implantable Device for Weight Loss
Obesity is considered a rising epidemic affecting more than 700 million adults and children. Measures to control obesity have ranged from risky weight-loss...
Loading Comments...