SEP 22, 2018 9:15 PM PDT

Tackling Speech and Object Recognition

WRITTEN BY: Nouran Amin

Image via Tech Hive

Computer scientists at MIT have opened new doors in speech and image recognition systems by creating a model system that can identify objects within an image based solely on the spoken descriptions of the image; an audio caption. Despite current speech-recognition technology, the new model will not need manual transcriptions and annotations of the examples it is trained on. The system instead adapts to words directly from recorded speech clips and objects placed raw images, and then associates them with one another. Even though the new systems currently recognizes only several different hundred words, researchers are hopeful in the future that their combined speech-object recognition technique can be useful instead of hours of manual labor

"We wanted to do speech recognition in a way that's more natural, leveraging additional signals and information that humans have the benefit of using, but that machine learning algorithms don't typically have access to. We got the idea of training a model in a manner similar to walking a child through the world and narrating what you're seeing," says David Harwath, a researcher in the Computer Science and Artificial Intelligence Laboratory (CSAIL) and the Spoken Language Systems Group.

Additionally, on useful application of the new system is replacing a bilingual annotator by learning the translations between languages, without the need of a bilingual annotator. "There's potential there for a Babel Fish-type of mechanism," explains Harwath by referring to the fictitious living earpiece in the "Hitchhiker's Guide to the Galaxy".

Image via Electronic Design

In their research paper, the scientists altered the model system to combine specific words with patches of pixels. They trained the model on a database system giving correct and incorrect images with captions. However, there exists a challenge during the training where the model doesn’t have access to alignment information between the speech and the image. "The biggest contribution of the paper," Harwath explains, "is demonstrating that these cross-modal [audio and visual] alignments can be inferred automatically by simply teaching the network which images and captions belong together and which pairs don't."

Source: MIT news

About the Author
  • Nouran earned her BS and MS in Biology at IUPUI and currently shares her love of science by teaching. She enjoys writing on various topics as well including science & medicine, global health, and conservation biology. She hopes through her writing she can make science more engaging and communicable to the general public.
You May Also Like
DEC 17, 2019
Space & Astronomy
DEC 17, 2019
What to Expect From Boeing's Starliner Spacecraft
NASA’s Commercial Crew Program sports two major contenders for sending astronauts to the International Space Station from American soil for the first...
DEC 22, 2019
Space & Astronomy
DEC 22, 2019
Boeing Launches Botched Starliner Demo Mission for NASA
Boeing finally moved forward with the initial un-crewed test launch of its Starliner Commercial Crew spacecraft for NASA at the end of this past week follo...
JAN 15, 2020
Technology
JAN 15, 2020
Brain-Inspired Computing
The invention of the transistor, which lets a weak signal control much larger flow, was developed in 1947 and since its development computing has been on t...
JAN 19, 2020
Technology
JAN 19, 2020
A Biosesnor Can Diagnose Sepsis Rapidly
Sepsis is the result of systemic infection leading to organ failure followed by death. It claims one life every four seconds and the primary cause of death...
FEB 04, 2020
Space & Astronomy
FEB 04, 2020
What Are NASA's 'Great Observatories?'
NASA recently retired its Spitzer Space Telescope, one of four specialized space-based observatories that together made up the American Space Agency’...
FEB 16, 2020
Space & Astronomy
FEB 16, 2020
ISS Poised to Receive Fresh Supplies by Tuesday
Life on the International Space Station isn’t quite as convenient as it is for the rest of us here on Earth. With no convenient restaurants or conven...
Loading Comments...