SEP 22, 2018 9:15 PM PDT

Tackling Speech and Object Recognition

WRITTEN BY: Nouran Amin

Image via Tech Hive

Computer scientists at MIT have opened new doors in speech and image recognition systems by creating a model system that can identify objects within an image based solely on the spoken descriptions of the image; an audio caption. Despite current speech-recognition technology, the new model will not need manual transcriptions and annotations of the examples it is trained on. The system instead adapts to words directly from recorded speech clips and objects placed raw images, and then associates them with one another. Even though the new systems currently recognizes only several different hundred words, researchers are hopeful in the future that their combined speech-object recognition technique can be useful instead of hours of manual labor

"We wanted to do speech recognition in a way that's more natural, leveraging additional signals and information that humans have the benefit of using, but that machine learning algorithms don't typically have access to. We got the idea of training a model in a manner similar to walking a child through the world and narrating what you're seeing," says David Harwath, a researcher in the Computer Science and Artificial Intelligence Laboratory (CSAIL) and the Spoken Language Systems Group.

Additionally, on useful application of the new system is replacing a bilingual annotator by learning the translations between languages, without the need of a bilingual annotator. "There's potential there for a Babel Fish-type of mechanism," explains Harwath by referring to the fictitious living earpiece in the "Hitchhiker's Guide to the Galaxy".

Image via Electronic Design

In their research paper, the scientists altered the model system to combine specific words with patches of pixels. They trained the model on a database system giving correct and incorrect images with captions. However, there exists a challenge during the training where the model doesn’t have access to alignment information between the speech and the image. "The biggest contribution of the paper," Harwath explains, "is demonstrating that these cross-modal [audio and visual] alignments can be inferred automatically by simply teaching the network which images and captions belong together and which pairs don't."

Source: MIT news

About the Author
  • Nouran earned her BS and MS in Biology at IUPUI and currently shares her love of science by teaching. She enjoys writing on various topics as well including science & medicine, global health, and conservation biology. She hopes through her writing she can make science more engaging and communicable to the general public.
You May Also Like
AUG 07, 2020
Technology
Origami Microbots
AUG 07, 2020
Origami Microbots
Using the principles of origami can unlock the power of tiny robots, enhance speed, and control in machinery.  &quo ...
SEP 27, 2020
Neuroscience
Computer Reads Peoples' Minds to Generate Images
SEP 27, 2020
Computer Reads Peoples' Minds to Generate Images
Researchers from the University of Helsinki in Finland have found a way to generate images on computers by monitoring br ...
OCT 02, 2020
Clinical & Molecular DX
Detecting Dystonia in the Blink of an AI
OCT 02, 2020
Detecting Dystonia in the Blink of an AI
A team of scientists have created a diagnostic tool, powered by artificial intelligence (AI), that can pick up on the su ...
OCT 06, 2020
Technology
Web Resources Provide Insights Into COVID-19 Research
OCT 06, 2020
Web Resources Provide Insights Into COVID-19 Research
Web resources are now helping researchers globally to answer critical questions about the COVID-19 pandemic. These resou ...
OCT 07, 2020
Technology
Millimeter-Precision Drug Delivery
OCT 07, 2020
Millimeter-Precision Drug Delivery
It is almost impossible to deliver targeted drug therapy via the bloodstream without reaching the entire brain and body ...
OCT 28, 2020
Cell & Molecular Biology
Mimicking Cells With a Microfluidic Chip
OCT 28, 2020
Mimicking Cells With a Microfluidic Chip
Cell culture models are one way for scientists to learn more about biology. But cells grow in large cultures that are of ...
Loading Comments...