Multimodal Machine Learning: A Survey and Taxonomy

0162-8828 (c) 2018 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See for more article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI , IEEET ransactions on Pattern analysis and Machine IntelligenceTRANSACTIONS OF PATTERN analysis AND Machine INTELLIGENCE1 Multimodal Machine Learning: A Survey and TaxonomyTadas Baltru saitis, Chaitanya Ahuja, and Louis-Philippe MorencyAbstract Our experience of the world is Multimodal - we see objects, hear sounds, feel texture, smell odors, and taste to the way in which something happens or is experienced and a research problem is characterized asmultimodalwhenit includes multiple such modalities.

In order for Artificial Intelligence to make progress in understanding the world around us, it needsto be able to interpret such Multimodal signals Machine learningaims to build models that can process and relateinformation from multiple modalities. It is a vibrant multi-disciplinary field of increasing importance and with extraordinary of focusing on specific Multimodal applications, this paper surveys the recent advances in Multimodal Machine learning itselfand presents them in a common Taxonomy . We go beyond the typical early and late fusion categorization and identify broaderchallenges that are faced by Multimodal Machine learning, namely: representation, translation, alignment, fusion, and co-learning. Thisnew Taxonomy will enable researchers to better understand the state of the field and identify directions for future Terms Multimodal , Machine learning, introductory, INTRODUCTIONTHE world surrounding us involves multiple modalities we see objects, hear sounds, feel texture, smell odors,and so on.

In general terms, amodalityrefers to the way inwhich something happens or is experienced. Most peopleassociate the word modality with thesensory modalitieswhich represent our primary channels of communicationand sensation, such as vision or touch. A research problemor dataset is therefore characterized asmultimodalwhen itincludes multiple such modalities. In this paper we focusprimarily, but not exclusively, on three modalities: naturallanguagewhich can be both written or spoken;visualsignalswhich are often represented with images or videos; andvocalsignals which encode sounds and para-verbal informationsuch as prosody and vocal order for Artificial Intelligence to make progress inunderstanding the world around us, it needs to be ableto interpret and reason about Multimodal Machine learningaims to build models that can processand relate information from multiple modalities.

From earlyresearch on audio-visual speech recognition to the recentexplosion of interest in language and vision models, multi-modal Machine learning is a vibrant multi-disciplinary fieldof increasing importance and with extraordinary research field of Multimodal Machine Learningbrings some unique challenges for computational re-searchers given the heterogeneity of the data. Learning frommultimodal sources offers the possibility of capturing cor-respondences between modalities and gaining an in-depthunderstanding of natural phenomena. In this paper we iden-tify and explore five core technical challenges (and relatedsub-challenges) surrounding Multimodal Machine learning. T. Baltru saitis is with Microsoft Corporation, Cambridge, UK.

C. Ahujaand L-P. Morency are with the Language Technologies Institute, atCarnegie Mellon University, Pittsburgh, PennsylvaniaE-mail: tbaltrus, cahuja, received May 18, 2017 They are central to the Multimodal setting and need to betackled in order to progress the field. Our Taxonomy goesbeyond the typical early and late fusion split, and consistsof the five following challenges:1)RepresentationA first fundamental challenge is learninghow to represent and summarize Multimodal data in away that exploits the complementarity and redundancyof multiple modalities. The heterogeneity of multimodaldata makes it challenging to construct such representa-tions. For example, language is often symbolic while au-dio and visual modalities will be represented as )TranslationA second challenge addresses how to trans-late (map) data from one modality to another.

Not onlyis the data heterogeneous, but the relationship betweenmodalities is often open-ended or subjective. For exam-ple, there exist a number ofcorrectways to describe animage and and one perfect translation may not )AlignmentA third challenge is to identify the direct rela-tions between (sub)elements from two or more differentmodalities. For example, we may want to align the stepsin a recipe to a video showing the dish being tackle this challenge we need to measure similaritybetween different modalities and deal with possible long-range dependencies and )FusionA fourth challenge is to join information fromtwo or more modalities to perform a prediction. Forexample, for audio-visual speech recognition , the visualdescription of the lip motion is fused with the speechsignal to predict spoken words.

The information comingfrom different modalities may have varying predictivepower and noise topology, with possibly missing data inat least one of the )Co-learningA fifth challenge is to transfer knowledgebetween modalities, their representation, and their pre-dictive models. This is exemplified by algorithms of co-training, conceptual grounding, and zero shot explores how knowledge learning from one0162-8828 (c) 2018 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See for more article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI , IEEET ransactions on Pattern analysis and Machine IntelligenceTRANSACTIONS OF PATTERN analysis AND Machine INTELLIGENCE2 Table 1: A summary of applications enabled by Multimodal Machine learning.

For each application area we identify thecore technical challenges that need to be addressed in order to tackle recognitionAudio-visual speech recognitionEvent detectionAction classificationMultimedia event detectionEmotion and affectRecognitionSynthesisMediadescripti onImagedescriptionVideo descriptionVisual question-answeringMedia summarizationMultimediaretrievalCross modal retrievalCross modal hashingMultimediageneration(Visual) speech and sound synthesisImageand scene generationmodalitycan help a computational model trained on adifferent modality. This challenge is particularly relevantwhen one of the modalities has limited resources ( ,annotated data).For each of these five challenges, we defines taxonomicclasses and sub-classes to help structure the recent workin this emerging research field of Multimodal machinelearning.

We start with a discussion of main applicationsof Multimodal Machine learning (Section 2) followed by adiscussion on the recent developments on all of the five coretechnical challenges facing Multimodal Machine Learning: representation (Section 3), translation (Section 4), alignment(Section 5), fusion (Section 6), and co-learning (Section 7).We conclude with a discussion in Section APPLICATIONS:A HISTORICAL PERSPECTIVEM ultimodal Machine learning enables a wide range ofapplications: from audio-visual speech recognition to im-age captioning. In this section we present a brief historyof Multimodal applications, from its beginnings in audio-visual speech recognition to a recently renewed interest inlanguage and vision of the earliest examples of Multimodal research isaudio-visual speech recognition (AVSR) [251].

It was moti-vated by the McGurk effect [143] an interaction betweenhearing and vision during speech perception. When humansubjects heard the syllable /ba-ba/ while watching the lipsof a person saying /ga-ga/, they perceived a third sound:/da-da/. These results motivated many researchers from thespeech community to extend their approaches with visualinformation. Given the prominence of hidden Markov mod-els (HMMs) in the speech community at the time [99], it iswithout surprise that many of the early models for AVSR were based on various HMM extensions [25], [26]. Whileresearch into AVSR is not as common these days, it has seenrenewed interest from the deep learning community [157].While the original vision of AVSR was to improvespeech recognition performance ( , word error rate) inall contexts, the experimental results showed that the mainadvantage of visual information was when the speech signalwas noisy ( , low signal-to-noise ratio) [78], [157], [251].

Multimodal Machine Learning: A Survey and Taxonomy

Tags:

Information

Transcription of Multimodal Machine Learning: A Survey and Taxonomy

Multimodal Machine Learning: A Survey and Taxonomy

Tags:

Information

Documents from same domain

Expressing emotion through posture and gesture

The Control-Value Theory of Achievement Emotions ...

Academic emotions in students' self-regulated learning and ...