Transcription of Multimodal Deep Learning
{{id}} {{{paragraph}}}
Multimodal deep Learning Jiquan Ngiam1 Aditya Khosla1 Mingyu Kim1 Juhan Nam1 Honglak Lee2 Andrew Y. Ng1 1. Computer Science Department, Stanford University, Stanford, CA 94305, USA. 2. Computer Science and Engineering Division, University of Michigan, Ann Arbor, MI 48109, USA. Abstract mation on the place of articulation and muscle move- ments (Summerfield, 1992) which can often help to dis- deep networks have been successfully applied ambiguate between speech with similar acoustics ( , to unsupervised feature Learning for single the unvoiced consonants /p/ and /k/ ). modalities ( , text, images or audio). In this work, we propose a novel application of Multimodal Learning involves relating information deep networks to learn features over multiple from multiple sources. For example, images and 3-d modalities.
To train a multimodal model, a direct approach is to train a RBM over the concatenated audio and video data (Figure 2c). While this approach jointly mod-els the distribution of the audio and video data, it is limited as a shallow model. In particular, since the cor-relations between the audio and video data are highly
Domain:
Source:
Link to this page:
Please notify us if you found a problem with this document:
{{id}} {{{paragraph}}}