Transcription of Multimodal Deep Learning
1 Multimodal deep Learning Jiquan Ngiam1 Aditya Khosla1 Mingyu Kim1 Juhan Nam1 Honglak Lee2 Andrew Y. Ng1 1. Computer Science Department, Stanford University, Stanford, CA 94305, USA. 2. Computer Science and Engineering Division, University of Michigan, Ann Arbor, MI 48109, USA. Abstract mation on the place of articulation and muscle move- ments (Summerfield, 1992) which can often help to dis- deep networks have been successfully applied ambiguate between speech with similar acoustics ( , to unsupervised feature Learning for single the unvoiced consonants /p/ and /k/ ). modalities ( , text, images or audio). In this work, we propose a novel application of Multimodal Learning involves relating information deep networks to learn features over multiple from multiple sources. For example, images and 3-d modalities.
2 We present a series of tasks for depth scans are correlated at first-order as depth dis- Multimodal Learning and show how to train continuities often manifest as strong edges in images. deep networks that learn features to address Conversely, audio and visual data for speech recogni- these tasks. In particular, we demonstrate tion have correlations at a mid-level , as phonemes cross modality feature Learning , where better and visemes (lip pose and motions); it can be difficult features for one modality ( , video) can be to relate raw pixels to audio waveforms or spectro- learned if multiple modalities ( , audio and grams. video) are present at feature Learning time. In this paper, we are interested in modeling mid- Furthermore, we show how to learn a shared level relationships, thus we choose to use audio-visual representation between modalities and evalu- speech classification to validate our methods.
3 In par- ate it on a unique task, where the classifier is ticular, we focus on Learning representations for speech trained with audio-only data but tested with audio which are coupled with videos of the lips. video-only data and vice-versa. Our mod- els are validated on the CUAVE and AVLet- We will consider the Learning settings shown in Figure ters datasets on audio-visual speech classifi- 1. The overall task can be divided into three phases cation, demonstrating best published visual feature Learning , supervised training, and testing. speech classification on AVLetters and effec- A simple linear classifier is used for supervised train- tive shared representation Learning . ing and testing to examine different feature Learning models with Multimodal data. In particular, we con- sider three Learning settings Multimodal fusion, cross 1.
4 Introduction modality Learning , and shared representation Learning . In speech recognition, humans are known to inte- In the Multimodal fusion setting, data from all modal- grate audio-visual information in order to understand ities is available at all phases; this represents the typ- speech. This was first exemplified in the McGurk ef- ical setting considered in most prior work in audio- fect (McGurk & MacDonald, 1976) where a visual /ga/ visual speech recognition (Potamianos et al., 2004). In with a voiced /ba/ is perceived as /da/ by most sub- cross modality Learning , data from multiple modalities jects. In particular, the visual modality provides infor- is available only during feature Learning ; during the supervised training and testing phase, only data from Appearing in Proceedings of the 28 th International Con- a single modality is provided.
5 For this setting, the aim ference on Machine Learning , Bellevue, WA, USA, 2011. is to learn better single modality representations given Copyright 2011 by the author(s)/owner(s). unlabeled data from multiple modalities. Last, we con- Multimodal deep Learning sider a shared representation Learning setting, which is Feature Supervised Testing unique in that different modalities are presented for su- Learning Training pervised training and testing. This setting allows us Audio Audio Audio Classic deep Learning to evaluate if the feature representations can capture Video Video Video correlations across different modalities. Specifically, Multimodal Fusion A+V A+V A+V. studying this setting allows us to assess whether the Cross Modality A+V Video Video learned representations are modality-invariant. Learning A+V Audio Audio In the following sections, we first describe the build- ing blocks of our model .
6 We then present different Shared Representation A+V Audio Video Multimodal Learning models leading to a deep network Learning A+V Video Audio that is able to perform the various Multimodal learn- Figure 1: Multimodal Learning settings where A+V. ing tasks. Finally, we report experimental results and refers to Audio and Video. conclude. 2. Background model (wi,j , bj , ci ) using contrastive divergence (Hin- ton, 2002). Recent work on deep Learning (Hinton & Salakhut- dinov, 2006; Salakhutdinov & Hinton, 2009) has ex- To regularize the model for sparsity (Lee et al., amined how deep sigmoidal networks can be trained 2007), we encourage each hidden unit to have a pre- to produce useful representations for handwritten dig- determined expected activation usingPma regularization 1. ( k=1 E[hj |vk ]))2 , P. its and text.
7 The key idea is to use greedy layer-wise penalty of the form j ( m training with Restricted Boltzmann Machines (RBMs) where {v1 , .., vm } is the training set and determines followed by fine-tuning. We use an extension of RBMs the sparsity of the hidden unit activations. with sparsity (Lee et al., 2007), which have been shown to learn meaningful features for digits and natural im- 3. Learning architectures ages. In the next section, we review the sparse RBM, which is used as a layer-wise building block for our In this section, we describe our models for the task of models. audio-visual bimodal feature Learning , where the au- dio and visual input to the model are contiguous audio Sparse restricted Boltzmann machines (spectrogram) and video frames. To motivate our deep autoencoder (Hinton & Salakhutdinov, 2006) model , The RBM is an undirected graphical model with hid- we first describe several simple models and their draw- den variables (h) and visible variables (v) (Figure 2a).)
8 Backs. There are symmetric connections between the hidden and visible variables (Wi,j ), but no connections within One of the most straightforward approaches to feature hidden variables or visible variables. The model de- Learning is to train a RBM model separately for au- fines a probability distribution over h, v (Equation 1). dio and video (Figure 2a,b). After Learning the RBM, This particular configuration makes it easy to compute the posteriors of the hidden variables given the visible the conditional probability distributions, when v or h variables (Equation 2) can then be used as a new repre- is fixed (Equation 2). sentation for the data. We use this model as a baseline to compare the results of our Multimodal models, as log P (v, h) E(v, h) = well as for pre-training the deep networks. 1 T 1 T T T.
9 To train a Multimodal model , a direct approach is to v v c v + b h + h W v (1). 2 2 2 train a RBM over the concatenated audio and video 1 data (Figure 2c). While this approach jointly mod- p(hj |v) = sigmoid( (bj + wTj v)) (2) els the distribution of the audio and video data, it is 2. limited as a shallow model . In particular, since the cor- This formulation models the visible variables as real- relations between the audio and video data are highly valued units and the hidden variables as binary non-linear, it is hard for a RBM to learn these corre- As it is intractable to compute the gradient of the lations and form Multimodal representations. In prac- log-likelihood term, we learn the parameters of the tice, we found that Learning a shallow bimodal RBM. 1. We use Gaussian visible units for the RBM that is results in hidden units that have strong connections to connected to the input data.
10 When training the deeper variables from individual modality but few units that layers, we use binary visible units. connect across the modalities. Multimodal deep Learning .. deep Hidden Layer .. Hidden Units Hidden Units Shared Representation .. Audio Input Video Input Audio Input Video Input Audio Input Video Input (a) Audio RBM (b) Video RBM (c) Shallow Bimodal RBM (d) Bimodal DBN. Figure 2: RBM Pretraining Models. We train RBMs for (a) audio and (b) video separately as a baseline. The shallow model (c) is limited and we find that this model is unable to capture correlations across the modalities. The bimodal deep belief network (DBN) model (d) is trained in a greedy layer-wise fashion by first training models (a) & (b). We later unroll the deep model (d) to train the deep autoencoder models presented in Figure 3.