Transcription of 3D Convolutional Neural Networks for Human Action …
1 3D Convolutional Neural Networks for Human Action recognition Shuiwang Ji Arizona State University, Tempe, AZ 85287, USA. Wei Xu Ming Yang Kai Yu NEC Laboratories America, Inc., Cupertino, CA 95014, USA. Abstract cluttered backgrounds, occlusions, and viewpoint vari- ations, etc. Therefore, most of the existing approaches We consider the fully automated recognition (Efros et al., 2003; Schu ldt et al., 2004; Dolla r et al., of actions in uncontrolled environment. Most 2005; Laptev & Pe rez, 2007; Jhuang et al., 2007). existing work relies on domain knowledge to make certain assumptions ( , small scale and view- construct complex handcrafted features from point changes) about the circumstances under which inputs. In addition, the environments are the video was taken. However, such assumptions sel- usually assumed to be controlled.
2 Convolu- dom hold in real-world environment. In addition, most tional Neural Networks (CNNs) are a type of of these approaches follow the conventional paradigm deep models that can act directly on the raw of pattern recognition , which consists of two steps in inputs, thus automating the process of fea- which the first step computes complex handcrafted fea- ture construction. However, such models are tures from raw video frames and the second step learns currently limited to handle 2D inputs. In this classifiers based on the obtained features. In real-world paper, we develop a novel 3D CNN model for scenarios, it is rarely known which features are impor- Action recognition . This model extracts fea- tant for the task at hand, since the choice of feature is tures from both spatial and temporal dimen- highly problem-dependent.
3 Especially for Human ac- sions by performing 3D convolutions, thereby tion recognition , different Action classes may appear capturing the motion information encoded dramatically different in terms of their appearances in multiple adjacent frames. The developed and motion patterns. model generates multiple channels of infor- mation from the input frames, and the final Deep learning models (Fukushima, 1980; LeCun et al., feature representation is obtained by com- 1998; Hinton & Salakhutdinov, 2006; Hinton et al., bining information from all channels. We 2006; Bengio, 2009) are a class of machines that can apply the developed model to recognize hu- learn a hierarchy of features by building high-level man actions in real-world environment, and features from low-level ones, thereby automating the it achieves superior performance without re- process of feature construction.
4 Such learning ma- lying on handcrafted features. chines can be trained using either supervised or un- supervised approaches, and the resulting systems have been shown to yield competitive performance in visual 1. Introduction object recognition (LeCun et al., 1998; Hinton et al., 2006; Ranzato et al., 2007; Lee et al., 2009a), natu- Recognizing Human actions in real-world environment ral language processing (Collobert & Weston, 2008), finds applications in a variety of domains including in- and audio classification (Lee et al., 2009b) tasks. The telligent video surveillance, customer attributes, and Convolutional Neural Networks (CNNs) (LeCun et al., shopping behavior analysis. However, accurate recog- 1998) are a type of deep models in which trainable nition of actions is a highly challenging task due to filters and local neighborhood pooling operations are Appearing in Proceedings of the 27 th International Confer- applied alternatingly on the raw input images, result- ence on Machine Learning, Haifa, Israel, 2010.
5 Copyright ing in a hierarchy of increasingly complex features. 2010 by the author(s)/owner(s). It has been shown that, when trained with appropri- 3D Convolutional Neural Networks for Human Action recognition ate regularization (Ahmed et al., 2008; Yu et al., 2008; handcrafted features, demonstrating that the 3D CNN. Mobahi et al., 2009), CNNs can achieve superior per- model is more effective for real-world environments formance on visual object recognition tasks without such as those captured in TRECVID data. The exper- relying on handcrafted features. In addition, CNNs iments also show that the 3D CNN model significantly have been shown to be relatively insensitive to certain outperforms the frame-based 2D CNN for most tasks. variations on the inputs (LeCun et al., 2004). We also observe that the performance differences be- tween 3D CNN and other methods tend to be larger As a class of attractive deep models for automated fea- when the number of positive training samples is small.
6 Ture construction, CNNs have been primarily applied on 2D images. In this paper, we consider the use of CNNs for Human Action recognition in videos. A sim- 2. 3D Convolutional Neural Networks ple approach in this direction is to treat video frames In 2D CNNs, 2D convolution is performed at the con- as still images and apply CNNs to recognize actions volutional layers to extract features from local neigh- at the individual frame level. Indeed, this approach borhood on feature maps in the previous layer. Then has been used to analyze the videos of developing an additive bias is applied and the result is passed embryos (Ning et al., 2005). However, such approach through a sigmoid function. Formally, the value of does not consider the motion information encoded in unit at position (x, y) in the jth feature map in the multiple contiguous frames.
7 To effectively incorporate xy ith layer, denoted as vij , is given by the motion information in video analysis, we propose to perform 3D convolution in the Convolutional layers i 1 Q. ! X PX i 1. of CNNs so that discriminative features along both (x+p)(y+q). xy X pq vij = tanh bij + wijm v(i 1)m , spatial and temporal dimensions are captured. We m p=0 q=0. show that by applying multiple distinct Convolutional (1). operations at the same location on the input, multi- where tanh( ) is the hyperbolic tangent function, bij ple types of features can be extracted. Based on the is the bias for this feature map, m indexes over the proposed 3D convolution, a variety of 3D CNN archi- set of feature maps in the (i 1)th layer connected tectures can be devised to analyze video data. We to the current feature map, wijk pq is the value at the develop a 3D CNN architecture that generates multi- position (p, q) of the kernel connected to the kth fea- ple channels of information from adjacent video frames ture map, and Pi and Qi are the height and width and performs convolution and subsampling separately of the kernel, respectively.
8 In the subsampling lay- in each channel. The final feature representation is ers, the resolution of the feature maps is reduced by obtained by combining information from all channels. pooling over local neighborhood on the feature maps An additional advantage of the CNN-based models is in the previous layer, thereby increasing invariance to that the recognition phase is very efficient due to their distortions on the inputs. A CNN architecture can be feed-forward nature. constructed by stacking multiple layers of convolution We evaluated the developed 3D CNN model on the and subsampling in an alternating fashion. The pa- TREC Video Retrieval Evaluation (TRECVID) data1 , rameters of CNN, such as the bias bij and the kernel pq which consist of surveillance video data recorded in weight wijk , are usually trained using either super- London Gatwick Airport.
9 We constructed a multi- vised or unsupervised approaches (LeCun et al., 1998;. module event detection system, which includes 3D Ranzato et al., 2007). CNN as a module, and participated in three tasks of the TRECVID 2009 Evaluation for Surveillance Event 3D Convolution Detection. Our system achieved the best performance In 2D CNNs, convolutions are applied on the 2D fea- on all three participated tasks. To provide indepen- ture maps to compute features from the spatial dimen- dent evaluation of the 3D CNN model, we report its sions only. When applied to video analysis problems, performance on the TRECVID 2008 development set it is desirable to capture the motion information en- in this paper. We also present results on the KTH coded in multiple contiguous frames. To this end, we data as published performance for this data is avail- propose to perform 3D convolutions in the convolution able.
10 Our experiments show that the developed 3D stages of CNNs to compute features from both spa- CNN model outperforms other baseline methods on tial and temporal dimensions. The 3D convolution is the TRECVID data, and it achieves competitive per- achieved by convolving a 3D kernel to the cube formed formance on the KTH data without depending on by stacking multiple contiguous frames together. By 1. this construction, the feature maps in the convolution layer is connected to multiple contiguous frames in the 3D Convolutional Neural Networks for Human Action recognition (a) 2D convolution t em por al t em por al Figure 2. Extraction of multiple features from contiguous frames. Multiple 3D convolutions can be applied to con- tiguous frames to extract multiple features. As in Figure 1, the sets of connections are color-coded so that the shared weights are in the same color.