Large-scale Video Classification with Convolutional Neural ...

Large-scale Video Classification with Convolutional Neural Networks Andrej Karpathy1,2 George Toderici1 Sanketh Shetty1. 1 1. Thomas Leung Rahul Sukthankar Li Fei-Fei2. 1 2. Google Research Computer Science Department, Stanford University Abstract image features [28]. Encouraged by positive results in domain of images, we study the performance of CNNs in Convolutional Neural Networks (CNNs) have been es- Large-scale Video Classification , where the networks have tablished as a powerful class of models for image recog- access to not only the appearance information present in nition problems. Encouraged by these results, we pro- single, static images, but also their complex temporal evolu- vide an extensive empirical evaluation of CNNs on large - tion.

There are several challenges to extending and applying scale Video Classification using a new dataset of 1 million CNNs in this setting. YouTube videos belonging to 487 classes. We study mul- From a practical standpoint, there are currently no Video tiple approaches for extending the connectivity of a CNN Classification benchmarks that match the scale and variety in time domain to take advantage of local spatio-temporal of existing image datasets because videos are significantly information and suggest a multiresolution, foveated archi- more difficult to collect, annotate and store. To obtain suffi- tecture as a promising way of speeding up the training. cient amount of data needed to train our CNN architectures, Our best spatio-temporal networks display significant per- we collected a new Sports-1M dataset, which consists of 1.

Formance improvements compared to strong feature-based million YouTube videos belonging to a taxonomy of 487. baselines ( to ), but only a surprisingly mod- classes of sports. We make Sports-1M available to the re- est improvement compared to single-frame models ( search community to support future work in this area. to ). We further study the generalization performance From a modeling perspective, we are interested in an- of our best model by retraining the top layers on the UCF- swering the following questions: what temporal connectiv- 101 Action Recognition dataset and observe significant per- ity pattern in a CNN architecture is best at taking advantage formance improvements compared to the UCF-101 baseline of local motion information present in the Video ?

How does model ( up from ). the additional motion information influence the predictions of a CNN and how much does it improve performance over- all? We examine these questions empirically by evaluating 1. Introduction multiple CNN architectures that each take a different ap- Images and videos have become ubiquitous on the in- proach to combining information across the time domain. ternet, which has encouraged the development of algo- From a computational perspective, CNNs require exten- rithms that can analyze their semantic content for vari- sively long periods of training time to effectively optimize ous applications, including search and summarization. Re- the millions of parameters that parametrize the model. This cently, Convolutional Neural Networks (CNNs) [15] have difficulty is further compounded when extending the con- been demonstrated as an effective class of models for un- nectivity of the architecture in time because the network derstanding image content, giving state-of-the-art results must process not just one image but several frames of Video on image recognition, segmentation, detection and retrieval at a time.

To mitigate this issue, we show that an effec- [11, 3, 2, 20, 9, 18]. The key enabling factors behind these tive approach to speeding up the runtime performance of results were techniques for scaling up the networks to tens CNNs is to modify the architecture to contain two separate of millions of parameters and massive labeled datasets that streams of processing: a context stream that learns features can support the learning process. Under these conditions, on low-resolution frames and a high-resolution fovea stream CNNs have been shown to learn powerful and interpretable that only operates on the middle portion of the frame. We observe a 2-4x increase in runtime performance of the net- ing from feature design and accumulation strategies to de- work due to the reduced dimensionality of the input, while sign of the network connectivity structure and hyperparam- retaining the Classification accuracy.

Eter choices. Due to computational constraints, CNNs have Finally, a natural question that arises is whether features until recently been applied to relatively small scale image learned on the Sports-1M dataset are generic enough to recognition problems (on datasets such as MNIST, CIFAR- generalize to a different, smaller dataset. We investigate 10/100, NORB, and Caltech-101/256), but improvements the transfer learning problem empirically, achieving sig- on GPU hardware have enabled CNNs to scale to networks nificantly better performance ( , up from ) on of millions of parameters, which has in turn led to signif- UCF-101 by re-purposing low-level features learned on the icant improvements in image Classification [11], object de- Sports-1M dataset than by training the entire network on tection [20, 9], scene labeling [3], indoor segmentation [4].

UCF-101 alone. Furthermore, since only some classes in and house number digit Classification [19]. Additionally, UCF-101 are related to sports, we can quantify the relative features learned by large networks trained on ImageNet improvements of the transfer learning in both settings. [7] have been shown to yield state-of-the-art performance Our contributions can be summarized as follows: across many standard image recognition datasets when clas- sified with an SVM, even with no fine-tuning [18]. We provide extensive experimental evaluation of mul- Compared to image data domains, there is relatively lit- tiple approaches for extending CNNs into Video clas- tle work on applying CNNs to Video Classification . Since sification on a Large-scale dataset of 1 million videos all successful applications of CNNs in image domains share with 487 categories (which we release as Sports-1M.)

The availability of a large training set, we speculate that this dataset) and report significant gains in performance is partly attributable to lack of Large-scale Video classifica- over strong feature-based baselines. tion benchmarks. In particular, commonly used datasets We highlight an architecture that processes input at two (KTH, Weizmann, UCF Sports, IXMAS, Hollywood 2, spatial resolutions - a low-resolution context stream UCF-50) only contain up to few thousand clips and up to and a high-resolution fovea stream - as a promising few dozen classes. Even the largest available datasets such way of improving the runtime performance of CNNs as CCV (9,317 videos and 20 classes) and the recently in- at no cost in accuracy. troduced UCF-101[22] (13,320 videos and 101 classes) are We apply our networks to the UCF-101 dataset and re- still dwarfed by available image datasets in the number of port significant improvement over feature-based state- instances and their variety [7].

Despite these limitations, of-the-art results and baselines established by training some extensions of CNNs into the Video domain have been networks on UCF-101 alone. explored. [1] and [10] extend an image CNN to Video domains by treating space and time as equivalent dimen- 2. Related Work sions of the input and perform convolutions in both time The standard approach to Video Classification [26, 16, and space. We consider these extensions as only one of the 21, 17] involves three major stages: First, local visual fea- possible generalizations in this work. Unsupervised learn- tures that describe a region of the Video are extracted ei- ing schemes for training spatio-temporal features have also ther densely [25] or at a sparse set of interest points [12, 8].

Been developed, based on Convolutional Gated Restricted Next, the features get combined into a fixed-sized Video - Boltzmann Machines [23] and Independent Subspace Anal- level description. One popular approach is to quantize all ysis [14]. In contrast, our models are trained end to end features using a learned k-means dictionary and accumulate fully supervised. the visual words over the duration of the Video into his- 3. Models tograms of varying spatio-temporal positions and extents [13]. Lastly, a classifier (such as an SVM) is trained on Unlike images which can be cropped and rescaled to a the resulting bag of words representation to distinguish fixed size, videos vary widely in temporal extent and can- among the visual classes of interest.

Large-scale Video Classification with Convolutional Neural ...

Tags:

Information

Advertisement

Transcription of Large-scale Video Classification with Convolutional Neural ...

Related search queries

Large-scale Video Classification with Convolutional Neural ...

Tags:

Information

Advertisement

Documents from same domain

Related documents

Related search queries