Abstract 1. Introduction

Building High-level FeaturesUsing Large Scale Unsupervised LearningQuoc V. Aurelio S. Y. consider the problem of building high-level, class-specific feature detectors fromonly unlabeled data. For example, is it pos-sible to learn a face detector using only unla-beled images? To answer this, we train a 9-layered locally connected sparse autoencoderwith pooling and local contrast normalizationon a large dataset of images (the model has1 billion connections, the dataset has 10 mil-lion 200x200 pixel images downloaded fromthe Internet). We train this network usingmodel parallelism and asynchronous SGD ona cluster with 1,000 machines (16,000 cores)for three days.

Contrary to what appears tobe a widely-held intuition, our experimentalresults reveal that it is possible to train a facedetector without having to label images ascontaining a face or not. Control experimentsshow that this feature detector is robust notonly to translation but also to scaling andout-of-plane rotation. We also find that thesame network is sensitive to other high-levelconcepts such as cat faces and human bod-ies. Starting with these learned features, wetrained our network to obtain accu-racy in recognizing 22,000 object categoriesfrom ImageNet, a leap of 70% relative im-provement over the previous inProceedings of the 29thInternational Confer-ence on Machine learning , Edinburgh, Scotland, UK, 2012 by the author(s)/owner(s).

1. IntroductionThe focus of this work is to buildhigh-level, class-specific feature detectors fromunlabeledimages. Forinstance, we would like to understand if it is possible tobuild a face detector from only unlabeled images. Thisapproach is inspired by the neuroscientific conjecturethat there exist highly class-specific neurons in the hu-man brain, generally and informally known as grand-mother neurons. The extent of class-specificity ofneurons in the brain is an area of active investiga-tion, but current experimental evidence suggests thepossibility that some neurons in the temporal cortexare highly selective for object categories such as facesor hands (Desimone et al.)

, 1984), and perhaps evenspecific people (Quiroga et al., 2005).Contemporary computer vision methodology typicallyemphasizes the role oflabeleddata to obtain theseclass-specific feature detectors. For example, to builda face detector, one needs a large collection of imageslabeled as containing faces, often with a bounding boxaround the face. The need for large labeled sets posesa significant challenge for problems where labeled dataare rare. Although approaches that make use of inex-pensive unlabeled data are often preferred, they havenot been shown to work well for building work investigates the feasibility of building high-level features from onlyunlabeleddata. A positiveanswer to this question will give rise to two significantresults.

Practically, this provides an inexpensive wayto develop features from unlabeled data. But perhapsmore importantly, it answers an intriguing question asto whether the specificity of the grandmother neuron could possibly be learned from unlabeled data. Infor-mally, this would suggest that it is at least in principlepossible that a baby learns to group faces into one [ ] 12 Jul 2012 Building high-level features using large-scale unsupervised learningbecause it has seen many of them and not because itis guided by supervision or feature learning and deep learning haveemerged as methodologies in machine learning forbuilding features fromunlabeleddata. Using unlabeleddata in the wild to learn features is the key idea be-hind theself-taught learningframework (Raina et al.)

,2007). Successful feature learning algorithms and theirapplications can be found in recent literature using avariety of approaches such as RBMs (Hinton et al.,2006), autoencoders (Hinton & Salakhutdinov, 2006;Bengio et al., 2007), sparse coding (Lee et al., 2007)and K-means (Coates et al., 2011). So far, most ofthese algorithms have only succeeded in learninglow-levelfeatures such as edge or blob detectors. Go-ing beyond such simple features and capturing com-plex invariances is the topic of this studies observe that it is quite time intensiveto train deep learning algorithms to yield state of theart results (Ciresan et al., 2010). We conjecture thatthe long training time is partially responsible for thelack of high-level features reported in the instance, researchers typically reduce the sizes ofdatasets and models in order to train networks in apractical amount of time, and these reductions under-mine the learning of high-level address this problem by scaling up the core compo-nents involved in training deep networks: the dataset,the model, and the computational resources.

First,we use a large dataset generated by sampling randomframes from random YouTube input dataare 200x200 images, much larger than typical 32x32images used in deep learning and unsupervised fea-ture learning (Krizhevsky, 2009; Ciresan et al., 2010;Le et al., 2010; Coates et al., 2011). Our model, adeep autoencoder with pooling and local contrast nor-malization, is scaled to these large images by usinga large computer cluster. To support parallelism onthis cluster, we use the idea of local receptive fields, , (Raina et al., 2009; Le et al., 2010; 2011b). Thisidea reduces communication costs between machinesand thus allows model parallelism (parameters are dis-tributed across machines).

Asynchronous SGD is em-ployed to support data parallelism. The model wastrained in a distributed fashion on a cluster with 1,000machines (16,000 cores) for three results using classification and visualiza-tion confirm that it is indeed possible to build high-level features from unlabeled data. In particular, usinga hold-out test set consisting of faces and distractors,we discover a feature that is highly selective for is different from the work of (Lee et al., 2009) whotrained their model on images from one result is also validated by visualization via nu-merical optimization. Control experiments show thatthe learned detector is not only invariant to translationbut also to out-of-plane rotation and experiments reveal the network also learns theconcepts of cat faces and human learned representations are also the learned features, we obtain significant leapsin object recognition with ImageNet.

For instance, onImageNet with 22,000 categories, we achieved , a relative improvement of 70% over the state-of-the-art. Note that, random guess achieves less accuracy for this Training set constructionOur training dataset is constructed by sampling framesfrom 10 million YouTube videos. To avoid duplicates,each video contributes only one image to the example is a color image with 200x200 subset of training images is shown in Ap-pendix A. To check the proportion of faces inthe dataset, we run an OpenCV face detector on60x60 randomly-sampled patches from the dataset( ). This exper-iment shows that patches, being detected as faces bythe OpenCV face detector, account for less than 3% ofthe 100,000 sampled patches3.

AlgorithmIn this section, we describe the algorithm that we useto learn features from the unlabeled training Previous workOur work is inspired by recent successful algorithms inunsupervised feature learning and deep learning (Hin-ton et al., 2006; Bengio et al., 2007; Ranzato et al.,2007; Lee et al., 2007). It is strongly influenced by thework of (Olshausen & Field, 1996) on sparse to their study, sparse coding can be trainedon unlabeled natural images to yield receptive fieldsakin to V1 simple cells (Hubel & Wiesel, 1959).One shortcoming of early approaches such as sparsecoding (Olshausen & Field, 1996) is that their archi-tectures are shallow and typically capture low-levelconcepts ( , edge Gabor filters) and simple invari-ances.

Abstract 1. Introduction

Tags:

Information

Transcription of Abstract 1. Introduction

Related search queries

Abstract 1. Introduction

Tags:

Information

Documents from same domain

Related documents

Related search queries