Histograms of Oriented Gradients for Human Detection

Histograms of Oriented Gradients for Human DetectionNavneet Dalal and Bill TriggsINRIA Rh one-Alps, 655 avenue de l Europe, Montbonnot 38334, study the question of feature sets for robust visual ob-ject recognition, adopting linear SVM based Human detec-tion as a test case. After reviewing existing edge and gra-dient based descriptors, we show experimentally that gridsof Histograms of Oriented Gradient (HOG) descriptors sig-nificantly outperform existing feature sets for Human detec-tion. We study the influence of each stage of the computationon performance, concluding that fine-scale Gradients , fineorientation binning, relatively coarse spatial binning, andhigh-quality local contrast normalization in overlapping de-scriptor blocks are all important for good results. The newapproach gives near-perfect separation on the original MITpedestrian database, so we introduce a more challengingdataset containing over 1800 annotated Human images witha large range of pose variations and IntroductionDetecting humans in images is a challenging task owingto their variable appearance and the wide range of poses thatthey can adopt.

The first need is a robust feature set thatallows the Human form to be discriminated cleanly, even incluttered backgrounds under difficult illumination. We studythe issue of feature sets for Human Detection , showing that lo-cally normalized Histogram of Oriented Gradient (HOG) de-scriptors provide excellent performance relative to other ex-isting feature sets including wavelets [17, 22]. The proposeddescriptors are reminiscent of edge orientation Histograms [4, 5], SIFT descriptors [12] and shape contexts [1], but theyare computed on a dense grid of uniformly spaced cells andthey use overlapping local contrast normalizations for im-proved performance. We make a detailed study of the effectsof various implementation choices on detector performance,taking pedestrian Detection (the Detection of mostly visiblepeople in more or less upright poses) as a test case. For sim-plicity and speed, we use linear SVM as a baseline classifierthroughout the study. The new detectors give essentially per-fect results on the MIT pedestrian test set [18, 17], so we havecreated a more challenging set containing over 1800 pedes-trian images with a large range of poses and work suggests that our feature set performs equallywell for other shape-based object briefly discuss previous work on Human Detection in 2, give an overview of our method 3, describe our datasets in 4 and give a detailed description and experimentalevaluation of each stage of the process in 5 6.

The mainconclusions are summarized in WorkThere is an extensive literature on object Detection , buthere we mention just a few relevant papers on Human detec-tion [18, 17, 22, 16, 20]. See [6] for a survey. Papageorgiouetal[18] describe a pedestrian detector based on a polynomialSVM using rectified Haar wavelets as input descriptors, witha parts (subwindow) based variant in [17]. Depoortereet algive an optimized version of this [2]. Gavrila & Philomen[8] take a more direct approach, extracting edge images andmatching them to a set of learned exemplars using chamferdistance. This has been used in a practical real-time pedes-trian Detection system [7]. Violaet al[22] build an efficientmoving person detector, using AdaBoost to train a chain ofprogressively more complex region rejection rules based onHaar-like wavelets and space-time differences. Ronfardetal[19] build an articulated body detector by incorporatingSVM based limb classifiers over1stand2ndorder Gaussianfilters in a dynamic programming framework similar to thoseof Felzenszwalb & Huttenlocher [3] and Ioffe & Forsyth[9].

Mikolajczyket al[16] use combinations of orientation-position Histograms with binary-thresholded gradient magni-tudes to build a parts based method containing detectors forfaces, heads, and front and side profiles of upper and lowerbody parts. In contrast, our detector uses a simpler archi-tecture with a single Detection window, but appears to givesignificantly higher performance on pedestrian section gives an overview of our feature extractionchain, which is summarized in fig. 1. Implementation detailsare postponed until 6. The method is based on evaluatingwell-normalized local Histograms of image gradient orienta-tions in a dense grid. Similar features have seen increasinguse over the past decade [4, 5, 12, 15]. The basic idea is thatlocal object appearance and shape can often be characterizedrather well by the distribution of local intensity Gradients orProceedings of the 2005 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR 05) 1063-6919/05 $ 2005 IEEE colourNormalizegamma &gradientsComputeinto spatial &Weighted voteorientation cellsInputimageover detectionwindowCollect HOG sSVML inearnon personclassificationPerson /spatial blocksover overlappingContrast normalizeFigure 1.

An overview of our feature extraction and object Detection chain. The detector window is tiled with a grid of overlapping blocksin which Histogram of Oriented Gradient feature vectors are extracted. The combined vectors are fed to a linear SVM for object/non-objectclassification. The Detection window is scanned across the image at all positions and scales, and conventional non-maximum suppressionis run on the output pyramid to detect object instances, but this paper concentrates on the feature extraction directions, even without precise knowledge of the cor-responding gradient or edge positions. In practice this is im-plemented by dividing the image window into small spatialregions ( cells ), for each cell accumulating a local 1-D his-togram of gradient directions or edge orientations over thepixels of the cell. The combined histogram entries form therepresentation. For better invariance to illumination, shad-owing,etc., it is also useful to contrast-normalize the localresponses before using them.

This can be done by accumu-lating a measure of local histogram energy over somewhatlarger spatial regions ( blocks ) and using the results to nor-malize all of the cells in the block. We will refer to the nor-malized descriptor blocks asHistogram of Oriented Gradi-ent (HOG)descriptors. Tiling the Detection window witha dense (in fact, overlapping) grid of HOG descriptors andusing the combined feature vector in a conventional SVMbased window classifier gives our Human Detection chain(see fig. 1).The use of orientation Histograms has many precursors[13, 4, 5], but it only reached maturity when combined withlocal spatial histogramming and normalization in Lowe sScale Invariant feature Transformation (SIFT)approach towide baseline image matching [12], in which it providesthe underlying image patch descriptor for matching scale-invariant keypoints. SIFT-style approaches perform remark-ably well in this application [12, 14]. TheShape Contextwork [1] studied alternative cell and block shapes, albeit ini-tially using only edge pixel counts without the orientationhistogramming that makes the representation so success of these sparse feature based representations hassomewhat overshadowed the power and simplicity of HOG sas dense image descriptors.

We hope that our study will helpto rectify this. In particular, our informal experiments sug-gest that even the best current keypoint based approaches arelikely to have false positive rates at least 1 2 orders of mag-nitude higher than our dense grid approach for Human detec-tion, mainly because none of the keypoint detectors that weare aware of detect Human body structures HOG/SIFT representation has several advantages. Itcaptures edge or gradient structure that is very characteristicof local shape, and it does so in a local representation withan easily controllable degree of invariance to local geometricand photometric transformations: translations or rotationsmake little difference if they are much smaller that the localspatial or orientation bin size. For Human Detection , rathercoarse spatial sampling, fine orientation sampling and stronglocal photometric normalization turns out to be the best strat-egy, presumably because it permits limbs and body segmentsto change appearance and move from side to side quite a lotprovided that they maintain a roughly upright Sets and tested our detector on two different data first is the well-established MIT pedestrian database[18], containing 509 training and 200 test images of pedestri-ans in city scenes (plus left-right reflections of these).

It con-tains only front or back views with a relatively limited rangeof poses. Our best detectors give essentially perfect resultson this data set, so we produced a new and significantly morechallenging data set, INRIA , containing 180564 128im-ages of humans cropped from a varied set of personal pho-tos. Fig. 2 shows some samples. The people are usuallystanding, but appear in any orientation and against a widevariety of background image including crowds. Many arebystanders taken from the image backgrounds, so there is noparticular bias on their pose. The database is available research selected 1239 of the images as positivetraining examples, together with their left-right reflections(2478 images in all). A fixed set of 12180 patches sampledrandomly from 1218 person-free training photos providedthe initial negative set. For each detector and parameter com-bination a preliminary detector is trained and the 1218 nega-tive training photos are searched exhaustively for false posi-tives ( hard examples ).

The method is then re-trained usingthis augmented set (initial 12180 + hard examples) to pro-duce the final detector. The set of hard examples is subsam-pled if necessary, so that the descriptors of the final trainingset fit into Gb of RAM for SVM training. This retrain-ing process significantly improves the performance of eachdetector (by 5% at10 4 False Positives Per Window tested(FPPW) for our default detector), but additional rounds ofretraining make little difference so we do not use quantify detector performance we plotDetection Er-ror Tradeoff (DET)curves on a log-log scale, (1 RecallorFalseNegTruePos+FalseNeg) versus FPPW. Lower val-ues are better. DET plots are used extensively in speech andin NIST evaluations. They present the same information asReceiver Operating Characteristics (ROC s) but allow smallProceedings of the 2005 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR 05) 1063-6919/05 $ 2005 IEEE Figure 2.

Some sample images from our new Human Detection database. The subjects are always upright, but with some partial occlusionsand a wide range of variations in pose, appearance, clothing, illumination and to be distinguished more easily. We will oftenuse miss rate at10 4 FPPW as a reference point for is arbitrary but no more so than, Area Under a multiscale detector it corresponds to a raw error rate positives per640 480image tested. (The fulldetector has an even lower false positive rate owing to non-maximum suppression). Our DET curves are usually quiteshallow so even very small improvements in miss rate areequivalent to large gains in FPPW at constant miss rate. Forexample, for our default detector at 1e-4 FPPW, every 1%absolute (9% relative) reduction in miss rate is equivalent toreducing the FPPW at constant miss rate by a factor of presenting our detailed implementation and per-formance analysis, we compare the overall performance ofour final HOG detectors with that of some other existingmethods.

Histograms of Oriented Gradients for Human Detection

Tags:

Information

Transcription of Histograms of Oriented Gradients for Human Detection

Related search queries

Histograms of Oriented Gradients for Human Detection

Tags:

Information

Documents from same domain

Related documents

Related search queries