Example: air traffic controller

Single-Image Crowd Counting via Multi-Column …

Single-Image Crowd Counting via Multi-Column convolutional neural NetworkYingying Zhang Desen Zhou Siqin Chen Shenghua Gao Yi MaShanghaitech paper aims to develop a method than can accuratelyestimate the Crowd count from an individual image with ar-bitrary Crowd density and arbitrary perspective. To this end,we have proposed a simple but effective Multi-Column Con-volutional neural network (MCNN) architecture to map theimage to its Crowd density map. The proposed MCNN al-lows the input image to be of arbitrary size or utilizing filters with receptive fields of different sizes, thefeatures learned by each column CNN are adaptive to varia-tions in people/head size due to perspective effect or imageresolution.

2. Multi-column CNN for Crowd Counting 2.1. Density map based crowd counting To estimate the number of people in a given image via the Convolutional Neural Networks (CNNs), there are two natural configurations. One is a network whose input is the image and the output is the estimated head count. The other

Tags:

  Multi, Network, Neural, Convolutional, Convolutional neural

Information

Domain:

Source:

Link to this page:

Please notify us if you found a problem with this document:

Other abuse

Transcription of Single-Image Crowd Counting via Multi-Column …

1 Single-Image Crowd Counting via Multi-Column convolutional neural NetworkYingying Zhang Desen Zhou Siqin Chen Shenghua Gao Yi MaShanghaitech paper aims to develop a method than can accuratelyestimate the Crowd count from an individual image with ar-bitrary Crowd density and arbitrary perspective. To this end,we have proposed a simple but effective Multi-Column Con-volutional neural network (MCNN) architecture to map theimage to its Crowd density map. The proposed MCNN al-lows the input image to be of arbitrary size or utilizing filters with receptive fields of different sizes, thefeatures learned by each column CNN are adaptive to varia-tions in people/head size due to perspective effect or imageresolution.

2 Furthermore, the true density map is comput-ed accurately based on geometry-adaptive kernels which donot need knowing the perspective map of the input image. S-ince exiting Crowd Counting datasets do not adequately cov-er all the challenging situations considered in our work,we have collected and labelled a large new dataset thatincludes 1198 images with about 330,000 heads annotat-ed. On this challenging new dataset, as well as all existingdatasets, we conduct extensive experiments to verify the ef-fectiveness of the proposed model and method.

3 In partic-ular, with the proposed simple MCNN model, our methodoutperforms all existing methods. In addition, experimentsshow that our model, once trained on one dataset, can bereadily transferred to a new IntroductionIn the new year eve of 2015, 35 people were killed ina massive stampede in Shanghai, China. Unfortunately, s-ince then, many more massive stampedes have taken placearound the world which have claimed many more victim-s. Accurately estimating crowds from images or videos hasbecome an increasingly important application of computervision technology for purposes of Crowd control and publicsafety.

4 In some scenarios, such as public rallies and sportsevents, the number or density of participating people is anessential piece of information for future event planning andspace design. Good methods of Crowd Counting can also beextended to other domains, for instance, Counting cells orbacteria from microscopic images, animal Crowd estimatesin wildlife sanctuaries, or estimating the number of vehiclesat transportation hubs or traffic jams, algorithms have been proposed inthe literature for Crowd Counting . Earlier methods [29]adopt a detection-style framework that scans a detector overtwo consecutive frames of a video sequence to estimate thenumber of pedestrians, based on boosting appearance andmotion features.

5 [19, 30, 31] have used a similar detection-based framework for pedestrian Counting . In detection-based Crowd Counting methods, people typically assume acrowd is composed of individual entities which can be de-tected by some given detectors [13, 34, 18, 10]. The limi-tation of such detection-based methods is that occlusion a-mong people in a clustered environment or in a very densecrowd significantly affects the performance of the detectorhence the final estimation Counting crowds in videos, people have proposed tocluster trajectories of tracked visual features.

6 For instance,[24] has used highly parallelized version of the KLT track-er and agglomerative clustering to estimate the number ofmoving people. [3] has tracked simple image features andprobabilistically group them into clusters representing inde-pendently moving entities. However, such tracking-basedmethods do not work for estimating crowds from individualstill the most extensively used method for crowdcounting is feature-based regression, see [4, 7, 5, 27, 15,20]. The main steps of this kind of method are: 1) seg-menting the foreground; 2) extracting various features fromthe foreground, such as area of Crowd mask [4, 7, 27, 23],edge count [4, 7, 27, 25], or texture features [22, 7]; 3) u-tilizing a regression function to estimate the Crowd [23] or piece-wise linear [25] functions are relativelysimple models and yield decent performance.

7 Other moreadvanced/effective methods are ridge regression (RR) [7],Gaussian process regression (GPR) [4], and neural network (NN) [22].There have also been some works focusing on crowdcounting from still images. [12] has proposed to leverage1589multiple sources of information to compute an estimate ofthe number of individuals present in an extremely densecrowd visible in a single image. In that work, a datasetof fifty Crowd images containing64 Kannotated human-s (UCFCC50) is introduced. [2] has followed the workand estimated counts by fusing information from multi -ple sources, namely, interest points (SIFT), Fourier analy-sis, wavelet decomposition, GLCM features, and low confi-dence head detections.

8 [28] has utilized the features extract-ed from a pre-trained CNN to train a support vector machine(SVM) that subsequently generates counts for still Zhanget al. [33] has proposed a CNN basedmethod to count Crowd in different scenes. They first pre-train a network for certain scenes. When a test image froma new scene is given, they choose similar training data tofine-tune the pretrained network based on the perspectiveinformation and similarity in density map. Their methoddemonstrates good performance on most existing their method requires perspective maps both on trainingscenes and the test scene.

9 Unfortunately, in many practicalapplications of Crowd Counting , the perspective maps arenot readily available, which limits the applicability of of this this paper, we aim toconduct accurate Crowd Counting from an arbitrary still im-age, with an arbitrary camera perspective and Crowd density(see Figure 1 for some typical examples). At first sight thisseems to be a rather daunting task, since we obviously needto conquer series of challenges:1. Foreground segmentation is indispensable in most ex-isting work. However foreground segmentation is achallenging task all by itself and inaccurate segmenta-tion will have irreversible bad effect on the final our task, the viewpoint of an image can be information about scene geometry or motion,it is almost impossible to segment the Crowd from itsbackground accurately.

10 Hence, we have to estimate thenumber of Crowd without segmenting the The density and distribution of Crowd vary signifi-cantly in our task (or datasets) and typically there aretremendous occlusions for most people in each im-age. Hence traditional detection-based methods do notwork well on such images and As there might be significant variation in the scale ofthe people in the images, we need to utilize featuresat different scales all together in order to accuratelyestimate Crowd counts for different images.


Related search queries