Crowd counting via Multi-Scale Adversarial Convolutional Neural Networks

,


Introduction
With the rapid growth in the urban population, public safety issues have become the focus of attention in video surveillance. In a real-time analysis of crowds such as public gatherings and sports events, it is necessary to estimate the number and density map of the population. In recent years, crowd analysis has attracted many researchers. Not only it can be applied to urban planning [1], scene understanding [2], and traffic monitoring, but also to the counting tasks of other domains, such as counting cells under the microscope [3][4][5][6], vehicle counts [7][8][9][10][11]. However, due to the presence of various complexities, such as complex illumination, pedestrian occlusion in a dense scene, perspective distortion and non-uniform distribution of people, it is a challenging task in computer vision and these issues result in an accuracy of estimation that is far from optimal.
Some earlier methods of crowd counting considered it as a computer vision problem, counting the number of pedestrians by detecting and tracking, and then training a detector to detect the number of pedestrians appearing in the crowd image. However, if the crowd is very dense, the occlusion between pedestrians is more serious, which may result in poor detection. Moreover these methods are based on the traditional hand-featured regression, achieving better performance than detection through regressing the number of pedestrians on the image. Additionally, this method uses manual features like HOG [12], and therefore difficult to achieve the best result due to insufficient expression of local features, angle, and large-scale variation of the crowd image. Inspired by a recent successful solution of multiple computer vision tasks with convolutional neural network (CNN), many CNN-based methods [13][14][15] were developed to solve these issues and obtained remarkable success. For instance [15][16][17] used a multi-channel CNN structure to emphasize the scale variation and achieved good results in the crowd density estimation, using different sizes of convolutional kernels to deal with different sizes of a head in input images and try to solve the head scale variation. In the crowd density map, each marking point represents the location of a pedestrian, and the number of crowd is obtained by pixel integration in the density map. Current CNN-based methods [8,18,19] use multi-path convolutional neural network, and Euclidean loss is used as an objective function to optimize model, each sub-network uses different convolutional kernel sizes to extract multi-scale features. Local optimization is achieved by minimizing Euclidean loss, and finally fine-tuned all sub-network by joint training.
To solve these issues based on the multi-column CNN [19] which has a success of working in the crowd counting, a new crowd counting framework called Multi-Scale Adversarial Convolutional Neural Network (MSA-CNN) is proposed. The multi-column is used to extract high-dimensional features of the crowd image, and then a series of fractionally-strided convolutional layers process to restore the detail of features caused by max-pooling layers, so that to obtain a high-resolution density map. In addition, inspired by Generative Adversarial Network (GAN) in successful image interpretation [20], we propose the adversarial training method to reduce the blurring effect and improve the quality of the density map. Figure 1 shows the result of our method on one sample. In this paper, our main contributions are summarized as follows: 1. We proposed a novel parameter-optimized MSA-CNN to solve crowd counting and density estimation issues. 2. After extracting the high-level image features of the crowd, several fractionally-strided convolutional layers are used to restore some details of the image caused by the previous max-pooling, therefore improving the quality of the estimated density map, and ultimately improving the accuracy. 3. We conduct extensive experiments on the two representative datasets [9,12] and compared the outcomes with existing methods. Our method was proved superior to the current state-of-the-art performance.

Related works
Current crowd density estimation methods are broadly divided into: 1) detection-based methods, 2) regression methods based on hand-crafted features, and 3) CNN-based methods. These are briefly explained as follows: Detection-Based methods: The initial adoption of a single-person-based framework considers the population as a single entity group to estimate the number of pedestrians [7,9,12,21,22], and none of these methods are applied for a single still image. Since early related research simply focused on video surveillance scenarios to fully explore the information of motion and appearance. For instance, [12] trained dynamic detectors pass two consecutive segments of a video sequence frames to capture this information, and then the recurrent neural network framework has been used for head detection in the crowd scene. [23] use GoogLeNet's deep functionality in the Long Short-Term Memory (LSTM) framework to return the bounding box of the head. [4,5] proposed a trajectory clustering method based on tracking visual features to finish crowd counting in video surveillance, but this method also cannot estimate the number of people in a single static image. Moreover, the detection and tracking method seriously affects the performance of the estimated population when the crowd is very dense and the image prone to occlusion.

Regression-Based methods:
the most widely used methods for crowd counting is feature-based regression [12][13][14]24], which regressed the scalar values (number of people) or density maps [3,24]. The main steps of the method are divided into: (1) extracting the foreground; (2) extracting various features of the foreground, such as the area of the crowd [3,12,13,16], the edge information [3,12,14,25], or texture information [3,6], and (3) estimating the number of persons with a regression function. The linear [1] or piece-wise linear [15] function is a relatively simple model and exhibits good performance. Other more effective methods are Ridge Regression (RR) [3], Gaussian Process Regression (GPR) [13] and Neural Network (NN) [26], these methods are suitable for crowd counting algorithms of monitoring videos, due to foreground segmentation. It is very difficult task and the performance of the algorithm is largely affected by it. There are also some works for crowd counting of still images, [8] suggested making use of multi-source information to estimate the number of people in a single image. [27] estimated counts by combining information from multiple sources, such as point of interest (SIFT) [28], fourier analysis, wavelet decomposition, Gray-Level Co-occurrence Matrix (GLCM) features, and low confidence head detections. [17] trained a support vector machine (SVM) with features extracted from a pre-trained model, and then estimated the number of people in a single still image. The regression-based methods are better than the detection methods, this method can only extract low-level features, so it is also not the best way to map features to the number of pedestrians.

CNN-Based method:
Recent CNN-based methods are also a kind of regression methods. It is introduced separately because it is different from the traditional regression methods which are based on traditional hand-crafted features. It is possible to extract high-dimensional features of the crowd images by the convolutional operation. [15] proposed a CNN-based method for crowd counting in different scenes, and then fine-tuned the pre-trained network based on foreground information when passing a test data, this method achieves good performance on the most of existing datasets, but their train and test datasets require foreground maps, while in crowd counting applications, there are no foreground maps available. In [14,19], a multi-column network structure is used to deal with the scale change problem. Using traditional CNN, each column is separately trained; the obtained three models are merged and then fine-tuned them. The fully connected layer uses a 1 × 1 convolution kernel to fuse the feature maps from a particular scale of training and regress a density map. Inspired by the multi-column convolutional neural network (MCNN) [19], Switch-CNN [18] proposed a selective transformation structure to select the appropriate regressor according to a particular input patch. These methods had achieved good performance, and the accuracy of predicting the number of people was emphasized. They only use the max pooling and ℓ 2 loss, thus ignoring the quality of the density map. [29] utilizes a single-column convolutional neural network similar to the VGG-16 structure, they emphasized on the depth of the network. The structure produces a strong scale-adaptive crowd counter for each image, and introduces a multi-task loss to improve the network generalization on crowd scenes with few pedestrians, but it requires a large number of parameters. Recent research CP-CNN [30] proposed the contextual pyramid of CNN to generate high-quality crowd density maps and lower estimation error. They fused high-dimensional features extracted from contextual information and multi-column structure into Fusion-CNN which consists of convolution and fractionally-strided convolution. CP-CNN requires a crowd density level label, which is currently not available in the existing datasets. Both our method and CP-CNN take into account the quality of the generated density map, compare to CP-CNN, we also propose an explanation of the image to density with respect to loss and the model has relatively few model parameters.

Network architecture
Inspired by the success of multi-column [19] structure and take advantage of the generative adversarial network, we proposed MSA-CNN for crowd counting. In our method, the generator network learns to map the crowd image to the corresponding density map, as shown in Figure 2. Firstly a convolutional layer to generate 16 feature maps with a filter size 11 × 11 and then the second convolutional layer with 24 feature maps with a filter size of 9 × 9, the third convolutional layer with 32 feature maps with a filter size of 7 × 7, the feature maps generated by this shallow network are shared by three column CNN. Further, inspired by the multicolumn [27] structure in solving the scale variation of the crowd image, the improved multi-column structure to extract the high-dimensional features; the difference from the previous multi-column structure is that our structure is deeper. Specifically, we use similar to multi-column structure which by optimized based on the size and number of filters to reduce the count estimation error. It is noteworthy that addition of more columns and the filter size will have to be decided based on the scale variation present in the dataset. The new network caters to different datasets containing different scale variations, the filter size will require more time to do experiments. The details of these parameters are shown in Table 1. The max-pooling layer is employed to down-sample the crowd image to extract the high-dimensional features, but the max-pooling layer caused the loss of detail of feature maps. In order to improve the quality of the density map, the feature extraction structure stage consists of convolutional layer and Rectified Linear Unit (ReLU). We add fractionally-strided convolutional layers that are used to up-sample the input data so that restore the lost details, and improve the quality of the estimated density map. Following network structure uses: CR(48,9)-CR(24,7)-TR(32)-CR(20,5)-TR(16)-C(1,1) Where C is the convolutional layer, R is the ReLU layer, T is the fractionally-strided convolutional layer. The fractionally-strided convolutional layer increases the input resolution by a stride of 2, which helps to regress on full resolution density map to ensure the input and the output have the same resolution. Trained the generator network by optimizing the ℓ 2 loss between the estimated and ground truth density maps, then using the L I loss to optimizing discriminator and fine-tuning generator, finally the test dataset feed to the trained generator to estimate density map. We used the idea of adversarial neural networks, mainly considering we need to generate a high-resolution density map. However only using a traditional pixel-wise Euclidean loss to back propagation gradient variation depends on the deviation of a particular pixel. There-fore it will tend to blur map on edges and outliers of the image [26]. However, when using adversarial loss, it will judge whether a pixel is "real" or "fake", by optimizing loss function to encourage the "fake" have the same as "real" pixel distribution. In principle, it is possible to prompt a clear image and avoid blur as well, so it is impossible to generate blurred images [31]. But if we simply use the adversarial loss as objective function may cause exceptions in the spatial structure and even it exists outliers in the input label space. So we refer to the previous work [20,32,33] and further add a conventional loss to improve the solution. The following sub-sections discuss the details of the objective function formula.

Figure 2:
Generator stage: the first part is used to extract high-dimensional feature map, which is basically composed of convolutional layer-PReLU (Conv-PReLU), pooling represents max-pooling layer, factor is 2; then the second is fractionally-strided convolutional phase, its basic composition is deconvolution-ReLU (DeConv-ReLU). update Θ G1 by stochastic gradient descent 7: end for 8: end for 9: /*Fine-tuning the first generator network parameters and Training for Tc epochs*/ 10: Initialize parameters of discriminator network as Θ D and Fractionally-strided phase as Θ G2 with random Gaussian weights 11: for i = 1 to Tc do do 12: for i = 1 to N do do 13: ℓ I i =argminL I 14: update Θ D , Θ G2 and fine-tuning Θ G1 15: end for 16: end for

Objective function
It has been widely acknowledged that Euclidean loss has certain disadvantages [34] such as sensitivity to outliers and image blur. Motivated by GAN in image reconstruction and these observations, a combined scheme of Euclidean loss and weighted adversarial loss as the final loss function for solving the issue of L2-minimization was incorporated [20]. The objective function is as follow: • Euclidean loss Where N is the number of training samples, X i is the i th training sample, Θ representing the network parameters, G θG (X i , Θ) indicates the density maps and are estimated by the network, P GT i representing the i th ground true density map. • Adversarial loss Where G θG and D θD are the outputs of the Generator and Discriminator network structures respectively, L A representing the adversarial loss function. I indicates the input crowd image. • Final objective In this formula, λ indicates the weight multiple that connects the two functions. We set the value is 10 −3 , L A is the Adversarial loss function, while L E is Euclidean loss.

Training and Implementation Details
In training and testing phase, the ground truth density map data is necessary. The original data provides the crowd image and the corresponding annotated head position, so we only need to convert the available point position data into a density map. Therefore, it is useful to take Gaussian kernel to blur each head position so that the integral is one for a person. In order to deal with object size differences and angle distortions in the crowd image, we take advantage of the geometry-adaptive Gaussian kernels method was proposed in [19] to generate the crowd density map. The ground truth density map Dx can be calculated by convolving the delta function with a Gaussian kernel function: Where N is the number of pedestrians in the image, assumed that the head position at x i , the parameter setting refer to [14,18,19], while the density map is equivalent to the number of people in the crowd image, the process of generating density map as shown in Figure 3. In the following dataset, the main attributes we used are the number of images (N), the number of channels (C) of the image and its width (W) and height (H), as well as the head coordinates (x i ). In order to prevent over-fitting, we augment the training dataset by randomly select 100 locations from the original image to crop the image, and the final size of the patch is 1 4 of original image. At the same time, the method of horizontal flipping and adding noise are applied for each cropped crowd image to enhance the training dataset, which finally generates 300 patches for each image on the original dataset. The learning rate is set to 0.00001 and the momentum of 0.9 to update our network parameters and perform end-to-end training by the weighted combination of Euclidean losses and adversarial loss.

Experiments
Some experiments are conducted on two available representative datasets and compared them with existing methods to demonstrate the robustness of our approach performance. We evaluate the performance of our method by the mean absolute error (MAE) and the mean square error (MSE), which are used in previous articles [12,14,15,[18][19][20], the two evaluation standard defined as follows: Where N is the number of test images, Y i represents the ground truth number of pedestrians in the i th image, and Y ′ i is the estimated number of pedestrians in the image. MAE measures the average modulus length of the prediction error, which can better reflect actual situation of the prediction error, and MSE reflects measures of dispersion of the dataset. When they are smaller, the model has a better predictive ability.

Experiment on ShanghaiTech dataset
The ShanghaiTech dataset was created by [19]. The dataset includes 1,198 annotated indoor and streetscape images with a total of 33015 pedestrians, as well as crowd images at different angles, and consists of two parts: 482 images in Part_A and 716 images in Part_B. The two parts of the dataset are further divided into a train set and a test set. The train set of Part_A and Part_B are 300 and 400 images respectively, the rest of the images are used as test dataset. The proposed method is compared with the recent five best methods: [15], MCNN [19], Switching-CNN [18], Cascaded-MTL [35], and CP-CNN [30] on ShanghaiTech datasets. Comparative results are shown in Table 2. [15] proposed two learning objectives for crowd counting and density estimation. Further, they learned the network by alternately training two objective functions. [19] used a multi-column CNN to solve the multi-scale difference issue on crowd images and proposed a density map generation method. [18] proposed a switched CNN classifier, it can select the suitable network branch to solve the problem of largescale and perspective variation, and at the same time improve the accuracy of crowd estimation. [35] proposed a multi-task cascade CNN that utilizes a high-level prior to learn crowd count classification and density map estimation tasks. In [30] the author extracted global and local context information of the image to generate a high-quality density map and lower estimation error. It can be seen from Table 2 that result of MSA-CNN com- Figure 4: The density map estimated by MSA-CNN on the Shanghai Tech Part_B dataset, the first column is test images, the second column is ground truth density map, and the third column is the estimated density map by our approach(MSA-CNN).

Experiment on UCF_CC_50 dataset
UCF_CC_50 was first introduced by [12]. It is a challenging dataset which consists of 50 images of the crowd, with a total of 63,974 persons. The crowd counts range from 96 to 4543. There is a large variation of crowd density in the image. Following [12], we also use five-fold cross-validation to report the average test performance. The author in [15] proposed to combine multiple source information such as Fourier analysis, head detection and texture features to generate density map and crowd counting. A comparative result with the existing six methods is shown in Table 3. Our method achieves lower error than other methods. Figure 6 shows some examples of visualization obtained by our method on the UCF_CC_50 dataset.

Comparisons with State-of-the-art
The proposed approach is compared with several state-of-the-art methods on two benchmarks, and the results are shown in Table 2, 3. Table 2 indicates comparison on ShanghaiTech datasets; the proposed MSA-CNN obtains significant improvement over prior methods, and acquires the best MAE and MSE on the Part_A dataset. This dataset is closer to the realistic monitoring screens than the others, which states that our algorithm has a good performance on the actual scenes and achieves better stability. It also shows a good result on the Part_B dataset, it shows the robustness of the proposed method which can be applied to scenes with sparse crowds. In Table 3, we compare the performance of MSA-CNN with other methods using MAE and MSE as metrics on the UCCF_CC_50 dataset. MSA-CNN outperforms all others methods in MAE and gets a com-  [12] 419.5 541.6 [15] 467.0 498.5 MCNN [19] 377.6 173.2 Cascaded-MTL [35] 322.8 341.4 Switching-CNN [18] 318.1 439.2 CP-CNN [30] 295.8 320.9 MSA-CNN (ours) 293.9 361.6 Figure 6: The density map estimated by MSA-CNN on the UCF_CC_50 dataset, the first column is test images, the second is ground truth density map, and the third is estimated density map by our approach (MSA-CNN).
petitive MSE score, which indicates the robustness of predicted count. Considering practical applications of crowd counting algorithm, we perform a simple and practical study. As shown in Table 4, MCNN has the least parameters, and CP-CNN is 500 times more than MCNN. In contrast, our algorithm has a relatively small amount of parameters.

Conclusion
In this paper, a multi-scale adversarial convolutional neural network is designed for estimating crowd density map and the number of pedestrians in crowd images. The improved multi-column convolutional neural network is used to extract high-dimensional feature maps. These fractionally-strided convolutional layers try to recover the loss of detail caused by previous max-pooling layers. Since, we adopted the advantage of the superior performance of GAN in image reconstruction, thereby improving the resolution of the estimated density map and reducing the crowd estimation error. The model is trained in an end-to-end manner by optimizing a weighted combination of Euclidean loss and adversarial loss and the number of parameters is low. A lot of experiments on challenging datasets are conducted, in contrast to the existing methods, our method demonstrated significant improvements.