Skip to content
BY 4.0 license Open Access Published by De Gruyter Open Access June 19, 2023

Application of SSD network algorithm in panoramic video image vehicle detection system

  • Tao Jiang EMAIL logo
From the journal Open Computer Science


Due to the popularity of high-performance cameras and the development of computer video pattern recognition technology, intelligent video monitoring technology is widely used in all aspects of social life. It mainly includes the following: industrial control system uses video monitoring technology for remote monitoring and comprehensive monitoring; in addition, intelligent video monitoring technology is also widely used in the agricultural field, for example, farm administrators can view the activities of animals in real time through smart phones, and agricultural experts can predict future weather changes according to the growth of crops. In the implementation of intelligent monitoring system, automatic detection of vehicles in images is an important topic. The construction of China’s Intelligent Transportation System started late, especially in video traffic detection. Although there are many related studies on video traffic detection algorithms, these algorithms usually only analyze and process information from a single sensor. This article describes the application of the single-shot detector (SSD) network algorithm in a panoramic video image vehicle detection system. The purpose of this article is to investigate the effectiveness of the SSD network algorithm in a panoramic video image vehicle detection system. The experimental results show that the detection accuracy of a single convolutional neural network (CNN) algorithm is only 0.7554, the recall rate is 0.9052, and the comprehensive detection accuracy is 0.8235. The detection accuracy of SSD network algorithm is 0.8720, recall rate is 0.9397, and the comprehensive detection accuracy is 0.9046, which is higher than that of single CNN algorithm. Thus, the proposed SSD network algorithm is compared with a single convolution network algorithm. It is more suitable for vehicle detection, and it plays an important role in panoramic video image vehicle detection.

1 Introduction

With the development of intelligent transportation and information technology, intelligent transportation video monitoring technology has become a major research topic, and real-time detection of sports vehicles is the core of this topic. The video vehicle detection algorithm can provide a theoretical basis for the inference and understanding of traffic behavior and traffic incidents. An effective target detection algorithm is of great significance to the operation of the entire intelligent transportation monitoring system. However, in actual monitoring scenarios, there are many factors that affect the accuracy of target detection, such as changing light, vehicle blocking, bad weather, and swaying leaves. Therefore, it is necessary to further study an algorithm with better overall detection performance in various scenarios.

Through the research and development of video vehicle detection system, this article effectively obtains various traffic information such as traffic flow, traffic density, and vehicle speed on the road. It enables traffic management departments to make timely and accurate overall planning and planning of traffic information for the whole city. It implements intelligent business management. Video detection can not only provide general traffic statistics, but also detect illegal behaviors such as speeding, illegal parking, traffic not following the driving route, and traffic in the wrong direction. With the license plate recognition technology, it can capture and record illegal vehicles, and these functions have been more and more popular.

The innovations of this article are as follows: (1) this article introduces the theory of panoramic video image vehicle detection system and single-shot detector (SSD) network algorithm. In this article, an SSD network algorithm is proposed based on convolutional neural network (CNN), and how the SSD network algorithm plays a role in vehicle detection in panoramic video images is analyzed. (2) In this article, the monitoring effects of SSD network algorithm and other three detection algorithms are compared and analyzed. It can be known from experiments that in the panoramic video image vehicle detection, SSD network algorithm has higher monitoring accuracy and accuracy than other detection algorithms, and its error detection rate is also lower than other algorithms. This is conducive to the detection of video images of vehicles, thus reducing the occurrence of traffic accidents.

2 Related work

At present, the requirements of real-time video vehicle monitoring for traffic monitoring are becoming more and more stringent, and research on vehicle monitoring is of great concern. Among them, Hadi et al. presented a new method to detect and track moving vehicles effectively in complex road scenes with shadows and partial obstruction by combining improved background subtraction and innovative adaptive search window methods [1]. Chen et al. found it difficult to detect small objects such as vehicles in satellite images. Many features have been used to improve the performance of target detection, but they are mainly used in simple environments such as roads [2]. Moutakki et al. offered a real-time management and control system for assessing road traffic utilizing a fixed camera [3]. Chen et al. studied the construction of a region covariance descriptor (RCD) for vehicle detection. He proposed a unified method to construct RCD features by using a constant convolution kernel in the form of a two-dimensional mask [4]. Garcia et al. presented a novel sensor fusion method. This method uses data fusion technology to achieve safer roads and focuses on the interaction between vehicle drivers and smart vehicles [5]. It can be seen that these research studies have compared the previous research on vehicle monitoring, which provides ideas for the research in this article. However, the accuracy and effectiveness of the method used for vehicle monitoring are not high.

The SSD algorithm may be used to track automobiles in the region, whereas the CNN algorithm does it directly. Okaishi et al. described a system for detecting and measuring vehicle wait length at intersections in real time. To adapt to the system and improve the accuracy of the vehicle identification process, he made several changes to the SSD algorithm [6]. Hu et al. created the residual encoder–decoder convolution network for low-dose CT imaging by combining auto-encoder, deconvolution network, and fast connection [7]. Heng et al. suggested a deep CNN and transfer learning-based approach for retrieving cultivated land information (DTCLE). Other learning mechanisms are used to extract knowledge from farmed land [8]. In order to measure the amount and characteristics of traffic in real time, Acharya et al. suggested a four-layer CNN structure with four maximum pool layers and three fully linked layers. It uses electrocardiogram signal segments of a duration of 2 and 5 s to detect coronary artery disease [9]. Obviously, by reading the aforementioned papers, we can understand that these algorithms improve the detection efficiency and accuracy of the objects to be detected to a certain extent. However, in the study of vehicle monitoring, they are difficult to implement and apply.

3 Vehicle detection algorithm based on SSD network and CNN

3.1 Status of panoramic video image vehicle detection system

With the rapid development of world economy, the number of cars is increasing, and the road traffic problems in cities such as traffic jams, frequent traffic accidents, and serious environmental pollution are gradually expanding [10]. Therefore, how to further improve the traffic capacity of pedestrians and vehicles has become a major premise. At present, the application of intelligent transportation system (ITS) in the transportation system is an important means to solve these traffic problems.

ITS is an efficient integrated traffic management system implemented under the increasingly urgent situation of traffic congestion, environmental pollution, energy waste, low traffic efficiency, and so on [11]. This system works in a wide range and direction [12]. The working principle of ITS is shown in Figure 1.

Figure 1 
                  Working principle of ITS.
Figure 1

Working principle of ITS.

As shown in Figure 1, as an important part of ITS, traffic video monitoring system is an efficient and real-time video monitoring system based on computer data processing, digital image processing, pattern recognition, and machine learning technology. This intelligent system can not only save manpower, but also help staff deal with emergencies more effectively to a certain extent, so as to improve the monitoring efficiency. A key problem in intelligent video surveillance is how to extract useful information from a large number of video data independently and effectively. Among them, the relatively important information is pedestrian and vehicle information [13].

To sum up, in urban road traffic, the vehicle detection problem contained in video surveillance system has become a research hotspot. People can control traffic by detecting traffic flow parameters, such as speed, traffic flow, and traffic density, so as to reduce traffic congestion, reduce the incidence of traffic accidents, and improve the traffic environment. The core content of vehicle detection is detection algorithm. Therefore, the research on moving vehicle detection algorithm under video image has important economic and social significance [14].

3.2 Vehicle detection algorithm based on CNN

At present, the detection algorithm based on video image is involved in many applications related to computer vision. For example, video surveillance needs to detect and track moving objects by analyzing video images, including intelligent environment and video retrieval [15]. In these applications, a moving target detection algorithm with low false detection rate is only used as a preprocessing step, which is quite important [16]. From the late 1800s to the present day, scholars have conducted a lot of research on the related challenges involved in motion target recognition and proposed many detection techniques, which have great guiding value in the field of motion target recognition. Among them, optical flow technology and frame difference method are the core methods of motion object detection after separating these existing algorithms.

The optical flow approach may detect moving targets without prior knowledge of the scene’s information. However, in complicated settings, interference variables such as shadow, occlusion, and noise will alter the optical flow field calculation findings. Furthermore, the optical flow method’s processing step is relatively difficult, so it does not have good real-time performance [17].

The disadvantage of the inter frame difference method is that it is as vulnerable to noise interference as optical flow method, and it is difficult to distinguish foreground motion and dynamic background. In addition, this method cannot effectively detect moving targets with uniform color or large outline, and the detection results will appear as “holes” and “ghosts” [18].

In the domains of computer vision, pattern classification and identification, and image processing, vehicle detection is one of the research subjects. CNNs have achieved significant progress in vehicle detection and diverse target detection tasks in recent years, putting them at the forefront of target detection technology and promoting its rapid development [19,20]. Figure 2 depicts the whole network structure of a CNN.

Figure 2 
                  CNN complete network structure.
Figure 2

CNN complete network structure.

As shown in Figure 2, the CNN has a large proportion of convolution layers. Convolution kernel is a machine learning algorithm designed to process image data such as video, photo, or text. The convolution kernel can consist of any number of neurons, each using a series of task-specific operations to collect the input images and phrases, and then classify the images and transform them into an easy-to-understand format and describe the resulting output model. Moreover, the parametric matrix of the convolution kernel can be viewed as a filter with specific functions to perform the convolution operations. The parameter matrix of the convolution kernel is also initialized to the corresponding multi-channel form. The calculation formula is as follows:

(1) u l = i M j A i l 1 k i j l + b j l ,

where A i l 1 represents the L – 1 feature map of the layer, and when l = 2, A i l 1 represents the ith channel of the original input image.

The gradient descent method and the loss function are two important aspects of the training process. Gradient descent is an effective machine learning algorithm that can be used to optimize neural networks with the same input and output features, thereby reducing the impact of parameter settings on model performance while maintaining low computational complexity. Loss function refers to the direct relationship between the parameter value and the value of a function. The larger the parameter value, the smaller the function value. The loss function describes a function corresponding to the respective error size on the training set under a given input, thus indicating the deviation between the model error on the training set and the model error on the test set. The goal of training is to get the CNN’s output as close to the true value as possible. The loss function is used to quantify the difference between the model’s predicted and actual values in this research. The data and labels that make up the input to a generic neural network are labels, which are the real values that correlate with the data [21].

We take Softmax as an example of how to quantify the difference between the model’s projected and actual values. Assuming that the number of categories is N, and that the labels are 1 − N, the formula according to the Softmax formula is as follows:

(2) S j = e a j k = 1 N e a k .

Among them, the output of S is a matrix of 1 × N , and the corresponding CNN output is the probability of the first class to the Nth class. Softmax is calculated as follows:

(3) L = j = 1 N b j log S j .

where b j indicates whether this real value is the jth class or not. For example, if the number of categories is 5 and the true value of these data is the third category, b 3 = 1 , the others are equal to 0.

With the popularity of ITS, the demand for computer vision-based vehicle detection technology is increasing, and the requirements for accuracy and speed are also increasing [22]. Therefore, an SSD detection algorithm is presented in this article.

3.3 Detection ideas for target detection network SSD

SSD CNN is a deep CNN model based on regression method, which is mainly used to solve the target detection problem [23]. It can perform end-to-end training and optimization with real-time detection speed and high precision. It has a broad application field. At present, it has SSD shadow in pedestrian detection, face detection, object recognition, and other applications.

SSD CNN is a supervised deep-learning model. It uses the label information of the input data to guide the training and learning of the model, which is mainly composed of convolution layer, activation function layer, and pooling layer. It combines target recognition tasks with border regression tasks through a multitask learning strategy. It is integrated into an end-to-end deep CNN, which makes the optimization and training of the whole model very easy. Its structure diagram is shown in Figure 3.

Figure 3 
                  SSD network structure diagram.
Figure 3

SSD network structure diagram.

As shown in Figure 3, as the network gets deeper, the resolution of the additional characteristic response map decreases and the sensory field of a single neuron increases. Combined with multiscale feature response maps for joint detection, SSD can generate shallow feature vectors for small objects and deep feature vectors for large objects. This ensures that the amount of information is saved [24].

SSD has been improved on the basis of you only look once. By utilizing the limited perception domain of the convolution kernel, the feature extraction area is limited from the full image to the local one, that is, the prediction is made using only the characteristics of the surrounding area of a location [25]. Moreover, it uses the different resolution of the feature response map to extract the region characteristics at different scales. It greatly improves the detection accuracy.

It also uses the anchor frame mechanism of Faster-RCNN to establish the corresponding relationship between the position and the characteristics in the graph. However, unlike Faster-RCNN, Faster-SRCNN only generates different scaling scales in the last layer of the feature response map, and Faster-SRCNN generates candidate boxes with different scales and aspect ratios from the feature response maps of various resolutions. Region candidate boxes with different aspect ratios do not extract multiscale features. The candidate box for SSD is shown in Figure 4.

Figure 4 
                  Candidate box for SSD.
Figure 4

Candidate box for SSD.

SSD discretizes the output space using regional candidate boxes of varying scales and aspect ratios, as shown in Figure 4. It merely predicts the likelihood that each regional candidate box will contain a variety of targets, as well as the offset of each regional candidate box’s location. It also forecasts scores based on the dependability of each regional candidate box, leaves a box with the most likely target after non-maximum suppression, and adjusts the box’s position and form based on the offset. The detection result is the end outcome.

A special convolution operation called void convolution has been proposed by some researchers. It expands the perception field of the network, but does not change the size of the feature map, thereby reducing the loss of information in the image. It allows each convolution output to contain a larger range of information. In the underlying network of SSD, the convolution layer Conv6 uses a void convolution operation. An important parameter in void convolution is the delation parameter. An example of two-dimensional void convolution is shown in Figure 5.

Figure 5 
                  Example of two-dimensional void convolution. (a) Hole convolution with delation value of 1. (b) Hole convolution with delation value of 2.
Figure 5

Example of two-dimensional void convolution. (a) Hole convolution with delation value of 1. (b) Hole convolution with delation value of 2.

As shown in Figure 5, there are usually four main adjustable parameters in a convolution layer. They are the number of convolution cores, the size of convolution cores, the sliding step, and the edge filling parameters. Edge filling refers to filling the input image or the boundary of the feature map with 0 elements to increase the height and width of the image. Assuming the size of the convolution core is k × k and the edge filling parameter is p, the size of the output signature graph is w 1 × w 1 with the following formula:

(4) w 1 = ( w k + 2 × p ) s + 1 .

When the void convolution is used, the value of the delation parameter is set to d, and the size of the output signature is w 2 × w 2 , where the value of w 2 is as follows:

(5) w 2 = w + 2 × p ( d × ( k 1 ) + 1 ) s + 1 .

3.3.1 Activating the function layer

In the SSD CNN, the activation function mainly performs nonlinear transformation on the results obtained by the convolution layer, which enhances the nonlinear ability of the model. If the neural network consists of only layers that are fully connected to the convolution layer, the weights and bias only accept linear transformations, and the output of each layer is a linear combination of the inputs from the previous layer. As networks increase, only linear models cannot handle complex tasks such as language translation and image classification. Nonlinear activation functions commonly used in neural networks include Sigmoid, tanh, and ReLU. Among them, ReLU is the activation function used in SSD CNN. The three activation functions are defined as follows:

Sigmoid activation function:

(6) f ( a ) = 1 1 + e a .

Tanh activation function:

(7) f ( a ) = 1 e 2 a 1 + e 2 a .

ReLU activation function:

(8) f ( a ) = max ( 0 , a ) .

Diagrams of the three activation functions are shown in Figure 6.

Figure 6 
                     Diagram of three activation functions. (a) Sigmoid activation function. (b) tan h activation function and (c) ReLU activation function.
Figure 6

Diagram of three activation functions. (a) Sigmoid activation function. (b) tan h activation function and (c) ReLU activation function.

As shown in Figure 6, the Sigmoid activation function is a monotonically increasing continuous function, and the output value of the function is always between 0 and 1. It is easy to derive, but its output is not zero-centric. The Sigmoid function curve tends to be smooth on both sides, indicating that its derivative tends to be 0 on both sides, which is soft saturated and easy to cause the disappearance of gradients, leading to training problems. ReLU can effectively reduce the time of model training.

3.3.2 Pooling layer

In the SSD convolution network, the pooled layer is often behind the convolution layer, which is not easily over-adapted. There are usually three ways to pool operations: average pooling, maximum pooling, and random pooling.

The size of the output feature map is h 1 × h 1 :

(9) h 1 = h k + 2 × p s + 1 .

The L layer is shown by the following formula:

(10) a j l = f ( β l × Pool ( a i l 1 ) + B l ) ,

where a i l 1 represents the output of Layer L – 1, that is, the output of the convolution layer and the input of the pooling layer. f denotes the activation function, which can be either a Sigmoid function or a tanh function. If Pool is pooled to the maximum, then we have the following formula:

(11) Pool ( a i l 1 ) = max a i l 1 .

If Pool is pooled averagely, then we have the following formula:

(12) Pool ( a i l 1 ) = mean a i l 1 .

Random pooling refers to the random output of a value from a selected pooled area in terms of probability size. These three pooling methods will still produce the same pooling characteristics with local invariance when there is a slight change in the input data. The maximum and average pooling operation diagrams are shown in Figure 7.

Figure 7 
                     Diagram of two pooling modes.
Figure 7

Diagram of two pooling modes.

As shown in Figure 7, the essence of pooling is actually sampling. By aggregating data information from adjacent rectangular areas, the obtained statistics can be used to replace the characteristics of this area, which is pooling.

3.3.3 Loss function

The SSD convolution network uses a multitask learning strategy. It performs both target detection and target recognition tasks and establishes a multitask loss function; the specific expression refers to the following formula:

(13) L ( a , c , l , g ) = 1 N ( L coof ( a , c ) + α L loc ( a , l , g ) ) ,

where α is the weight factor of the classified loss function and the border regression loss function. In fact, SSD is trained with a large number of default boxes with labels as samples, such as formula (14):

(14) L coof ( a , c ) = i pos N a i j p log ( c i 0 ) .

The a i j p iteration training reduces the regression error of the four offsets between the anticipated and actual target frames, as well as adjusting the position and size of the predicted and real target frames.

3.4 Vehicle detection in video image based on SSD network algorithm

During the training phase of classifying tasks, the input of the network is a single picture and its corresponding label, which the network learns through. This minimizes the error between the eigenvector and the label vector of the forward propagation output of the picture, as shown in the following formula:

(15) min 1 N = i = 1 N l ( b i , h ( a i , w ) ) .

Among them, a represents the input signal, w represents the network weight, and N represents the number of samples. One common feature of these input pictures is that the target in the picture is only one and occupies the majority of the image, and the background information is less, for example, ImageNet datasets, because redundant background noise can affect the learning of discriminatory features from the network. In the detection task, there are multiple targets in the picture. The label of the training set is no longer a picture corresponding to a category label, but a picture corresponding to multiple category labels and each category label corresponding to a location label.

The training phase of SSD network is actually learning the parameters w f of network feature extraction and the prediction parameter corresponding to each region candidate box. The network training process solves the following optimization functions, such as formula (14):

(16) w f = 1 N i = 1 N l ( b i , h ( a i , Θ C , Θ L ) ) ,

where Θ L represents the location prediction parameter, and Θ C represents the category probability prediction parameter.

For vehicle target detection tasks in panoramic video images, Precision, Recall, and FL-scope are used to evaluate the detection performance of different methods, which are defined as follows:

(17) Precision = TP TP + FP ,

(18) Recall = TP NP ,

(19) FL-score = 2 × Precision × Recall Precision + Recall .

TP refers to the number of real vehicle targets detected, FP refers to the number of background errors detected as vehicle targets, and NP refers to the total number of real vehicle targets in the graph. Precision refers to the detection accuracy, Recall refers to the recall rate, FL-score is a balance factor between Precision and Recall, and it is the main reference index to evaluate the detection performance. The higher the Precision and Recall values, the better the detection performance.

4 Video image vehicle monitoring experiment

4.1 Experiment of vehicle detection algorithm based on SD video image in different scenes

To verify the effectiveness of the proposed algorithm for vehicle detection in video images based on secure digital card (SD) three types of scenes are used as experimental datasets in this article. It compares optical flow, inter-frame difference, and SSD network algorithms on video, and then selects a frame in the output mask image. Finally, the calculated data and corresponding polyline diagrams are compared and analyzed.

Scenario 1: It is an ideal environment without too much external disturbance. In this article, these algorithms are simulated on the highway video under Baseline, and the result of the 800th frame image in the output frame sequence image is shown in Figure 8.

Figure 8 
                  Comparison of recall and accuracy of three methods in Scenario 1.
Figure 8

Comparison of recall and accuracy of three methods in Scenario 1.

As shown in Figure 8, we can see that the SSD network detection algorithm and other algorithms proposed in this article have better overall performance and better evaluation index under ideal scenarios. The recall rate and accuracy of SSD network algorithm are the highest compared with other algorithms.

Scenario 2: The moving target has dynamic background interference such as being obscured by the trunk and swaying leaves. In this article, these algorithms are simulated on fall video under Dynamic Background, and the result of the 800th frame image in the output frame sequence image is shown in Figure 9.

Figure 9 
                  Comparison of recall and accuracy of three methods in Scenario 2.
Figure 9

Comparison of recall and accuracy of three methods in Scenario 2.

In Figure 9, the graph shows that the evaluation indexes of other detection algorithms are not as good as those of SSD network algorithms. This also indicates to some extent that the better detection algorithms for all kinds of scenes are not always obtained by combining them. The detection performance of SSD network is much better than other algorithms. It greatly reduces the interference of dynamic background on the detection of moving vehicles. It is undeniable that the recall rate and accuracy of the proposed SSD network algorithm are the highest, which indicate that the algorithm is relatively robust.

Scenario 3: It has a night highway, low visibility, and other disturbing factors such as car light halo. In this article, these algorithms are simulated on the fluid highway under Night Videos, and the comparison results in the output frame sequence images are shown in Table 1.

Table 1

Comparison of three algorithms in Scene 3

Index Optical flow method Difference between frames SSD network algorithm
Precision 0.478 0.542 0.678
Recall 0.469 0.5814 0.7587
False detection rate 0.012 0.009 0.005

As shown in Table 1, it can be seen that the optical flow method and the inter-frame difference method do not work very well in night scenes because the night light is weak. Therefore, the invisible areas of moving target vehicles increase, which greatly reduces the recall and accuracy of various detection algorithms. Also, in the detection process, there will be interference factors such as dizziness of headlight and road reflections, so the false detection rate is much higher than other scenes. Although the detection accuracy of the SSD network algorithm is not as high as that of the previous scenarios, it is still higher than the other two algorithms.

4.2 Comparison of CNN and SSD network algorithms

The Caffe deep learning framework from the Ubuntu 16.04 operating system is used in this article. The following are the specific settings of Caffe’s network training. The loss function is reduced and the network weights are updated using the stochastic gradient descent optimization method. In the video, the total number of training iterations is 600, and the batch size is 20 automobiles.

The automotive training set is used to train the CNN model and the SSD model, and the detection performance of the two models is compared, as shown in Figure 10.

Figure 10 
                  The detection performance of two models with various training iterations is compared. (a) The first 300 iterations of two models’ detection performance are compared. (b) A comparison of two models’ detection performance over the last 300 iterations.
Figure 10

The detection performance of two models with various training iterations is compared. (a) The first 300 iterations of two models’ detection performance are compared. (b) A comparison of two models’ detection performance over the last 300 iterations.

As shown in Figure 10, the distribution of a particular dataset is studied first when assessing the data distribution of cars. The region candidate box is then reset so that the output space can be discretized properly. The ideal matching threshold is then investigated for the reset area candidate box, which increases the SSD network identification accuracy.

The proposed method is compared to the classic single CNN target detection algorithm in order to validate its superiority, as shown in Figure 11.

Figure 11 
                  Two test images. (a) Test diagram A and (b) Test diagram B.
Figure 11

Two test images. (a) Test diagram A and (b) Test diagram B.

As shown in Figure 11, for CNNs, the bottom features usually represent the edge contour information of the target, while the top features usually represent the abstract semantic information of the target. It is very helpful for target detection of specific absorption rate image to combine the low-level and high-level features.

Comparison of detection performance between CNN and SSD network algorithms for two test images is shown in Tables 2 and 3.

Table 2

Detection performance of CNN for two test images

Test images and indicators Figure A Figure B Precision Recall FL-score
The total number of real targets 52 64 / / /
The number of real objects detected 50 55 0.7554 0.9052 0.8235
Number of false alarm targets 18 16
Table 3

Detection performance of SSD network algorithm for two test images

Test images and indicators Figure A Figure B Precision Recall FL-score
The total number of real targets 52 64 / / /
The number of real objects detected 51 58 0.8720 0.9397 0.9046
Number of false alarm targets 6 10

As shown in Tables 2 and 3, the detection accuracy of a single CNN algorithm is the lowest, which is only 0.7554. Its detection performance is also the worst: it cannot locate the target well, and the detected target frame is fixed and cannot be adjusted automatically. The main reason is that the CNN algorithm of one only considers the intensity information of the target. It does not make use of the size, structure, and other information of the target, and the use of image information is insufficient. SSD makes use of multiscale convolution feature maps at the bottom and at the top to make joint predictions and make full use of the features.

5 Conclusion

The video vehicle detection system discussed in this article is an important part of the highway traffic system. By analyzing and processing the video image sequence of road scene, it realizes the automatic detection of road traffic information and the automatic recognition of vehicle characteristics. It is also the direction of the development of China’s traffic management modernization. In this article, the research and development of key technologies in video vehicle detection system will produce far-reaching social and economic benefits. To solve the drawbacks of classic image vehicle detection methods, this research presents a CNN technique and an SSD network algorithm. In the method section of Section 3, the benefits of these two methods are discussed. The SSD network method, in particular, combines the benefits of the CNN technique to monitor automobiles in a panoramic video image more precisely. Section 4 initially compares and analyzes algorithm effects for four panoramic video images in various settings. Finally, it is discovered that in any case, the accuracy of the SSD network algorithm is higher than previous approaches, while the detection rate is lower. As a result, the SSD network method is the best choice for vehicle detection.

  1. Funding information: This work was supported by Anhui Provincial Department of Education Natural Science Key Project “Vehicle Control Based on Sensor Fusion and Driver Reaction Time in Typical Traffic Scenarios” (KJ2021A1394). This work was supported by Hefei Technology College Natural Science Major Project “Sensor Fusion for Object Recognition and Vehicle Control in Typical Traffic Scenes” (2021KJZD01). This work was supported by 2021 Hefei Technology College School-level Science Project Academic and Technical Leader (2021DTR02).

  2. Conflict of interest: Author declares that there are no conflicts of interest regarding the publication of this article.

  3. Ethical approval: This article does not contain any studies with human participants or animals performed by any of the authors.

  4. Data availability statement: The data that support the findings of this study are available from the corresponding author upon reasonable request.


[1] R. A. Hadi, L. E. George, and M. J. Mohammed, “A computationally economic novel approach for real-time moving multi-vehicle detection and tracking toward efficient traffic surveillance,” Arab. J. Sci. Eng., vol. 42, no. 2, pp. 817–831, 2017.10.1007/s13369-016-2351-8Search in Google Scholar

[2] X. Chen, S. Xiang, C. Liu, and C. Pan, “Vehicle detection in satellite images by hybrid deep convolutional neural networks,” IEEE Geosci. Remote. Sens. Lett., vol. 11, no. 10, pp. 1797–1801, 2017.10.1109/LGRS.2014.2309695Search in Google Scholar

[3] Z. Moutakki, I. M. Ouloul, K. Afdel, and A. Amghar, “Real-time system based on feature extraction for vehicle detection and classification,” Transp. Telecommun. J., vol. 19, no. 2, pp. 93–102, 2018.10.2478/ttj-2018-0008Search in Google Scholar

[4] X. Chen, R. X. Gong, L. L. Xie, S. Xiang, C. L. Liu, and C. H. Pan, “Building regional covariance descriptors for vehicle detection,” IEEE Geosci. Remote. Sens. Lett., vol. 14, no. 4, pp. 524–528, 2017.10.1109/LGRS.2017.2653772Search in Google Scholar

[5] F. Garcia, D. Martin, D. Arturo, and J. M. Armingol, “Sensor fusion methodology for vehicle detection,” IEEE Intell. Transp. Syst. Mag., vol. 9, no. 1, pp. 123–133, 2017.10.1109/MITS.2016.2620398Search in Google Scholar

[6] W. A. Okaishi, A. Zaarane, I. Slimani, I. Atouf, and M. Benrabh, “A vehicular queue length measurement system in real-time based on SSD network,” Transp. Telecommun. J., vol. 22, no. 1, pp. 29–38, 2021.10.2478/ttj-2021-0003Search in Google Scholar

[7] C. Hu, Z. Yi, M. K. Kalra, L. Feng, C. Yang, P. Liao, et al., “Low-Dose CT with a residual encoder-decoder convolutional neural network (RED-CNN),” IEEE Trans. Med. Imaging, vol. 36, no. 99, pp. 2524–2535, 2017.10.1109/TMI.2017.2715284Search in Google Scholar PubMed PubMed Central

[8] L. U. Heng, F. U. Xiao, C. Liu, L. I. Long-Guo, H. E. Yu-Xin, L. I. Nai-Wen, et al., “Cultivated land information extraction in UAV imagery based on deep convolutional neural network and transfer learning,” J. Mt. Sci., vol. 14, no. 4, pp. 731–741, 2017.10.1007/s11629-016-3950-2Search in Google Scholar

[9] U. R. Acharya, H. Fujita, O. S. Lih, M. Adam, J. H. Tan, C. K. Chua, et al., “Automated detection of coronary artery disease using different durations of ECG segments with convolutional neural network,” Knowl. Syst., vol. 132, no. sep.15, pp. 62–71, 2017.10.1016/j.knosys.2017.06.003Search in Google Scholar

[10] Y. Liu, L. Yang, M. Xu, and Z. Wang, “Rate control schemes for panoramic video coding,” J. Vis. Commun. Image Representation, vol. 53, no. MAY, pp. 76–85, 2018.10.1016/j.jvcir.2018.03.001Search in Google Scholar

[11] G. Li, N. Cao, P. Zhu, Y. Zhang, Y. Zhang, L. Li, et al., “Towards smart transportation system: A case study on the rebalancing problem of bike sharing system based on reinforcement learning,” J. Organ. End. User Comput. (JOEUC), vol. 33, no. 3, pp. 35–49, 2021, Search in Google Scholar

[12] J. Sang, P. Guo, Z. Xiang, H. Luo, and X. Chen, “Vehicle detection based on faster-RCNN,” Chongqing Daxue Xuebao/Journal Chongqing Univ., vol. 40, no. 7, pp. 32–36, 2017.Search in Google Scholar

[13] S. Parvin, L. J. Rozario, and M. E. Islam, “Vision-based on-road nighttime vehicle detection and tracking using taillight and headlight features,” J. Comput. Commun., vol. 9, no. 3, pp. 29–53, 2021.10.4236/jcc.2021.93003Search in Google Scholar

[14] J. Lei, Y. Dong, and H. Sui, “Tiny moving vehicle detection in satellite video with constraints of multiple prior information,” Int. J. Remote. Sens., vol. 42, no. 11, pp. 4110–4125, 2021.10.1080/01431161.2021.1887542Search in Google Scholar

[15] H. Wei and N. Kehtarnavaz, “Semi-supervised faster rcnn-based person detection and load classification for far field video surveillance,” Mach. Learn. Knowl. Extr., vol. 1, no. 3, pp. 756–767, 2019.10.3390/make1030044Search in Google Scholar

[16] S. B. Park, H. Y. Lim, and D. S. Kang, “Implementation of rotating invariant multi object detection system applying MI-FL based on SSD algorithm,” J. Korean Inst. Inf. Technol., vol. 17, no. 5, pp. 13–20, 2019.10.14801/jkiit.2019.17.5.13Search in Google Scholar

[17] I. Chattate, M. E. Khaili, and J. Ba Kk Oury, “A new fuzzy-TOPSIS based algorithm for network selection in next-generation heterogeneous networks,” J. Commun., vol. 14, no. 3, pp. 194–201, 2019.10.12720/jcm.14.3.194-201Search in Google Scholar

[18] L. Yang, Z. Qi, Z. Liu, H. Liu, M. Ling, L. Shi, et al., “An embedded implementation of CNN-based hand detection and orientation estimation algorithm,” Mach. Vis. Appl., vol. 30, no. 6, pp. 1071–1082, 2019.10.1007/s00138-019-01038-4Search in Google Scholar

[19] Z. Lv, Y. Li, H. Feng, and H. Lv, “Deep learning for security in digital twins of cooperative intelligent transportation systems,” IEEE Trans. Intell. Transp. Syst., vol. 23, no. 9, pp. 16666–16675, 2021.10.1109/TITS.2021.3113779Search in Google Scholar

[20] X. Zeng, Z. Wang, and Y. Hu, Enabling efficient deep convolutional neural network-based sensor fusion for autonomous driving, arXiv preprint arXiv:2202, 2022, p. 11231.10.1145/3489517.3530444Search in Google Scholar

[21] H. A. Haenssle, C. Fink, R. Schneiderbauer, F. Toberer, T. Buhl, A. Blum, et al., “Man against machine: diagnostic performance of a deep learning convolutional neural network for dermoscopic melanoma recognition in comparison to 58 dermatologists,” Ann. Oncol., vol. 29, no. 8, pp. 1836–1842, 2018.Search in Google Scholar

[22] S. Dabiri and K. Heaslip, “Inferring transportation modes from GPS trajectories using a convolutional neural network,” Transp. Res. Part. C. Emerg. Technol., vol. 86, no. JAN, pp. 360–371, 2018.10.1016/j.trc.2017.11.021Search in Google Scholar

[23] T. Hirasawa, K. Aoyama, and T. Tanimoto, “Application of artificial intelligence using a convolutional neural network for detecting gastric cancer in endoscopic images,” Gastric Cancer, vol. 87, no. Suppl 1, pp. 1–8, 2018.10.1016/j.gie.2018.04.025Search in Google Scholar

[24] F. C. Chen and R. Jahanshahi, “NB-CNN: Deep learning-based crack detection using convolutional neural network and Naïve Bayes data fusion,” IEEE Trans. Ind. Electron., vol. 65, no. 99, pp. 4392–4400, 2018.10.1109/TIE.2017.2764844Search in Google Scholar

[25] Q. Yuan, Y. Wei, X. Meng, H. Shen, and L. Zhang “A multiscale and multidepth convolutional neural network for remote sensing imagery pan-sharpening.” IEEE J. Sel. Top. Appl. Earth Observations Remote. Sens., vol. 11, no. 3, pp. 978–989, 2018.10.1109/JSTARS.2018.2794888Search in Google Scholar

Received: 2022-12-28
Revised: 2023-02-09
Accepted: 2023-03-10
Published Online: 2023-06-19

© 2023 the author(s), published by De Gruyter

This work is licensed under the Creative Commons Attribution 4.0 International License.

Downloaded on 23.9.2023 from
Scroll to top button