Keywords

1 Introduction

In security, detecting potentially dangerous situations as soon as possible is of vital importance. Constant human supervision of the images provided by Closed Circuit Television (CCTV) systems is non feasible. In the last decades, several efforts have been made to create automated video surveillance (AVS) systems that can locate potentially threatening objects or events in the video sequence [11]. With the extended use of modern deep learning techniques, these systems obtain promising results [10, 19].

The development of these detection systems is normally driven to achieve both high detection and low false positive rates. Ideally, training data would contain representative instances from all possible application scenarios. In practice, obtaining such a huge amount of data is not feasible in terms of time and resources. This problem forces data scientists to be cautious about overfitting and poor generalization when training new models [20]. Some techniques such as dataset partitioning, L1 and L2 regularizations or early stopping are applied to alleviate them [14]. However, misclassification of samples in new scenarios must be addressed. In this respect, it is conceivable that a weapon detector could be trained using a dataset containing instances from all possible weapons that provides accurate detections and a small number of false positives. Then, when put into a surveillance system in a real scenario the result is generally an unacceptable rate of false alarms [18]. This means that the system will almost certainly be switched off, specially in cases where the incidence of the event of interest is very low. In this context we propose to add an additional step that models and filters the typical false alarms that are particular of the new scenario while roughly maintaining the ability to detect the objects it was trained for.

We focus on a handgun detection problem [10]. When such detector runs in a new scenario over a period of time, all of those detections, most likely false positives, can be stored and leveraged. Here we propose to use those false positives to train a deep autoencoder to model the typical false alarms of the particular scenario (and this can be done down to the level of individual cameras). The one-class classification approach has been widely used in the literature to detect abnormal and extreme data [6]. Autoencoders have shown the ability to perform this task even where other techniques fail [5].

In addition, due to the large number of images required to train deep learning models, we propose to use an entirely synthetic dataset for training and testing the autoencoder. It consists of frames captured from a realistic 3D environment that resembles the new scenario in which the detector would be deployed.

The rest of the paper is organized as follows. Section 2 gives the details of the handgun detector used. Section 3 covers the generation of the synthetic dataset of the new detection scenario. Section 4 describes the procedure followed to filter the false positives of the new scenario. Finally, Sect. 5 shows the performance of the proposed method and Sect. 6 summarizes the main conclusions.

2 Handgun Detector

To address the reduction of the false positive rate of the handgun detector we need one as a starting point. Classical machine learning methods based on keypoint matching and feature extraction and classification have been extensively applied to RGB images taken by CCTV videocameras [10]. Most of these methods use the sliding window approach with which the detection problem is solved as a classification problem in every examined window. This approach not only works with traditional methods but also with the new deep learning classification architectures. Convolutional neural networks (CNNs) can be used in the same way as support vector machines (SVMs) or cascade classifiers without having to represent the image as a set of features before performing classification.

The problem with the sliding window approach is the variability in the object locations within the image and the large differences in their aspect ratios. Thus, the number of regions to be examined is huge. In [2] this problem is addressed by taking different regions of interest from the image using a selective search to extract a manageable number of regions called region proposals. R-CNN, Fast R-CNN and Faster R-CNN are the main detection architectures based on such region proposals approach [13]. In addition, there are other detection networks like YOLO (You Only Look Once) [12] or SSD (Single Shot Detector) [8] that are able to predict the bounding boxes and the class probabilities for these boxes, examining the whole image only once.

Due to the unavailability of pretrained handgun detection models, we have trained one handgun detector with a dataset provided by the University of Seville, Spain. The dataset contains 871 images with a total of 177 annotated handguns. Images come from 2 different CCTV controlled scenarios. The CNN architecture selected was the Faster R-CNN (Fig. 1).

Fig. 1.
figure 1

Faster R-CNN architecture used for the handgun detector [13]

3 Synthetic Dataset

Collecting and annotating data to train deep learning networks is a tedious task. It is even more complex for detection and segmentation problems where someone selects not only the class but also a rectangle around every desired object in the image or pixel-level contour. The easiest solution would be to use an existing dataset for the problem addressed but this is not always possible.

To test the hypothesis in our work, we have used Unreal Engine 4 [17] to generate synthetic data from a school hall (Fig. 2). There are other popular alternatives such as Unity [16] or Lumberyard/CryEngine [9] that can be also used.

The synthetic scenario is similar to a high-school corridor. It is rendered with people walking across it generating a dataset of 3000 images where some people are carrying handguns, mobile phones or nothing in their hands. From these images, 2657 contains handguns, 343 do not contain the object of interest and the total number of handguns is 5437. Since we control the dataset generation, it is possible to automatically annotate where each handgun is. To store the annotations we have used XML files with the format defined in the Pascal VOC 2012 Challenge [1]. Although images containing the object of interest are not strictly necessary to train the autoencoder, they are needed to test the improvement of the detector+autoencoder system since reducing the false positive ratio might have a negative effect in the detection ratio.

Fig. 2.
figure 2

Synthetic scenario

4 Autoencoder

Autoencoder networks learn an approximation to an input data distribution. In other words, they learn how to compress input data into a short code and then reconstruct that code into something as close as possible to its original input. The structure of these networks consist of an encoder path that reduces the dimensionality ignoring signal noise and a decoder path that makes the reconstruction. Autoencoders are commonly used for anomaly detection or one-class classification [4].

In our case the autoencoder is trained to model one class: the typical false positives of the particular scenario (see Figs. 3 and 4).

Fig. 3.
figure 3

Autoencoder training phase

Fig. 4.
figure 4

Typical false positives of the handgun detector in this scenario (enlarged).

Once the autoencoder is trained, it is applied to reconstruct images from a test dataset that contains also instances from the object of interest. If the reconstruction error is measured, it is lower for the class used to train the autoencoder and thus, a threshold can be established to separate the two classes (in this case, false positive or real positive). Therefore, the trained autoencoder will act as a filter that checks whether the output of the detector is one of its typical false positives (Fig. 5). In this work, two different error measures were applied, the Mean Squared Error (MSE) [15] and the Peak Signal-to-Noise Ratio (PSNR) [7]. Figure 6 illustrates the reconstruction error of an autoencoder in a toy example using the MSE measure.

Fig. 5.
figure 5

Detector+Autoencoder system

Fig. 6.
figure 6

Reconstruction error comparison. The more the data differs from the training set the higher the error is. (a) Autoencoder training data from the function \(y=1+0.05x+sin(x)/x+0.2*r\) where r is a normally distributed random number. (b) Reconstructed data from \(y=1+0.05x+sin(x)/x+0.3*r\) with \(MSE=0.533\). (c) Reconstructed data from \(y=1+0.05x+sin(x)/x+0.8*r\) with \(MSE=1.099\).

The autoencoder structures used are shown in Fig. 7. In both cases, the input size is a \(64 \times 64\) image with 3 channels. Their compressive paths consists of 1 or 6 convolutional and max-pooling layers. Similarly, the reconstruction paths have also 1 or 6 convolutional and up-sampling layers.

Fig. 7.
figure 7

Autoencoder architectures used.

The dataset used is composed of 2100 images from the synthetic scenario. From them, 1200 images where used to generate the false positives to train the autoencoder and the remaining 900 to validate it.

5 Results

The Faster R-CNN model trained obtained an mAP of 79.33%. Its training process took around 2 days to complete 62 epochs in 2 nVIDIA Quadro M4000 cards using Keras with TensorFlow backend and CUDA 8.0 installed in a PC running Ubuntu 14.04 LTS.

We applied both the handgun detector and the detector+autoencoder approach to a test dataset composed of 900 images from the synthetic scenario, containing 808 instances of the object. The histograms of the reconstruction error of the autoencoders by error measure are depicted in Figs. 8 and 9. The best separation of the FPs from the TPs was obtained with the larger autoencoder structure when using MSE and the smaller when using PSNR. Although the TPs are overlapped with the FPs in both cases, the first part of the histograms, that do not contain TPs, contain the 26.5% of all the FPs when the MSE is used and the 37.9% when the PSNR is used. This means that those FPs can be potentially filtered with the autoencoder by selecting a threshold based on the reconstruction error. Furthermore, if the image reconstructions of the FPs and the TPs are compared, it is possible to notice how the FPs are reconstructed better than the TPs (see Fig. 10).

Fig. 8.
figure 8

Histogram of the reconstruction errors using MSE (y-axis uses logarithmic scale). FP - false positives of the detector, TP - true positives of the detector.

Fig. 9.
figure 9

Histogram of the reconstruction errors using PSNR (y-axis uses logarithmic scale). FP - false positives of the detector, TP - true positives of the detector.

Fig. 10.
figure 10

Reconstruction made by the deepest autoencoder of a TP and a FP of the detector. (a) TP, (d) FP, (b) and (e) are the reconstructions, and (c) and (f) are the difference between the original image and the reconstruction.

In addition, the corresponding precision-recall curves were also obtained. Considering the possible outputs of the detector, where TP and TN represent the number of true positives and true negatives and FP and FN stand for the number of false positives and the number of false negatives, the precision (p) and the recall (r) values can be calculated as shown in Eq. 1 [3].

$$\begin{aligned} p = \frac{TP}{TP+FP} \qquad r = \frac{TP}{TP+FN} \end{aligned}$$
(1)

The experimental results show a reduction in the number of false positives while maintaining the detection capabilities (Figs. 11 and 12). The figures depict the precision-recall curves of both the detector and the detector+autoencoder approaches. Those curves were obtained varying the detector confidence threshold and fixing a certain threshold for the autoencoder. When compared, they show a maximum increase in the precision of 0.015 at the same recall values when the autoencoder and the MSE are used (maximum distance between precisions without displacing the recall values to the left) and a maximum increase in the precision of 0.020 at the same recall values when the autoencoder and the PSNR are used (threshold = 77.24).

Fig. 11.
figure 11

Precision-recall curves using MSE. Detector in blue, Detector+Autoencoder in green. Best viewed in color. MSE threshold = 0.0044.

Fig. 12.
figure 12

Precision-recall curves using PSNR. Detector in blue, Detector+Autoencoder in green. Best viewed in color. Thresholds used, from top to bottom, 78.5, 78 and 77.24.

In addition, it is also worth noting that, although the results with the shortest autoencoder architecture and the PSNR error measure are superior to those obtained by the deepest architecture and the MSE error measure, the reconstructed image is noisier (see Fig. 13).

Fig. 13.
figure 13

Reconstruction with the two autoencoder architectures. (a) Original image of a FP of the detector. (b) Reconstruction with the deepest autoencoder architecture. (c) Reconstruction with the shortest autoencoder architecture.

With respect to the computational times, the deep autoencoders training processes took only 45 min to complete 500 epochs in an nVIDIA GTX 1060 MaxQ card using Keras with TensorFlow backend and CUDA 9.0 installed in a Windows 10 PC.

6 Conclusions

This work focuses on reducing the number of false alarms when a surveillance application is run in a new particular scenario. A synthetic scenario has been generated with the game engine Unreal Engine 4, resembling a surveillance camera inside a high-school hall where people is walking by. Images coming from this scenario were used as input of a pretrained handgun detector. Its false positive detections where then used to train a deep autoencoder to model them and act like a filter, removing those false positives.

The autoencoder has demonstrated to be able to reduce the number of FPs in up to 37.9% in combination with the PSNR error while maintaining the same capability of detecting handguns. Notice that the number of FPs and the detection ability of the detector is a trade-off that should be always considered. The detector+autoencoder approach proposed helps keep a good balance between them since it is possible to not affect both at the same time for the first range of the autoencoder thresholds.

Overall, our approach can be used with generic detectors (i.e. a generic handgun detector) and with different particular scenarios (down to the level of individual cameras, which depending on the point of view, lighting, etc, will produce different false positives). Thus, in practice we would only need the generic detector and one trained autoencoder per camera feed.

As future work, it would be useful to consider other detection architectures to check if they can influence the autoencoder. In addition, other error metrics can be used to get the autoencoder threshold.