Keywords

1 Introduction

Abnormal behavior analysis on crowded scenes is an important and growing research field. Video cameras, given their ease installation and low cost, have been widely used for monitoring internal and external areas such as buildings, parks, stadiums etc. Algorithms for pose detection and action recognition for single or, in some cases, very low density groups of people are extensively treated in the pattern recognition community. Nevertheless, many of these algorithms need to segment each person which is impractical in crowded scenes due to high levels of occlusion.

Abnormal behavior situations are always associated with the scene context, a behavior considered as normal in a scene may be considered abnormal in other. These specific conditions increase the difficulties for automatic analysis and require specific modeling of the abnormal behavior for each particular scene.

In order to build such models many algorithms have been proposed. In [7] optical flow is used to compute interaction forces between adjacent pixels and a bag of words approach is used to classify frames as normal or abnormal. In [6] dynamic textures (DT) are used to model the appearance and dynamics of normal behavior, samples with low probabilities in the model are labeled as abnormal. In [8] entropy and energetic concepts are used as features to model the probability of finding abnormal behavior in the scene. Natural language processing is used in [11] as a classification algorithm for features recognition based on viscous fluid field concepts.

Many algorithms employ machine learning techniques for classification. Support Vector Machine (SVM) is used in [9, 12] to classify histograms of the orientation of optical flow. Multilayer Perceptron Neural Network is used in [14]. k-Nearest Neighbors is used in [1] to classify outlier observed trajectories as abnormal behavior. Finally, Fuzzy C-Means are used in [3, 4] to derive an unsupervised model for the crowd trajectory patterns.

Most of the recent algorithms employ supervised or unsupervised machine learning techniques. Supervised techniques are employed when all possible abnormal situations are well known and there are a sufficient number of video samples with both normal and abnormal situations. In most of the cases supervised techniques present very limited results since it is difficult to obtain the information of all the possible abnormal situations and the number of samples with this type of behavior is usually very low. The detection of abnormal behavior using unsupervised algorithms can be seen as a problem of outlier detection. A model is constructed using only samples of normal behaviors and in the test phase each sample that does not fit the model is labeled as abnormal.

In general, to construct the feature vector used in many of the algorithms described above, a set of parameters must be correctly defined in order to achieve the performance reported by the authors. Some of the state-of-the-art methods are based in complex probabilistic models which leads to high processing time. Despite being the processing time per frame reported only for very few papers, it is in general high. For example, in [6] the authors reported a test time of 25 s per frame for 160 \(\times \) 240 pixel images and in [13] the reported test time per frame is 5 s in videos with 320 \(\times \) 240 pixel resolution.

The main contribution of this paper is a simple but efficient method for abnormal crowd behavior detection that reduces the processing time per frames allowing practical use.

The rest of this paper is organized as follows. Section 2 describes the proposed approach. Section 3 presents the experimental results. Section 4 presents the conclusions.

2 Proposed Method

2.1 Pre-processing

The dense optical flow for each input frame is obtained using the algorithm presented in [2]. The optical flow information \(F(x,y) = (v_x(x,y), v_y(x,y))\) can be expressed as a vector of horizontal \(v_x(x,y)\) and vertical \(v_y(x,y)\) velocity components for each image pixel (xy). The magnitude and direction of each optical flow vector is obtained using Eqs. 1 and 2 respectively.

$$\begin{aligned} m(x,y)= & {} \sqrt{v_x(x,y)^2 + v_y(x,y)^2} \;, \end{aligned}$$
(1)
$$\begin{aligned} \alpha (x,y)= & {} arctan \left( \frac{v_y(x,y)}{v_x(x,y)} \right) . \end{aligned}$$
(2)

In parallel with the optical flow computation, the background segmentation algorithm presented in [10] is used to obtain a foreground binary mask \(I_f\). The foreground information will be used to reduce the number of optical flow vectors processed as shown in the next section.

2.2 Normal Behavior Model

The normal behavior model is obtained from a sequence of N frames containing only normal behavior samples. First, each input frame is divided into smaller regions \(R_i\) of fixed size \(\widehat{w} \times \widehat{h}\). The regions are overlapped in the x and y direction by \(\widehat{w}/2\) and \(\widehat{h}/2\) respectively. Consequently, a total T regions are obtained, where T is computed using

$$\begin{aligned} T = \left( \frac{h}{\widehat{h}}*2 - 1\right) *\left( \frac{w}{\widehat{w}}*2 - 1\right) \,, \end{aligned}$$
(3)

where w and h are the width and the height of the input frame respectively.

A matrix \(\varvec{\mathcal {X}}_i\), shown in Eq. 4, is used to store the magnitude and direction values of the optical flow vectors for all the training frames within the region \(R_i\).

$$\begin{aligned} \varvec{\mathcal {X}}_i = \begin{bmatrix} \varvec{\alpha }_1^j&\varvec{m}_1^j \\ \varvec{\alpha }_2^j&\varvec{m}_2^j \\ \vdots&\vdots \\ \varvec{\alpha }_N^j&\varvec{m}_N^j\end{bmatrix}\,. \end{aligned}$$
(4)

where the column vector \(\varvec{\alpha }_i^j\) will contain the direction values and the column vector \(\varvec{m}_i^j\) their corresponding magnitude within the i-th region of the j-th frame according to Eqs. 5 and 6 respectively.

$$\begin{aligned} \varvec{\alpha }_i^j= & {} \{\alpha (x,y) \,\, | \,\, I_f(x,y) \ne 0, (x,y) \in R_i\} \,\, \forall \,\, j \in [1,N]\,, \end{aligned}$$
(5)
$$\begin{aligned} \varvec{m}_i^j= & {} \{m(x,y) \,\,| \,\, I_f(x,y) \ne 0, (x,y) \in R_i\} \,\, \forall \,\, j \in [1,N]\,, \end{aligned}$$
(6)

where \(I_f(x,y)\) is the foreground image obtained as described in the previous section.

Assuming that all the magnitude and direction values of a specific (xy) point through the training video can be modeled as a mixture of M 2D Gaussian distributions, the probability of a particular pair \(\varvec{x}=(\alpha (x,y),m(x,y))\) being part of the Gaussian Mixture distribution is given by

$$\begin{aligned} P\left( \varvec{x}|{\varvec{\Theta }_\mathbf{i}}\right) = \sum _{k=1}^M\lambda _k^i\,\mathcal {N}\left( \varvec{x}|\varvec{\mu }_k^i,\varvec{\varSigma }_k^i\right) \,, \end{aligned}$$
(7)

where \({\varvec{\Theta }_\mathbf{i}} = \{\lambda _1^i,\ldots ,\lambda _M^i,\varvec{\mu }_1^i,\ldots ,\varvec{\mu }_M^i, \varvec{\varSigma }_1^i,\ldots ,\varvec{\varSigma }_M^i\}\) are the model parameter vectors, \(\lambda _k^i\) are the mixing coefficients, \(\varvec{\mu }_k^i\) and \(\varvec{\varSigma }_k^i\) are the mean vector and the covariance matrix of \(\varvec{\mathcal {X}}_i\) respectively and \(\mathcal {N}\) is the Multivariate Gaussian distribution given by

$$\begin{aligned} \mathcal {N}\left( \varvec{x}|\varvec{\mu }_k^i,\varvec{\varSigma }_k^i\right) = \frac{1}{(2\pi )^{\frac{n}{2}}\mid \varvec{\varSigma }_k^i \mid }exp \left( -\frac{1}{2}\left( \varvec{x} - \varvec{\mu }_k^i\right) ^T \left( \varvec{\varSigma }_k^i\right) ^{-1} \left( \varvec{x} - \varvec{\mu }_k^i\right) \right) . \end{aligned}$$
(8)

The mixture parameter vectors \({\varvec{\Theta }_\mathbf{i}}\) are obtained through the Expectation-Maximization (EM) algorithm.

Figure 1 shows an example region (in red), the \(\varvec{\mathcal {X}}_i\) matrix formation and the contour map of the Gaussian Distribution obtained from \(\varvec{\mathcal {X}}_i\), with \(i = 22\). From the Gaussian Distribution it is observed that, in the example region, the main displacement directions are approximately \(10^{\circ }\), \(180^{\circ }\) and \(360^{\circ }\) and the main magnitude value is approximately 0.8.

Fig. 1.
figure 1

Gaussian Distributions contours plot obtained from the region marked in red using all the training videos. (Color figure online)

2.3 Abnormal Detection

In the test phase a binary image \(I_b\) is obtained for each input frame. The \(I_b\) image has the same size as the input frames and its values are computed as is shown in Eq. 9.

$$\begin{aligned} I_b(x,y)= {\left\{ \begin{array}{ll} 1, &{} \text {if } P(\varvec{x}|{\varvec{\Theta }_\mathbf{i}}) < T_h\\ 0, &{} \text {otherwise} \end{array}\right. } \,, \end{aligned}$$
(9)

where \(\varvec{x} = (\alpha (x,y), m(x,y))\) is the optical flow direction and magnitude at point (xy), \({\varvec{\Theta }_\mathbf{i}}\) is the parameter vector for the Gaussian Mixture Model at the region \(R_i\) and \(T_h\) is a threshold that specify the probability limit where the pixel (xy) is marked as normal (0) or abnormal (1).

In order to improve the algorithm’s performance a FIFO type list with fixed size S is defined and filled up with the latest S binary images \(I_b(x,y)\). A connected component analysis is performed on each new image \(I_{b}(x,y)\). If any pixel within a blob appears as abnormal in at least W images within the list, where \(W<S\), then the whole blob is marked as abnormal. The list size S and the number W are user controlled parameters and can be used for sensitivity adjustment, since a higher value of W means a higher alarm delay time.

3 Results and Comparisons

The proposed algorithm was implemented in Qt/C++ using OpenCV 3.0 on a 2.7 GHz Intel Core i7 PC with 16 GB of RAM. The parameters of the model were fixed to \(M = 5\) (number of Gaussian in the mixture), \(S = 3\), \(W = 2\). All frames, regardless the dataset, were divided into \(T=35\) regions, that means that the \(\widehat{w}\) and \(\widehat{h}\) values will depend on the input frame size. The presented method was tested in two publicly available anomaly detection datasets: UMNFootnote 1 and UCSDFootnote 2. Figure 2 shows a normal frame for each scenarios in the UMN dataset, the abnormality detected by the proposed approach and the performance comparison against the ground truth.

Fig. 2.
figure 2

Abnormal behavior detection: normal (top) and detected abnormal (bottom) situations in UMN dataset.

The Fig. 3 shows six examples frames with abnormal behavior detection from the UCSD dataset.

Fig. 3.
figure 3

Example of abnormal behavior detection in the UCSD dataset. UCSDped1 (top) and UCSDped2 (bottom).

The proposed method was compared with similar state-of-the-art algorithms including Mixture Dynamic Texture (MDT) [6], Mixture of Optical Flow (MPPCA) [5], Social Force [7], Social Force with MPPCA [5] and the Hierarchical Activity Approach [13]. Figure 4 shows the Receiver Operation Characteristic (ROC) curves for the proposed method and the comparative algorithms, taken from [13]. Table 1 shows the Area Under the ROC curve (AUC) for the five comparative methods and the proposed one. Finally, Fig. 5 shows the processing time per frame for some state-of-the-art algorithms and the proposed in this paper.

Fig. 4.
figure 4

Quantitative comparison of abnormal behavior detection in (a) UCSDped1 and (b) UCSDped2 against state-of-the-art algorithms.

Fig. 5.
figure 5

Comparison of consumed time per frame with others state-of-the-art algorithms. Showed time is for the test phase in UCSDped1.

Table 1. Area Under Curve of the proposed method compared with the others algorithms

4 Conclusions

This paper presents a new method for abnormal behavior detection. It is based on optical flow and Mixture of Gaussians Model. The experimental results show that the proposed method presents a better performance in both detection rate and time processing per frame compared to other state-of-the-art algorithms.