Keywords

1 Introduction

Unmanned aerial vehicle (UAV) is one of the most important tools for obtaining information, which is implemented by multiple tasks both in civilian and military areas such as normal observation, disaster monitoring, and battlefield detection, etc. Among these applications, UAV object tracking is rather challenging due to various factors like illumination change, occlusion, motion blur, and texture variation [3,4,5,6]. To this end, the conventional data association and temporal filters always fail due to the fast motion and changing object or background appearances.

In object tracking filed, generative tracking methods always search for best matched regions in successive frames as results. Most of recent generative methods such as [13,14,15] focus on building a good representation of the target. Because the generative method focuses on the representation of the target itself and ignores the background information, it is prone to drift when the object changes violently or under occlusion.

On the other hand, a popular trend of visual tracking research in recent years is the use of discriminative learning methods, such as [7, 8]. Discriminative method can be more robust than generative method because it distinguishes the background and foreground information significantly. However, these methods face such problems as lack of samples and confusion of boundary demarcation. In addition, many existing methods are slow in operation and expensive in computation, which limits their practical applications [9].

Though many attempts have been made to mitigate these tracking problems for general videos, none of them can be perfectly applied to UAV videos due to the characteristic of small target caused by the long-distance imaging and movement of both target and background caused by the violent motion of the airframe. Inspired by the visual attention mechanism, our works establish a more robust feature representation by combining the saliency feature with the ordinary feature on the basis of the kernelized correlation filter (KCF).

The rest of the paper is organized as follows. Section 2 introduces our method including details about how saliency feature can be efficiently embedded in the KCF framework and the relocation mechanism used to detect and adjust the tracking failure. Section 3 shows the extensive experiments we designed to evaluate the results of our method in comparison of other state-in-the-art methods. Finally, conclusions are put forward in Sect. 4.

2 Methodology

2.1 Tracking by Correlation Framework

Discriminative trackers [9] are thought more robust than generative trackers because the former regard visual tracking as a problem of binary classification. Utilizing the trained classifier, trackers can distinguish foreground and background and estimate target position among vast candidates.

In 2010, correlation filter was introduced into tracking to raise efficiency when facing a large number of samples [1]. Using the circulant structure and ridge regression, Henriques et al. [7] simplify the training and testing process. The purpose of ridge regression training is to find a filter \( \omega \) which can minimize the squared error over training samples \( x \) and their label \( y \). The training problem is formulated as

$$ \mathop {\hbox{min} }\limits_{\omega } ||\phi \omega - y|| + \lambda ||\omega || $$
(1)

where \( \phi \) denotes the mapping of all the circular shifts of the template \( x \). If the mapping \( \phi \) is linear, Eq. 1 could be solved directly using the DFT matrix and the target’s position could be located when we get its filter \( \omega \).

Then the appealing method being able to work solely with diagonal matrices by Fourier diagonalizing circulant matrices can be used simplify the linear regression in Eq. 2.

$$ X = Fdiag(\overset{\lower0.5em\hbox{$\smash{\scriptscriptstyle\frown}$}}{x}^{ * } )F^{H} $$
(2)

From the Eq. 2, we have

$$ \alpha = (K + \lambda I)^{ - 1} y $$
(3)

with the gaussian kernel \( k \), the matrix \( K_{i,j} = k(P^{i} x,P^{j} x) \) is circulant. Using the quick calculation of the Eq. 3, the FFT of \( \alpha \) is calculated by:

$$ \hat{\alpha } = \frac{{\hat{y}}}{{\hat{k}^{xx} + \lambda }} $$
(4)

where the notation “^” represents discrete Fourier operator, \( k^{xx} \) is the first row of the circulant matrix \( K \), and test patch \( z \) are evaluated by:

$$ \hat{y} = \hat{k}^{zx}\Theta \hat{\alpha } $$
(5)

where \( \Theta \) is the element-wise product. \( \hat{y} \) is the output response for all the testing patches in frequency domain. Then we have

$$ response = \hbox{max} (ifft(\hat{y})) $$
(6)

The target position is the one with the maximal value among response calculated by Eq. 6. Finally, a new model is trained at the new position, and the obtained values of \( \alpha \) and \( x \) are linearly interpolated with the ones from the previous frame, to provide the tracker with some memory.

Despite of the huge success of correlation tracking, these trackers could not represent the small target of UAV application robustly since the ground targets always “weak” and “small”, which showing low representation. Therefore, how to find a more robust feature and solve the critical drift problem in the UAV application should be a challenging problem.

2.2 Tracking by Compounded Features

Due to the weakly representation of the small target, we can use a saliency method, which is thought to be one of the guiding mechanisms of human visual tracking, to get close to the approximate target position. There have been some works using visual saliency for object tracking. For example, [24] used multiple trackers to modulate the attention distribution and [25] embedded three kinds of attention to help locate the object, while they are both time-consuming. In this paper, we propose a simple and efficient feature model based on the image saliency, combing the saliency feature and the ordinary feature, which can be represented as:

$$ C_{o} (z) = \gamma C_{f} (z) + (1 - \gamma )C_{s} (z) $$
(7)

in which \( C_{o} (z) \) represents the compounded feature, \( C_{f} (z) \) represents the ordinary feature such as gray and HoG, \( C_{s} (z) \) represents the saliency feature, \( \gamma \) represents the weight coefficient, \( z \) means the candidate target. This combination of two parts provides a more robust feature representation. And we will introduce the two parts in detail below respectively.

Ordinary Feature Representation.

Low-level features, such as intensity, color, orientation, motion, etc., are widely used. It is because that fast guidance of image key content is usually believed to be driven by bottom-up features. For simplicity and speed, we omit other cues but intensity. Color cue does not exist in gray-scale image sequences. Besides, the computational cost of color or motion is rather expensive. It is not a good trade-off to trade significant amount of computing time for a little performance improvement, especially when our main task is tracking, in which speed is a key performance metric. Therefore, an intensity map \( I(z) \) is used to model appearance distribution.

$$ I(z) = (r(z) + g(z) + b(z))/3 $$
(8)

in which \( r(z),g(z),b(z) \) represents the three color channels, respectively.

Saliency Feature Representation.

By investigating different saliency models, we note that computational models, especially spectral models, fit the UAV tracking situation better to the real-time application for its speediness. Therefore, to get the saliency feature representation, we cite the image signature method [22], providing an approach to the figure-ground separation problem using a binary, holistic image descriptor.

$$ Imagesignature(x){\text{ = sign[DCT(x)]}} $$
(9)
$$ \tilde{x} = IDCT[sign(\overline{x} )] $$
(10)
$$ m = g \cdot [\tilde{x} \circ \tilde{x}] $$
(11)

where \( x \) represents the image, \( g \) is a Gaussian kernel, \( \cdot \) denotes the convolution operation and \( \circ \) denotes the Hadamard production. It has been verified experimentally that this method efficiently suppresses the background and highlight the foreground.

Then, through transferring the single-channel definition of the DCT signature to quaternion DCT(QDCT) signature, we derive a sparsity map using the QDCT method [23], describing the possible target candidates of image around the target in the last frame.

$$ m_{QDCT} (I_{Q} ) = g \cdot [\tilde{I}_{Q} \circ \tilde{I}_{Q} ] $$
(12)
$$ \tilde{I}_{Q} { = }IQDCT^{L} \{ sign[QDCT^{L} (I_{Q} )]\} $$
(13)

where \( I_{Q} \) is the quaternion image, \( g \) is a \( 10 \times 10 \) Gaussian kernel with \( \sigma { = }2.5 \). Using the QDCT method, color information can be totally covered in the saliency detection. The mixture of spectral method and low-level features (color in our method only) improves the performance of original image signature method.

$$ QDCT^{L} (p,q) = \alpha_{p}^{M} \alpha_{q}^{N} \sum\limits_{m = 0}^{M - 1} {\sum\limits_{n = 0}^{N - 1} {u_{Q} I_{Q} (m,n)\beta_{p,m}^{M} \beta_{q,n}^{N} } } $$
(14)
$$ QDCT^{R} (p,q) = \alpha_{p}^{M} \alpha_{q}^{N} \sum\limits_{m = 0}^{M - 1} {\sum\limits_{n = 0}^{N - 1} {I_{Q} (m,n)\beta_{p,m}^{M} \beta_{q,n}^{N} u_{Q} } } $$
(15)
$$ IQDCT^{L} (m,n) = \sum\limits_{p = 0}^{M - 1} {\sum\limits_{q = 0}^{N - 1} {\alpha_{p}^{M} \alpha_{q}^{N} u_{Q} C_{Q} (p,q)\beta_{p,q}^{M} \beta_{m,n}^{N} } } $$
(16)
$$ IQDCT^{R} (m,n) = \sum\limits_{p = 0}^{M - 1} {\sum\limits_{q = 0}^{N - 1} {\alpha_{p}^{M} \alpha_{q}^{N} C_{Q} (p,q)\beta_{p,q}^{M} \beta_{m,n}^{N} u_{Q} } } $$
(17)

Where \( u_{Q} \) is a unit (pure) quaternion, \( u_{Q}^{2} { = } - 1 \) that serves as DCT axis, \( I_{Q} \) is the \( M \times N \) quaternion matrix. In accordance with the definition of the traditional type-II DCT, we define \( \alpha \), \( \beta \) and \( u_{Q} \) as follows:

$$ \alpha_{p}^{M} = \left\{ {\begin{array}{*{20}c} {\begin{array}{*{20}l} {\sqrt {{1 \mathord{\left/ {\vphantom {1 M}} \right. \kern-0pt} M}} } \hfill & {for} \hfill & {p = 0} \hfill \\ \end{array} } \\ {\begin{array}{*{20}l} {\sqrt {{2 \mathord{\left/ {\vphantom {2 M}} \right. \kern-0pt} M}} } \hfill & {for} \hfill & {p \ne 0} \hfill \\ \end{array} } \\ \end{array} } \right. $$
(18)
$$ \beta_{p,m}^{M} = \cos [\frac{\pi }{M}(m + \frac{1}{2})p] $$
(19)
$$ u_{Q} { = } - \sqrt {{1 \mathord{\left/ {\vphantom {1 3}} \right. \kern-0pt} 3}} i - \sqrt {{1 \mathord{\left/ {\vphantom {1 3}} \right. \kern-0pt} 3}} j - \sqrt {{1 \mathord{\left/ {\vphantom {1 3}} \right. \kern-0pt} 3}} k $$
(20)

Figure 1 shows the process of obtaining saliency feature. In this model, the red, blue and green channels are extracted respectively in the candidate patch, occupying three channels of the QDCT (the residual channel is set as void). Then we get the saliency map of each channel. Through QDCT and IQDCT transform, three channel maps are grouped to the final saliency map, which can be combined with the gray feature.

Fig. 1.
figure 1

Process of obtaining saliency feature.

2.3 Tracking by Relocation Mechanism

As shown in the failure detection case in the output constraint transfer theory, the output of the test image is reasonably considered to follow a Gaussian distribution, which is theoretically transferred to be a constraint condition in the Bayesian optimization problem, and successfully used to detect the tracking failure problem. Here, we argue that the Gaussian prior [12] can well help to detect failure by setting \( |(y_{\hbox{max} }^{t} - \mu^{t} )/\sigma^{t} | < T_{g} \), where \( \mu^{t} \) and \( \sigma^{t} \) are the average and variance of response calculated based on all previous frames in the tracking procedure, \( y_{\hbox{max} }^{t} \) represents the maximum response, and according to [12], \( T_{g} \) is set as 0.7.

After judging and entering the drift processing mechanism, we get \( N \) samples from the target position in the last frame to relocate the target position by a linear stochastic differential equation:

$$ p_{t}^{(n)} = p_{t - 1}^{(n)} + A\delta_{t}^{(n)} $$
(21)

where \( \delta_{t}^{(n)} \) is a multivariate Gaussian random variable, \( t \) represent \( t^{th} \) frame, \( n \) represent \( n^{th} \) sample and \( p \) represents the position, \( A \) is the proportionality factor, which can be updated by:

$$ A = \frac{{P(\delta_{t}^{(n)} (x))P(\delta_{t}^{(n)} (y))P(F_{t}^{(n)} )}}{{ 1 { - }P(\delta_{t}^{(n)} (x))P(\delta_{t}^{(n)} (y))P(F_{t}^{(n)} )}} $$
(22)

where \( x,y \) represents two directions in a two-dimensional image, \( P(\delta_{t}^{(n)} ) \) is Gauss probability and \( P(F_{t}^{(n)} ) \) is the likelihood calculated by Euclidean distance of the feature between the candidate targets and the target. Through Eq. 22, we can refine the candidate location so that obtain a more precise tracking result.

figure a

3 Experiments

Most visual tracking models have been tested on the commonly used benchmark [10] with many datasets like OTB50 [10], ALOV300++ [21] and so on, which have abundant natural scenes. However, in the field of UAV object tracking, few datasets have been proposed to test the tracking models. In 2016, a new dataset (UAV123) [11] with sequences from an aerial viewpoint was proposed, which contains a total of 123 video sequences and more than 110 K frames, making it the second largest object tracking dataset after ALOV300++. To evaluate the effectiveness and robustness of trackers, we selected 45 challenging sequences from UAV123 dataset, covering typical UAV tracking problems. According to the benchmark [10], each sequence is manually tagged with 11 attributes which represent challenging aspects including illumination variations (IV), scale variations (SV), occlusions (OCC), deformations (DEF), motion blur (MB), fast motion (FM), in-plane rotation (IPR), out-of-plane rotation (OPR), out-of-view (OV), background clutters (BC), and low resolution (LR). In addition, we compare our tracker with 10 state-of-the–art trackers covering correlation filter-based tracker CT [18], CSK [7], KCF [19], OCT [12], saliency-based tracker SPC [20] and other representative trackers, such as LOT [2], ORIA [17] and DFT [16]. The platform of our experiments is Intel I7 2.7 GZ (4 cores) CPU with 8G RAM.

3.1 Objective Evaluation

Figure 2 shows qualitative results comparing with the other state-of-the-art trackers on challenging sequences. In the first row, our tracker can precisely track the wakeboard, while the conventional KCF tracker fails. The famous OCT tracker could also relocate the target because of the Gaussian prior, although the tracking bounding boxes of the OCT is not as precise as those of ours. While in uav, the appearances of the target are weak and small, with severe changing and drifting. Most trackers drift easily due to poor robustness. While our tracker could locate the target accurately because of the robust compounded feature and utilization of relocation method. It is also observed that our proposed tracker works very well in other sequences, e.g., road and truck. In contrast, all other compared trackers get false or imprecise results in one sequence at least.

Fig. 2.
figure 2

Illustration of some key frames.

3.2 Subjective Evaluation

In Fig. 3, the overall success and precision plots generated by the benchmark toolbox are reported. These plots report 11 performing trackers in the benchmarks. Our tracker and OCT achieve 90.8% and 84.7% based on the average precision rate when the threshold is set to 20, while the famous KCF and CSK trackers, respectively, achieve 74.1% and 72.4%. In terms of success rate, Ours and OCT, respectively, achieve 56.9% and 55.8%. We also compare with SPC, which presents a saliency prior context model, showing that our tracker achieves a significant performance in terms of precision (18.4% higher) and success rate (10.3% higher). These results confirm that our method performs better than most state-of-the-art trackers.

Fig. 3.
figure 3

Success and precision plots.

Results about average success rate for 8 typical attributes are summarized in Table 1. In this table we use bold fonts to indicate the best tracker of different attributes among KCF, OCT and our tracker. Statistic data shows that our method is effective and efficient in front of the challenges including in-plane rotation, out-of-plane rotation, scale variation and fast motion, which occurs frequently in UAV videos. We are pleasantly surprised that our method outperforms the KCF and OCT for low resolution videos, with a success rate of 0.55, meaning our method can perform well although the target is weak and small. Besides, our method hold a tracking speed of 92FPS, so it can realize real-time UAV object tracking.

Table 1. Results for success rate of different attributes

4 Conclusion

In this paper, we propose a compounded feature and a relocation method to enhance KCF for object tracking in UAV applications. We explore that the incorporation of low-level features and saliency feature could effectively solve the tough problems in UAV tracking, i.e. low resolution and fast motion. Also, we show that the failure judgement and relocation mechanism is appropriate for the weak and small targets in UAV tracking process. In the experimental section, we implement our tracking framework under the compounded feature – the combination of QDCT image signatures and gray feature. It’s worth noting that the proposed method could be extended to other features and extensive tracking situations. Extensive experiments of the UAV videos and comparisons on the benchmark show that the proposed method significantly improves the performance of KCF, and achieves a better performance than state-of-the-art tracker.