Keywords

1 Introduction

Visual object tracking is an important task in computer vision with a variety of applications, such as human-computer interaction, security and surveillance, traffic control, and so on [26, 27]. Although many efforts have been made in the past decades, object tracking is still a challenging task. This is because there are various unfavorable factors in the tracking scenarios that a single tracker can’t perfectly deal with. However, we observe that different trackers usually contain different and complementary information. For example, the tracker with features in the latter layers of deep convolutional neural networks (CNNs) captures strong semantic information, while the tracker with features in the earlier layers of CNNs captures more spatial details [17]. Therefore, exploiting the complementary information among different trackers provides a useful framework to deal with the challenge [18, 25].

The existing multi-tracker based methods can be divided into two kinds. One mainly uses a fixed weight [13, 23] or an attention model [7] to fuse different trackers. However, the fusion method requires most of its base trackers to work correctly, which is difficult to achieve in complex tracking scenarios. The other uses a specific criterion to select the best tracker, such as the entropy minimization [29], the consistency degree between the different trackers [18, 25], and the peak value of a response map [1]. However, these selection strategies lack foresight, because they only evaluate the performance of the trackers in the current frame, without considering that in the subsequent frames. The manual design methods are limited in the representation capability and can’t take all the impact factors into account. Thus, these methods do not take full advantage of the complementary information among base trackers, and their results are suboptimal.

Recently, The advantage of model-free deep reinforcement learning (DRL) has been demonstrated on a range of challenging decision-making tasks [19]. DRL aims to learn a policy that maximizes the expected reward by interacting with the environment. This characteristic also makes the DRL competent for the task of tracker selection.

In this paper, we formulate the multi-tracker tracking problem as a decision-making task, in which an expert (called the agent in DRL) is trained by DRL to select the best base tracker to process the current frame. Our framework can be separated into the tracker pool part and the expert part. The tracker pool part consists of a series of base trackers and works as an environment, while the expert part is a deep network and works as an agent, as illustrated in Fig. 1. The tracker pool regresses the input image patch to a series of response maps. Then, these response maps are input into the expert network to estimate their value. Finally, the response map with the highest value will be taken to search the target and update base trackers for the next frame. The contributions of our method can be summarized as three-fold:

  1. (1)

    We formulate the multi-tracker tracking problem as a decision-making task in which the tracker selection is executed by the deep expert. This impels the expert to consider the long-term expected performance of the tracker when making decisions, rather than the immediate performance, which makes the results more reliable.

  2. (2)

    We train the deep expert offline by combining the supervised learning and deep reinforcement learning to estimate the reliability of base trackers. Benefiting from the CNNs, the expert has strong expressive ability to describe the complex relationship between the response map and the reliability of tracker.

  3. (3)

    The deep expert guides base trackers to update themselves adaptively during online tracking to capture the changes of object appearance and prevent corruption.

The results on OTB-2013 [26], OTB-100 [27] and VOT2017 [15] benchmarks show that the proposed method takes full advantage of the complementary information among base trackers and outperforms the existing multi-tracker based methods.

Fig. 1.
figure 1

The framework of the proposed method. The base tracker part consists of a series of correlation filter based trackers built in different feature spaces. The expert network observes the state offered by base trackers and outputs an action to select the best tracker.

2 Related Work

In this section, we briefly review the related works from two aspects: DRL based methods and multi-tracker based methods.

2.1 DRL Based Methods

Tracking an object always comes with a lot of decisions about the selection of tracker components. Such decision-making can be done through reinforcement learning. Yun et al. [28] focused on the motion model of the tracker and proposed ADNet in which a deep reinforcement network is used to generate a series of more and more precise candidates by iteration. Huang et al. [14] focused on feature extractor and proposed an early-stopping tracker, in which an agent is trained by reinforcement learning to select the cheap features for easy frames and the expensive deep features for challenging frames. In [6], an agent is trained to choose the best template for tracker from the template pool. Supancic et al. [24] focused on model update and formulated tracking as a partially observable decision-making process, in which the tracker must decide whether to update or not, or to reinitialize.

Different from the above methods, the proposed method in this paper focuses on the post-processor ensemble.

2.2 Multi-tracker Based Methods

Zhang et al. [29] introduced a multi-expert tracking framework (MEEM), in which an expert committee is constituted by a discriminative tracker and its former snapshots, and the best expert is selected according to the minimum entropy criterion to restore the tracking when there is a disagreement among the experts. Bai et al. [1] proposed a heterogeneous cue fusion method, where multi-templates, multi-features and multi-scales are integrated to further improve the discriminative ability of the filters. Otherwise, Wang et al. [25] proposed a multi-cue tracker, which integrates multi-features at the decision-level. Nam et al. [21] proposed to deal with multi-modality in target appearances by modeling multiple CNNs in a tree structure.

However, the performances of the above methods are suboptimal because of the handcrafted fusion strategy. In this paper, we adopt the deep reinforcement network to dynamically select base trackers, which can maximize the role of each tracker.

3 Method

3.1 Definition of Expert Network

The goal of the agent in RL is to learn a policy function to select the best tracker, where is the state space and is the action space. In our case, the action to select the based tracker does not output directly from the expert. Instead, we exploit expert to estimate the value of each base tracker according the value function it learns , where is the value space. Then the action to select the best tracker can be defined as

(1)

where \(\mathbf {s}^{(i)} \in \{\mathbf {s}^{(1)}, \mathbf {s}^{(2)},..., \mathbf {s}^{(m)}\}\) is the state of base tracker.

State. The state \(\mathbf {s}\) in our framework is a single response map of the CF-based tracker. Because the target output of the CF-based tracker is a Gauss distribution, the tracker will be more reliable if the output response map is more like Gauss distribution, such as has high peak and steep slope, as illustrated in Fig. 2a. The Peak-to-Sidelobe ratio (PSR) used in previous work for adaptive tracking [4] and fusion tracking [25] is also based on this property. However, the calculation of PSR only involves the peak value, mean value, and variance of the response map.

Fig. 2.
figure 2

Relationship between the response map, estimated value and Intersection-over-Union (IoU). (a) The correct sample. The high estimate value and IoU appear in map of Gaussian distribution. (b) The corrupt sample. The high IoUs appear in the response map of irregularity distribution, but the estimate values are still correct.

Value Function. Different from the general reinforcement learning methods which estimate a state-action value function (Q-function) to measure the value of executing a given action in a given state, we only estimate a state value function (Value function) to measure the value of a given state. Value function under a policy \(\pi \) is defined formally as

$$\begin{aligned} V_\pi (\mathbf {s})=\mathbb {E}_\pi \left[ \sum _{k=0}^{\infty }{\gamma ^kr_{t+k+1}}\Biggl |\mathbf {s}_t=\mathbf {s}\right] , \end{aligned}$$
(2)

where r is the reward return from the environment and \(\gamma \in [0, 1]\) is a discounting rate. Because the reward to the agent depends on the action it takes, the value function is defined as the expected discounted return with respect to the policy.

The state space in our framework is enormous, which means we cannot expect to find the optimal value function using finite time and data. The usual way to solve this problem is to find a good approximate solution. The value function in (2) can be approximated by a deep network with parameter \(\theta \), which is called the expert network in this paper.

$$\begin{aligned} V_\pi (\mathbf {s}) \approx V(\mathbf {s}| {{\varvec{\theta }}}). \end{aligned}$$
(3)

The expert network can be solved by the gradient descent methodology, which will be described in Sect. 3.2.

Reward Function. The reward function \(r(\mathbf {s},a,\mathbf {s}')\) represents the consequences of the action a. The goal of the agent is to maximize the cumulative reward it receives. Since each state has a ground-truth and a predicted box, the reward function can be defined frame by frame as

$$\begin{aligned} r(\mathbf {s},a,\mathbf {s}')=\left\{ \begin{aligned} 1,&IoU(\mathbf {b},\mathbf {g})>=0.7 \\ 0,&0.4<=IoU(\mathbf {b},\mathbf {g})<0.7 \\ -1,&IoU(\mathbf {b},\mathbf {g})<0.4 \end{aligned}\right. , \end{aligned}$$
(4)

where \(IoU(\mathbf {b},\mathbf {g}) = area(\mathbf {b}\cap \mathbf {g})/area(\mathbf {b} \cup \mathbf {g})\).

3.2 Offline Training of Expert Network

The expert network is firstly pre-trained by supervised learning to obtain a reliable initiation. Then the DRL is used for final training.

Pre-training. It is wildly known that the deep reinforcement learning network is hard in convergence due to the large state and action space. The usual methods to overcome it is to assign a reliable initiation to the deep network before training it with reinforcement learning [28]. Before applying reinforcement learning, we use supervised learning to pre-train our expert network offline. The output of our expert network can be approximated by the IoU between the predicting bounding box and the ground truth. Thus, the expert network can be pre-trained by minimizing the following loss,

(5)

where N is the number of training samples in a mini-batch, y is the IoU between the predict boundary box of the input response map and the ground truth.

Note that, the network trained by (5) can only output a coarse estimate of the state value, not the real estimate, for two reasons. One is that there are some noises in the corresponding relationship between IoU and the response map. For example, some response maps don’t have the Gaussian-like distribution but have a high IoU, as shown in Fig. 2b. The other is that the IoU only represents the result of the current frame, but the value function focuses on not only the current return but the future return. The unequal relationship makes IoU fail to provide accurate guidance for network training. These two drawbacks in supervised learning can be well addressed by reinforcement learning.

Training with Deep Reinforcement Learning. After pre-training of supervised learning, we train our network with DRL. We use the temporal-difference based reinforcement learning method to train our expert network, which uses the current estimate value v instead of the true value \(v_\pi \) to formula the target as \(r + \gamma v(\mathbf {s}')\). We adopt the strategies in [19] that uses a separate network for generating the targets and use an experience pool to remove data correlation. After randomly sampling N pairs of \((\mathbf {s}_i, a_i, r_i, \mathbf {s}_i')\) samples from experience pool, the target of the training process is to minimize the following loss

(6)

where \(V'\) with parameter \(\theta '\) is the target network of V. Note that there are several temporary states \(\{\mathbf {s}_t^{(1)},\mathbf {s}_t^{(2)},...,\mathbf {s}_t^{(m)}\}\) generate at frame t (m is the number of based tracker). These states are estimated by the expert network and the state with the highest value will be selected as the final state \(s_t\) stored in the experience pool.

3.3 Online Tracking

In the online tracking, a search patch \(\mathbf {p}_t\) is firstly cropped from the current frame \(F_t\) according to the previously estimated boundary box \(\mathbf {b}_{t-1}\) and the pre-processing function \(\mathbf {p}_t=\phi (\mathbf {b}_{t-1},F_t)\). The patch \(\mathbf {p}_t\) is then sent into the correlation filter base trackers and regressed into a series of response maps \(S_t = \{\mathbf {s}_t^{(1)},\mathbf {s}_t^{(2)},...,\mathbf {s}_t^{(m)}\}\). The values of these response maps are estimated by the expert network according to the equation (3). Next, the best tracker can be selected by the action defined in (1). Finally, the scale of the boundary box can be identified by the scale estimation method introduced in [9]. After process a frame, all base trackers are updated using the same template of the new target and share the same searching areas (ROI) in the next frame.

3.4 Adaptive Updating

In order to avoid the corruption of the tracker when online updating, we should carefully check the updated template. In our experiment, the value of the response map, which measures the quality of the tracking result, is a good adaptive update indicator. Thus, the learning rate of tracker update can be defined as

$$\begin{aligned} C = \left( \frac{\bar{V}(\mathbf {s})-T_{min}}{T_{max}-T_{min}}\right) \cdot \eta ', \end{aligned}$$
(7)

where

$$\begin{aligned} \bar{V}(\mathbf {s})=\left\{ \begin{aligned} T_{max},&V(\mathbf {s})>T_{max} \\ T_{min},&V(\mathbf {s})<T_{min} \\ V(\mathbf {s}),&otherwise \end{aligned} \right. . \end{aligned}$$
(8)

\(\eta '\) is the learning rate of model learning in the standard DCF, \(T_{max}\) and \(T_{min}\) are the upper and lower update boundary respectively.

4 Experiment

4.1 Implementation

Network Architecture. Because of the success of lightweight deep network in object tracking [22, 28], we build our expert network with four convolution layers and three fully-connected layers (the last one is the output layer). The convolution layers have the same structure and initial parameters as the first three layers of VGG-M network [5]. The next three fully-connected layer has 256, 128 and 1 output units respectively. In addition to the output layer, each layer is initialized by the Xavier initializer and activated by the ReLU function.

Tracker Pool. Because of the good performance and high tracking speed of DCFs in object tracking, we take it as our base tracker. We implement our tracker pool following the setting in [25]. In order to capture low-level details and high-level semantic information meantime, the tracker pool consists of three types of feature: \(\{Low, Middle, High\}\), where the low-level feature is the HOG feature, the middle-level feature is the conv4-4 layer of VGG-19, and high-level feature is the conv5-4 layer of VGG-19. Then these three types feature are optionally combined into \(C_3^1+C_3^2+C_3^3=7\) kinds of base trackers. The hyperparameter setting of base trackers is consistent with [25].

Hyperparameters. To train our Expert Network offline, we use TC-128 [16] and UAV123 [20] as the training set, eliminating the same video in the test set. We randomly choose a video clip with a length less than 400 frames for each iteration.

The possibility of using the expert decision \(\eta \) is increased linearly from 0.5 to 0.9 over the first 100 thousand iterations, then is fixed. The reward decay rate \(\gamma \) is set to 0.99. The target network is soft-updated with a learning rate of 0.001, while the evaluation network is optimized by the Adam and the learning rate is 0.001. The replay buffer size is \(10^5\) and the training batch size is 32. We train the expert network through 10,000 iterations of supervised learning and 250,000 iterations of reinforcement learning. In online tracking, the update boundary \(T_{max}\) and \(T_{min}\) is set to 8.0 and 2.0 respectively. Our algorithm is implemented in Python and runs with a single Titanx GPU.

4.2 Ablation Analysis

To verify the contributions of the different component of the proposed method, we implement and evaluate the proposed method with and without adaptive update, donated by Ours and Ours-NU respectively. We also compare them with the adaptive update version (MCCT) and the non-adaptive update version (MCCT-NU) of the baseline [25]. The ablation analyses mainly do on the OTB-2013 and OTB-100 datasets, using area-under-curve (AUC) of success plot and distance precision (DP) rate for as Metrics.

From the results in Table 1, we can observe that our methods significantly improve the performance of \(T_7\), the best individual tracker, which illustrates the effectiveness of our method. When using our Expert Network only for tracker selection, Ours-NU outperforms \(T_7\) with a gain of \(3.4\%\) in DP and \(2.4\%\) in AUC on OTB-2013, and \(2.4\%\) in DP and \(1.6\%\) in AUC on OTB100. When using our Expert Network for both tracker selection and adaptive update, the performance of the best base tracker is improved \(3.6\%\) in DP and \(3.0\%\) in AUC on OTB-2013, and \(5.2\%\) in DP and \(3.7\%\) in AUC on OTB100.

In comparison with the baseline, regardless of adaptive update, our Ours-NU method outperforms MCCT-NU with a gain of \(1.5\%\) in AUC and \(2.3\%\) in DP on OTB-2013, and \(1.9\%\) in AUC and \(2.3\%\) in DP on OTB-100. This means that our selection method can make better use of complementary information among base trackers than the handcrafted selection method in MCCT.

Table 1. Ablation analysis of the proposed method on the OTB-2013 [26] and OTB-2015 [27] datasets. The DP (@20px) and AUC scores are used for evaluation. The best result is marked in the bold font.

4.3 Tracker Selection VS Tracker Fusion

In order to compare the difference between the tracker selection and fusion strategy, we implement a tracker fusion strategy in our multi-tracker framework. The fusion weight of the \(i^{th}\) tracker is defined as \(w^{(i)} = V(s^{(i)}|\theta )/\sum _i V(s^{(i)}|\theta )\). Then, we can obtain the fusion response map by \(R_f = \sum _i w^{(i)} R^{(i)}\). We evaluate these two version method on the 11 attributes of OTB-2013 [26] in terms of DP (@20px) and AUC scores. On the fast motion (FM), the in-plane rotation (IPR), the out-of-view (OV) and the background clutters (BC) these four attributes, the performance of tracker selection is obviously superior to the tracker fusion. Because in these attributes, the target changes dramatically, which makes most base trackers drift and affect the fusion results, as shown in Fig. 3. However, the tracker fusion outperforms tracker selection on the deformation (DEF), the out-of-plane rotation (OPR) and the low resolution (LR) these three attributes. Because most base trackers in these attributes have a good performance, so we can get a better performance with the fusion strategy, as shown in Fig. 3.

The above results show that the tracker fusion is suitable for simple tracking scenarios, while the tracker selection performs better in complex scenarios.

Fig. 3.
figure 3

The performance of base trackers on different attributes.

4.4 State-of-the-Art Comparisons

Evaluation on OTB Benchmarks. The proposed approach is compared with the state-of-the-art methods, including multi-tracker based methods MCCT [25], SCT4 [8], HDT [23], ACFN [7], MEEM [29], BranchOut [13], and other outstanding methods ADNet [28], CCOT [12], DSST [9], SRDCFdecon [11], CF2 [17], SiamFC [3], on both OTB-2013 and OTB-100 benchmarks by one-pass evaluation (OPE). The results of top ten trackers are shown in Fig. 4a and b. The comparison result shows that our method is better than the other trackers.

Figure 4a illustrates the precision and success plots of the OTB-2013 benchmark. The AUC score of the success plots of our method is 0.706, which outperforms the two existing best multi-tracker based trackers BranchOut and MCCT. The improvement range is \(0.4\%\) and \(0.5\%\), respectively. The DP (@20px) of the precision plot of our method is 0.931, which outperforms the MCCT, but is slightly worse than the BranchOut. Because base trackers of BranchOut are trained end-to-end, which are more superior than ours.

From the precision and success plots of OTB-100 benchmark illustrated in Fig. 4b, we can see that our method obtains the best results in both success and precision plots. Specifically, our method achieves an AUC score of 0.688 and a precision score of 0.919, which outperforms both BranchOut (0.678/0.917) and MCCT (0.684/0.916). The results of the OTB benchmark demonstrate the advantages of our selection strategy of multi-tracker.

Fig. 4.
figure 4

The results of the OTB benchmark. The numbers in the legend are the precisions at 20 pixels and the area-under-curve scores, respectively for the precision plots and the success plots.

Evaluation on VOT2017 Benchmark. We also compare the proposed method with MCCT [25], SiamFC [3], CCOT [12], DSST [9], MEEM [29], SRDCF [10], Staple [2], UCT [30] on VOT2017. The expected average overlap (EAO), accuracy and failure times are used to evaluate these trackers, as shown in Tab. 2. We can see that our method obtains an EAO of 0.293, which outperforms the second best tracker MCCT with a gain of \(2.3\%\). What’s more, the average failure of our tracker is 18.29, which is the lowest among the above trackers. Although our method is slightly worse than MCCT in accuracy, it still has a comparable score of 0.5. The results of the VOT2017 benchmark also demonstrate the competitive performance of the proposed method.

Table 2. A comparison with the state-of-the-art trackers on VOT2017 dataset. The
figure a
and
figure b
best scores are highlighted in color. The \(\uparrow \) represents the highest is the best, while the \(\downarrow \) represents the lowest is the best.

5 Conclusions

In this work, we have formulated the multi-tracker tracking method as a decision-making task, where an expert is trained offline by deep reinforcement learning to select the best tracker for the current frame and guide the trackers to update. Formulating in decision-making process brings the foresight and strong representation to our method, which enables it to take full advantage of the complementary information among base trackers. The experimental results demonstrate the superiority of our method.