Tracker-Level Decision by Deep Reinforcement Learning for Robust Visual Tracking

Huang, Wenju; Wu, Yuwei; Jia, Yunde

doi:10.1007/978-3-030-34120-6_36

Wenju Huang¹⁴,
Yuwei Wu¹⁴ &
Yunde Jia¹⁴

Part of the book series: Lecture Notes in Computer Science ((LNIP,volume 11901))

Included in the following conference series:

International Conference on Image and Graphics

1958 Accesses
1 Citations

Abstract

In this paper, we formulate the multi-tracker tracking problem as a decision-making task and train an expert by the deep reinforcement learning (DRL) to select the best tracker. Specifically, the expert takes the response map of the tracker as input and outputs a scalar to indicate the reliability of the tracker. With the DRL, the expert can make full use of complementary information among base trackers. Furthermore, under the guidance of the deep expert, base trackers update themselves adaptively to capture the changes of object appearance and prevent corruption. The experimental results on public tracking benchmarks demonstrate that the proposed method outperforms the state-of-the-art methods.

You have full access to this open access chapter, Download conference paper PDF

Deep Reinforcement Learning with Iterative Shift for Visual Tracking

Collaborative Deep Reinforcement Learning for Multi-object Tracking

Real-Time Visual Object Tracking Based on Reinforcement Learning with Twin Delayed Deep Deterministic Algorithm

Keywords

1 Introduction

Visual object tracking is an important task in computer vision with a variety of applications, such as human-computer interaction, security and surveillance, traffic control, and so on [26, 27]. Although many efforts have been made in the past decades, object tracking is still a challenging task. This is because there are various unfavorable factors in the tracking scenarios that a single tracker can’t perfectly deal with. However, we observe that different trackers usually contain different and complementary information. For example, the tracker with features in the latter layers of deep convolutional neural networks (CNNs) captures strong semantic information, while the tracker with features in the earlier layers of CNNs captures more spatial details [17]. Therefore, exploiting the complementary information among different trackers provides a useful framework to deal with the challenge [18, 25].

The existing multi-tracker based methods can be divided into two kinds. One mainly uses a fixed weight [13, 23] or an attention model [7] to fuse different trackers. However, the fusion method requires most of its base trackers to work correctly, which is difficult to achieve in complex tracking scenarios. The other uses a specific criterion to select the best tracker, such as the entropy minimization [29], the consistency degree between the different trackers [18, 25], and the peak value of a response map [1]. However, these selection strategies lack foresight, because they only evaluate the performance of the trackers in the current frame, without considering that in the subsequent frames. The manual design methods are limited in the representation capability and can’t take all the impact factors into account. Thus, these methods do not take full advantage of the complementary information among base trackers, and their results are suboptimal.

Recently, The advantage of model-free deep reinforcement learning (DRL) has been demonstrated on a range of challenging decision-making tasks [19]. DRL aims to learn a policy that maximizes the expected reward by interacting with the environment. This characteristic also makes the DRL competent for the task of tracker selection.

In this paper, we formulate the multi-tracker tracking problem as a decision-making task, in which an expert (called the agent in DRL) is trained by DRL to select the best base tracker to process the current frame. Our framework can be separated into the tracker pool part and the expert part. The tracker pool part consists of a series of base trackers and works as an environment, while the expert part is a deep network and works as an agent, as illustrated in Fig. 1. The tracker pool regresses the input image patch to a series of response maps. Then, these response maps are input into the expert network to estimate their value. Finally, the response map with the highest value will be taken to search the target and update base trackers for the next frame. The contributions of our method can be summarized as three-fold:

(1)
We formulate the multi-tracker tracking problem as a decision-making task in which the tracker selection is executed by the deep expert. This impels the expert to consider the long-term expected performance of the tracker when making decisions, rather than the immediate performance, which makes the results more reliable.
(2)
We train the deep expert offline by combining the supervised learning and deep reinforcement learning to estimate the reliability of base trackers. Benefiting from the CNNs, the expert has strong expressive ability to describe the complex relationship between the response map and the reliability of tracker.
(3)
The deep expert guides base trackers to update themselves adaptively during online tracking to capture the changes of object appearance and prevent corruption.

The results on OTB-2013 [26], OTB-100 [27] and VOT2017 [15] benchmarks show that the proposed method takes full advantage of the complementary information among base trackers and outperforms the existing multi-tracker based methods.

2 Related Work

In this section, we briefly review the related works from two aspects: DRL based methods and multi-tracker based methods.

2.1 DRL Based Methods

Tracking an object always comes with a lot of decisions about the selection of tracker components. Such decision-making can be done through reinforcement learning. Yun et al. [28] focused on the motion model of the tracker and proposed ADNet in which a deep reinforcement network is used to generate a series of more and more precise candidates by iteration. Huang et al. [14] focused on feature extractor and proposed an early-stopping tracker, in which an agent is trained by reinforcement learning to select the cheap features for easy frames and the expensive deep features for challenging frames. In [6], an agent is trained to choose the best template for tracker from the template pool. Supancic et al. [24] focused on model update and formulated tracking as a partially observable decision-making process, in which the tracker must decide whether to update or not, or to reinitialize.

Different from the above methods, the proposed method in this paper focuses on the post-processor ensemble.

2.2 Multi-tracker Based Methods

Zhang et al. [29] introduced a multi-expert tracking framework (MEEM), in which an expert committee is constituted by a discriminative tracker and its former snapshots, and the best expert is selected according to the minimum entropy criterion to restore the tracking when there is a disagreement among the experts. Bai et al. [1] proposed a heterogeneous cue fusion method, where multi-templates, multi-features and multi-scales are integrated to further improve the discriminative ability of the filters. Otherwise, Wang et al. [25] proposed a multi-cue tracker, which integrates multi-features at the decision-level. Nam et al. [21] proposed to deal with multi-modality in target appearances by modeling multiple CNNs in a tree structure.

However, the performances of the above methods are suboptimal because of the handcrafted fusion strategy. In this paper, we adopt the deep reinforcement network to dynamically select base trackers, which can maximize the role of each tracker.

3 Method

3.1 Definition of Expert Network

The goal of the agent in RL is to learn a policy function to select the best tracker, where is the state space and is the action space. In our case, the action to select the based tracker does not output directly from the expert. Instead, we exploit expert to estimate the value of each base tracker according the value function it learns , where is the value space. Then the action to select the best tracker can be defined as

(1)

where $\mathbf {s}^{(i)} \in \{\mathbf {s}^{(1)}, \mathbf {s}^{(2)},..., \mathbf {s}^{(m)}\}$ is the state of base tracker.

State. The state $\mathbf {s}$ in our framework is a single response map of the CF-based tracker. Because the target output of the CF-based tracker is a Gauss distribution, the tracker will be more reliable if the output response map is more like Gauss distribution, such as has high peak and steep slope, as illustrated in Fig. 2a. The Peak-to-Sidelobe ratio (PSR) used in previous work for adaptive tracking [4] and fusion tracking [25] is also based on this property. However, the calculation of PSR only involves the peak value, mean value, and variance of the response map.

Value Function. Different from the general reinforcement learning methods which estimate a state-action value function (Q-function) to measure the value of executing a given action in a given state, we only estimate a state value function (Value function) to measure the value of a given state. Value function under a policy $\pi $ is defined formally as

$$\begin{aligned} V_\pi (\mathbf {s})=\mathbb {E}_\pi \left[ \sum _{k=0}^{\infty }{\gamma ^kr_{t+k+1}}\Biggl |\mathbf {s}_t=\mathbf {s}\right] , \end{aligned}$$

(2)

where r is the reward return from the environment and $\gamma \in [0, 1]$ is a discounting rate. Because the reward to the agent depends on the action it takes, the value function is defined as the expected discounted return with respect to the policy.

The state space in our framework is enormous, which means we cannot expect to find the optimal value function using finite time and data. The usual way to solve this problem is to find a good approximate solution. The value function in (2) can be approximated by a deep network with parameter $\theta $, which is called the expert network in this paper.

$$\begin{aligned} V_\pi (\mathbf {s}) \approx V(\mathbf {s}| {{\varvec{\theta }}}). \end{aligned}$$

(3)

The expert network can be solved by the gradient descent methodology, which will be described in Sect. 3.2.

Reward Function. The reward function $r(\mathbf {s},a,\mathbf {s}')$ represents the consequences of the action a. The goal of the agent is to maximize the cumulative reward it receives. Since each state has a ground-truth and a predicted box, the reward function can be defined frame by frame as

$$\begin{aligned} r(\mathbf {s},a,\mathbf {s}')=\left\{ \begin{aligned} 1,&IoU(\mathbf {b},\mathbf {g})>=0.7 \\ 0,&0.4<=IoU(\mathbf {b},\mathbf {g})<0.7 \\ -1,&IoU(\mathbf {b},\mathbf {g})<0.4 \end{aligned}\right. , \end{aligned}$$

(4)

where $IoU(\mathbf {b},\mathbf {g}) = area(\mathbf {b}\cap \mathbf {g})/area(\mathbf {b} \cup \mathbf {g})$.

3.2 Offline Training of Expert Network

The expert network is firstly pre-trained by supervised learning to obtain a reliable initiation. Then the DRL is used for final training.

Pre-training. It is wildly known that the deep reinforcement learning network is hard in convergence due to the large state and action space. The usual methods to overcome it is to assign a reliable initiation to the deep network before training it with reinforcement learning [28]. Before applying reinforcement learning, we use supervised learning to pre-train our expert network offline. The output of our expert network can be approximated by the IoU between the predicting bounding box and the ground truth. Thus, the expert network can be pre-trained by minimizing the following loss,

(5)

where N is the number of training samples in a mini-batch, y is the IoU between the predict boundary box of the input response map and the ground truth.

Note that, the network trained by (5) can only output a coarse estimate of the state value, not the real estimate, for two reasons. One is that there are some noises in the corresponding relationship between IoU and the response map. For example, some response maps don’t have the Gaussian-like distribution but have a high IoU, as shown in Fig. 2b. The other is that the IoU only represents the result of the current frame, but the value function focuses on not only the current return but the future return. The unequal relationship makes IoU fail to provide accurate guidance for network training. These two drawbacks in supervised learning can be well addressed by reinforcement learning.

Training with Deep Reinforcement Learning. After pre-training of supervised learning, we train our network with DRL. We use the temporal-difference based reinforcement learning method to train our expert network, which uses the current estimate value v instead of the true value $v_\pi $ to formula the target as $r + \gamma v(\mathbf {s}')$. We adopt the strategies in [19] that uses a separate network for generating the targets and use an experience pool to remove data correlation. After randomly sampling N pairs of $(\mathbf {s}_i, a_i, r_i, \mathbf {s}_i')$ samples from experience pool, the target of the training process is to minimize the following loss

(6)

where $V'$ with parameter $\theta '$ is the target network of V. Note that there are several temporary states $\{\mathbf {s}_t^{(1)},\mathbf {s}_t^{(2)},...,\mathbf {s}_t^{(m)}\}$ generate at frame t (m is the number of based tracker). These states are estimated by the expert network and the state with the highest value will be selected as the final state $s_t$ stored in the experience pool.

3.3 Online Tracking

In the online tracking, a search patch $\mathbf {p}_t$ is firstly cropped from the current frame $F_t$ according to the previously estimated boundary box $\mathbf {b}_{t-1}$ and the pre-processing function $\mathbf {p}_t=\phi (\mathbf {b}_{t-1},F_t)$. The patch $\mathbf {p}_t$ is then sent into the correlation filter base trackers and regressed into a series of response maps $S_t = \{\mathbf {s}_t^{(1)},\mathbf {s}_t^{(2)},...,\mathbf {s}_t^{(m)}\}$. The values of these response maps are estimated by the expert network according to the equation (3). Next, the best tracker can be selected by the action defined in (1). Finally, the scale of the boundary box can be identified by the scale estimation method introduced in [9]. After process a frame, all base trackers are updated using the same template of the new target and share the same searching areas (ROI) in the next frame.

3.4 Adaptive Updating

In order to avoid the corruption of the tracker when online updating, we should carefully check the updated template. In our experiment, the value of the response map, which measures the quality of the tracking result, is a good adaptive update indicator. Thus, the learning rate of tracker update can be defined as

$$\begin{aligned} C = \left( \frac{\bar{V}(\mathbf {s})-T_{min}}{T_{max}-T_{min}}\right) \cdot \eta ', \end{aligned}$$

(7)

where

$$\begin{aligned} \bar{V}(\mathbf {s})=\left\{ \begin{aligned} T_{max},&V(\mathbf {s})>T_{max} \\ T_{min},&V(\mathbf {s})<T_{min} \\ V(\mathbf {s}),&otherwise \end{aligned} \right. . \end{aligned}$$

(8)

$\eta '$ is the learning rate of model learning in the standard DCF, $T_{max}$ and $T_{min}$ are the upper and lower update boundary respectively.

4 Experiment

4.1 Implementation

Network Architecture. Because of the success of lightweight deep network in object tracking [22, 28], we build our expert network with four convolution layers and three fully-connected layers (the last one is the output layer). The convolution layers have the same structure and initial parameters as the first three layers of VGG-M network [5]. The next three fully-connected layer has 256, 128 and 1 output units respectively. In addition to the output layer, each layer is initialized by the Xavier initializer and activated by the ReLU function.

Tracker Pool. Because of the good performance and high tracking speed of DCFs in object tracking, we take it as our base tracker. We implement our tracker pool following the setting in [25]. In order to capture low-level details and high-level semantic information meantime, the tracker pool consists of three types of feature: $\{Low, Middle, High\}$, where the low-level feature is the HOG feature, the middle-level feature is the conv4-4 layer of VGG-19, and high-level feature is the conv5-4 layer of VGG-19. Then these three types feature are optionally combined into $C_3^1+C_3^2+C_3^3=7$ kinds of base trackers. The hyperparameter setting of base trackers is consistent with [25].

Hyperparameters. To train our Expert Network offline, we use TC-128 [16] and UAV123 [20] as the training set, eliminating the same video in the test set. We randomly choose a video clip with a length less than 400 frames for each iteration.

The possibility of using the expert decision $\eta $ is increased linearly from 0.5 to 0.9 over the first 100 thousand iterations, then is fixed. The reward decay rate $\gamma $ is set to 0.99. The target network is soft-updated with a learning rate of 0.001, while the evaluation network is optimized by the Adam and the learning rate is 0.001. The replay buffer size is $10^5$ and the training batch size is 32. We train the expert network through 10,000 iterations of supervised learning and 250,000 iterations of reinforcement learning. In online tracking, the update boundary $T_{max}$ and $T_{min}$ is set to 8.0 and 2.0 respectively. Our algorithm is implemented in Python and runs with a single Titanx GPU.

4.2 Ablation Analysis

To verify the contributions of the different component of the proposed method, we implement and evaluate the proposed method with and without adaptive update, donated by Ours and Ours-NU respectively. We also compare them with the adaptive update version (MCCT) and the non-adaptive update version (MCCT-NU) of the baseline [25]. The ablation analyses mainly do on the OTB-2013 and OTB-100 datasets, using area-under-curve (AUC) of success plot and distance precision (DP) rate for as Metrics.

From the results in Table 1, we can observe that our methods significantly improve the performance of $T_7$, the best individual tracker, which illustrates the effectiveness of our method. When using our Expert Network only for tracker selection, Ours-NU outperforms $T_7$ with a gain of $3.4\%$ in DP and $2.4\%$ in AUC on OTB-2013, and $2.4\%$ in DP and $1.6\%$ in AUC on OTB100. When using our Expert Network for both tracker selection and adaptive update, the performance of the best base tracker is improved $3.6\%$ in DP and $3.0\%$ in AUC on OTB-2013, and $5.2\%$ in DP and $3.7\%$ in AUC on OTB100.

In comparison with the baseline, regardless of adaptive update, our Ours-NU method outperforms MCCT-NU with a gain of $1.5\%$ in AUC and $2.3\%$ in DP on OTB-2013, and $1.9\%$ in AUC and $2.3\%$ in DP on OTB-100. This means that our selection method can make better use of complementary information among base trackers than the handcrafted selection method in MCCT.

Table 1. Ablation analysis of the proposed method on the OTB-2013 [26] and OTB-2015 [27] datasets. The DP (@20px) and AUC scores are used for evaluation. The best result is marked in the bold font.

Full size table

4.3 Tracker Selection VS Tracker Fusion

In order to compare the difference between the tracker selection and fusion strategy, we implement a tracker fusion strategy in our multi-tracker framework. The fusion weight of the $i^{th}$ tracker is defined as $w^{(i)} = V(s^{(i)}|\theta )/\sum _i V(s^{(i)}|\theta )$. Then, we can obtain the fusion response map by $R_f = \sum _i w^{(i)} R^{(i)}$. We evaluate these two version method on the 11 attributes of OTB-2013 [26] in terms of DP (@20px) and AUC scores. On the fast motion (FM), the in-plane rotation (IPR), the out-of-view (OV) and the background clutters (BC) these four attributes, the performance of tracker selection is obviously superior to the tracker fusion. Because in these attributes, the target changes dramatically, which makes most base trackers drift and affect the fusion results, as shown in Fig. 3. However, the tracker fusion outperforms tracker selection on the deformation (DEF), the out-of-plane rotation (OPR) and the low resolution (LR) these three attributes. Because most base trackers in these attributes have a good performance, so we can get a better performance with the fusion strategy, as shown in Fig. 3.

The above results show that the tracker fusion is suitable for simple tracking scenarios, while the tracker selection performs better in complex scenarios.

4.4 State-of-the-Art Comparisons

Evaluation on OTB Benchmarks. The proposed approach is compared with the state-of-the-art methods, including multi-tracker based methods MCCT [25], SCT4 [8], HDT [23], ACFN [7], MEEM [29], BranchOut [13], and other outstanding methods ADNet [28], CCOT [12], DSST [9], SRDCFdecon [11], CF2 [17], SiamFC [3], on both OTB-2013 and OTB-100 benchmarks by one-pass evaluation (OPE). The results of top ten trackers are shown in Fig. 4a and b. The comparison result shows that our method is better than the other trackers.

Figure 4a illustrates the precision and success plots of the OTB-2013 benchmark. The AUC score of the success plots of our method is 0.706, which outperforms the two existing best multi-tracker based trackers BranchOut and MCCT. The improvement range is $0.4\%$ and $0.5\%$, respectively. The DP (@20px) of the precision plot of our method is 0.931, which outperforms the MCCT, but is slightly worse than the BranchOut. Because base trackers of BranchOut are trained end-to-end, which are more superior than ours.

From the precision and success plots of OTB-100 benchmark illustrated in Fig. 4b, we can see that our method obtains the best results in both success and precision plots. Specifically, our method achieves an AUC score of 0.688 and a precision score of 0.919, which outperforms both BranchOut (0.678/0.917) and MCCT (0.684/0.916). The results of the OTB benchmark demonstrate the advantages of our selection strategy of multi-tracker.

Evaluation on VOT2017 Benchmark. We also compare the proposed method with MCCT [25], SiamFC [3], CCOT [12], DSST [9], MEEM [29], SRDCF [10], Staple [2], UCT [30] on VOT2017. The expected average overlap (EAO), accuracy and failure times are used to evaluate these trackers, as shown in Tab. 2. We can see that our method obtains an EAO of 0.293, which outperforms the second best tracker MCCT with a gain of $2.3\%$. What’s more, the average failure of our tracker is 18.29, which is the lowest among the above trackers. Although our method is slightly worse than MCCT in accuracy, it still has a comparable score of 0.5. The results of the VOT2017 benchmark also demonstrate the competitive performance of the proposed method.

figure a — Table 2. A comparison with the state-of-the-art trackers on VOT2017 dataset. The
and
best scores are highlighted in color. The $\uparrow $ represents the highest is the best, while the $\downarrow $ represents the lowest is the best.

figure b — Table 2. A comparison with the state-of-the-art trackers on VOT2017 dataset. The
and
best scores are highlighted in color. The $\uparrow $ represents the highest is the best, while the $\downarrow $ represents the lowest is the best.

5 Conclusions

In this work, we have formulated the multi-tracker tracking method as a decision-making task, where an expert is trained offline by deep reinforcement learning to select the best tracker for the current frame and guide the trackers to update. Formulating in decision-making process brings the foresight and strong representation to our method, which enables it to take full advantage of the complementary information among base trackers. The experimental results demonstrate the superiority of our method.

References

Bai, B., et al.: Kernel correlation filters for visual tracking with adaptive fusion of heterogeneous cues. Neurocomputing 286, 109–120 (2018)
Article Google Scholar
Bertinetto, L., Valmadre, J., Golodetz, S., Miksik, O., Torr, P.H.: Staple: complementary learners for real-time tracking. In: CVPR, pp. 1401–1409 (2016)
Google Scholar
Bertinetto, L., Valmadre, J., Henriques, J.F., Vedaldi, A., Torr, P.H.S.: Fully-convolutional siamese networks for object tracking. In: Hua, G., Jégou, H. (eds.) ECCV 2016. LNCS, vol. 9914, pp. 850–865. Springer, Cham (2016). https://doi.org/10.1007/978-3-319-48881-3_56
Chapter Google Scholar
Bolme, D.S., Beveridge, J.R., Draper, B.A., Lui, Y.M.: Visual object tracking using adaptive correlation filters. In: CVPR, pp. 2544–2550. IEEE (2010)
Google Scholar
Chatfield, K., Simonyan, K., Vedaldi, A., Zisserman, A.: Return of the devil in the details: delving deep into convolutional nets. arXiv preprint arXiv:1405.3531 (2014)
Choi, J., Kwon, J., Lee, K.M.: Real-time visual tracking by deep reinforced decision making. Comput. Vis. Image Underst. 171, 10–19 (2018)
Article Google Scholar
Choi, J., Chang, H.J., Yun, S., Fischer, T., Demiris, Y., Choi, J.Y., et al.: Attentional correlation filter network for adaptive visual tracking. In: CVPR, vol. 2, p. 7 (2017)
Google Scholar
Choi, J., Jin Chang, H., Jeong, J., Demiris, Y., Young Choi, J.: Visual tracking using attention-modulated disintegration and integration. In: CVPR, pp. 4321–4330 (2016)
Google Scholar
Danelljan, M., Häger, G., Khan, F., Felsberg, M.: Accurate scale estimation for robust visual tracking. In: British Machine Vision Conference, Nottingham, 1–5 September 2014. BMVA Press (2014)
Google Scholar
Danelljan, M., Hager, G., Shahbaz Khan, F., Felsberg, M.: Learning spatially regularized correlation filters for visual tracking. In: ICCV, pp. 4310–4318 (2015)
Google Scholar
Danelljan, M., Hager, G., Shahbaz Khan, F., Felsberg, M.: Adaptive decontamination of the training set: a unified formulation for discriminative visual tracking. In: CVPR, pp. 1430–1438 (2016)
Google Scholar
Danelljan, M., Robinson, A., Shahbaz Khan, F., Felsberg, M.: Beyond correlation filters: learning continuous convolution operators for visual tracking. In: Leibe, B., Matas, J., Sebe, N., Welling, M. (eds.) ECCV 2016. LNCS, vol. 9909, pp. 472–488. Springer, Cham (2016). https://doi.org/10.1007/978-3-319-46454-1_29
Chapter Google Scholar
Han, B., Sim, J., Adam, H.: Branchout: regularization for online ensemble tracking with convolutional neural networks. In: ICCV, pp. 2217–2224 (2017)
Google Scholar
Huang, C., Lucey, S., Ramanan, D.: Learning policies for adaptive tracking with deep feature cascades. In: IEEE International Conference on Computer Vision (ICCV), pp. 105–114 (2017)
Google Scholar
Kristan, M., et al.: The visual object tracking vot2017 challenge results (2017)
Google Scholar
Liang, P., Blasch, E., Ling, H.: Encoding color information for visual tracking: algorithms and benchmark. IEEE Trans. Image Process. 24(12), 5630–5644 (2015)
Article MathSciNet Google Scholar
Ma, C., Huang, J.B., Yang, X., Yang, M.H.: Hierarchical convolutional features for visual tracking. In: ICCV, pp. 3074–3082 (2015)
Google Scholar
Meshgi, K., Oba, S., Ishii, S.: Efficient diverse ensemble for discriminative co-tracking. In: CVPR, pp. 4814–4823 (2018)
Google Scholar
Mnih, V., et al.: Human-level control through deep reinforcement learning. Nature 518(7540), 529 (2015)
Article Google Scholar
Mueller, M., Smith, N., Ghanem, B.: A benchmark and simulator for UAV tracking. In: Leibe, B., Matas, J., Sebe, N., Welling, M. (eds.) ECCV 2016. LNCS, vol. 9905, pp. 445–461. Springer, Cham (2016). https://doi.org/10.1007/978-3-319-46448-0_27
Chapter Google Scholar
Nam, H., Baek, M., Han, B.: Modeling and propagating CNNs in a tree structure for visual tracking. arXiv preprint arXiv:1608.07242 (2016)
Nam, H., Han, B.: Learning multi-domain convolutional neural networks for visual tracking. In: CVPR, pp. 4293–4302 (2016)
Google Scholar
Qi, Y., et al.: Hedged deep tracking. In: CVPR, pp. 4303–4311 (2016)
Google Scholar
Supancic III, J.S., Ramanan, D.: Tracking as online decision-making: Learning a policy from streaming videos with reinforcement learning. In: ICCV, pp. 322–331 (2017)
Google Scholar
Wang, N., Zhou, W., Tian, Q., Hong, R., Wang, M., Li, H.: Multi-cue correlation filters for robust visual tracking. In: CVPR, pp. 4844–4853 (2018)
Google Scholar
Wu, Y., Lim, J., Yang, M.H.: Online object tracking: a benchmark. In: CVPR, pp. 2411–2418 (2013)
Google Scholar
Wu, Y., Lim, J., Yang, M.H.: Object tracking benchmark. T-PAMI 37(9), 1834–1848 (2015)
Article Google Scholar
Yoo, S., Yun, K., Choi, J.Y., Yun, K., Choi, J.: Action-decision networks for visual tracking with deep reinforcement learning. In: CVPR (2017)
Google Scholar
Zhang, J., Ma, S., Sclaroff, S.: MEEM: robust tracking via multiple experts using entropy minimization. In: Fleet, D., Pajdla, T., Schiele, B., Tuytelaars, T. (eds.) ECCV 2014. LNCS, vol. 8694, pp. 188–203. Springer, Cham (2014). https://doi.org/10.1007/978-3-319-10599-4_13
Chapter Google Scholar
Zhu, Z., Huang, G., Zou, W., Du, D., Huang, C.: Uct: learning unified convolutional networks for real-time visual tracking. In: ICCV, pp. 1973–1982 (2017)
Google Scholar

Download references

Acknowledgments

This work was supported in part by the Natural Science Foundation of China (NSFC) under Grant No. 61702037 Beijing Municipal Natural Science Foundation under Grant No. L172027, and Beijing Institute of Technology Research Fund Program for Young Scholars.

Author information

Authors and Affiliations

Beijing Laboratory of Intelligent Information Technology, School of Computer Science, Beijing Institute of Technology, Beijing, 100081, People’s Republic of China
Wenju Huang, Yuwei Wu & Yunde Jia

Authors

Wenju Huang
View author publications
You can also search for this author in PubMed Google Scholar
Yuwei Wu
View author publications
You can also search for this author in PubMed Google Scholar
Yunde Jia
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Yuwei Wu .

Editor information

Editors and Affiliations

Beijing Jiaotong University, Beijing, China
Yao Zhao
The Australian National University, Canberra, Australia
Nick Barnes
Peking University, Beijing, China
Baoquan Chen
The Technical University of Munich, Munich, Bayern, Germany
Rüdiger Westermann
Zhejiang University, Hangzhou, China
Xiangwei Kong
Beijing Jiaotong University, Beijing, China
Chunyu Lin

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Huang, W., Wu, Y., Jia, Y. (2019). Tracker-Level Decision by Deep Reinforcement Learning for Robust Visual Tracking. In: Zhao, Y., Barnes, N., Chen, B., Westermann, R., Kong, X., Lin, C. (eds) Image and Graphics. ICIG 2019. Lecture Notes in Computer Science(), vol 11901. Springer, Cham. https://doi.org/10.1007/978-3-030-34120-6_36

Download citation

DOI: https://doi.org/10.1007/978-3-030-34120-6_36
Published: 28 November 2019
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-34119-0
Online ISBN: 978-3-030-34120-6
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

Societies and partnerships

The International Association for Pattern Recognition (opens in a new tab)