Keywords

1 Introduction

For an undirected weighted graph \(G=(V, E)\), where V is the set of vertices and E is the set of edges, a clique C is a subset of vertices in V in which each pair of vertices is connected by an edge in E. The Maximum Weight Clique (MWC) problem is to find a clique C which maximizes

$$\begin{aligned} w(C) = \sum _{v_i \in C}w_V(v_i)+\sum _{v_i, v_j \in C}w_E(v_i, v_j), \end{aligned}$$
(1)

where \(w_V:V\rightarrow \mathbb {R}\) and \(w_E:E\rightarrow \mathbb {R}\) are the weight functions for the vertices and edges respectively. Successfully solving the MWC problem leads to various applications in practice.

In this paper, we focus on the task of common object discovery, which aims at discovering the objects of the same class in an image collection. Co-localizing objects in unconstrained environment is challenging. For images in the real-world applications, such as those in the PASCAL datasets [8, 9], the objects of the same class may look very different due to viewpoint, occlusion, deformation, illumination, etc. Also, there could be considerable diversities within certain object class such as human beings, for their differences in gender, age, costume, hair style or skin color. Besides, there could be multiple common objects in the given set of images, thus the definition of “common” may be ambiguous. In addition, the efficiency of the involved method is very significant in time sensitive applications such as object co-localization in large collections of images or video streams.

To achieve robust and efficient object co-localization, we formulate the task as a Maximum Weight Clique (MWC) problem. It aims at finding a group of objects that are most similar to each other, which corresponds to a MWC in the associated graph. The nodes in the graph correspond to the object candidates generated from the given image collection, while the weight on an edge indicates how similar two given candidates are. We can discover a set of common objects by finding the MWC in the associated graph. Each node in the MWC is a discovered common object across the images. The main idea of the paper is illustrated in Fig. 1.

Fig. 1.
figure 1

Given a set of object candidates generated from an image collection (left), our goal is to find common objects by searching for the maximum weight clique in the associated graph. Each node in the clique (right) corresponds to a discovered common object.

The main contributions of this work are as follows. (1) We address the task of object co-localization as a well-defined MWC problem in the associated graph. It provides a practical and general solution for research and applications related to the MWC problem. (2) We develop a hashing based mechanism to detect the revisiting of the local optimum in the local search based MWC solver [35]. It can alleviate the cycling issue in the optimization process. (3) The Region Proposal Network (RPN) [25] is applied for efficiently generating the object candidates. The candidates are then re-ranked to improve the robustness against the background noise. (4) A Triplet Network (TN) is learned to obtain the feature embeddings of the object candidates, so as to construct a reliable affinity measure between the candidates. (5) The performance is evaluated on the PASCAL VOC 2007 image dataset [8] and the YouTube-Objects video dataset [16]. Superior performance is obtained compared to recent state-of-the-art methods.

2 Related Works

The problem of common object discovery has been investigated extensively in the past few years. Papazoglou et al. [22] view the task as a foreground object mining problem, where Optical Flow is used to estimate the object motion and the Gaussian Mixture model is utilized to capture the appearance of the foreground and background. Cho et al. [6] tackle the problem using a part-based region matching method, where a probabilistic Hough transform is used to evaluate the quality of each candidate correspondence. Joulin et al. [15] extend the method in [6] to co-localize objects in video frames, and a Frank-Wolfe algorithm is used to optimize the proposed quadratic programming problem. Zhang et al. [37] apply a part-based object detector and a motion aware region detector to generate object candidates. The problem is then formulated as a joint assignment problem and the solution is refined by inferring shape likelihoods afterwards. Kwak et al. [17] also focus on the problem of localizing dominant objects in videos, where an iterative process of detection and tracking is applied. Li et al. [18] devise an entropy-based objective function to learn a common object detector, and they address the task with a Conditional Random Field (CRF) model. Wei et al. [36] perform Principal Component Analysis (PCA) on the convolutional feature maps of all the images, and locate the most correlated regions across the images. Wang et al. [32] use segmentations produced by Fully Convolutional Networks (FCN) as object candidates. Then they discover common objects by solving a N-Partite Graph Matching problem.

Many of these methods explicitly or implicitly employ graph based models to interpret the task of object co-localization. Similarly in this paper, an undirected weighted graph is first constructed over the given set of images, modeling the visual affinities between the object candidates. We find the common objects as the Maximum Weight Clique (MWC) in this graph, where each node in the clique corresponds to a detected common object across the images. The MWC problem is NP-hard and it is difficult to obtain a global optimal solution. Generally, there are two types of algorithms to solve the MWC problem: the exact methods such as [12, 14] and the heuristic methods such as [2, 10, 21, 24, 35]. Most existing works on the MWC problem focused on the heuristic approaches due to their efficiency in space and time. In this paper, our optimization algorithm adopts a simple variant of Tabu Search (TS) heuristic to discover the MWC, and it has several features: (1) it considers the local circumstance of a vertex in each step; (2) it takes only an auxiliary Boolean array for implementation; (3) it requires no extra parameters besides the time limit.

3 Problem Formulation

Given a set of N images \(\mathcal {I}=\{I_{1},I_{2},\ldots ,I_{N}\}\), we generate a set of object candidates from all images \(\mathcal {B}=\{\varvec{b}\ |\ \varvec{b} \in \mathcal {P}(I),\ I \in \mathcal {I}\}\), where \(\mathcal {P}(I)\) is the set of object candidates extracted from image I and \(\varvec{b}\) is a bounding box of that object candidate. Suppose \(n_i\) object candidates are extracted from image \(I_i\), then \(|\mathcal {B}|=\sum _{i=1}^{N}{n_i}\) candidates will be generated from the image collection \(\mathcal {I}\) in total. We denote \(n=|\mathcal {B}|\) in the remainder of the paper. Let \(o(\varvec{b}_i)\) be the score of some bounding box \(\varvec{b}_i\) containing the common object, and let \(s(\varvec{b}_i, \varvec{b}_j)\) represent the similarity between two object candidates in \(\varvec{b}_i\) and \(\varvec{b}_j\), then the task of object co-localization can be formulated as finding an optimal subset \(\mathcal {B}^* \subset \mathcal {B}\) such that

$$\begin{aligned} w(\mathcal {B}^*) = \sum _{\varvec{b}_i \in \mathcal {B}^*}{o(\varvec{b}_i)}+\sum _{\varvec{b}_i, \varvec{b}_j \in \mathcal {B^*}, \varvec{b}_i \ne \varvec{b}_j}{s(\varvec{b}_i, \varvec{b}_j)} \end{aligned}$$
(2)

is maximized, with the constraint that at most one object candidate can be selected from each image. For the reason explained in Sect. 4.2, we set \(o(\varvec{b}_i) = 0\) for all \(\varvec{b}_i\), which means it is a Maximum Edge Weight Clique problem. However, the proposed MWC solver can optimize problems with both vertex and edge weights.

Further, we assign a label \(x_i \in \{0,1\}\) to each object candidate \(\varvec{b}_i\), where \(x_i=1\) means that the object candidate \(\varvec{b}_i\) is selected in the subset \(\mathcal {B^*}\). Thus, an indicator vector \(\varvec{x}\in \{0,1\}^{n}\) is used to identify the common objects discovered in \(\mathcal {B}\). Besides, an affinity matrix \(A \in \mathbb {R}^{n \times n}\) is constructed, where

$$\begin{aligned} \begin{aligned}&A_{ii}=o(\varvec{b}_i), \ \forall \varvec{b}_i \in \mathcal {B},\ \ \mathrm {and} \ \ A_{ij}=s(\varvec{b}_i,\ \varvec{b}_j),\ \forall \varvec{b}_i, \varvec{b}_j \in \mathcal {B}. \end{aligned} \end{aligned}$$
(3)

Here we assume the similarity metric \(s(\varvec{b}_i,\ \varvec{b}_j)\) is symmetric and non-negative, namely \(A_{ij}=A_{ji}\ge 0\). On the other hand, we remove the edge between object candidates \(\varvec{b}_i\) and \(\varvec{b}_j\) if they are present in the same image, hence they cannot be simultaneously selected in \(\mathcal {B^*}\). Then the problem in (2) can be expressed as finding an optimal indicator vector \(\varvec{x} \in \{0,1\}^{n}\), such that \(\varvec{x}^{T} A \varvec{x}\) is maximized. Hence the selected nodes in \(\mathcal {B}^*\) correspond to a MWC in the constructed graph, and they represent the discovered set of common objects. To summarize, the overall objective function of the MWC problem can be written in the matrix form as

$$\begin{aligned} \begin{aligned}&\varvec{x}^*=\underset{\varvec{x}}{{\text {argmax}}}\ \ {\varvec{x}^{T} A \varvec{x}},&\text {s.t.}\ \ \varvec{x} \in \{0,1\}^{n}. \end{aligned} \end{aligned}$$
(4)

To this end, the task of object co-localization is formulated as a Maximum Weight Clique (MWC) problem as described in (1).

4 Graph Construction

4.1 Object Candidates Generation

The nodes in the associated graph correspond to the object candidates in all the images. We expect those candidates to cover as many foreground objects as possible. Meanwhile, the total number of candidates will also influence the search space for the MWC. Therefore, our first priority is to find a proper method to extract the object candidates. The Region Proposal Networks (RPN) [25] is used in our approach to generate rectangular object candidates from each image. We use the raw RPN proposals in the intermediate stage and apply Non-Maximum Suppression (NMS) [28] to remove redundant boxes. We choose the top-K scoring proposals from each image to construct the associated graph for computational efficiency. We consider two different proposal scoring measures. The first one is commonly used and is based on RPN objectness score of each object candidate. RPN also generates a vector of class likelihoods for each object candidate, and we propose to re-rank the object candidates according to the entropy of the class distribution. Since the entropy is a measure of uncertainty, it serves a similar purpose as the objectness score but tend to be more accurate in this setting. Hence we can re-rank the raw RPN proposals according to the entropy, and select the top-K scoring boxes with low uncertainty as object candidates in each image.

4.2 Common Objectness Score

For object co-localization, the underlying class of the common object is unknown in advance. Thus, the score \(o(\varvec{b})\) of some object \(\varvec{b}\) being the common one is difficult to estimate. A possible way is to set \(o(\varvec{b})\) as the objectness score of \(\varvec{b}\). But this can be problematic when \(\varvec{b}\) indeed contains an object but not the common one. Thus, it may lead to unexpected results if the objectness score is directly used, as observed in [31]. Therefore, we set the contribution of the score to the objective function (2) to zero, i.e.,

$$\begin{aligned} A_{ii}=o(\varvec{b}_i)=0, \forall \varvec{b}_i \in \mathcal {B}. \end{aligned}$$
(5)

In the case of object co-localization, it means we focus on the MWC problem with edge weight only. However, as shown in Sect. 5, the proposed MWC problem solver is generic and can be applied to other tasks where both vertex weights and edge weights are present.

4.3 Object Representation and Similarity

The edge weights in the associated graph represent visual similarity between the selected object candidates. Thus, we need an accurate way to represent the object candidates and evaluate their similarities. In this paper, we employ the Triplet Network framework [13] to learn the deep feature embeddings of the object candidates. Suppose a pre-trained Convolutional Neural Network (CNN) is selected to extract the deep features \(f(\varvec{b}; \varvec{w})\) for each object candidate \(\varvec{b} \in \mathcal {B}\), where \(\varvec{w}\) is the set of parameters of the CNN. In this framework, a set of triplets is then constructed for fine-tuning the parameters \(\varvec{w}\). Each triplet consists of a reference object \(\varvec{b}_r\), a positive object \(\varvec{b}_p\) and a negative object \(\varvec{b}_n\). Namely, \(\varvec{b}_r\) and \(\varvec{b}_p\) represent a pair of similar objects, while \(\varvec{b}_r\) and \(\varvec{b}_n\) are a pair of dissimilar objects. Two objects are viewed as similar if they belong to the same category and otherwise dissimilar. Then, the hinge loss of a triplet is defined as

$$\begin{aligned} l(\varvec{b}_r,\varvec{b}_p,\varvec{b}_n) = \max \{0, \lambda +s(\varvec{b}_r, \varvec{b}_n)-s(\varvec{b}_r, \varvec{b}_p)\}, \end{aligned}$$
(6)

where \(\lambda \) is a margin threshold controlling how different \(s(\varvec{b}_r, \varvec{b}_n)\) and \(s(\varvec{b}_r, \varvec{b}_p)\) should be. The goal of the Triplet Network learning is to find a set of optimal parameters \(\varvec{w}\), such that the sum of the hinge loss of all triplets

$$\begin{aligned} L(\mathcal {T}) = \sum _{(\varvec{b}_r,\varvec{b}_p,\varvec{b}_n)\in \mathcal {T}}{l(\varvec{b}_r,\varvec{b}_p,\varvec{b}_n)} \end{aligned}$$
(7)

is minimized over a training set of triplets \(\mathcal {T}\). Namely, in the specified metric space, the learning process makes similar objects closer to each other, while dissimilar objects are pushed away. In the triplet hinge loss \(l(\varvec{b}_r,\varvec{b}_p,\varvec{b}_n)\), frequently used similarity metrics include dot-product (the linear kernel) and the Euclidean distance. But the output ranges of these metrics are not bounded, and this may invalidate the margin threshold \(\lambda \) in the loss function, as observed in [5]. In addition, more complex metrics can be also used here, such as the polynomial kernel and the Gaussian kernel (the RBF kernel). But there are a few more parameters in these kernel functions and they have to be chosen wisely. For simplicity, we define \(s(\varvec{b}_i, \varvec{b}_j)\) as the cosine similarity between two CNN feature vectors \(f(\varvec{b}_i; \varvec{w})\) and \(f(\varvec{b}_j; \varvec{w})\), namely

$$\begin{aligned} A_{ij}=s(\varvec{b}_i, \varvec{b}_j) = \frac{f(\varvec{b}_i; \varvec{w})^T f(\varvec{b}_j; \varvec{w})}{||f(\varvec{b}_i; \varvec{w}) ||||f(\varvec{b}_j; \varvec{w}) ||}, \end{aligned}$$
(8)

since it is already neatly bounded and parameter free. The parameters \(\varvec{w}\) in the overall loss function (7) can be updated via the standard Stochastic Gradient Descent (SGD) method. An intuitive description of our triplet network can be found in Fig. 2.

Fig. 2.
figure 2

The architecture of our triplet network. The weights in the CNN backbones are shared in the three branches. The goal is to learn a feature embedding such that similar objects are closer to each other in the metric space while dissimilar objects are pushed away.

5 The MWC Problem Solver

We use a local search based method to solve the MWC problem (4). The local search usually moves from one clique to another until it reaches the cutoff, then the best clique found is kept as the solution. The pipeline of our MWC solver is summarized in Algorithm 1.

figure a

Compared to RSL and RRWL in [11], our algorithm starts from a random single-vertex clique, while they start with a random maximal clique. This is particularly useful when the run-time is restricted. Besides, while RSL and RRWL restart when a solution is revisited in the so-called first growing step, our algorithm simply restarts when a local optimum is revisited. In this way, our solver spends less time on searching the local area that has been visited intensively.

5.1 Detecting Revisiting via a Hash Table

In recent methods, the local search typically moves in a deterministic way, i.e., no randomness exists in this process. Thus, a sequence of steps from a previously visited local optimum would be simply repeated, and it may not improve the best clique found so far. Hence, we improve this kind of methods by introducing a cycle elimination based restart strategy, where a hash table is used to approximately detect the revisiting of a local optimum. Given a candidate solution \(\mathcal {B}_c^*\) and a prime number p, we define the hash value of \(\mathcal {B}_c^*\) as

$$\begin{aligned} hash(\mathcal {B}_c^*)=(\sum _{\varvec{b}_i \in \mathcal {B}_c^*}{2^i}) \mod p, \end{aligned}$$
(9)

where \(i\in \{1,2,\ldots ,n\}\) is the index of \(\varvec{b}_i\) in the entire object candidate set \(\mathcal {B}\). If p is large enough, the chance of collision is negligible. The parameter p can be set according to the memory capacity of the machine. In the proposed algorithm, the revisiting of a local optimum is detected by checking whether the respective hash entry has been visited. If the local optimum was not visited before, the local search continues. Otherwise, the solver will be restarted and try to look for a better solution.

5.2 Scoring Functions and Candidate Nodes

Given an undirected weighted graph \(G = (V, E)\), we describe our approach to finding the MWC in Algorithm 1. To begin with, we first introduce some notations used in our algorithm. In the local search for the MWC, the \(\texttt {add}\) operation adds a new node to the current clique C. The \(\texttt {drop}\) operation drops an existing node from the current clique C. The \(\texttt {swap}\) operation swaps two nodes from inside and outside the current clique C. Each operation returns a new clique as the current solution, which maximizes the gain of the clique weight. Suppose w(C) is the weight of a clique C defined in Eq. (1), then for the \(\texttt {add}\) and \(\texttt {drop}\) operation, the gain of adding and dropping a node v is computed as

$$\begin{aligned} \begin{aligned} score(v, C) = {\left\{ \begin{array}{ll} w(C\cup \{v\}) - w(C) &{} \text {if }v \not \in C;\\ w(C\backslash \{v\}) - w(C) &{} \text {if }v \in C. \end{array}\right. } \end{aligned} \end{aligned}$$
(10)

For \(\texttt {swap}\) operation, the gain of clique weight when swapping two nodes (uv) is

$$\begin{aligned} score(u, v, C) = w(C\backslash \{u\} \cup \{v\}) - w(C), u \in C, v \not \in C, (u, v) \not \in E. \end{aligned}$$
(11)

We denote the set of neighbors of a vertex v as \(\mathcal {N}(v) = \{u | (u,v) \in E\}\). To ensure that the local search always maintains a clique, we define two operand sets. Firstly for a clique C, we define the set of candidate nodes for the \(\texttt {add}\) operation as

$$\begin{aligned} \begin{aligned} S_{add}(C) = {\left\{ \begin{array}{ll} \{v | v \not \in C, v \in \mathcal {N}(u), \forall u \in C\} &{} \text {if }|C|>0; \\ \emptyset , \text { otherwise}. \end{array}\right. } \end{aligned} \end{aligned}$$
(12)

Secondly, the set of candidate node pairs for the \(\texttt {swap}\) operation is defined as

$$\begin{aligned} \begin{aligned} S_{swap}(C) = {\left\{ \begin{array}{ll} \{(u, v) | u \in C, v \not \in C, (u, v) \not \in E, v \in \mathcal {N}(w), \forall w \in C \backslash \{u\}\} \ \text {if }|C|>1; \\ \emptyset , \text { otherwise}. \end{array}\right. } \end{aligned} \end{aligned}$$
(13)

To maximize the gain of clique weight in each step, the \(\texttt {add}\) operation adds a node \(v^*\) to the current clique C such that \(v^*=\mathop {\mathrm {argmax}}\nolimits _v score(v, C), v \in S_{add}(C)\). The \(\texttt {drop}\) operation drops a node \(v^*=\mathop {\mathrm {argmax}}\nolimits _v score(v, C), v \in C\) from the current clique C. The \(\texttt {swap}\) operation swaps two nodes \((u^*, v^*)\) such that \((u^*, v^*)=\mathop {\mathrm {argmax}}\nolimits _{(u, v)} score(u, v, C), (u, v) \in S_{swap}(C)\).

5.3 The Strong Configuration Checking Strategy

We apply the Strong Configuration Checking (SCC) strategy [35] to avoid revisiting a solution too early. The main idea of the SCC strategy works as follows. After a vertex v is dropped or swapped from a clique C, it can be added or swapped back into C only if one of its neighbors is added into C. Suppose \( confChange (v)\) is an indicator function of node v, where \( confChange (v)\) = 1 means v is allowed to be added or swapped into the candidate solution and \( confChange (v)\) = 0 means v is forbidden to be added or swapped into the candidate solution, then the SCC strategy specifies the following rules:

  1. 1.

    Initially \( confChange (v)\) is set to 1 for each vertex v;

  2. 2.

    When v is added, \( confChange (u)\) is set to 1 for all \(u \in \mathcal {N}(v)\);

  3. 3.

    When v is dropped, \( confChange (v)\) is set to 0;

  4. 4.

    When \((u, v)\in S_{swap}(C)\) are swapped, \( confChange (u)\) is set to 0.

6 Experiments

To evaluate the performance of our method in comparison to other approaches, experiments are conducted on the PASCAL VOC 2007 image dataset [8] and the YouTube-Objects video dataset [16]. The standard PASCAL criterion Intersection over Union (IoU) is adopted for evaluation. Namely, a predicted bounding box \(\varvec{b}^{p}\) is correct if \(IoU(\varvec{b}^{p},\varvec{b}^{gt})=\frac{area(\varvec{b}^{p}\cap \varvec{b}^{gt})}{area(\varvec{b}^{p}\cup \varvec{b}^{gt})}>0.5\), where \(\varvec{b}^{gt}\) is a ground-truth annotation of the bounding box. Finally, the percentage of images with correct object localization (CorLoc) [18] is used as the evaluation protocol. Our method is denoted as LSMWC for local search MWC solver.

6.1 Implementation Details

Our experiments are carried out on a desktop machine with two Intel(R) Core(TM) i7 CPUs (2.80 GHz) and 64 GB memory. A GeForce GTX Titan X GPU is used for training and testing related deep neural networks. The proposed MWC solver is implemented in C/C++. The deep learning framework Caffe and MatConvNet are utilized as carriers for building the Region Proposal Network and the Triplet Network. The pipeline of the system is organized in MATLAB with some utilities written as MEX files, due to the efficiency for high level data management and visualization. The default parameters are used to learn RPN and generate the object candidates. A threshold of 0.5 is used for the NMS process to remove redundant object proposals. The best \(K=20\) object candidates are selected in each image. We set \(\lambda =0.25\) in the hinge loss (6) of a triplet. The prime number p in the hash function (9) is set to \(10^9+7\), thus the hash table consumes around 1 GB memory. The RPN and Triplet Network in our method are built upon the VGG-f model [30] as well as the VGG-16 model [4]. Compared to the VGG-16 model, the structure of the VGG-f model is much simpler thus more computationally efficient. The VGG-f and VGG-16 models are pre-trained on the ImageNet dataset [29] and fine-tuned on the Microsoft COCO dataset [19]. All parameters are fixed the same in the experiments unless explicitly stated otherwise.

6.2 Experiments on the PASCAL07 Dataset

The PASCAL VOC 2007 dataset [8] is used to evaluate the performance of object co-localization in images. The dataset is split as a training-validation set and a test set, each with about 5,000 images in 20 classes. We follow [15] to construct a collection of images for object co-localization from the training-validation set and denote it as PASCAL07. This is fine in our framework, since our RPN and Triplet networks are not trained on this dataset but on the ImageNet and COCO datesets as stated in Sect. 6.1.

We first compare the co-localization accuracy of different MWC problem solvers on the PASCAL07 dataset in Table 1. The graph instances of these MWC problems are constructed based on the VGG-16 model. Since the PASCAL07 dataset has images from 20 different classes, we construct 20 different graphs, one graph for each image class. For the experiments on the PASCAL07 dataset, the average number of nodes in the constructed graphs is 6081.67, and the average number of edges is \(2.16\times 10^7\). The average density of the graphs is 0.9962. Different solvers are evaluated on exactly the same MWC problem instances constructed by our co-localization framework. As randomized processes may exist in different methods, the reported accuracy is taken as the average over 10 runs with different seeds for the random number generator.

For the method [20], it solves the MWC problem in the relaxed continuous domain and a modified Frank-Wolfe algorithm is proposed to attack the problem. Similar to our approach, the solver TBMA [1] also solves the MWC problem directly in the discrete domain. Compared to their solver, our solver will restart if a local optimum is revisited, while TBMA will restart if the solution quality has not been improved for a specified number of steps. As for the solver LSCC [35], originally it is dedicated to solve the MWC problems where the edge weights are absent. Namely, the add, swap or drop operations change the weight of a clique considering related vertex weights only. Here we modify it so that the edge weights are taken into account in these operations. Two versions of the LSCC solver in the original paper are evaluated, and they serve as the baseline results in our experiment. The experiments justify our choice of the MWC problem solver, which improves the accuracy of object co-localization.

Table 1. Co-localization CorLoc (%) of different MWC solvers on the PASCAL07 dataset.

The co-localization accuracy of different object candidate generation and feature embedding methods on the PASCAL07 dataset is compared in Table 2. Different CNN models are used to extract the object candidate features, then the cosine similarity is applied on these deep neural network features. It shows that re-ranking the object proposals in each image according to the entropy of the class distribution of each object proposal leads to significantly better results than directly using the RPN objectness score of each proposal for ranking. With the involvement of the Triplet Network learning framework, the co-localization performance improves further. The experiments validate that the performance of the object co-localization is benefited from the proper choice of the object candidate generation and feature embedding scheme.

Table 2. Co-localization CorLoc (%) of different strategies on the PASCAL07 dataset.

The co-localization accuracy of different object co-localization methods on the PASCAL07 dataset is reported in Table 3. The results of the compared methods are directly taken from the corresponding literature. Among these methods using deep CNN features as visual descriptors [3, 7, 18, 26, 27, 34, 36], our method demonstrates superior results over recent state-of-the-art methods. The experiments confirm the effectiveness of the proposed object co-localization framework.

Table 3. Co-localization CorLoc (%) of different methods on the PASCAL07 dataset.
Table 4. Co-localization CorLoc (%) of different methods on the YouTube-Objects dataset.

6.3 Experiments on the YouTube-Objects Dataset

The YouTube-Objects dataset [16] is used for object co-localization in videos. The dataset contains videos collected from YouTube with 10 object classes. There are about 570,000 frames with 1,407 annotations in the first version of the dataset [23]. According to our knowledge, it is the largest available video dataset with bounding-box annotations on multiple classes. The individual video frames after decompression are used in our experiments to avoid possible confusion when applying different video decoders. We only perform object co-localization on video frames with ground-truth annotations, following the practice in [15]. No additional spatial-temporal information is utilized in our method. The Youtube-Objects dataset comes with the test videos divided in 10 classes according to which dominant object is mostly present in them. Hence we construct 10 different graphs for this dataset. The co-localization accuracy of different methods on the YouTube-Objects dataset are summarized in Table 4. Among all the methods, [27, 37] also utilize deep networks for visual representation. The experiments justify that the proposed object co-localization framework is also very effective for mining common objects in videos.

7 Conclusion

In this paper, we present a novel framework to address the problem of object co-localization. It provides a practical and general solution for research and applications related to the MWC problem. Besides, deep learning based methods are utilized to localize the candidates of the common objects and describe their visual characteristics. This makes it possible to better discriminate the inter-class similarities and identify the intra-class variations. Finally, a cycle elimination based restart strategy is proposed to guide the local search for the MWC. It successfully resolves the cycling issue in the optimization process. The experimental results on the object co-localization tasks demonstrate that our MWC solver is particularly suitable for graphs with high density. The proposed method shows significant improvements over several strong baselines.