Keywords

1 Introduction

Co-saliency detection refers to searching for visually salient objects repetitively appearing in multiple given images. For its superior scalability, co-saliency has been applied to help various applications regarding image content understanding, such as image/video co-segmentation [1,2,3], object co-localization [4], content-aware compression [5], etc.

The success of co-saliency detection relies on robust feature representations of co-salient objects against appearance variations across images. Engineered features, such as color histograms, Gabor filtered texture features, and SIFT [6] are widely used in conventional co-saliency methods [7,8,9,10]. Deep learning (DL) has recently emerged and demonstrated success in many computer vision applications. DL-based features have been adopted for co-saliency detection, such as those extracted from a pre-trained convolutional neural network (CNN) [11] or from unsupervised semantic feature learning with restricted Boltzmann machines (RBMs) [12]. However, feature extraction and co-saliency detection are treated as separate steps in these approaches [7,8,9,10,11,12], leading to suboptimal performance. In contrast, the supervised methods, by metric learning [13] or DL [14], enable the integration of feature learning and co-saliency detection. However, they require additional training data in the form of object masks, often manually drawn or delineated by tools with intensive user interaction. Such heavy annotation cost makes these methods less practical as pointed out in other applications, such as semantic segmentation [15] and saliency detection [16]. Furthermore, their learned models may not perform well for unseen object categories in testing, since the models do not adapt themselves to unseen categories.

Fig. 1.
figure 1

Motivation of our method. (a) Our method optimizes an objective function defined on a graph where single-image saliency (SIS) detection (red edges) and cross-image co-occurrence (COOC) discovery (blue edges) are considered jointly. (b) The first row displays the images for co-saliency detection. The following three rows show the detected saliency maps by using COOC, SIS, and both of them, respectively. (Color figure online)

In this work, we address the aforementioned issues by proposing an unsupervised CNN-based method for joint adaptive feature learning and co-saliency detection for given images, hence making a good compromise between the performance and the annotation requirement. In the proposed method, co-saliency detection is decomposed into two complementary parts, single-image saliency detection and cross-image co-occurrence region discovery. The former detects the saliency object in a single image, which may not repetitively appear across images. The latter discovers regions repetitively appearing across images, which may not be visually salient. To this end, we design two novel losses, the single-image saliency (SIS) loss and the co-occurrence (COOC) loss, to capture the two different but complementary sources of information. These two losses measure the quality of the saliency maps by referring to individual images and the co-occurrence regions for each image pair, respectively. They are further integrated on a graphical model whose unary and pairwise terms correspond to the proposed SIS and COOC losses respectively, as illustrated in Fig. 1(a). Through optimizing the proposed losses, our approach can generate co-saliency maps of high quality by integrating SIS and COOC cues, as shown in Fig. 1(b).

To the best of our knowledge, our method represents the first unsupervised CNN model for co-saliency detection. Compared with unsupervised methods including those using engineered features [3, 7,8,9,10] and those using DL-based features [11, 12], our method achieves better performance by joint adaptive feature learning and co-saliency detection based on CNNs. Compared with the supervised method [13, 17], our method can reach comparable or even slightly better performance and does not suffer from the high annotation cost of labeling object masks as training data. We comprehensively evaluate our method on three benchmarks for co-saliency detection, including the MSRC dataset [18], the iCoseg dataset [19], and the Cosal2015 dataset [12]. The results show that our approach remarkably outperforms the state-of-the-art unsupervised methods and even surpasses many supervised DL-based saliency detection methods.

2 Related Work

2.1 Single-Image Saliency Detection

Single-image saliency detection is to distinguish salient objects from the background by either unsupervised [20,21,22,23,24,25] or supervised [26,27,28,29,30] methods based on color appearance, spatial locations, as well as various supplementary higher-level priors, including objectness. These approaches can handle well images with single salient objects. However, they may fail when the scenes are more complex, for example when multiple salient objects are presented with intra-image variations. By exploiting co-occurrence patterns when common objects appearing in multiple images, co-saliency detection is expected to perform better. However, the appearance variations of common objects across images could also make co-saliency detection a more challenging task.

2.2 Co-saliency Detection

Co-saliency detection discovers common and salient objects across multiple images using different strategies. The co-saliency detection methods have been developed within the bottom-up frameworks based on different robust features, including low-level handcrafted features [3, 7,8,9,10, 17, 31, 32] and high-level DL-based semantic features [11, 12] to catch intra-image visual stimulus as well as inter-image repetitiveness. However, there are no features adopted suitable for all visual variations, and they treat the separate steps of feature extraction and co-saliency detection, leading to suboptimal performance. Data-driven methods [13, 14, 17] directly learn the patterns of co-salient objects to overcome the limitation of bottom-up methods. For instance, the transfer-learning-based method [17] uses the object masks to train a stacked denoising autoencoder (SDAE) to learn the intra-image contrast evidence, and propagate this knowledge to catch inter-image coherent foreground representations. Despite their impressive results, the performance might drop dramatically once the transferred knowledge on feature representations is not satisfactory as the separation of feature extraction and co-saliency detection may potentially impede the performance. Recently, Wei et al. [14] and Han et al. [13] have proposed unified learning-based methods to learn semantic features and detect co-salient objects jointly. Despite the improved performance, their methods rely on a large number of training object masks. It reduces the generalizability of their approaches to unseen images. However, our method can perform the adaptive and unified learning for given images in an unsupervised manner, and hence no aforementioned issues exist in our approach.

2.3 Graphical Models with CNNs

Deep learning has demonstrated success in many computer vision applications. For better preserving spatial consistency, graphical models have been integrated with CNNs when requiring structured outputs, such as depth estimation [33], stereo matching [34], semantic segmentation [35,36,37], image denoising, and optical flow estimation [38]. Although showing promise in preserving spatial consistency and modeling pairwise relationships, these methods have three major limitations when extending to co-saliency detection. First, their graphical models are built on single images, and hence can not be directly applied to co-saliency detection with multiple images. Second, the pairwise terms in these graphical models often act as regularization terms to ensure spatial consistency but can not work alone by themselves. Finally, they require training data to train the model. For the inter-image graphical models, Hayder et al. [39] and Yuan et al. [40] respectively integrated fully-connected CRFs into CNNs for object proposal co-generation and object co-segmentation, where each node is an object proposal. However, their methods still suffer from the last two limitations. In comparison, our method integrates the merits from graphical models for co-saliency detection without aforementioned issues.

Fig. 2.
figure 2

Overview of our approach to co-saliency detection. It optimizes an objective function defined on a graph by learning two collaborative FCN models \(g_s\) and \(g_c\) which respectively generates single-image saliency maps and cross-image co-occurrence maps.

3 Our Approach

We first describe the proposed formulation for co-saliency detection. Next, we propose a couple of enhancements by self-paced learning and fully connected conditional random fields. Finally, the optimization process and the implementation details are provided.

3.1 The Proposed Formulation

Given a set of N images \(\{I_n\}_{n=1}^{N}\), co-saliency detection aims to detect the salient objects of a category commonly present in these images. We accomplish the task by decomposing it into two sub-tasks, single-image saliency detection and cross-image co-occurrence discovery. The former detects the salient regions in a single image, without considering whether the detected regions are commonly present across images. The latter discovers the regions repetitively occurring across images, while disregarding whether the discovered regions stand out visually. Co-saliency detection, finding the salient co-occurrence regions, can then be carried out by performing and integrating the two tasks on a graph whose two types of edges respectively correspond to the two tasks, as shown in Fig. 1(a). The proposed objective function on the graph is defined by

$$\begin{aligned} E({\mathbf {w}}) = \sum _{n = 1}^N \psi _{s} (I_n;{\mathbf {w}}) + \sum _{n=1}^{N}\sum _{m \ne n} \psi _{c} (I_n,I_m;{\mathbf {w}}), \end{aligned}$$
(1)

where the unary term \(\psi _{s}(I_n;{\mathbf {w}})\) focuses on saliency detection for the image \(I_n\), the pairwise term \(\psi _{c} (I_n,I_m;{\mathbf {w}})\) accounts for co-occurrence discovery for the image pair \((I_n,I_m)\), and \({\mathbf {w}}\) is the set of model parameters.

As shown in Fig. 2, we learn two fully convolutional network (FCN) [41] models, \(g_s\) and \(g_c\), to optimize the unary term \(\psi _{s}\) and the pairwise term \(\psi _{c}\) in Eq. (1), respectively. For image \(I_n\), FCN \(g_s\) investigates intra-image clues and generates its saliency map \(S_n^s\). In contrast, FCN \(g_c\) discovers cross-image evidence and produces its co-occurrence map \(S_n^c\), where the repetitively occurring regions are highlighted. The resultant co-saliency map, highlighting the co-occurrence and salient regions, is yielded by \(S_n = g_s(I_n) \otimes g_c(I_n) = S_n^s \otimes S_n^c\), where \(\otimes \) denotes the element-wise multiplication operator.

Let \({\mathbf {w}}_s\) and \({\mathbf {w}}_c\) denote the learnable parameter sets of FCNs \(g_s\) and \(g_c\), respectively. We learn \(g_s\) and \(g_c\) jointly by optimizing \(E({\mathbf {w}}= {\mathbf {w}}_s \cup {\mathbf {w}}_c)\) in Eq. (1). The unary term \(\psi _{s}\) and the pairwise term \(\psi _{c}\) in Eq. (1) are described below.

3.2 Unary Term \(\psi _{s}\)

This term aims to identify the salient regions in a single image. It guides the training of FCN \(g_s\), which produces saliency map \(S_n^s\) for image \(I_n\), i.e., \(S_n^s = g_s(I_n)\). Inspired by Zhang et al. [42], we apply an existing unsupervised method for saliency detection to image \(I_n\), and use its output saliency map \(\tilde{S}_n\) as the desired target for learning FCN \(g_s\). In this work, we adopt MILP [25] to generate \(\tilde{S}_n\). Specifically, the unary term \(\psi _{s}(I_n; {\mathbf {w}}_s)\) applied to image \(I_n\) is defined by

$$\begin{aligned} \psi _{s}(I_n; {\mathbf {w}}_s) = \sum _{i \in I_n} R_n(i) |S_n^{s}(i) - \tilde{S}_n(i)|^2, \end{aligned}$$
(2)

where i is the index of the pixels in \(I_n\), \(S_n^{s}(i)\) and \(\tilde{S}_n(i)\) are respectively the saliency values of maps \(S_n^{s}\) and \(\tilde{S}_n\) at pixel i, and \(R_n(i)\) represents the importance of pixel i. Pixels in map \(\tilde{S}_n\) can be divided into the salient and non-salient groups by using the mean value of \(\tilde{S}_n\) as the threshold. \(R_n(i)\) is introduced here to deal with the potential size unbalance between the two groups. Let \(\delta \) be ratio of salient pixels over the whole image \(I_n\). \(R_n(i)\) takes the value \(1-\delta \) if pixel i belongs to the salient group, and \(\delta \) otherwise. In this way, the salient and non-salient groups contribute equally in Eq. (2).

3.3 Pairwise Term \(\psi _{c}\)

The pairwise term \(\psi _{c}\) seeks the regions simultaneously appearing across images. It serves as the objective to learn FCN \(g_c\). The regions should look similar across images but distinctive from surrounding non-detected regions. Thus, two criteria are jointly considered in the design of \(\psi _{c}\), including (1) high cross-image similarity between the detected co-occurrence regions and (2) high intra-image distinctness between the detected co-occurrence regions and the rest of the image. The second criterion is auxiliary but crucial to avoid trivial solutions.

As shown in Fig. 2, FCN \(g_c\) produces the co-occurrence map \(S_n^c\) for image \(I_n\), i.e., \(S_n^c = g_c(I_n)\). The sigmoid function is used as the activation function in the last layer of \(g_c\). Thus, the value of the co-occurrence map at each pixel i, \(S_n^c(i)\), is between 0 and 1. With \(S_n^c\), image \(I_n\) is decomposed into two masked images,

$$\begin{aligned} I_{n}^{o} = S_n^c \otimes I_n \; \text { and } \; I_{n}^{b} = (1-S_n^c) \otimes I_n, \end{aligned}$$
(3)

where \(\otimes \) denotes element-wise multiplication. The masked image \(I_{n}^{o}\) keeps the detected co-occurrence regions of \(I_n\), while image \(I_{n}^{b}\) contains the rest.

To measure the similarity between images, we employ a feature extractor f to compute the features of a given image. In this work, the extractor f can be a pre-trained CNN model for image classification, e.g., AlexNet [43] or VGG-19 [44], with the softmax function and the last fully connected layer removed. We apply the extractor f to all masked images \(\{I_{n}^{o}, I_{n}^{b}\}_{n=1}^{N}\) and obtain their features \(\{f(I_{n}^{o}) \in {\mathbb {R}}^c, f(I_{n}^{b}) \in {\mathbb {R}}^c \}_{n=1}^{N}\), where c is the feature dimension. With these extracted features, the pairwise term \(\psi _{c} (I_n,I_m;{\mathbf {w}}_c)\) applied the image pair \(I_n\) and \(I_m\) is defined by

$$\begin{aligned} \psi _{c}(I_n,I_m;{\mathbf {w}}_c) = -\log (p_{nm}), \end{aligned}$$
(4)

where \(p_{nm}\) is the score estimating the quality of the detected co-occurrence regions in \(I_n\) and \(I_m\). The score \(p_{nm}\) is defined below,

$$\begin{aligned} p_{nm}&= \frac{\exp (-d^{+}_{nm})}{\exp (-d^{+}_{nm}) + \exp (-d^{-}_{nm})}, \; \text {where} \end{aligned}$$
(5)
$$\begin{aligned} d^{+}_{nm}&= \frac{1}{c} \Vert f(I_{n}^{o}) - f(I_{m}^{o}) \Vert ^2 \; \text {and} \end{aligned}$$
(6)
$$\begin{aligned} d^{-}_{nm}&= \frac{1}{2c} \Vert f(I_{n}^{o}) - f(I_{n}^{b}) \Vert ^2 + \frac{1}{2c} \Vert f(I_{m}^{o}) - f(I_{m}^{b}) \Vert ^2. \end{aligned}$$
(7)

Equation (6) measures the inter-image distance between the detected co-occurrence regions in images \(I_n\) and \(I_m\) (criterion 1). Equation (7) evaluates the intra-image distance between the detected co-occurrence regions and the rest of the image (criterion 2). By minimizing the pairwise term \(\psi _{c} (I_n,I_m;{\mathbf {w}}_c)\) in Eq. (4) for each image pair \((I_n, I_m)\), the resultant FCN \(g_c\) will produce the co-occurrence maps where the inter-image distances between the detected co-occurrence regions are minimized while the intra-image distances between the detected co-occurrence regions and the rest of the images are maximized. After learning FCNs \(g_s\) and \(g_c\) jointly through the unary and pairwise terms in Eq. (1), the resultant co-saliency map \(S_n\) of a given image \(I_n\) is produced via \(S_n = g_s(I_n) \otimes g_c(I_n)\).

Note that the pairwise term in Eq. (4) is defined by referring to the co-occurrence maps produced by FCN \(g_c\), i.e., \(S_n^c\) and \(S_m^c\). In practice, we found that the performance of co-saliency detection can be further improved if co-saliency maps \(S_n\) and \(S_m\) are also taking into account in the pairwise term. In our implementation, we extend the pairwise term in Eq. (4) to

$$\begin{aligned} \psi _{c}(I_n,I_m;{\mathbf {w}}_c) = -\lambda _{c}\log (p_{nm}) -\lambda _{\tilde{c}}\log (\tilde{p}_{nm}), \end{aligned}$$
(8)

where like \(p_{nm}\), \(\tilde{p}_{nm}\) is computed in the same way but by referring to co-saliency maps \(S_n\) and \(S_m\). Constants \(\lambda _{c}\) and \(\lambda _{\tilde{c}}\) are used in Eq. (8) for weighting the corresponding terms. In the following, we will show that the quality of the co-saliency maps can be further improved via two extensions, including map enhancement by self-paced learning and postprocessing by fully connected conditional random fields (or DenseCRFs) [45].

3.4 Co-saliency Map Enhancement

The self-paced learning with CNNs is proposed to make salience map sharper. Then, fully connected conditional random fields are adopted to preserve co-salient objects’ boundaries. The details of these two extensions are given below.

Co-saliency Map Enhancement by Self-paced Learning. The co-saliency maps obtained by optimizing Eq. (1) are sometimes over smooth, because both FCNs \(g_s\) and \(g_c\) do not take into account the information regarding object boundaries. To address this issue, we oversegment each image \(I_n\) into superpixels \(Q_n = \{q_n^{k}\}_{k=1}^{K}\), where \(q_n^{k}\) is the kth superpixel and K is the number of superpixels. Pixels in a superpixel tend to belong to either a salient object or the background all together. This property can be leveraged to propagate information from pixels of high confidence to those of low confidence within the same superpixel. We divide superpixels into three groups, i.e., \(Q_n = O_n \cup B_n \cup T_n\). The first two groups, \(O_n\) and \(B_n\), contain superpixels that likely belong to the object and the background, respectively. The third group \(T_n\) covers the rest. Given the co-saliency map \(S_n\), the three groups are yielded by

$$\begin{aligned} q_n^{k} \in {\left\{ \begin{array}{ll} O_n, &{} \text {if } \mu _n^k > \mu _n + \sigma _n,\\ B_n, &{} \text {if } \mu _n^k < \mu _n - 0.25 * \sigma _n, \\ T_n, &{} \text {otherwise}, \\ \end{array}\right. } \; \text {for } k=1, 2, ..., K, \end{aligned}$$
(9)

where \(\mu _n^k\) is the mean saliency value of superpixel \(q_n^{k}\), while \(\mu _n\) and \(\sigma _n\) are the mean and the standard deviation of \(\{\mu _n^k\}_{k=1}^K\). In addition, we follow the background seed sampling strategy used in previous work [20, 46], and add superpixels on the image boundary to the set \(B_n\). Superpixels in \(O_n\) and \(B_n\) are confident to be assigned to either the salient regions or the background. Those in \(T_n\) are ambiguous, so they are not taken into account here. With \(O_n\) and \(B_n\) of image \(I_n\), another FCN \(g_e\) for co-saliency map enhancement is trained by optimizing

$$\begin{aligned} \psi _{e}(I_n;{\mathbf {w}}_e) = w_o \sum _{q \in O_n} \sum _{i \in q} |S_n^{e}(i) - 1|^2 + w_b \sum _{q \in B_n} \sum _{i \in q} |S_n^{e}(i) - 0|^2, \end{aligned}$$
(10)

where map \(S_n^{e} = g_e(I_n)\) is generated by FCN \(g_e\), and i the index of pixels in \(I_n\). Constants \(w_o = \frac{|B_n|}{|O_n|+|B_n|}\) and \(w_b = \frac{|O_n|}{|O_n|+|B_n|}\) are the weights used to balance the contributions of \(O_n\) and \(B_n\), where \(|O_n|\) and \(|B_n|\) are the numbers of pixels in \(O_n\) and \(B_n\), respectively.

The term in Eq. (10) enhances the consensus within superpixels of high confidence. If it is turned on, the objective is extended from that in Eq. (1) to

$$\begin{aligned} E({\mathbf {w}}) = \sum _{n = 1}^N \psi _{s} (I_n;{\mathbf {w}}_s) + \lambda _e \psi _e (I_n;{\mathbf {w}}_e) + \sum _{n=1}^{N-1}\sum _{m=n+1}^N \psi _{c} (I_n,I_m;{\mathbf {w}}_c), \end{aligned}$$
(11)

where \(\lambda _e\) is a weight, and \({\mathbf {w}}= {\mathbf {w}}_s \cup {\mathbf {w}}_e \cup {\mathbf {w}}_c\) is the union of the learnable parameter sets of FCNs \(g_s\), \(g_e\), and \(g_c\). After optimizing the objective function in Eq. (11), the co-saliency map \(S_n\) of image \(I_n\) is generated by \(S_n = g_s(I_n) \otimes g_e(I_n) \otimes g_c(I_n) = S^{s}_n \otimes S^{e}_n \otimes S^{c}_n\).

Postprocessing Using DenseCRFs. The co-saliency maps obtained by optimizing the objective in Eq. (11) can be further improved by enforcing spatial coherence and preserving object boundaries. To this end, we follow previous work [28, 47] and adopt DenseCRFs [45] to postprocess the co-saliency map \(S_n\) of a given image \(I_n\). We use the DenseCRFs code implemented by Li and Yu [47] in this work.

3.5 Optimization

To reduce memory consumption and speed up the training, the proposed method is optimized by using a two-stage procedure. At the first stage, we respectively learn FCNs \(g_s\) and \(g_c\) by using the objective functions in Eqs. (2) and (4) with all images for 20 epochs. The co-saliency maps \(\{S_n = g_s(I_n) \otimes g_c(I_n)\}_{n=1}^N\) become stable enough. Thus, we divide the superpixels of each image into three groups via Eq. (9). FCN \(g_e\) is then trained with the objective in Eq. (10) with all images for 20 epochs. At the second stage, we turn on all the three terms in Eq. (11) where the extended pairwise term in Eq. (8) is adopted. The three FCNs, \(g_s\), \(g_e\), and \(g_c\), are optimized jointly for 20 epochs. Note that at the second stage, we optimize only the parameters in the last two convolutional layers and the skip connections of each FCN model.

The objectives in Eqs. (1) and (11) are defined on a fully-connected graph. It is difficult to directly optimize either objective with all images at the same time due to the limited memory size. Thereby, we adopt the piecewise training scheme [48]. Namely, we consider only the sub-graph yielded by a subset of images at each time. The learning rate is set to \(10^{-6}\) at the first stage and is reduced to \(10^{-8}\) at the second stage. The weight decay and momentum are set to 0.0005 and 0.9, respectively. The objective function in Eq. (11) is differentiable. We choose ADAM [49] as the optimization solver for its rapid convergence. The gradients with respect to the optimization variables can be derived straightforward, so we omit their derivation here.

3.6 Implementation Details

The proposed method is implemented using MatConvNet [50]. The same network architecture is used in all the experiments. ResNet-50 [51] is adopted as the feature extractor f for the pairwise term, because AlexNet [43] and VGG-16/19 [44] sometimes lead to the problem of vanishing gradients in our application. The feature extractor f is pre-trained on ImageNet [52] and fixed during the optimization process. The features extracted by f are the inputs to the last fully connected layer of f. The feature dimension, i.e., c in Eqs. (6) and (7), is set to 2,048. All FCNs, including \(g_s\), \(g_e\) and \(g_c\), are developed based on the VGG-16 [44] setting of FCN [41]. We replace the activation function softmax in the last layer with the sigmoid function. SLIC [53] is adopted to generate superpixels because of its computational efficiency, better compactness and regularity. The models pre-trained on the ImageNet [54] dataset for classification are required. Following previous co-saliency detection methods [11, 12], we determine the values of hyperparameters empirically and keep them fixed in all the experiments.

Fig. 3.
figure 3

Comparison with the state-of-the-art methods with the same setting in terms of PR curves on three benchmark datasets. The numbers in parentheses are AP values.

4 Experimental Results

In this section, we first describe the datasets and evaluation metrics. Next, we compare our method with a set of state-of-the-art methods. Finally, we investigate contributions of different components by reporting ablation studies.

4.1 Datasets and Evaluation Metrics

Datasets. We evaluated the proposed approach on three public benchmark datasets: iCoseg [19], MSRC [18] and Cosal2015 [12]. iCoseg consists of 38 groups of total 643 images, and each group has 4–42 images. The images of iCoseg contain single or multiple similar objects in various poses and sizes with complex backgrounds. MSRC contains 7 groups of total 240 images, and each group has 30–53 images. Compared to iCoseg, objects in MSRC exhibit greater appearance variation. Cosal2015 is a more recent and more challenging dataset than the other two. It has 50 groups and a total of 2015 images. Each group contains 26 to 52 images, with various poses and sizes, appearance variations and even more complex backgrounds. Because the images of iCoseg and Cosal2015 have larger sizes than the ones of MSRC, different batch sizes and resolutions were used. The batch size is 3 and the resolution is \(512 \times 512\) for iCoseg and Cosal2015, while the batch size is 5 and the resolution is \(320 \times 320\) for MSRC.

Evaluation Metrics. To evaluate the performance of co-saliency detection, we consider three metrics, average precision (AP), F-measure (\(F_{\beta }\)), and structure measure (\(S_{\alpha }\)). AP is computed from the area under the Precision-Recall (PR) curve, which is produced by binarizing saliency maps with every integer threshold in the range of [0, 255]. F-measure denotes the harmonic mean of the precision and recall values obtained by a self-adaptive threshold \(T = \mu + \sigma \), where \(\mu \) and \(\sigma \) are respectively the mean and standard deviation of the saliency map. With the precision and recall values, the F-measure is computed by \({F_{\beta }} = \frac{(1+\beta ^2) \times precision \times recall}{\beta ^2 \times precision+recall}\), where \(\beta ^2 = 0.3\) to emphasize more on recall as suggested in previous work [11, 12, 56]. The structure measure (\(S_{\alpha }\)) [57] is adopted to evaluate the spatial structure similarities of saliency maps based on both region-aware structural similarity \(S_r\) and object-aware structural similarity \(S_o\), defined as \(S_{\alpha } = \alpha * S_r + (1-\alpha ) * S_o\), where \(\alpha = 0.5\) following [57].

Table 1. The performance of co-saliency detection on three benchmark datasets. SI and CS denote the single-image saliency and co-saliency methods, respectively. US and S indicate the unsupervised and supervised methods, respectively. The numbers in red and green respectively indicate the best and the second best results of the unsupervised co-saliency methods (CS+US), the group which the proposed method belongs to.

4.2 Comparison with State-of-the-Art Methods

To have a thorough comparison with state-of-the-art methods, we divide them into four groups, i.e., the unsupervised saliency [20, 22,23,24,25, 42] and co-saliency [3, 7, 8, 11, 12, 31, 32, 55] detection methods as well as supervised saliency [26, 28,29,30, 47, 58] and co-saliency [13, 17] detection methods. The overall performance statistics are compared in Table 1 and Fig. 3. Please note that all compared supervised single-image saliency detection methods are CNN-based. Among unsupervised single-image saliency methods, SVFSal [42] is CNN-based. When available, we used the publicly released source code with default parameters provided by the authors to reproduce the experimental results. For methods without releasing source code, we either evaluated metrics on their pre-generated co-saliency maps (SP-MIL [11], CoDW [12] and DIM [17]), or directly copied the numbers reported in their papers (UMLBF [13] and MVSRC [55]).

From Table 1, our method outperforms all methods with the same unsupervised co-saliency detection setting by a significant margin. Most approaches of this category take feature extraction and co-salient object detection as separating steps. Our approach excels them by performing these steps simultaneously and adopting CNN models. Comparing with the group of the supervised co-saliency method, UMLBF [13] and DIM [17], our method yields comparable or even slightly better performance without expensive object annotations. Although both with the unsupervised setting, by taking advantage of additional information within an image set, our method clearly outperforms the group of unsupervised single-image saliency detection methods. It’s worth mentioning that our method also outperforms the unsupervised CNN-based single-saliency method, SVFSal [42] that requires saliency proposal fusion for generating high-quality pseudo ground-truth as training data. In general, the supervised CNN-based single-image saliency methods perform the best among four groups of methods as they better utilize the object annotations. Even so, our method still outperforms many of the methods in this group by exploiting cross-image referencing and adaptive feature learning. From the PR curves in Fig. 3, the proposed method outperforms the state-of-the-arts by a large margin.

Fig. 4.
figure 4

Example saliency maps generated by our method and some state-of-the-art methods. From the top to the bottom, they are the given images, ours, CSSCF [3], CoDW [12], MILP [25], SVFSal [42], UCF [29] and Amulet [30].

Fig. 5.
figure 5

Ablation studies on three benchmarks. The top row plots the PR curves, while the bottom row shows the performance in \(F_{\beta }\) and \(S_{\alpha }\).

Figure 4 shows example saliency maps produced by our method and some state-of-the-art methods, including unsupervised co-saliency detection methods (CSSCF [3], CoDW [12]), unsupervised single-image saliency methods (MILP [25] and SVFSal [42]), and supervised CNN-based methods (UCF [29] and Amulet [30]). Without referring to other images in the given image set, single-image saliency methods could detect the visually salient objects that do not repetitively appear in other images, such as the orange and the apple in the second image of the banana set or the woman in the first image of the babycrib set. Co-saliency detection methods perform better in this regard. The competing co-saliency methods, CSSCF [3] and CoDW [12], cannot perform well for images with low figure-ground discrepancies or highly-textured backgrounds, such as the second and third images of the babycrib set or the first and second images of the bird set. The major drawback of their approaches is to treat feature extraction as a separate step. Thus, they cannot find the most discriminative features across images. Our method addresses the problem by performing adaptive feature learning and co-saliency detection jointly.

Fig. 6.
figure 6

Example co-saliency maps generated by combinations of different components. From the top to the bottom, they are the given images, \(g_c\), \(g_s\), \(g_c+g_s\), \(g_c+g_s+g_e\) and \(g_c+g_s+g_e+\mathrm{D}\), respectively.

4.3 Ablation Studies

We have performed ablation studies to investigate the contributions of individual components, \(g_c\), \(g_s\), \(g_e\) and DenseCRFs. Figure 5 reports the results with different metrics. \(+\)D denotes the results refined by DenseCRFs. For both AP and \(F_{\beta }\), the integration of \(g_c\) and \(g_s\) outperforms either alone. It is not the case for \(S_{\alpha }\) measuring the structure of the detected objects. It will be explained later. Both self-paced learning and DenseCRFs further improve the results.

Figure 6 gives example co-saliency maps for ablation studies. They demonstrate that \(g_c\) and \(g_s\) can be complementary to each other. Taking the butterfly set as an example, \(g_s\) highlights both butterflies and flowers in the first, third and fourth images. After integrating the co-occurrence information discovered by \(g_c\), the flowers are mostly removed and lightened in \(g_c+g_s\). As mentioned above, \(g_c+g_s\) could perform worse in terms of \(S_{\alpha }\). It is because \(g_c\) tends to have less certainty, particularly inside objects or ambiguous background regions, as illustrated in the second row of Fig. 6. Thus, \(g_c+g_s\) usually generates fuzzier maps than \(g_s\) alone. For example, the cattle have lower saliency values in \(g_c + g_s\) (the fourth row of Fig. 6) than \(g_s\) (the third row of Fig. 6). By propagating information from regions with high confidence, \(g_e\) improves the certainty of the results of \(g_c+g_s\). Although with less gain in AP and \(F_{\beta }\), it brings large improvement in \(S_{\alpha }\) since objects are more highlighted and the backgrounds are further lightened as shown in the fifth row of Fig. 6. Finally, the DenseCRF enhances spatial coherence and boundary preservation, thus improving both quantitative and qualitative results.

5 Conclusions

In this paper, we have presented an unsupervised method for co-saliency detection using CNNs. To the best of our knowledge, it is the first one to address this problem with an unsupervised CNN. Our method decomposes the problem into two sub-tasks, single-image saliency detection and cross-image co-occurrence region discovery, by modeling the corresponding novel losses: single-image saliency (SIS) loss and co-occurrence (COOC) loss. The graphical model is adopted to integrate these two losses with unary and pairwise terms corresponding to the SIS and COOC losses, respectively. By optimizing the energy function associated with the graph, two networks are learnt jointly. The quality of co-saliency maps is further improved by self-paced learning and postprocessing by DenseCRFs. Experiments on three challenging benchmarks show that the proposed method outperforms the state-of-the-art unsupervised methods. In the future, we plan to generalize our method to other applications, such as semantic correspondence [59], image co-localization [14] and object co-segmentation [60] that also require learning among multiple images.