Keywords

1 Introduction

Salient object detection, also known as saliency detection, aims to localize and segment the most conspicuous and eye-attracting objects or regions in an image. It is usually served as a pre-processing step to facilitate various subsequent high-level vision tasks, such as image segmentation [1], image captioning [2], and so on. Recently, with the quick development of deep convolutional neural networks (CNNs), salient object detection has achieved significant improvements over conventional hand-crafted feature based approaches. The emergence of fully convolutional neural networks (FCNs) [3] further pushed it to a new state-of-the-art due to its efficiency and end-to-end training. Such architecture also benefits other applications, e.g., semantic segmentation [4], edge detection [5].

Albeit profound progresses have been made, there still exists two major challenges that hinder its applications in real-world, e.g., embedded devices. One is the low resolution of the saliency maps produced by FCNs based saliency models. Due to the repeated stride and pooling operations in CNN architectures, it is inevitable to lose resolution and difficult to refine, making it infeasible to locate salient objects accurately, especially for the object boundaries and small objects. The other is the heavy weight and large redundancy of the existing deep saliency models. As can be seen in Fig. 1, all the listed deep models are larger than 100 MB, which is too heavy for a pre-processing step to apply in subsequent high-level tasks, and also not memory efficient for embedded devices.

Fig. 1.
figure 1

Maximum F-Measure of recent deep CNN-based saliency detection models on ECSSD, including DS [6], ELD [7], \(\hbox {DCL}^{+}\) [8], DHS [8], RFCN [9], NLDF [10], \(\hbox {DSS}^{+}\) [11], MSRNet [12], Amulet [13], UCF [14], and ours (red circle). As can be seen that the proposed model is the only one less than 100 MB while achieves comparable performance with state-of-the-art methods. (Color figure online)

Diverse solutions have been explored to improve the resolution of the FCNs based prediction. Early works [8, 15, 16] usually combined it with an extra region or superpixel based stream to fuse their respective advantages at the expense of high time cost. Then, some simple yet effective structures are constructed to combine the complementary cues of shallow and deep CNN features, which capture low-level spatial details and high-level semantic information respectively, such as skip connections [12], short connections [11], dense connections [17], adaptive aggregation [13]. Such multi-level feature fusion schemes also play an important role in semantic segmentation [18, 19], edge detection [20], skeleton detection [21, 22]. Nevertheless, the existing archaic fusions are still incompetent for saliency detection under complex real-world scenarios, especially when dealing with multiple salient objects with diverse scales. In addition, some time consuming post-processing skills are also applied for refinement, e.g., superpixel-based filter [23], fully connected conditional random field (CRF) [8, 11, 24]. However, to the best of our knowledge, there are no saliency detection networks explored considering both lightweight model and high accuracy.

Fig. 2.
figure 2

Visual comparison of saliency maps produced by DSS [11] (top row), our method without (middle row) and with reverse attention (bottom row) in different side-outputs, respectively. As can be seen clearly that the resolutions of the saliency maps are improved gradually from deep to shallow side-outputs, and our reverse attention based side-output residual learning performs much better than short connections [11].

To this end, we present an accurate yet compact deep salient object detection network which achieved comparable performance with state-of-the-art methods, thus enables for real-time applications. In generally, more convolutional channels with large kernel size leads to better performance in salient object detection due to the large receptive field and model capacity to capture more semantic information, e.g., there are 512 channels with kernel size \(7\times 7\) in the last side-output of DSS [11]. In a different way, we introduce residual learning [25] into the architecture of HED [5], and regard salient object detection as a super-resolution reconstruction problem [26]. Given the low resolution prediction of FCNs, side-output residual features are learned to refine it step by step. Note that it can be achieved only using convolution with 64 channels and kernel size \(3\times 3\) in each side-output, whose parameters are significant fewer than DSS.

Similar residual learning was also utilized in skeleton detection [21] and image super-resolution [27]. However, the performance is not satisfactory enough if we directly apply it for salient object detection due to its challenging. Since most of the existing deep saliency models are fine-tuned from image classification network, the fine-tuned network will unconsciously focus on the regions with high response values during residual learning as can be seen in Fig. 5, thus struggling to capture the residual details, e.g., object boundaries and other undetected object parts. To solve it, we propose reverse attention to guide side-output residual learning in a top-down manner. Specifically, prediction of deep layer is upsampled then reversed to weight its neighbor shallow side-output feature, which quickly guides the network to focus on the undetected regions for residual capture, thus leads to better performance as seen in Fig. 2.

In summary, the contributions of this paper can be concluded as: (1) We introduce residual learning into the architecture of HED for salient object detection. With the help of the learned side-output residual features, the resolution of the saliency map can be improved gradually with much fewer parameters compared to the existing deep saliency networks. (2) We further propose reverse attention to guide side-output residual learning. By erasing the current prediction, the network can disscover the missing object parts and residual details effectively and quickly, which leads to significant performance improvement. (3) Benefit from the above two components, our approach consistently achieves comparable performance with state-of-the-art methods, and with advantages in terms of simplicity, efficiency (45 FPS) and model size (81 MB).

2 Related Work

There are plenty of saliency detection methods proposed in the past two deceads. Here, we only focus on the recent state-of-the-art methods. Almost all of them are FCNs based and try to solve the common problem: how to produce saliency map with high resolution by using FCNs? Kuen et al. [28] applied recurrent unit into FCNs to iteratively refine each salient region. Hu et al. [23] entended a superpixel-based guided filter to be a layer in the network for boundary refinement. Hou et al. [11] designed short connections for multi-scale feature fusion, while in Amulet [13], multi-level convolutional features were aggregated adaptively. Luo et al. [10] proposed a multi-resolution grid structure to capture both local and global cues. In addition, a new loss function was introduced to penalize errors on the boundaries. Zhang et al. [14] further proposed a novel upsampling method to reduce the artifacts produced in deconvolution. Recently, dilated convolution [23] and dense connections [17] are further incorporated to obtain high resolution saliency map. There are also some progressive works to address the above issue in semantic segmentation. In [19], skip connections was proposed to refine object instances, while in [29], it was used to build a Laplacian pyramid reconstruction network for object boundary refinement.

Instead of fusing multi-level convolutional features as the above works, we try to learn residual feature for low resolution refinement. The idea of residual learning was first proposed by He et al. [25] for image classification. After that, it was widely applied in various applications. Ke et al. [21] leraned side-output residual feature for accurate object symmetry detection. Kim et al. [27] built a very deep convolutional network based on residual learning for accurate image super-resolution.

Although it is natural to apply it for salient object detection, the performance is not satisfactory enough. To solve it, we introduce attention mechanism which is inspired from human perception process. By using top information to efficiently guide bottom-up feedforward process, it has achieved great success in many tasks. Attention model was designed to weight multi-scale features in [12, 30]. Residual attention module was stacted to generate deep attention-aware features for image classification in [31]. In ILSVRC 2017 Image Classification Challenge, Hu et al. [32] won the 1st place by constructing Squeeze-and-Excitation block for channel attention. Huang et al. [33] designed an attention mask to highlight the prediction of the reverse object class, which then be subtracted from the original prediction to correct the mistakes in the confusion area for semantic segmentation. Inspired but differed from it, we employ reverse attention in a top-down manner to guide side-output residual learning. Benefit from it, we can learn more accurate residual details which leads to significant improvement.

Fig. 3.
figure 3

The overall architecture of the proposed network. Here, only three side-outputs are listed for illustration.“R” denotes the proposed reverse attention block that is illustrated in Fig. 4. As can be seen, the residual error decreases along the stacking orientation with the supervision both on the input and output of the residual unit (yellow circle). (Color figure online)

3 Proposed Method

In this section, we first describe the overall architecture of the proposed deep salient object detection network, and then present the details of the main components one by one, which are corresponding to side-output residual learning and top-down reverse attention respectively.

3.1 Architecture

The proposed network is built upon the HED [5] architecture and choses VGG-16 [34] as backbone. We use the layers up to “pool5” and select {conv1_2, conv2_2, conv3_3, conv4_3, conv5_3} as side-outputs, which have strides of {1, 2, 4, 8, 16} pixels with respect to the input image repectively. We first reduce the dimension of “pool5” into 256 by convolution with kernel size \(1\times 1\), and then add three convolutional layers with \(5\times 5\) kernels to capture global saliency. Since the resolution of the global saliency map is only 1/32 of the input image, we further learn residual feature in each side-output to improve its resolution gradually. In specifically, D convolutional layers with \(3\times 3\) kernels and 64 channels are stacked for residual learning. The reverse attention block is embedded before side-output residual learning. The prediction of the shallowest side-output is fed into a sigmoid layer for final output. The overall architecture is shown in Fig. 3 and complete configurations are outlined in Table 1.

Table 1. The configurations of the proposed network. \((n, k\times k)\times D\) denotes stacking D convolutional layers with channel number (n) and kernel size (k), and ReLU layer is added for nonlinear transformation.

3.2 Side-Output Residual Learning

As we know, deep layers of network capture high-level semantic information but messy details, while it is opposite for shallow ones. Based on this observation, multi-level features fusion is a common choice to capture their complementary cues, however, it will degrade the confident prediction of deep layers when combining with shallow ones. In this paper, we implement it in a different yet more efficient way by employing residual learning to remedy the errors between the predicted saliency maps and the ground truth. Specifically, the residual feature is learned by applying deep supervision both on the input and output of the designed residual unit, which is illustrated in Fig. 3. Formally, given the upsampled input saliency map \(S_{i+1}^{up}\) by a factor 2 in side-output stage \(i+1\), and the residual feature \(R_{i}\) learned in side-output stage i, then the deep supervision can be formulated as:

$$\begin{aligned} \left\{ \begin{aligned}&\{S_{i+1}\}^{up\times 2^{i+1}} \approx G \\ \{S_{i+1}^{up}&+R_{i}\}^{up\times 2^{i}}=\{S_{i}\}^{up\times 2^{i}}\approx G \\ \end{aligned} \quad ,\right. \end{aligned}$$
(1)

where \(S_{i}\) is the output of the residual unit and G is ground truth, \(up\times 2^{i}\) denotes the upsample operation by a factor \(2^{i}\), which is implemented by the same bilinear interpolation with HED [5].

Such a learning objective inherits the following good property. The residual units establish shortcut connections between the predictions from different scales and the ground truth, which makes it easier to remedy their errors with higher scale adaptability. Generally, the error between the input and output of the residual unit is fairly small based on the same supervision, thus can be learned more easily with fewer parameters and iterations. To the extreme, the error is approximately equal to zero if the prediction is close enough to the ground truth. As a result, the constructed network can be very efficient and lightweight.

3.3 Top-Down Reverse Attention

Although it is natural and straightforward to learn residual details for saliency refinement, it is not easy for the network to capture them accurately without extra supervision, which will result in unsatisfactory detection. Since most of the existing saliency detection networks are fine-tuned from image classification networks which are only responsive to small and sparse discriminative object parts, it obviously deviates from the requirement of the saliency detection task that needs to explore dense and integral regions for pixel-wise prediction. To mitigate this gap, we propose a reverse attention based side-output residual learning approach for expanding object regions progressively. Starting with a coarse saliency map generated in the deepest layer with high semantic confidence but low resolution, our proposed approach guides the whole network to sequentially discover complement object regions and details by erasing the current predicted salient regions from side-output features, where the current prediction is upsampled from its deeper layer. Such a top-down erasing manner can eventually refine the coarse and low resolution prediction into a complete and high resolution saliency map with these explored regions and details, see Fig. 4 for illustration.

Fig. 4.
figure 4

Illustration of the proposed reverse attention block, whose input and output are highlighted in blue and green respectively. (Color figure online)

Given the side-output feature T and reverse attention weight A, then the output attentive feature can be produced by their element-wise multiplication, which can be formulated as:

$$\begin{aligned} F_{z,c}=A_{z} \cdot T_{z,c}, \end{aligned}$$
(2)

where z and c denote the spatial position of the feature map and the index of the feature channel, respectively. And the reverse attention weight in side-output stage i is simply generated by subtracting the upsampled prediction of side-output \(i+1\) from one, which is computed as below:

$$\begin{aligned} A_{i}=1-\mathrm{Sigmoid}(S_{i+1}^{up}). \end{aligned}$$
(3)

Figure 5 shows some visual examples of the learned residual feature to illustrate the effectiveness of the proposed revrse attention. As can be seen, the proposed network well captured the residual details near object boundaries with the help of reverse attention. While without reverse attention, it learned some redundant features inside object which is helpless for saliency refinement.

Fig. 5.
figure 5

Visualization of residual features in different side-outputs of the proposed network without (the first row) and with reverse attention (the second row). From left to right are saliency map, the last convolutional feature from side output 1 to 4, respectively. After appling our reverse attention, the proposed network well captured spatial details near object boundaries which is beneficial for saliency refinement, especially in shallow layers. Best viewed in color. (Color figure online)

3.4 Supervision

As shown in Fig. 3, deep supervision is applied to each side-output stage as did in [5, 11]. Each side-output produces a loss term \({\mathcal {L}}_{side}\) which is defined as below:

$$\begin{aligned} {\mathcal {L}}_{\mathrm{side}}(I,G,\mathrm{W},\mathrm{w})=\sum _{m=1}^{M}\ell _\mathrm{side}^{(m)}(I,G,\mathrm{W},\mathrm{w}^{(m)}), \end{aligned}$$
(4)

where M regards to the total side-output numbers including global saliency, W denotes the collection of all standard network layer parameters, I and G refer to the input image and the corresponding ground truth respectively. Each side-output layer is regarded as a pixel-wise classifier with the corresponding weights w which is represented by

$$\begin{aligned} \mathrm{w}=(\mathrm{w}^{(1)},\mathrm{w}^{(2)},...,\mathrm{w}^{(M)}). \end{aligned}$$
(5)

Here, \(\ell _{\mathrm{side}}^{(m)}\) represents the image-level class-balanced cross-entropy loss function [5] of the mth side output, which is computed by the following formulation:

$$\begin{aligned} \ell _{\mathrm{side}}^{(m)}(I,G,\mathrm{W},\mathrm{w}^{(m)})= & {} -\sum _{z=1}^{|I|}G(z)logPr(G(z)=1|I(z);\mathrm{W},\mathrm{w}^{(m)})\nonumber \\&+\,(1-G(z))logPr(G(z)=0|I(z);\mathrm{W},\mathrm{w}^{(m)}), \end{aligned}$$
(6)

where \(Pr(G(z)=1|I(z);W,w^{(m)})\) represents the probability of the activation value at location z in the \(m\hbox {th}\) side output, z is the saptial coordinate. Different with HED [5] and DSS [11], there is no fusion layer included in our approach. The output of the first side-output is used as our final prediction after a sigmoid layer in the testing stage.

3.5 Difference to Other Networks

Though shares the same name, the proposed network significantly differs from reverse attention network [33], which applied reverse attention to weight the prediction that is not associated with a target class, in this way to amplify the reverse-class response in the confused region, thus can help the original branch make correct prediction. While in our approach, the usage of reverse attention is totally different. It is used to erase the confident prediction from deep layer, which can guide the network to explore the missing object regions and details effectively.

There are also some significant differences with other residual learning based architectures, e.g., side-output residual network (SRN) [21], and Laplacian reconstruction network (LRN) [29]. In SRN, the residual feature is learned from each side-output of VGG-16 directly, while in this paper, it is learned after reverse attention that is applied to guide residual learning. The main difference with LRN lies in the usage of the wight mask, which is used to weight the learned side-output features for boundary refinement in LRN, in contrast, we apply it before side-output feature learning for guidance. In addition, the weight mask in LRN is generated from the edge of deep prediction which will miss some object regions due to its low resolution, while in this paper, we apply it to focus on all the undetected regions for saliency refinement, which not only refines object boundaries well but also highlights object regions more completely.

4 Experiments

4.1 Experimental Setup

The proposed network is built on the top of the implementations of HED [5] and DSS [11], and trained though the publicly available Caffe [35] library. The whole network is trained end-to-end using full-resolution images and optimized by stochastic gradient descent method. The hyper-parameters are set as below: batch size (1), iter_size (10), the momentum (0.9), the weight decay (5e-4), learning rate is initialized as 1e−8 and decreased by 10% when the training loss reaches a flat, the training iteration number (10K). All these parameters were fixed during the following experiments. The source code will be releasedFootnote 1.

We comprehensively evaluated our method on six representative datasets, including MSRA-B [36], HKU-IS [37], ECSSD [38], PASCAL-S [39], SOD [40], and DUT-OMRON [41], which contain 5000, 4447, 1000, 850, 300, 5168 well annotated images, respectively. Among them, PASCAL-S and DUT-OMRON are more challenging than the others. To guarantee a fair comparison with the existing approaches, we utilize the same training sets as in [8, 10, 11, 42] and test all of the datasets with the same model. Data augmentation is also implemented the same with [10, 11] to reduce the over-fitting risk, which increased by 2 times through horizontal flipping.

Three standard and widely agreed metrics are used to evaluate the performance, including Precision-Recall (PR) curve, F-measure, and the Mean Absolute Error (MAE). Pairs of precision and recall values are calculated by comparing the binary saliency maps with the ground truth to plot the PR curve, where the thresholds are in the range of [0, 255]. The F-measure is adopted to measure the overall performance, which is defined as the weighted harmonic mean of precision and recall:

$$\begin{aligned} F_{\beta }=(1+\beta ^{2})\frac{\mathrm{Precision}\times \mathrm{Recall}}{{\beta ^{2}}\mathrm{Precision}+\mathrm{Recall}}, \end{aligned}$$
(7)

where \(\beta ^{2}\) is set to 2 to emphasize the precision over recall as suggested in [43]. Only the maximum F-Measure is reported here to to show the best performance a detector can achieve. Given the normalized saliency map S and ground truth G, the MAE score is calculated by their average per-pixel difference:

$$\begin{aligned} \mathrm{MAE}=\frac{1}{H\times W} \sum _{x=1}^{H}\sum _{y=1}^{W}\left| S(x,y)-G(x,y) \right| , \end{aligned}$$
(8)

where W and H are the width and height of the saliency map, respectively.

4.2 Ablation Studies

Before comparing with the state-of-the-art methods, we first evaluate the influence of different design options (the depth D), the effectiveness of the proposed side-output residual learning and reverse attention in this section.

Depth D. We make a experiment to see how the depth D affects the performance by varying it from 1 to 3. The results on PASCAL-S and DUT-OMRON are shown in Table 2. As can be seen that the best performance is obtained when D = 2. Therefore, we set it as 2 in the following experiments.

Table 2. Performance comparison with different numbers of D.

Side-Output Residual Learning. To investigate the effectiveness of the side-output residual learning, we separately evaluate the performance of each side-output prediction and show in Table 3. We can find that the performance is gradually improved by combing more side-output residual features.

Table 3. Performance comparison with different side-output predictions.

Reverse Attention. As illustrated in Fig. 5, the network well located at the object boundaries with the help of reverse attention. Here, we perform a detailed comparison using F-measure and MAE scores which are reported in Table 4. From the results, we can get the following observations: (1) Without reverse attention, our performance is similiar to the state-of-the-art method DSS (without CRF-based post-processing), which indicates its large redundancy. (2) After applying reverse attention, the performance is improved by a large margin, specifically, we obtained an average of 1.4% gain in terms of F-measure and 0.5% decrease for MAE score, which clearly demonstrates its effectiveness.

4.3 Performance Comparison with State-of-the-art

We compare the proposed method with 10 state-of-the-art ones, including 9 recent CNN-based approaches, \(\hbox {DCL}^{+}\) [8], DHS [44], SSD [45], RFCN [9], DLS [23], NLDF [10], DSS and \(\hbox {DSS}^{+}\) [11], Amulet [13], UCF [14], and one conventional top approach, DRFI [42], where symbol “+” indicates that the network includes CRF-based post-processing. Note that all the saliency maps of the above methods are produced by running source codes or pre-computed by the authors, and ResNet based methods are not included for fair comparison.

Fig. 6.
figure 6

Visual comparisons with the existing methods in some challenging cases: complex scenes, low contrast, and multiple (small) salient objects. (Color figure online)

Fig. 7.
figure 7

Comparison of precision-recall curves on different datasets.

Quantitative Evaluation. The results of quantitative comparison with state-of-the-art methods are reported in Table 4 and Fig. 7. We can clearly observe that our approach significantly outperforms the competing methods both in terms of F-measure and MAE scores, expecially on the challenging datasets (e.g., DUT-OMRON). For PR curves, we also achieved comparable performance with state-of-the-arts except at high level of recall \((\hbox {recall}>0.9)\). In comparison to the top method, \(\hbox {DSS}^{+}\), which uses a CRF-based post-processing step to refine the resolution, nevertheless, our approach still attains nearly identical (or better) performance across the board. It also needs to point out that the existing methods used different training datasets and data augmentaion strategies, which caused an unfair comparison. Nevertheless, we still perform much better that clearly shows the superiority of the proposed approach. And we also believe that further performance gain can be obtained by using larger training dataset with more augmented training images, which is beyond the scope of this paper.

Table 4. Quantitative comparison with state-of-the-art methods on six benchmark datasets. Each cell (from up to down) contains max F-measure (higher better), and MAE (lower better). The top two results are highlighted in and respectively. “RA” denotes the proposed reverse attention, and “MK” is MSRA-10K [46], the other abbreviations are the initials of each dataset metioned in the paper. Note that the number of images listed here are including the augmented ones.
Table 5. Average execution time comparison with other methods on ECSSD.

Qualitative Evaluation. We also show some visual results of some representative images to exhibit the superiority of the proposed approach in Fig. 6, including complex scenes, low contrast between salient object and background, multiple (small) salient objects with diverse characteristics (e.g., size, color). Taking all the cases into account, it can be observed clearly that our approach not only highlights the salient regions correctly with less false detection but also produces sharp boundaries and coherent details (e.g., the mouth of the bird in the 4th row of Fig. 6). It is also interesting to note that the proposed method even corrected some false labeling in the ground truth, e.g., the left horn in the 7th row of Fig. 6. Nevertheless, we still obtain unsatisfactory results in some challenging cases, taking the last row of Fig. 6 for example, to segment all the salient objects completely is still very difficult for the existing methods.

Execution Time. Finally, we investigate the efficiency of our method, and conduct all the experiments on a single NVIDIA TITAN Xp GPU for fair comparison. It only takes less than 2 h to train our model, for comparison, DSS needs about 6 h. We also compared the average execution time with other five leading CNN-based methods on ECSSD. As can be seen from Table 5, our approach is much faster than all the competing methods. Therefore, considering both in visual quality and efficiency, our approach is the best choice for real-time applications up to now.

5 Conclusions

As a low-level pre-processing step, salient object detection has great applicability in various high-level tasks yet remains not being well solved, which mainly lies on the following two aspects: low resolution output and heavy model weight. In this paper, we presented an accurate yet compact deep network for efficient salient object detection. Instead of directly learning multi-scale saliency features in different side-output stages, we employ residual learning to learn side-output residual features for saliency refinement. Based on it, the resolution of the global saliency map generated by the deepest convolutional layer was improved gradually with very limited parameters. We further propose reverse attention to guide such side-output residual learning in a top-down manner. Benefit from it, our network learned more accurate residual features, which leads to significant performance improvement. Extensive experimental resutls demonstrate that the proposed approach performs favorably against state-of-the-art ones both in quantitative and qualitative comparisons, which enables it a better choice for further real-world applications, and also makes it a great potential to apply in other end-to-end pixel-level prediction tasks. Nevertheless, the global saliency branch and backbone (VGG-16) network still contain large redundancy, which will be further explored by introducing handcrafted saliency prior and learning from scratch in our future work.