Keywords

1 Introduction

Salient object detection is a subject aiming to detect the most salient attention-attracted object in a scene [1]. As an important part of computer vision and neuroscience, its application has been widely expend to various fields [2] such as image segmentation [3], retargeting [4, 5], compressing [6], object recognition [7] and object class discovery [8, 9]. Over the past decades, a lot of algorithms have been proposed for detecting salient regions in images, most of which aim to identify the salient subset from images first and then use it as a basis to segment salient objects from the whole image [10]. Generally, these approaches can be categorized into two breakdowns: bottom-up and top-down approaches. Bottom-up saliency approaches use some pre-assumed priors like contrast prior and texture prior to compute saliency of image while top-down approaches usually use high-level information to guide the detection. In this paper, only bottom-up saliency detection methods are considered.

To our best knowledge, bottom-up saliency detection methods can be categorized as patch-based saliency detection methods and graph-based saliency methods. Patch-based saliency detection utilizes pixels and patches as the basic components for saliency calculation while graph-based saliency detection regards each superpixel segmented from image as a node in graph and calculates saliency according to graph theory. Compared with pixel-based methods, graph-based saliency detection methods can more easily and directly represent the spatial relationship between superpixels.

Among all studies in patch-based saliency detection, Chen et al. [11] provided a simple but power approach to compute the saliency of each sparse-represented color. Duan et al. [12] presented a saliency detection method based on the spatially weighted dissimilarity which integrated dissimilarities, dimensional space and the central bias to determine the contrast of a patch. Other methods on pixel-based saliency detection also have made great contribution to detect attention-attracted regions. However, their results are sensitive to small regions with strong contrast, especially on the edge of image. For graph-based approaches, Chuan Yang et al. [13] proposed a graph-based manifold ranking algorithm to measure the difference between four sides and the rest parts of image. Yao Qin et al. [14] constructed a background-based map using colour and space contrast with the clustered boundary seeds to initial saliency maps. Other graph-based methods also show an outstanding performance on boundary suppressing. However, in order to enhance the local contrast, most of graph-based methods only assign edges between superpixels and their adjacent nodes. Because of this, some regions could be ununiformed, which will significantly decrease the accuracy of detection.

Although the previous methods mentioned above showed good results in some scenes, there are still a lot of problems remaining to be solved. The Saliency regions of an image often have great contrast in color, texture or other features. However, some parts of background can also have high-contrast regions while the contrast of some salient parts can be relatively low. This phenomenon will cause inconsistency and misdetection in saliency maps and cannot be removed by neither graph-based nor pixel-based algorithm due to the nature of contrast comparison.

The depth information of object has an indispensable role in the vision ability of human beings according to cognitive psychology [15]. Recently, with the development of stereo vision, smoothed and accurate depth of an image can be easily acquired through Kinect, ZED camera or other stereo cameras. Because of this, there is an increasing number of methods [16,17,18], both pixel-based and graph-based, that introduce depth as an important basis for saliency detection. To take advantage of it, this paper repeatedly uses depth information as a basis to generate, filter and refine saliency maps. Firstly, a multi-feature-fused manifold ranking algorithm is suggested to generate coarse saliency maps. Compared with traditional manifold ranking [13], this algorithm is more robust to varying scenes by taking depth and texture as references for ranking. Then, a self-adaptive segmentation method based on depth distribution is used to filter the less salient areas. This step segments possible salient regions from candidates based on the depth information of each superpixel. At last, we propose a further method of saliency-depth consistency check to remove or enhance ununiformed regions in background and saliency regions respectively. In the first step of this algorithm, multiple features are taken as references for ranking, some of which are unnecessary and can cause error highlighting or suppressing. The error saliency of these areas will be refined in the third step due to their saliency inconsistent with depth values.

2 Proposed Method

Our algorithm mainly includes three steps. Firstly, we generate coarse saliency maps through multi-feature-fused manifold ranking [13]. In this process, color, depth and joint entropy of superpixels are taken as references for ranking. Then, a self-adaptive segmentation method based on depth distribution is used to furtherly erase the saliency of background. Finally, we implement saliency-depth consistency check to remove small speckles on the background and enhance ununiformed regions on objects. The main stages of our algorithm are demonstrated in Fig. 1. In the figure, the result of every stage is listed to illustrate how our algorithm optimizes the saliency map step by step.

Fig. 1.
figure 1

Illustration of the main stages of proposed algorithm

2.1 Multi-feature-fused Manifold Ranking

Suppose \( {\text{X}} = \{ {\text{x}}_{1} , \cdots {\text{x}}_{\text{l}} ,{\text{x}}_{{{\text{l}} + 1}} , \cdots ,{\text{x}}_{\text{n}} \} \in {\text{R}}^{{{\text{m}} \times {\text{n}}}} \) is a data set, some of which are labelled as queries and others are supposed to be ranked according to their similarity of the queries. In our case, \( {\text{X}} \) is a set of superpixels segment via simple linear iterative clustering (SLIC) method [19]. \( {\text{f}}:{\text{X}} \to {\text{R}}^{\text{n}} \) is a ranking function which assigns each superpixel \( {\text{x}}_{\text{i}} \) to a ranking score \( {\text{r}}_{\text{i}} \). Then, a graph \( {\text{G}} = ({\text{V}},{\text{E}}) \) is constructed, where nodes \( {\text{V}} \) are superpixels in \( {\text{X}} \). \( {\text{W}} \in {\text{R}}^{{{\text{n}} \times {\text{n}}}} \) denotes the adjacency matrix with element \( {\text{w}}_{\text{ij}} \) to store the edges \( {\text{E}} \), which is the weight between superpixel \( {\text{i}} \) and \( {\text{j}} \). Finally, \( {\text{y}} = [{\text{y}}_{1} , \cdots ,{\text{y}}_{\text{n}} ]^{\text{T}} \) is an initial vector in which \( {\text{y}}_{\text{i}} = 1 \) if \( {\text{x}}_{\text{i}} \) is a query and \( {\text{y}}_{\text{i}} = 0 \) otherwise.

The cost function associated with \( r_{i} \) is defined to be:

$$ f^{*} \, = \,\arg \mathop {\hbox{min} }\limits_{f} \frac{1}{2}(\sum\limits_{i,j = 1}^{n} {w_{ij} } \left\| {\frac{{f_{i} }}{{\sqrt {d_{ii} } }} - \frac{fj}{{\sqrt {d_{jj} } }}} \right\|^{2} + \mu \sum\limits_{i = 1}^{n} {\left\| {f_{i} - y_{i} } \right\|^{2} } ) $$
(1)

The first and second term of cost function is smoothness constrain and fitting constrain, respectively. In this paper, a unormalized Laplacian matrix is used to construct \( f^{*} \):

$$ f^{*} \, = \,(D - \frac{1}{\mu + 1}W)^{ - 1} y $$
(2)

where \( D\, = \,{\text{diag}}\{ {\text{d}}_{11} , \cdots ,{\text{d}}_{nn} \} \) is the degree matrix of \( {\text{G}} \), \( {\text{d}}_{ii} \, = \,\mathop \sum \limits_{i} w_{ij} \). \( \upmu \) is the parameter which controls the balance between smoothness constrain and fitting constrain.

Traditional manifold ranking algorithm just assigns the Euclidean distance of colour in CIElab space to adjacency matrix, which does not consider other features such as depth. In this paper, we define \( W \) as:

$$ w_{ij} = \left\{ {\begin{array}{*{20}c} {\exp ( - \frac{1}{{2\sigma_{c}^{2} }}c_{ij} - \frac{1}{{2\sigma_{d}^{2} }}d_{ij} - \frac{1}{{2\sigma_{h}^{2} }}h_{ij} )\;} & {j \in N_{i} } \\ 0 & {otherwise} \\ \end{array} } \right. $$
(3)

where \( N_{i} \) is the set of superpixels which are adjacent to \( i \). \( \sigma_{c} ,\sigma_{d} ,\sigma_{h} \) are weighting factors for the corresponding features \( c_{ij} ,d_{ij} ,h_{ij} \), which have been normalized to \( [0,1] \) before fusion. \( c_{ij} = \left\| {c_{i} - c_{j} } \right\| \) is the Euclidean distance of colour in CIElab space. \( d_{ij} = \left\| {d_{i} - d_{j} } \right\| \) is the Euclidean distance of depth between two superpixels. \( h_{ij} \) refers to the joint entropy of two adjacent superpixels. The entropy can indicate the information content between two image patches. In other words, if the joint entropy \( h_{ij} \) is low, one image patch can be predicted by the other, which corresponds to low information [20]. The joint entropy is calculated as a sum of joint probability distribution of corresponding intensities of two superpixels:

$$ h_{ij} = \sum\limits_{a,b} {P_{ij} (a,b)\log P_{ij} (a,b)} $$
(4)

where \( P_{ij} \left( {a,b} \right) \) is the marginal probability distribution of two superpixels:

$$ P_{ij} (a,b)\, = \,\frac{{H_{i,j} (a,b)}}{{\sum\limits_{a = 0}^{L - 1} {\sum\limits_{b = 0}^{M - 1} {H_{i,j} (a,b)} } }} $$
(5)

where \( {\text{H}}_{i,j} (a,b) \) is the joint histogram of two image patches. \( L \) and \( {\text{M}} \) is the length and width of each image patch respectively. To calculate the joint entropy of two superpixels, the sizes of two superpixels need to be the same. Hence, we fit the compared superpixels in one minimum enclosing rectangle before calculation.

Specifically, we use the nodes on the image boundary as background seeds and construct four saliency maps using nodes on four sides of image as queries respectively. As shown in Fig. 1, the final saliency map for stage one is integrated by multiplying four saliency maps:

$$ S_{c} (i) = S_{t} (i) \times S_{b} (i) \times S_{l} (i) \times S_{r} (i) $$
(6)

where \( S_{t} (i), S_{b} (i),S_{l} (i) \) and \( S_{r} (i) \) is the saliency map using nodes on the top, bottom, left and right side of image as queries respectively. Taking the top as an example \( S_{t} \left( i \right) \) can be calculated by:

$$ S_{t} (i) = 1 - \overline{f}^{ * } $$
(7)

where \( \overline{f}^{ * } \) denotes the normalized vector. The others can be calculated in the same way.

2.2 Saliency Optimization Using Depth Distribution

While most regions of salient objects are highlighted in the coarse saliency map computed from multi-feature-fused manifold ranking, some superpixels which belong to background are also highlighted. To solve this problem, we firstly segment depth map using an adaptive threshold to roughly select the superpixels of foreground. Since most regions of foreground have been highlighted in the coarse saliency map, the threshold is set to be the mean saliency of the coarse saliency map. The saliency less than mean saliency are set to be zero and otherwise is set be one. Then we use this binary map \( S_{cb} (i) \) to segment depth map:

$$ D_{s} = S_{cb} \times D $$
(8)

where \( D_{s} \) is the segmented depth map and \( {\text{D}} \) is the original depth map.

In physical world, salient objects usually have discrete depth values while background objects like walls, grounds or sky often have continuous depth values. In other words, the depth value of salient object often aggregate in a small interval. According to this finding, we use a nonlinear mapping function to refine depth value as a method of suppressing the depth value in background:

$$ D_{s}^{\prime } \, = \,90\arctan (\frac{1}{20}D_{s} \, - \,T_{d} )\, + \,\frac{255}{2} $$
(9)

where \( D_{s}^{\prime } \) is the new depth map after mapping. \( T_{d} \) is the threshold value to control the enhancing interval and can be gained by choosing a reasonable distance for peak value of histogram.

At last, the saliency result of stage two \( S_{c}^{\prime } \) is gained by:

$$ S_{c}^{\prime } \, = \,D_{s}^{\prime } \, \times \,S_{c} $$
(10)

According to our observation, depth value does not naturally reflect the salient level of an object. An object with continuous depth like wall or ground usually has relatively close distance than salient object while their saliency is not that high. Thus, we use Eq. ( 9 ) to guarantee that the utilization of depth value in stage two does not change the existing saliency of object. It is just used as a reference of segmentation to separate foreground objects with background.

2.3 Saliency-Depth Consistency Check

Saliency detection based on graph could be adversely affected if some background regions have relatively high contrast with its surrounding regions. Meanwhile, some parts of the saliency region could be assigned low saliency even if its surrounding parts have high saliency. These two weaknesses could result in inconsistency of salient regions, as Fig. 2 demonstrates. The saliency of some small regions in the background can be very high while the saliency of some object regions, especially those embedded in the object center can be low due to their low contrast to the four sides.

Fig. 2.
figure 2

Procedure of saliency-depth-consistency check

Since this problem cannot be solved by simply adjusting the weights of features in Eq. (3 ), a method of saliency-depth consistency check is introduced here to refine saliency map furtherly. The idea of this method is based on the fact that if regions share the same (or similar) depth value, their saliency should be the same (or similar) as well. According to this hypothesis, we firstly check if the saliency of a superpixel is consistent with its surrounding regions. If the saliency is inconsistent with its surrounding regions, we use a self-adaptive method to identify which of its surrounding superpixels are consistent with it according to their depth values. At last, we refine the inconsistent saliency with the mean of its consistent surrounding regions. We iterate all superpixels until all of them are consistent with its surrounding regions. The procedure of this algorithm is shown is Fig. 2.

In details, the saliency of superpixel \( S_{c}^{\prime } \left( i \right) \) is defined to be consistent with its adjacent superpixels if:

$$ \frac{1}{a}\frac{{D_{i} }}{{D_{ij} }}\, \le \,\frac{{S_{c}^{\prime } (i)}}{{S_{c}^{\prime } (j)}}\, \le \,a\frac{{D_{i} }}{{D_{ij} }},\;\;j\, \in \,N_{i} $$
(11)

where \( {\text{a}}\, > \,1 \) is the threshold factor. \( N_{i} \) is the set of superpixels which are adjacent to \( {\text{i}} \). \( D_{i} \) and \( D_{ij} \) is the depth value of superpixel \( {\text{i}} \) and its adjacent superpixels respectively. In other words, if there exists a superpixel which does not meet Eq. ( 11 ), the saliency of superpixel \( i \) is inconsistent and should be refined by its adjacent superpixels. The new saliency of superpixel \( i \) can be calculated by:

$$ S_{c}^{\prime \prime } (i)\, = \,\left\{ {\begin{array}{*{20}c} {Mean(S_{c}^{\prime } (j))} & {if\;STD[\frac{{D_{i} }}{{D_{i1} }},\frac{{D_{i} }}{{D_{i2} }} \cdots \frac{{D_{i} }}{{D_{ij} }}]\, \le \,T} \\ {Mean(S_{c}^{\prime } (h))} & {if\;STD[\frac{{D_{i} }}{{D_{i1} }},\frac{{D_{i} }}{{D_{i2} }} \cdots \frac{{D_{i} }}{{D_{ij} }}]\, > \,T} \\ \end{array} } \right. $$
(12)

where \( {\text{Mean}}(S_{c}^{\prime } ({\text{j}})) \) is the mean of all saliency of superpixels adjacent to \( i \). \( {\text{Mean}}(S_{c}^{\prime } ({\text{h}})) \) is the mean of saliency of superpixels whose depth are considered to be consistent with \( {\text{i}} \). The threshold value \( {\text{T}} \) can be gained by minimizing \( \upsigma \):

$$ \sigma = STD[\frac{{D_{i} }}{{D_{i1} }},\frac{{D_{i} }}{{D_{i2} }}, \cdots ,\frac{{D_{i} }}{{D_{ih} }}] + STD[\frac{{D_{i} }}{{D_{ih} }},\frac{{D_{i} }}{{D_{i(h + 1)} }}, \cdots ,\frac{{D_{i} }}{{D_{ij} }}],j \in N_{i} $$
$$ \frac{{D_{i} }}{{D_{i1} }} > \frac{{D_{i} }}{{D_{i2} }} > , \cdots ,\frac{{D_{i} }}{{D_{ih} }} > \frac{{D_{i} }}{{D_{i(h + 1)} }} > , \cdots , > \frac{{D_{i} }}{{D_{ij} }} $$
(13)

3 Experimental Results

To demonstrate the effectiveness of proposed method, we evaluate it on a RGB-D data set [18] captured by Microsoft Kinect which contains 1000 RGB-D images selected from 5000 natural images. Compared with other 3D data set, it has plenty of important characteristics such as high diversity and low bias. Moreover, both RGB and depth images in this data set share variable colour or depth ranges, which makes the salient object extremely difficult to be extracted by a simple thresholding.

We compared our algorithm with eight state-of-the-art saliency detection algorithms: ASD [16], DMC [18], SF [21], GS [22], MF [13], OP [23], PCA [24] and SWD [25]. Among these algorithms, both RGB and RGB-D, patch-based and graph-based methods were considered. The results of these algorithms are all provided by their original open source codes.

3.1 Evaluation Measures

To conduct a quantitative performance evaluation, we select the method of precision-recall curve (PRC) and F-measure. The precision value indicates the proportion of salient pixels correctly assigned to all the pixels of detected regions, while the recall value correspond to the ratio of extract pixels be related to the ground-truth numbers. Like all state-of-the-art, the precision-recall curves are obtained by segmenting the saliency maps using threshold from 0 to 255. The F-measure is an overall performance measurement, which is defined as follow with the weighted harmonic of precision and recall:

$$ F_{\beta } = \frac{{(1 + \beta^{2} ) \times precision}}{{\beta^{2} precision + Recall}} $$
(14)

where \( \upbeta^{2} = 0.3 \) is the emphasizing factor of precision.

To evaluate the precision of detected pixels and salient pixels in ground truth, the mean absolute error (MAE) [21] is introduced in this paper, which is defined as:

$$ MAE = \frac{1}{W \times H}\sum\limits_{x = 1}^{W} {\sum\limits_{y = 1}^{H} {|s(x,y) - GT(x,y)|} } $$
(15)

where \( {\text{W}} \) and \( H \) is the width and height of image. \( s(x,y) \) indicates the saliency map and \( {\text{GT}}({\text{x}},{\text{y}}) \) is the image of ground truth.

3.2 Quantitative Comparison with Other Methods

We examine our algorithm with eight state-of-the-art methods. The PR curve, F-measure and MAE of eight methods as well as ours are shown is Fig. 3. In saliency detection, the higher the precision and recall value is, the better performance of the method achieves. In this perspective, our algorithm has the best performance among all the eight methods despite some samples with high recall but relatively low precision. Although the precision-recall curve can give a qualitative evaluation of algorithms, the quantitative analysis still remains to be done. That is the reason we implement F-measure evaluation shown is Fig. 3(c). In that respect, the results of ours fully exceed the others except the recall is next only to OP. The MAE shown is Fig. 3(d) indicates SF, DMC and OP have relatively low MAE than other methods while our result is still the best among all eight methods. In detail, our final result (MAE = 0.216276) is around 17% better than the second performance DMC (MAE = 0.260925). This is mainly due to the saliency-depth consistency check which highlights the shadow part on object and removes some speckles in the background. In a word, as shown is Fig. 3 our algorithm is effective on the perspective of PR-curve on the RGB-D data base, only with the recall next to OP, but outperformances all the state-of-the-art methods in both F-measure and MAE.

Fig. 3.
figure 3

Performance comparisons of proposed algorithm (a) PRC of proposed algorithm versus ASD [16], MF [13], OP [23] and PCA [24], (b) PRC of proposed algorithm versus DMC [18], GS [22], SF [21] and SWD [25], (c) F-measure scores, (d) MAE scores

It is worthwhile noticing that the performance of other methods measured by PR-curve, F-measure and MAE on this RGB-D data set are all poorer than the results presented in their original papers, which indicates that this RGB-D data set is more challenging compared with other data sets such as ASD and MSRA. Thus, throughout the evaluation on this RGB-D data set, it can be concluded that our algorithm can process more complex scenes and is more robust to the variation in the natural environments.

3.3 Vision Comparison with Other Methods

The vision comparison of the results presented by our algorithm and eight state-of-the-art methods are shown in Fig. 4. It is obvious that 3D saliency detection methods like DMC can suppress the saliency of background more effectively. However, graph-based 2D saliency detection methods like MF and OP have stronger ability to highlight the saliency region and provide easier way to segment salient object from background. By taking advantage of both these two kinds of methods, our algorithm can highlight the intact boundary of object as well as suppress the complex interference in the background. Moreover, our algorithm can provide more inner-consistent saliency map due to the saliency-depth consistency check, which makes it much easier for further segmentation.

Fig. 4.
figure 4

Vision comparison of results computed by different methods. The proposed algorithm has better inner-consistency than others

4 Conclusion

In this paper, a novel three-stage saliency detection method using multi-feature-fused manifold ranking and depth optimization is presented. By fusing the depth information and joint entropy with color contrast in CIElab space, our algorithm is more robust in varying natural environments. To reduce the saliency of isolated background regions and highlight the saliency regions uniformly, a segment strategy as well as saliency-depth consistency check is proposed. Moreover, our algorithm takes advantage of both the graph-based saliency detection method and 3D saliency detection method, which makes detected salient objects have intact boundaries as well as inner-consistent saliency. Experimental results comparing with eight state-of-the-art approaches on a challenging RGB-D data set illustrates that the proposed method brings out desirable performance and produces more robust and visual favorable results in terms of the precision and recall analysis, F-measure and MAE. Despite of few limitations, our algorithm indeed promotes the quality of saliency detection in natural scenes and has a great application prospect.