Keywords

1 Introduction

Video object segmentation (VOS) [1,2,3,4] is the task to segment out primary objects from the background in a video sequence, where a ‘primary’ object refers to the most salient one in the sequence [5, 6]. In this regard, VOS is closely related to video salient object detection (SOD) [7,8,9,10], in which the objective is to detect salient objects in a video. Note that the ‘salient’ objects mean that they appear frequently in the video and have dominant color and motion features. VOS can be used in many vision applications, including object recognition, action recognition, and video summarization. However, it is challenging to delineate salient objects in videos without any user annotations. Also, various factors, such as background clutter, fast motion, and object occlusion, make VOS even more difficult.

Recent object instance segmentation techniques for still images achieve remarkable performances, by employing convolutional neural networks (CNNs) [11,12,13,14]. On the other hand, many VOS techniques [15, 16] and video SOD techniques  [9, 10, 17] focus on the combination of spatial and temporal results. However, the fusion processes often cause temporal inconsistency and may fail to segment out primary objects properly when either spatial or temporal results are inaccurate. Also, although these techniques can effectively extract objects with dominant color and motion features, they do not consider the appearance frequency of an object in a video sequence. In other words, they may fail to detect primary objects, which have less dominant features in each frame but appear frequently in the sequence.

In this work, we propose a novel approach to segment out foreground objects in a video sequence. First, we generate object instances in each frame. Then, we perform instance matching, by selecting one object instance from each frame, in order to construct the most salient object track. This is formulated as finding a clique in a complete k-partite graph [18] of object instances. Note that the clique should contain the instances over frames, corresponding to an identical object. Thus, the instances should be similar to one another. However, finding the optimal clique with the maximal similarity weights is NP hard. We hence develop the sequential clique optimization (SCO) process, which considers both the node energy and the edge energy. By repeating the SCO process, we can extract multiple salient object tracks. Finally, we convert these salient object tracks into VOS results in unsupervised and semi-supervised settings. Experimental results demonstrate that the proposed algorithm significantly outperforms the state-of-the-art VOS and video SOD algorithms on the DAVIS [19] and FBMS [20] datasets.

This work has the following major contributions:

  • We develop the SCO process that determines a suboptimal clique efficiently with time complexity \(O(NT^2)\), where T is the number of frames in a video and N is the number of instances in each frame.

  • The proposed algorithm can extract multiple primary objects effectively, whereas most conventional algorithms assume a single primary object.

  • The proposed algorithm provides remarkable performances on the DAVIS 2016, DAVIS 2017, and FBMS benchmark datasets.

2 Related Work

Video Object Segmentation: VOS attempts to separate foreground objects from the background in a video. Many VOS algorithms extract a single primary object. Papazoglou and Ferrari [1] generate motion boundaries using optical flows, construct a foreground model for the regions within the motion boundaries, and then use it to extract moving objects. Lee et al. [21] extract object proposals with the objectness scores from all frames. The proposals are clustered, and each cluster is ranked according to the objectness score. In [3, 22], object proposals are used to construct a locally connected graph, and the optimal path in the graph is determined to describe a primary object. Koh et al. [23] consider the recurrence property of a primary object to choose proposals. Also, saliency detection techniques are widely employed to estimate initial regions of a primary object [2, 4, 24, 25]. Wang et al. [24] adopt geodesic distances for saliency estimation and design an energy function to enforce the temporal smoothness of a primary object. Jang et al. [4] obtain foreground and background distributions by adopting the boundary prior, and dichotomize each region into the primary object or background class by minimizing a hybrid energy function. Yang et al. [25] use saliency maps to build an appearance model. Faktor and Irani [2] employ saliency maps as the initial distribution of random walk simulation.

Another approach to VOS is motion segmentation [20, 26,27,28,29,30], which clusters point trajectories. Shi and Malik [26] divide a video into motion segments using the normalized cuts. Brox and Malik [27] construct sparse long-term trajectories and cluster them. Ochs and Brox [28] convert sparse motion clusters into dense segmentation via the sparse-to-dense interpolation scheme. They also adopt the spectral clustering based on a higher-order motion model [29]. Fragkiadaki et al. [30] analyze trajectory discontinuities at object boundaries to improve the segmentation accuracy.

Recently, deep learning techniques have been developed for VOS [15, 16, 31,32,33,34,35]. Jain et al. [15] propose an end-to-end learning framework, which combines appearance and motion information to provide pixel-wise segmentation results for salient objects. Tokmakov et al. [16, 31] learn motion patterns with a fully convolutional network by employing synthetic video sequences. Deep learning models are also used in semi-supervised VOS, which requires manually annotated masks at the first frame to segment out target objects in subsequent frames [33,34,35]. Caelles et al. [33] fine-tune a CNN using user annotations to extract a target object. In [34, 35], propagation errors of segmentation masks are recovered by deep learning models.

Salient Object Detection: Early SOD algorithms [36,37,38,39,40,41,42] for still images are based on bottom-up models, which use global or local contrast of image features. Some algorithms [40,41,42] adopt a priori knowledge, such as the boundary prior that boundary regions tend to belong to the background and thus be less salient than center regions. Recently, deep learning techniques have been adopted prevalently for SOD. Many deep fully convolutional networks are trained in an end-to-end manner to yield pixel-wise saliency maps [43,44,45]. Also, an instance-level segmentation algorithm for salient objects was proposed in [14], which uses both saliency maps and object proposals.

Image SOD has been extended to video SOD [7, 9, 17, 46, 47]. Kim et al. [7] produce a spatiotemporal saliency map via the pixel-wise multiplication of spatial and temporal saliency maps. In the multiplication, adaptive weights can be used to yield more robust results. For instance, Fang et al. [46] fuse spatial and temporal maps using entropy-based uncertainty weights. Also, Yang et al. [17] generate six kinds of saliency maps and superpose them adaptively. Chen et al. [9] combine color and motion saliency maps based on the salient foreground model and the non-salient background model.

Some algorithms [8, 10, 48] exploit spatial and temporal features jointly to detect spatiotemporally salient regions. Wang et al. [48] propose the gradient flow field to merge intra-frame boundaries and inter-frame features. Kim et al. [8] exploit spatial and temporal features in the random walk with restart framework. Wang et al. [10] design two networks for static saliency and dynamic saliency, respectively. They feed the output of the static network into the dynamic network to obtain a saliency map.

3 Proposed Algorithm

We segment primary objects in a sequence of video frames \(\mathcal{I}=\{I_1,\ldots ,I_{T}\}\). The output is the corresponding sequence of pixel-wise maps, which locate the primary objects in the frames. Figure 1 shows an overview of the proposed algorithm. First, we generate object instances in each frame. Second, we construct a complete k-partite graph using the set of object instances. Third, we extract salient object tracks by finding cliques in the graph and convert the tracks into VOS results.

Fig. 1.
figure 1

An overview of the proposed algorithm.

3.1 Generating Object Instances

To detect object instances without manual annotations, we employ the instance-aware semantic segmentation method FCIS [11]. FCIS measures the category-wise detection scores of region proposals, generated by the network in [49], and segments out foreground regions from the proposals. We choose only the proposals whose detection scores are higher than 0.5 and declare the corresponding foreground regions as object instances. For each proposal, we use the maximum of the category-wise scores, since the purpose of the proposed algorithm is to segment out salient objects in videos regardless of their categories. Figure 2(b) shows object instances in a frame. The set of detected object instances in frame \(I_t\) is denoted by \(\mathcal{O}_t=\{o_{t,\theta } \, | \, \theta \in {\mathbb {N}}_{N_t}\}\), where \({\mathbb {N}}_m=\{1,2,\ldots ,m\}\) is the finite index set and \(N_t\) is the number of object instances in frame \(I_t\). The \(\theta \)th object instance \(o_{t,\theta }\) in frame \(I_t\) has two attributes: saliency score \(s_{t,\theta }\) and feature vector \({\mathbf{f}}_{t,\theta }\).

To determine the saliency score \(s_{t,\theta }\), we simply estimate a foreground distribution based on the boundary prior. We over-segment a frame into superpixels using [50], and construct a 4-ring graph of superpixels. We compute edge weights by summing up the average LAB color difference and the average optical flow difference between two superpixels. We then obtain the background distribution using the random walk with restart (RWR) simulation [51], where only superpixels at the frame boundary are assigned nonzero restart probabilities. Finally, we invert the background distribution to yield the foreground distribution, as illustrated in Fig. 2(c). We then compute \(s_{t,\theta }\) by averaging the foreground probabilities within the instance \(o_{t,\theta }\).

Also, we construct the feature vector \(\mathbf{f}_{t,\theta }\) using the bag-of-visual-words (BoW) approach [52]. We quantize the LAB colors, extracted from the 40 training sequences in the VSB100 dataset [53], into 300 codewords using the K-means algorithm. We then construct the histogram of the codewords for the pixels within \(o_{t,\theta }\), and normalize it into the feature vector \(\mathbf{f}_{t,\theta }\).

Fig. 2.
figure 2

Object instance generation in the “Boxing-fisheye” sequence: (a) input frame, (b) object instances, and (c) foreground distribution.

3.2 Finding Salient Object Tracks

Problem: The set of all object instances, \(\mathcal{V} = \mathcal{O}_1\cup \mathcal{O}_2\cup \cdots \cup \mathcal{O}_T\), includes non-salient objects, as well as salient ones. From \(\mathcal{V}\), we extract as many salient objects as possible, while excluding non-salient ones, assuming that a salient object should have dominant features in each frame and appear frequently through the sequence. To this end, we construct the most salient object track, by selecting an object instance in each frame, which corresponds to one identical salient object. Then, after removing all instances in the track from \(\mathcal V\), we repeat the process to extract the next salient track, and so on.

Sequential Clique Optimization: Using the set of object instances \(\mathcal{V}=\mathcal{O}_1\cup \mathcal{O}_2\cup \cdots \cup \mathcal{O}_T\), we construct a complete k-partite graph \(\mathcal{G}=(\mathcal V, \mathcal E)\). Thus, \(\mathcal{V}\) becomes the node set, and each object instance becomes a node in the graph \(\mathcal{G}\). Since for \(t \ne \tau \), \(\mathcal{O}_1, \mathcal{O}_2, \cdots , \mathcal{O}_T\) form a partition of \(\mathcal{V}\). Also, we define the edge set as \(\mathcal{E} = \{(o_{t,i},o_{\tau ,j}) \, | \, t\ne \tau \}\). In other words, every pair of object instances in different frames are connected by an edge in \(\mathcal{E}\), whereas two instances in the same frame are not adjacent in the graph \(\mathcal{G}\). As a result, \(\mathcal{G}\) is complete k-partite [18], where \(k=T\). For example, Fig. 3(a) illustrates the complete k-partite graph for four frames, i.e. \(k=4\). We assign a weight to edge \((o_{t,i},o_{\tau ,j})\) by

$$\begin{aligned} w(o_{t,i},o_{\tau ,j}) = \exp \left( -\frac{d_{\chi ^2}(\mathbf{f}_{t,i},\mathbf{f}_{\tau ,j})}{\sigma ^2}\right) \end{aligned}$$
(1)

where \(\sigma ^2=0.01\) is a scaling parameter and \(d_{\chi ^2}\) denotes the chi-square distance.

Fig. 3.
figure 3

Illustration of finding salient object tracks over four frames in “Boxing-fisheye”: (a) complete 4-partite graph, (b) 1st salient object track \(\varTheta _1\), and (c) 2nd salient object track \(\varTheta _2\).

To extract the most salient object, we perform the instance matching by selecting one object instance (one node) from each frame (each node subset) \(\mathcal{O}_t\). This process of finding an object track is equivalent to finding a clique in the graph \(\mathcal{G}\). Notice that selecting one node from each frame satisfies the condition of a clique [18]: every pair of nodes within the clique are adjacent. In the clique, which represents the track of an identical object in the video sequence, the features of the member nodes should be similar to one another. Therefore, we determine the clique to maximize the sum of the edge weights. Let \(\varTheta =\{\theta _t\}_{t=1}^T\) denote a clique, which is represented by the sequence of the node indices in the clique. Here, \(\theta _t\in {\mathbb {N}}_{N_t}\) is the index of the selected node from the tth frame \(\mathcal{O}_t\). Then, we define the similarity \(E_\mathrm{similarity}(\varTheta )\) of clique \(\varTheta \) as

$$\begin{aligned} E_\mathrm{similarity}(\varTheta )=\sum _{t=1}^T\sum _{\tau =1 , \tau \ne t}^Tw(o_{t,\theta _t},o_{\tau ,\theta _\tau }), \end{aligned}$$
(2)

which is the sum of all edge weights in \(\varTheta \). We attempt to maximize the similarity, assuming that the features of an identical object do not change drastically over frames.

Also, object instances in a clique, representing a salient object track, should have high saliency scores. We hence define the saliency \(E_\mathrm{saliency}(\varTheta )\) of clique \(\varTheta \) as

$$\begin{aligned} E_\mathrm{saliency}(\varTheta )=\sum _{t=1}^Ts_{t,\theta _t}. \end{aligned}$$
(3)

We attempt to find the maximal weight clique \(\varTheta ^*\) that maximizes the similarity energy:

$$\begin{aligned} \varTheta ^* = \arg \max _\varTheta E_\mathrm{similarity}(\varTheta ) \end{aligned}$$
(4)

subject to the constraint that the saliency \(E_\mathrm{saliency}(\varTheta )\) is also high. However, even the unconstrained problem in (4) is NP hard [54, 55]. There are \(N_1 \times N_2 \times \cdots \times N_T\) possible cliques, which make the exhaustive search infeasible. Some approximation methods, e.g. multi-greedy heuristics [56], local search [55], and binary integer program [57], have been developed to obtain suboptimal cliques in complete k-partite graphs. But, these methods are still computationally expensive and do not consider the node energy (i.e. the saliency \(E_\mathrm{saliency}\) in this work). Instead, we develop an efficient optimization technique, called SCO, to find the clique that considers both the node energy \(E_\mathrm{saliency}\) and the edge energy \(E_\mathrm{similarity}\).

In SCO, we first initialize the clique \(\varTheta ^{(0)}\) to maximize the saliency \(E_\mathrm{saliency}\) in (3). Specifically, the tth element in \(\varTheta ^{(0)}\) is determined by

$$\begin{aligned} \theta _t^{(0)} = \arg \max _{\theta \in {\mathbb {N}}_{N_t}}s_{t,\theta }. \end{aligned}$$
(5)

Then, at iteration \(\kappa \), we update \(\theta _t^{(\kappa )}\), by selecting the node that is the most similar to the nodes in the other frames,

$$\begin{aligned} \theta _t^{(\kappa )} = \arg \max _{\theta \in {\mathbb {N}}_{N_t}}\sum _{\tau =1,\tau \ne t}^Tw(o_{t,\theta },o_{\tau ,\theta _\tau }), \end{aligned}$$
(6)

and then set \(\theta _t\) to be \(\theta _t^{(\kappa )}\) for each t sequentially from 1 to T. We repeat this sequential update of the nodes in all frames, until \(\varTheta ^{(\kappa )}=\{\theta _t^{(\kappa )}\}_{t=1}^T\) is unaltered from \(\varTheta ^{(\kappa -1)}=\{\theta _t^{(\kappa -1)}\}_{t=1}^T\). This process is theoretically guaranteed to converge, since \(E_\mathrm{similarity}(\varTheta ^{(\kappa )})\) is a monotonically increasing function of \(\kappa \). To summarize, SCO initializes the clique to maximize the saliency \(E_\mathrm{saliency}\) and then refines it iteratively to achieve a local maximum of \(E_\mathrm{similarity}\). Thus, at the initialization, the clique consists of salient object instances over frames, which may not represent an identical object. However, as the iteration goes on, the clique converges to a salient object track, in which the nodes represent an identical object and thus exhibit high similarity weights in general. Algorithm 1 summarizes the proposed SCO technique. In most cases, less than 10 iterations are required for the convergence.

Let \(\varTheta _1\) denote the most salient object track, obtained by this SCO process. To extract the next track \(\varTheta _2\), we exclude the nodes in \(\varTheta _1\) from \(\mathcal G\) and perform SCO again. This is repeated to yield the set of tracks, \(\{\varTheta _1,\varTheta _2,\ldots ,\varTheta _M\}\), until no node remains in \(\mathcal G\). In general, if \(p<q\), \(\varTheta _p\) is more salient than \(\varTheta _q\). Thus, the subscript p in \(\varTheta _p\) is the saliency rank of the track. Figures 3(b) and (c) illustrate the first two tracks \(\varTheta _1\) and \(\varTheta _2\).

figure a

Postprocessing. The track selection is greedy in the sense that, if an object instance is mistakenly included in a track \(\varTheta _p\), it cannot be included in a later track \(\varTheta _q\) even though it indeed belongs to \(\varTheta _q\). To alleviate this problem, we perform postprocessing to maximize the sum of the similarities

$$\begin{aligned} \sum _{m=1}^M E_\mathrm{similarity}(\varTheta _m) \end{aligned}$$
(7)

as follows. At each frame \(I_t\), we match object instances in \(\mathcal{O}_t\) to the tracks in \(\{\varTheta _p\}_{p=1}^{M}\). The matching cost \(C(o_{t,i},\varTheta _p)\) between an instance \(o_{t,i}\) and a track \(\varTheta _p\) is defined as the sum of the feature distances from \(o_{t,i}\) to all object instances in \(\varTheta _p\), except for the instance at the same frame t. After computing the matching costs, we find the optimal matching pairs using the Hungarian algorithm [58], and update the tracks to include the matched instances. This is performed for all frames. As a result, we obtain the set of the refined salient object tracks \(\{\tilde{\varTheta }_1,\tilde{\varTheta }_2,\ldots ,\tilde{\varTheta }_M\}\).

Disappearance Detection. Also, we detect disappearing events for each refined salient object track. When an object disappears or is fully occluded at some frames, noisy objects are selected at those frames. Given a refined salient object track \(\tilde{\varTheta }=\{\tilde{\theta }_t\}_{t=1}^T\), we determine whether to discard \(o_{t,\tilde{\theta }_t}\) at frame t from \(\tilde{\varTheta }\). To this end, for each \(\tau \ne t\), we compare the weight \(w(o_{\tau ,\tilde{\theta }_\tau },o_{t,\tilde{\theta }_t})\) against the average weight. Specifically, we count the number of object instances \(o_{\tau ,\tilde{\theta }_\tau }\) for \(\tau \ne t\), which satisfy \( w(o_{\tau ,\tilde{\theta }_\tau },o_{t,\tilde{\theta }_t}) < \frac{1}{N-2}\sum _{k=1,k\ne t}^Tw(o_{\tau ,\tilde{\theta }_\tau },o_{k,\tilde{\theta }_k}) \). If the number is larger than 0.7T, we declare \(o_{t,\tilde{\theta }_t}\) to be noisy and discard it.

3.3 Segmentation Results

Using the object tracks in \(\{\tilde{\varTheta }_1,\tilde{\varTheta }_2,\ldots ,\tilde{\varTheta }_M\}\), we generate a pixel-wise segmentation map for each frame, which delineates primary objects in the frame. We propose four schemes to yield final segmentation results: Proposed-F, Proposed-O, and Proposed-M for unsupervised VOS and Proposed-S for semi-supervised VOS.

  • Proposed-F: The first track \(\tilde{\varTheta }_1\) extracts the primary object in a video in general. Thus, Proposed-F selects \(\tilde{\varTheta }_1\). However, it may fail to extract spatially connected objects. For example, given a motorbike and its rider, it may detect only one of them. Therefore, Proposed-F additionally picks another salient object track \(\tilde{\varTheta }_p\), only when \(\tilde{\varTheta }_1\) and \(\tilde{\varTheta }_p\) are spatially adjacent in most frames in a video.

  • Proposed-O: The aforementioned Proposed-F is an offline approach, which constructs the global T-partite graph for an entire video. In contrast, Proposed-O is an online approach, which uses the t-partite graph for frames \(1, \ldots , t\) to obtain the segmentation result for the current frame t. In other words, Proposed-O uses the information in the current and past frames only to yield the segmentation result.

  • Proposed-M: To handle multiple primary objects, which are not spatially connected, we choose multiple tracks from \(\{\tilde{\varTheta }_1,\tilde{\varTheta }_2,\ldots ,\tilde{\varTheta }_M\}\). We compute the mean saliency score of object instances in each track, and discard the tracks whose mean scores are lower than a pre-specified threshold \(\delta \). We fix \(\delta =0.1\) in all experiments.

  • Proposed-S: Proposed-S is for semi-supervised VOS, which chooses the ground-truth segment in the first frame and fixes it in SCO. Proposed-S is based on the online approach, Proposed-O. Moreover, we warp a segment result to the next frame using optical flow, and then add the warped segment to the set of object instances.

Finally, we improve the segmentation qualities of object instances in the selected tracks using a two-class Markov random field (MRF) optimizer in [59].

3.4 Complexity Analysis

Let us analyze the computational complexity of the proposed SCO process. For the convenience of analysis, we fix the number of object instances in each frame to N. Note that SCO has two steps: initialization and update. In the initialization, \(N-1\) comparisons are made to find the maximum saliency in each frame, requiring O(NT) comparisons in total. In the update step, \(N(T-2)\) additions and \((N-1)\) comparisons are performed for each frame in one iteration. Thus, the update step demands \(O(KNT^2)\) complexity, where K is the number of iterations and is restricted to be less than 10 in this work.

We repeat the SCO process N times to extract N object tracks. Thus, the complexity is \(O(KN^2T^2)\). Finally, in the postprocessing, the Hungarian matching of \(O(N^3)\) complexity is performed for each frame. Hence the complexity of the postprocessing is \(O(N^3T)\). The overall complexity of the proposed algorithm can be approximated to \(O(KN^2T^2)\), since T is larger than N in general. This complexity is significantly lower than the binary integer program in [57], which requires \(O(2^{T^2N^2})\) complexity in the worst case because of the depth-first node selection [60]. Moreover, the proposed SCO yields better segmentation performance than [57], as will be shown in Sect. 4.

Table 1. Comparison with the conventional VOS algorithms on the DAVIS 2016 dataset. The best results are boldfaced, and the second best ones are underlined.

4 Experimental Results

Given a video sequence, the proposed algorithm can yield a segmentation mask for each frame, which delineates primary objects at the pixel level. Hence, we compare the proposed algorithm with the conventional VOS algorithms in [1, 2, 4, 15, 16, 21, 23, 24, 28, 30, 33, 35, 61, 62] and the conventional SOD algorithms in [8, 10, 24, 48, 63,64,65], which also extract primary or salient objects from each frame at the pixel level. For the comparison, we use the DAVIS dataset [19] and the FBMS dataset [20].

DAVIS Dataset [19]: It has two versions, DAVIS 2016 and DAVIS 2017. DAVIS 2016 is a benchmark to evaluate VOS algorithms. It consists of 50 video sequences, which are divided into training and test sequences. We assess the proposed algorithm using both the training and test sequences. Each sequence contains a single object or spatially connected objects, e.g. a motorbike and its rider, which appear repeatedly in the sequence. The spatially connected objects are also regarded as a primary object.

DAVIS 2016 was extended to DAVIS 2017, which is for semi-supervised VOS. It is composed of the train-validation, test-develop, and test-challenge subsets, which contain 90, 30, and 30 videos, respectively. We evaluate the proposed algorithm on the train-validation set. Note that DAVIS 2017 is more challenging than DAVIS 2016, since multiple objects, which are not connected to one another, correspond to different targets. The union region of those multiple instances is regarded as the ground for the evaluation of the unsupervised algorithms, while multiple instance annotations are used for that of the semi-supervised algorithms.

FBMS Dataset: The FBMS dataset [20] is for segmenting out moving objects in videos, where multiple objects are labeled as the ground-truth. It consists of 59 video sequences, which are split into 29 training and 30 test sequences. We assess the proposed algorithm using the test sequences.

Table 2. Comparison with the conventional VOS algorithms on the DAVIS 2017 dataset. The best results are boldfaced, and the second best ones are underlined.
Fig. 4.
figure 4

VOS results of Proposed-M on the DAVIS 2017 dataset: (a) the ground-truth of primary objects in 1st frames, (b) the detection results in 1st frames, and (c)\(\sim \)(f) the detection results in the subsequent frames. From top to bottom, “Dog-agility,” “Boxing-fisheye,” “Dog-gooses,” “Kid-football,” and “Sheep.” Detected regions are depicted in red.

4.1 Comparison with Video Object Segmentation Techniques

We compare the proposed algorithm with the conventional VOS algorithms, by employing the metrics of the region similarity \(\mathcal J\) and the contour accuracy \(\mathcal F\) [19]. The region similarity \(\mathcal J\) is defined as the intersection over union (IoU) ratio \(\mathcal{J}=\frac{|\mathcal{S}_\mathrm{p}\cap \mathcal{S}_\mathrm{gt}|}{|\mathcal{S}_\mathrm{p}\cup \mathcal{S}_\mathrm{gt}|}\), where \(\mathcal{S}_\mathrm{p}\) and \(\mathcal{S}_\mathrm{gt}\) are an estimated segment and the ground-truth. Also, the contour accuracy \(\mathcal F\) is the F-measure, which is the harmonic mean of contour precision and recall rates. In these metrics, there are two statistics: ‘Mean’ measures the average score and ‘Recall’ denotes the proportion of the frames whose scores are higher than 0.5.

Evaluation on DAVIS 2016 Dataset: Table 1 compares the results on DAVIS 2016 dataset. The scores of the conventional algorithms are from the DAVIS dataset website [19]. All three versions of the proposed algorithm (Proposed-F, Proposed-O, and Proposed-M) outperform all conventional algorithms. Note that even the online version Proposed-O performs better than the state-of-the-art algorithm ARP [23], even though ARP is an offline approach. Also, Proposed-F surpasses all conventional algorithms significantly, e.g. by convincing margins of 3.3% and 5.9% against ARP in terms of Mean \(\mathcal J\) and Mean \(\mathcal F\). Especially, Proposed-F yields a very high recall score of the region similarity \(\mathcal J\), which is almost as high as 95%. As compared with Proposed-F, Proposed-M provides lower performances, since it selects non-primary objects, as well as primary ones, in some videos.

Evaluation on DAVIS 2017 Dataset: Table 2 compares the proposed algorithm with the conventional unsupervised algorithms [1, 2, 4, 23] and semi-supervised ones [33, 62] on DAVIS 2017. The train-validation set and validation set are used for evaluating unsupervised and semi-supervised algorithms, respectively. We compute the results of [1, 2, 4, 23] using the source codes, provided by the respective authors, and take the scores of [33, 62] from the DAVIS website [19]. All unsupervised algorithms yield lower scores on DAVIS 2017 than on DAVIS 2016, since DAVIS 2017 is more challenging due to multiple primary objects. Nevertheless, the three versions of the proposed algorithm provide the best results in all metrics. Especially, Proposed-M undergoes the least degradation in the performance. This indicates that the proposed algorithm extracts multiple primary objects more reliably than the conventional ones. Figure 4 presents examples of segmentation results.

Also, notice that Proposed-S outperforms the conventional semi-supervised algorithms [33, 62], even though [33, 62] involve the fine-tuning, which requires the high computational complexity.

Table 3. Comparison of the Mean \(\mathcal J\) scores on the FBMS dataset. The best result is boldfaced, and the second best one is underlined.
Table 4. Comparison of the proposed SCO with the conventional matching techniques [3, 57] on the DAVIS 2016 and DAVIS 2017 datasets.

Evaluation on FMBS Dataset: Table 3 compares the Mean \(\mathcal J\) scores on the FBMS dataset. The scores of the conventional algorithms are from [23]. Compared with the state-of-the-art algorithm [23], Proposed-M improves the performance by 8.8%.

Efficacy of SCO: We analyze the efficacy of the proposed SCO. Note that SCO yields multiple salient object tracks, which are used to produce VOS results. We compare SCO with the conventional matching techniques for primary object segmentation [3] and multiple object tracking [57]. More specifically, given object instances from FCIS [11], we obtain multiple object tracks by employing the conventional techniques [3, 57]. Since [3] is designed to yield a single object track, we repeatedly perform [3] to obtain multiple object tracks. On the other hand, [57] solves the binary integer problem to produce multiple suboptimal cliques directly. Table 4 compares the proposed SCO with these conventional techniques. For all three methods, we produce segmentation results from the multiple object tracks using the method of Proposed-M.

We see that the proposed SCO outperforms the conventional methods [3, 57] significantly, even when the postprocessing and MRF (PPMRF) are not applied. Also, we perform oracle experiments for the performance upper bounds: we obtain segmentation results by matching object instances with the ground-truth segments. The proposed algorithm yields scores close to these oracle results. Finally, we use deep features (DF) instead of color-based bag-of-visual-words to compute edge weights in the graph. To this end, we extract feature maps by concatenating outputs of conv1, conv3, and conv5 in VGG-16 [66]. To generate a feature of an object instance, we average the values of pixels within the object for each channel of the feature maps. DF degrades the performance in this application, since deep semantic features undesirably yield high similarity weights between different objects in the same class.

4.2 Comparison with Salient Object Detection Techniques

To assess SOD results, we adopt three performance metrics: precision-recall (PR) curves, F-measure, and mean absolute error (MAE). The precision is the ratio \(\frac{|\mathcal{S}_\mathrm{p}\cap \mathcal{S}_\mathrm{gt}|}{|\mathcal{S}_\mathrm{p}|}\) and the recall is \(\frac{|\mathcal{S}_\mathrm{p}\cap \mathcal{S}_\mathrm{gt}|}{|\mathcal{S}_\mathrm{gt}|}\), where \(\mathcal{S}_\mathrm{p}\) and \(\mathcal{S}_\mathrm{gt}\) are an estimated result and the ground-truth, respectively. F-measure is defined as the harmonic mean of the precision and the recall, i.e. \( \mathrm{F\text {-} measure} = \frac{(1+\beta ^2) \cdot \mathrm{precision} \cdot \mathrm{recall}}{\beta ^2 \cdot \mathrm{precision + recall}}\) where \(\beta ^2\) is set to 0.3 as in [10]. Also, MAE is defined as the average of pixel-wise differences between \(\mathcal{S}_\mathrm{p}\) and \(\mathcal{S}_\mathrm{gt}\).

Figure 5 compares the proposed algorithm with the conventional SOD algorithms for still images [63, 64] and video sequences [8, 10, 24, 48, 65]. The scores of the conventional algorithms are from [10]. The conventional algorithms use thresholds to binarize continuous saliency maps to compute precision and recall rates. Thus, in Fig. 5, the PR curves are obtained by varying the thresholds from 0 to 255. In contrast, the proposed algorithm provides a single binary map for primary objects, without requiring a threshold. Therefore, the performance of Proposed-F or Proposed-M is given by a single dot for the pair of the average precision and recall. Both Proposed-F and Proposed-M significantly surpass all conventional algorithms on both DAVIS and FBMS datasets.

Fig. 5.
figure 5

Comparison of the precision-recall performances of the proposed algorithm with those of the conventional algorithms: (a) DAVIS 2016 and (b) FBMS datasets.

Table 5. Comparison with the conventional SOD algorithms. The best results are boldfaced.

Table 5 compares the F-measure and MAE performances. The proposed algorithm yields only one F-measure score, corresponding to the dot in Fig. 5. In contrast, each conventional algorithm yields 256 F-measure scores by varying the binarization threshold. Its maximum F-score is reported in Table 5 for impartial comparison. The proposed algorithm outperforms the conventional algorithms significantly. For example, Proposed-M yields about 0.19 and 0.12 higher F-measure scores than the state-of-the-art algorithm [10] on the DAVIS and FBMS datasets, respectively.

Fig. 6.
figure 6

The running times according to the number of (a) object instances and (b) frames.

4.3 Running Time Analysis

We measure the running times of the SCO algorithm for finding cliques in a complete k-partite graph. In this test, we use the “Boxing-fisheye” sequence in the DAVIS 2017 dataset. Also, we use a computer with a 2.6GHz CPU. The running time of the proposed SCO algorithm is affected by two factors: (1) the number N of object instances in a frame and (2) the number T of frames in a sequence. Figure 6(a) shows the running times according to N, when T is fixed to 50. Figure 6(b) plots the running times according to T, when N is limited to 10. We see that the proposed algorithm is faster than the binary integer program in [57], which consumes about 1 second when \(N=10\) and \(T=50\).

Table 6 lists the average running times per frame on DAVIS 2016. The proposed algorithm performs FCIS [11] for generating object instances and also the optical flow estimation, saliency estimation, feature extraction at each frame. Then, it does SCO for the global optimization. SCO takes 0.304 s for the entire sequence, which is negligible. Then, the proposed algorithm also performs the MRF optimization at each frame. In total, the proposed algorithm takes 3.44 seconds per frame (SPF). It is much faster than the conventional deep-learning-based VOS algorithms [15, 16], which take about 18 SPF and 7 SPF, respectively.

Table 6. Running times in seconds per frame (SPF).

5 Conclusions

We proposed a novel algorithm to segment out primary objects in a video sequence, by solving the problem of finding cliques in a complete k-partite graph. We first generated object instances in each frame. Then, we chose a salient instance from each frame to construct the salient object track. For this purpose, we developed the SCO technique to consider both the saliency and similarity energies. By applying SCO repeatedly, we obtained multiple salient object tracks. Experimental results showed that the proposed algorithm significantly outperforms the state-of-the-art VOS and video SOD algorithms.