Keywords

1 Introduction

Layered models are a powerful way to model a video sequence. Such models aim to explain a video by decomposing it into layers, which describe the shapes and appearances of objects, their motion, and a generative means to reconstructing the video. They also relate objects through their occlusion relations and depth ordering, i.e., the ordering of objects in front of each other with respect to the given camera viewpoint. Compared to dense 3D reconstruction from monocular video, which is valid for rigid scenes, layered approaches provide a computationally efficient intermediate 2D representation of (dynamic) scenes, which is still powerful enough for a variety of computer vision problems. Some of these problems include segmentation, motion estimation (e.g., tracking and optical flow), and shape analysis. Since all of the aforementioned problems are coupled, layered approaches provide a natural and principled framework to address these problems. Although such models are general in solving a variety of problems and have been successful in these problems, existing layered approaches are fundamentally limited as they are 2D and only model objects moving according to planar motions. Thus, they cannot cope with 3D motions such as rotation in depth and the associated self-occlusion phenomena. Here, we define self-occlusion as the part of a 3D object surface that is not visible, in the absence of other objects, due to camera viewpoint. In this paper, we generalize layered models and depth ordering to self-occlusion generated from out-of-plane object motion and non-planar camera viewpoint change.

Specifically, our contributions are as follows. 1. From a modeling perspective, we introduce flattened 3D object representations (see Fig. 1), which are compact 2D representations of the radiance of 3D deforming objects. These representations aggregate parts of the 3D object radiance that are self-occluded (and occluded by other objects) in some frames, but are visible in other frames into a compact 2D representation. They generalize layered models to enable modeling of 3D (non-planar) motion and corresponding self-occlusion phenomena. 2. We derive an optimization algorithm within a variational framework for inferring the flattened representations and segmentation whose complexity grows linearly (as opposed to combinatorially) with the number of layers. 3. We introduce a new global depth ordering method that treats self-occlusion, in addition to occlusion from other objects. The algorithm requires virtually no computation given the flattened representations and segmentation. It also allows for the depth ordering to change with time. 4. Finally, we demonstrate the advantage of our approach in recovering layers, depth ordering and in segmentation on benchmark datasets.

Fig. 1.
figure 1

Example flattened representation of the rotating earth. The video sequence (left) shows the rotating earth. The flattened representation reconstructed by our algorithm is on the right. Notice that the representation compactly captures parts of the earth that are self-occluded in some frames, but visible in others.

1.1 Related Work

The literature on layered models for segmentation, motion estimation and depth ordering is extensive, and we highlight only some of the advances. Layers relate to video segmentation and motion segmentation (e.g., [1,2,3,4,5,6]) in that layered models provide a segmentation, and a principled means of dealing with occlusion phenomena. We are interested in more than just segmentation, i.e., a generative explanation of the video, which these methods do not provide. Since the problems of segmentation, motion estimation and depth ordering are related, many layered approaches are treated as a joint inference problem where the layers, motion and depth ordering are solved together. As the joint inference problem is difficult and a computationally intensive optimization procedure, early approaches (e.g., [7,8,9,10,11,12,13,14,15]) for layers employed low dimensional parametric motion models (e.g., translation or affine), which inherently limits them to planar motion.

Later approaches (e.g., [16,17,18,19]) to layers model motion of layers with fully non-parametric models based on optical flow (e.g., [20,21,22,23,24]), thus enabling 2D articulated motion and deformation. [16] formulates the problem of inferring layered representations as an extension of the classical Mumford and Shah segmentation problem [25,26,27,28], which provides a principled approach to layers. In [16] depth ordering is not formulated, but layers can still be inferred. Optimization, based on gradient descent was employed due to the non-convexity of the problem. While our optimization problem is similar to the framework there, their optimization method does not allow for self-occlusion. Later advances (e.g., [17, 18]) improved the optimization in the layer and motion inference. However the depth ordering problem, which is coupled with layered inference, is combinatorial in the number of layers, restricting the number of layers. [29, 30] aim to overcome the combinatorial problem by considering localized layers rather than a full global depth ordering. Within local regions there are typically few layers and it is feasible to solve the combinatorial problem. Further advances in optimization were achieved in [19], where the expensive joint optimization problem for segmentation, motion estimation and depth ordering are decoupled, resulting in less expensive optimization. There, depth ordering is solved by a convex optimization problem based on occlusion cues. While the aforementioned layered approaches have modeled complex deformation, they are all 2D and cannot cope with self-occlusion phenomena arising from 3D rotation in depth, which is present in realistic scenes. Thus, segmentation could fail when objects undergo non-planar motion. Our work extends layers to model such self-occlusion, and our depth ordering also accounts for this phenomena. While [3, 31] does treat self-occlusion, it only performs video segmentation not layered inference; we show out-performance against that method in experiments in video segmentation.

A recent approach to layers [30] uses semantic segmentation in images (based on the advances in deep learning) to improve optical flow estimation and hence the layered inference. Although our method does not integrate semantic object detectors, as the focus is to address self-occlusion, it does not preclude them, and they can be used to enhance our method, for instance in the initialization.

2 Layered Segmentation with Flattened Object Models

In this section, we formulate the inference of the flattened 3D object representations, and segmentation as an optimization problem.

2.1 Energy Formulation

We denote the image sequence by \(\{I_t\}_{t=1}^T\) where \(I_t : \varOmega \rightarrow \mathbb {R}^k\) (\(k=3\) for the color channels), \(\varOmega \subset \mathbb {R}^2\) is the domain of the image, and T is the number of images. Suppose that there are N objects (including the “background” which includes all of the scene except the objects of interest), and denote by \(R_i \subset \mathbb {R}^2\) the domain (shape) of the flattened 3D object representation for object i. We denote by \(f_i : R_i \rightarrow \mathbb {R}^k\) the radiance function of object i defined in the flattened object domain. \(f_i\) is a compact representation of all the appearances of the object i seen in the image sequence. The object appearance in any image can be obtained from the part of \(f_i\) visible in that frame. We define the warps, \(w_{it} : R_i \rightarrow \varOmega \), as the mapping from the flattened representation domain of object i to frame t. These will be diffeomorphisms (smooth and invertible maps) from the un-occluded portion of \(R_i\) to the segmentation of object i at time t. For convenience, they will be extended diffeomorphically to all of \(R_i\). We denote by \(V_{i,t} : \varOmega \rightarrow [0,1]\) the visibility functions, the relaxed indicator functions for the pixels in image t that map to the visible portion of object i. Finally, we let \(\tilde{R}_{i,t} = \{ V_{i,t} = 1 \} \) be the domain of projected flattened object i that is visible in from t. See Fig. 2.

Fig. 2.
figure 2

Schematic of flattened representations and generation of images.

We now define an energy to recover the flattened representation of each the objects, i.e., \(f_i, R_i\), the warps \(w_{i,t}\) and the visibility functions. The energy consists of two components, \(E_{app}\), the appearance energy that is driven by the images, and \(E_{reg}\), which contain regularity terms. The main term of the appearance energy aims to choose the flattened representations such that they can as close as possible reconstruct each of the images \(I_{t}\) by deforming the flattened representations by smooth warps. Thus, the appearance energy consists of a term that warps the appearances \(f_i\) into the image domains via the inverse of \(w_{it}\) and compares it via the squared error to the image \(I_t\) within \(\tilde{R}_{it}\), the segmentations. The first term in the energy to be minimized is thus

$$\begin{aligned} E_{app} = \sum _{t,i} \int _{\tilde{R}_{it} } | I_t(x) - f_i( w_{it}^{-1}(x) ) |^2 \,\mathrm {d}x - \int _{\tilde{R}_{it} } \beta _t(x) \log { p_i(I_t(x)) } \,\mathrm {d}x. \end{aligned}$$
(1)

The second term above groups pixels by similarity to other image intensities, via local histograms (i.e., a collection of histograms that vary with spatial location) \(p_i\) for object i. The spatially varying weight \(\beta _t\) is small when the first term is reliable enough to group the pixel, and small otherwise. This term is needed to cope with noise: if a pixel back projects to a point in the scene that is only visible in few frames, the true appearance that can be recovered is unreliable, and hence more weight is placed on grouping the pixel based on similar intensities in the image. The weighting function \(\beta \), will be given in the optimization section, as it will be easier to interpret there. Other terms could be used rather than the second one, possibly integrating semantic knowledge, but we choose it for its simplicity, as our main objective is in optimization of the first term.

The regularity energy \(E_{reg}\) consists of boundary regularity of the regions defined by the visibility functions and an area penalty on the domains of the flattened object models, and is defined as follows:

$$\begin{aligned} E_{reg} = \alpha \sum _{i,t} \text {Len}( \partial \tilde{R}_{i,t} ) + \gamma \sum _i \text {Area}( R_i ), \end{aligned}$$
(2)

where \(\alpha , \gamma > 0\) are weights, \(\text {Len}( \partial \tilde{R}_{it} )\) is the length of the boundary of \(\tilde{R}_{it}\), which induces spatially regular regions in the images, and \(\text {Area}( R_i )\) is the area of the domain of the object model. The last term, which can be thought of as a measure of compactness of the representation, is needed so that the models are compact as possible. Note that if that term is not included, a trivial (non-useful) solution to the full optimization problem is to simply choose a single object model that is a concatenation of all the images, the warps to be the identity, and the visibility functions to be 1 everywhere, which gives \(E_{app} = 0\).

The goal is to optimize the full energy \(E = E_{app} + E_{reg}\), which is a joint optimization problem in the shapes \(R_i\) and appearances \(f_i\) of the flattened objects, the warps \(w_{it}\), and the visibility functions \(V_{it}\).

Occlusion and Self-occlusion: By formulating the energy with flattened object models, we implicitly address issues of both occlusion from one object moving in front of another, and self-occlusion, which are both naturally addressed and are not distinguished. The flattened model \(R_i, f_i\) contain parts of the projected object that are visible in one frame but not another. The occluded and self-occluded parts of the representation in frame t are the set \(R_i \backslash w_{it}^{-1}( \{ V_{it} = 1 \} )\). Considering only the first term of \(E_{app}\), the occluded part of the \(R_i\) are the points that map to points x in which the squared error \(|I_t(x)-f_i(w_{it}^{-1}(x))|^2\) is not smallest when compared to squared error from other flattened representations that map to the points x.

For the problem of flattened representation inference, distinguishing occlusion and is not needed. However, we eventually want to go beyond segmentation and obtain a depth ordering of objects, which requires distinguishing both occlusion (see Sect. 3). This separation of occlusion and self-occlusion allows one to see behind objects in images. See Fig. 6 where we visualize the flattened representation minus the self-occlusion, which shows the object(s) without other objects occluding them.

Fig. 3.
figure 3

Seeing behind occlusion from other objects. From top to bottom: original image, the flattened representation minus the self-occlusion, which removes occlusion due to other objects, and the object segmentation. Video segmentation datasets label the bottom as the segmentation, but the middle seems to be a natural object segmentation. Which should be considered ground truth?

2.2 Optimization Algorithm

Due to non-convexity, our optimization algorithm will be a joint gradient descent in the flattened shapes, appearances, warps, and the visibility functions. We now show the optimization of each one of these variables, given the others are fixed and then give the full optimization procedure at the end (Fig. 3).

Appearance Optimization: We optimize in \(f_i\) given estimates of the other variables. Notice that \(f_i\) appears only in the first term of \(E_{app}\). We can perform a change of variables of each of the integrals, and then differentiate the expression in \(f_i(x)\), and solve for the global optimum of \(f_i\), which gives that

$$\begin{aligned} f_i(x) = \frac{ \sum _t I_t(w_{it}(x)) V_{it}(w_{it}(x)) J_{it}(x) }{ \sum _t V_{it}(w_{it}(x)) J_{it}(x) }, \quad x\in R_i, \end{aligned}$$
(3)

where \(J_{it}(x) = \det { \nabla w_{it}(x)} \) is the determinant of the Jacobian of the warp. The expression for \(f_i\) has a natural interpretation: the appearance at x is a weighted average of the images values at visible projections of x, i.e., \(w_{it}(x)\), in the image domain. The weighting is done by area distortion of the mappings.

Shape Optimization: We optimize in the shape of the flattened region \(R_i\) by gradient descent, since the energy is non-convex in \(R_i\). We first consider the terms in \(E_{app}\) and perform a change of variables so that the integrals are over the domains \(R_i\). The resulting expression fits into a region competition problem [32], and we can use the known gradient computation there. One can show that the gradient with respect to the boundary \(\partial R_i\) is given by

$$\begin{aligned} \nabla _{\partial R_i} E = \sum _t \left[ | \tilde{I}_{it} - f_i |^2 - | \tilde{I}_{jt} - \tilde{f}_j |^2 - \beta _t \log \frac{p_i( \tilde{I}_{it} )}{ p_j( \tilde{I}_{jt} )} + \alpha \kappa _i \right] J_{it} \tilde{V}_i N_i + \gamma N_i, \end{aligned}$$
(4)

where \(N_i\) is the unit outward normal to the boundary of \(R_i\), \(\tilde{I}_{it} = I_t\circ w_{it}\), \(\tilde{V}_i = V_{it} \circ w_{it}\), \(\tilde{f}_j = f_j \circ w_{jt}^{-1} \circ w_{it}\), and j, which is a function of x and t, is the layer adjacent to layer i in \(I_{t}\). This optimization is needed so that the size and shape of the flattened representation can adapt to new self-occlusion discovered. This is a major distinction over [16], which although has a similar model to ours, by-passes this optimization and instead only optimizes the segmentation, which is equivalent in the case of no self-occlusion, but not otherwise. Thus, it cannot adapt to self-occlusion. See Fig. 4.

Fig. 4.
figure 4

Layered inference of Rubix cube. Two different video sequences (top and bottom rows) of the same Rubix cube with different camera motion. [Last column]: our flattened 3D representations capture information about the 3D structure (e.g., connectivity between faces of the Rubix cube) and motion, and includes parts of the object that are self-occluded. [Second last column]: existing 2D layered models (result from a modern implementation of [16]) cannot adapt to 3D motion and self-occlusion.

Visibility Optimization: We optimize in the visibility functions \(V_{it}\), which form the segmentation, given the other variables. Note that the visibility functions can be determined from the corresponding projected regions \(\tilde{R}_{it}\). We thus compute the gradient of the energy with respect to the boundary of the projected regions \(\partial \tilde{R}_{it}\). This is a standard region competition problem. One can show that the gradient is then

$$\begin{aligned} \nabla _{\partial \tilde{R}_{it} } E = \sum _t \left[ | I_{t} - \hat{f}_i|^2 - | I_{t} - \hat{f}_j |^2 - \beta _t \log \frac{ p_{i}(I_{t}) }{ p_{j}(I_{t}) } + \alpha \kappa _i \right] \tilde{N}_i, \quad x\in \partial \tilde{R}_{it} \end{aligned}$$
(5)

where \(\hat{f}_i = f_i(w_{it}^{-1}(x))\), \(\tilde{N}_i\) is the normal to \(\partial \tilde{R}_{it}\), and j is defined as before: it is the layer adjacent to i in \(I_t\).

Warp Optimization: We optimize in the warps \(w_{it}\) given the other variables. Since the energy is non-convex, we use gradient descent. To obtain smooth, diffeomorphic warps, and robustness to local minima, we use Sobolev gradients [33, 34]. The only term that involves the warp \(w_{it}\) is the first term of the \(E_{app}\). One can show that the Sobolev gradient \(G_{it}\) with respect to \(w_{it}\), has a translation component \(\text {avg}({G_{it}}) = \text {avg}({ F_{it} })\) and a deformation component that satisfies:

$$\begin{aligned} {\left\{ \begin{array}{ll} -\varDelta \tilde{G}_{it}(x) = F_{it}(x) &{} x \in w_{it}(R_i) \\ \nabla \tilde{G}_{it}(x)\cdot \tilde{N}_i = | I_t - \hat{f}_i |^2 \tilde{V}_{i} \tilde{N}_i &{} x\in \partial w_{it}(R_i) \end{array}\right. },\quad F_{it} = \nabla \hat{f}_i [ I_t - \hat{f}_i ]^T \tilde{V}_{i} \end{aligned}$$
(6)

where \(\varDelta \) denotes the Laplacian, and \(\nabla \) denotes the spatial gradient. The optimization involves updating the warp \(w_{it}\) iteratively by the translation until convergence, then one update step of \(w_{it}\) by the deformation \(\tilde{G}_{it}\), and the process is iterated until convergence.

Initialization: The innovation in our method is the formulation and the optimization for flattened representations and self-occlusion, and we do not focus here on the initialization. Here we provide a simple scheme that we use in experiments, unless otherwise stated. From \(\{I_t\}_{t=1}^{T}\), we compute frame-to-frame optical flow using [23] and then by composing flow, we obtain displacement \(v_{t,T/2}\) between t and T/2. We use these as components in an edge-detector [35], which gives the number of regions and a segmentation in frame T/2. We then choose that segmentation as the initial flattened regions. One could use more sophisticated strategies, for instance, by using semantic object detectors.

Overall Optimization Algorithm: The overall optimization is given by Algorithm 1. Rather than evolving boundaries of regions, we evolve relaxed indicator functions of the regions, described in Supplementary. We now specify \(\beta _t\) in (1) as \(\beta _t(x) = [\min _{j \sim i, j=i} \sum _{t'} V_{jt'}(w_{jt'} \circ w_{jt}^{-1}(x)) J_{jt'}(w_{jt}^{-1}(x)) ]^{-1}\) where \(j\sim i\) denotes object j is adjacent to object i at x and \(x\in \partial R_i\). \(\beta _t\) is the unreliability of the first term in \(E_{app}\), defined as follows. We compute for each j, the number of frames \(t'\) the point x corresponds to a point in the flattened representation j that is visible in frame \(t'\). To deal with distortion effects of the mapping, there is a weighting by \(J_{jt'}\). Since the evolution depends on data from all j adjacent to i and i, we define the unreliability \(\beta _t(x)\) as the inverse of the least reliable representation. Therefore, more times a point is visible, the more accurate the appearance model will be, and the more dependence on the first term in \(E_{app}\), and the less dependence on local histograms.

figure a

3 Depth Ordering

In this section, we show how the depth ordering of the objects in the images can be computed from the segmentation and flattened models determined in the previous section. In the first sub-section, we assume that the object surfaces in 3D, their mapping to the imaging plane, and the segmentation in the image are known, and present a (trivial) algorithm to recover the depth ordering. Of course, in our problem, the objects in 3D are not available. Thus, in the next sub-section, we show how the previous algorithm can be used without 3D object surfaces by using the flattened representations and their mappings to the imaging plane as proxies for the 3D surfaces and their mappings to the image.

3.1 Depth Ordering from 3D Object Surfaces

We first introduce notation for the object surfaces and mappings to the plane, and then formalize self-occlusion and occlusion induced from other objects. These concepts will be relevant to our depth ordering algorithm, which we present following these formal concepts.

Notation and Definitions: Let \(O_1,\ldots , O_N \subset \mathbb {R}^3\) denote N object surfaces in the 3D world that are imaged to form the image \(I : \varOmega \rightarrow \mathbb {R}^k\) at a given viewpoint at a given time. With abuse of notation we let \(V_i\) denote the segmentation (points in \(\varOmega \) of object i) in the image I. Based on the given viewpoint, the camera projection from points on the surface \(O_i\) to the imaging plane will be denoted \(w_{O_iI}\) and \(w_{O_iI}^{-1}\) will denote the inverse of the mapping. We can now provide computational definitions for self-occlusion and occlusion induced by other objects, relevant to our algorithms. The self-occlusion (formed due to the viewpoint of the camera) is just the points of \(O_i\) (when all other objects are removed from the scene) that are not visible from the viewpoint of the camera. \(w_{O_iI}(O_i)\) will denote the projection of non self-occluded points on \(O_i\). The occluded part of object \(O_i\) induced by object \(O_j\) is \(w_{O_iI}^{-1} (w_{O_iI}(O_i)\cap V_j)\). The occlusion of \(O_i\) induced by other objects (denoted by \(O_{i,occ}\)) is just the union of the occluded parts of \(O_i\) induced all other objects, which is given by \({w_{O_iI}}^{-1}(\cup _{j\ne i} (w_{O_iI}(O_i)\cap V_j))\).

Algorithm for Depth Ordering: We now present an algorithm for depth ordering. The algorithm makes the assumption that if any part of object i is occluded by object j, then any part of object j is not occluded by object i. This can be formulated as

Assumption 1

For \(i \ne j\), one of \(w_{O_iI}(O_i)\cap V_j\) or \(w_{O_jI}(O_j)\cap V_i\) must be empty.

Under this assumption, we can relate the depth ordering of object i and j; indeed, \(Depth(i)<Depth(j)\) (object i is in front of object j) in case \(w_{O_jI}(O_j)\cap V_i \ne \emptyset \). This naturally defines the depth ordering of each objects ranging from 1 to N. Note that the depth ordering is not unique due to two cases, when both sets in the assumption above are empty. First, if the projections of two objects do not overlap (\(w_{O_iI}(O_i) \cap w_{O_jI}(O_j) = \emptyset \)) then no relation can be established and the ordering can be arbitrary. Second, if the overlapping part of the projections of two objects are fully occluded by another object (\(w_{O_iI}(O_i) \cap w_{O_jI}(O_j) \subseteq V_k, k \ne i \ or \ j\)) then the depth relation between i and j cannot be established.

Under the previous assumption, we can derive a simple algorithm for depth ordering. Note that by definition of depth ordering, for object i satisfying \(Depth(i) = 1\), we have that \(\cup _{j\ne i} w^i_{OI}(O_i)\cap V_j = \emptyset \), which means that it is not occluded by any other object. Therefore, we can recover the object with depth 1. By removing that object from the scene, we can repeat the same test and identify the object with depth 2. Continuing this way, we can recover the depth ordering of all objects. One can effectively remove an object i from the scene in the image by removing \(V_i\) from the segmentation in image I and then augmenting \(V_j\) by the occluded part of object j induced by object i. Therefore we can recover the depth ordering by Algorithm 2.

figure b

3.2 Depth Ordering from Flattened Representations

We now translate the depth ordering algorithm assuming 3D surfaces in the previous section to the case of depth ordering with flattened representations. We define \(w_{O_iR_i}\) to be the mapping from the surface \(O_i\) to the flattened representation \(R_i\). Ideally, \(w_{O_iR_i}\) is a one-to-one mapping, but in general it will be onto since the video sequence from which the flattened representation is constructed may not observe all parts of the object. By defining the mapping from the flattened representation to the image as \(w_{R_iI} := w_{O_iI} \circ w_{O_iR}^{-1}\), the definitions of self-occlusion, occlusion induced by other objects, and the visible part of the object can be naturally extended to the flattened representation. By noting that \(w^{-1}_{O_iR_i}(R_i) \subset O_i\), and under Assumption 1, we obtain the following property.

Statement 1

At least one of \(w_{R_iI}(R_i)\cap V_j\) and \(w_{R_jI}(R_j)\cap V_i\) must be empty.

This translates Assumption 1 to the mappings from flattened representations to the image. This statement allows us to similarly define a depth ordering as \(w_{R_jI}(R_j)\cap V_i \ne \emptyset \) means \(Depth(i) < Depth(j)\), as before. Therefore, we can apply the same algorithm in the previous section with \(w_{O_iI}\) replaced by \(w_{R_iI}\).

In theory, the mappings \(w_{R_iI}\) only map the non-self occluded part of \(R_i\) to the image. However, in practice \(w_{R_iI}\) is computed from optical flow computation in Sect. 2.2, which maps the entire flattened region \(R_i\) to the image. The optical flow computation implicitly ignores data from the occluded (self-occluded as well as occlusion from other objects) part of the flattened representation through robust norms on the data fidelity, and extends the flow into occluded parts by extrapolating the warp from the visible parts. Near the self occluding boundary of the object, the mapping \(w_{O_i I}\) maps large surface areas to small ones in the image so that the determinant of the Jacobian of the warp becomes small. Since the warping \(w_{R_iI}\) from the flattened representation is a composition with \(w_{O_iI}\), near the self-occlusion, the map \(w_{R_iI}\) maps large areas to small areas in the image. Since the optical flow algorithm extends the mapping near the self-occlusion into the self-occlusion, the self-occlusion is mapped to a small area (close to zero) in the image. Therefore, in Statement 1 rather than the condition that \(w_{R_jI}(R_j)\cap V_i = \emptyset \) (object j is in front of object i), it is reasonable to assume that \(w_{R_jI}(R_j)\cap V_i\) has small area (representing the area of the mapping of the self-occluded part of object j to \(V_i\)).

We can now extend the algorithm for depth ordering to deal with the case of \(w_{R_iI}\) approximated with optical flow computation, based on the fact that self-occlusions are mapped to a small region in the image. To identify the object on top (depth ordering 1), rather than the condition \( w_{R_iI}(R_i) \backslash V_i = \emptyset \), we compute the object \(i_1\) such that \(\text {Area}(w_{R_iI}(R_i) \backslash V_i)\) is smallest over all i. As in the previous algorithm, we can now remove the object with depth ordering 1, and again find the object \(i_2\) that minimizes \(\text {Area}(w_{R_iI}(R_i) \backslash V_{i_1} \backslash V_{i})\) over all \(i\ne i_1\). We can continue in this way to obtain Algorithm 3. Note that this allows one to compute depth ordering from only a single image, which allows the depth ordering to change with frames in a video sequence.

figure c

4 Experiments

In this section, we show the performance of our method on three standard benchmarks, one for layered segmentation, and the others for video segmentation.

MIT Human Annotated Dataset Results: MIT Human Annotated Dataset [36] has 10 sequences, and is used to test layered segmentation approaches and depth ordering. Results are reported visually. Both planar and 3D motion are present in these image sequences. We test our layered framework by using as initialization the human labeled ground truth segmentation of the first frame (not depth ordering). Figure 5 presents the segmentation and depth ordering results. Our algorithm recovers the layers with high accuracy, and the depth ordering of the layers correctly in most of the cases.

Fig. 5.
figure 5

Segmentation and depth ordering in MIT dataset. Multiple layers are extracted to obtain multi-label segmentation. Based on the segmentation result and extracted layers, Algorithm 3 is applied to compute depth ordering. In most cases the depth ordering are inferred correctly. Note that due to the ambiguity of the depth ordering, in some cases ground truth depth ordering does not exist. Layers in the front are indicated by small values of depth.

DAVIS 2016 Dataset: The DAVIS 2016 dataset [37] dataset is a dataset focusing on video object segmentation tasks. Video segmentation is one output of our method, but our method goes further. The dataset contains 50 sequences ranging from 25 to 100 frames. In each frame the ground truth segmentation of the moving object versus the background is densely annotated. We run our scheme fully automatically initialized by the method described in Sect. 2.2.

Coarse-to-Fine Scheme for DAVIS and FBMS-59: The initialization scheme described in Sect. 2.2 often results in noisy results over time, perhaps missing segmentations in some frames. To clean up this noise, we first run our algorithm with this initialization in small overlapping batches (of size 15 frames) of the video. This fills in missing segmentations. We then run our algorithm with this result as initialization on the whole video. This integrates coarse information across the whole video. Finally, to obtain finer scale details, we again run our algorithm on overlapping small batches (of size 7 frames). We iterate the last two steps to obtain our final result. Table 1 shows the result of these stages (labeled initialization, 1st, 2nd, 3rd, and the final result is labeled “ours”) on DAVIS.

Table 1. Evaluation of segmentation results on DAVIS. From left to right: result after our initialization, result after the 1st stage of our coarse-to-fine layered approach (see text for an explanation), result after our 2nd stage, result after our 3rd stage of coarse-to-fine, results of competing methods, and finally our final result after the last stage of our coarse-to-fine scheme.
Fig. 6.
figure 6

Qualitative comparison on DAVIS. From left to right: (images 1–3): sequences with 3D motion inducing self-occlusion, (image 4): sequence with object color similarity to background, and (images 5–8): sequences with occlusion by other objects. Our layered segmentation successfully captures the object all of the sequence cases. In (1–3) [19], a layered approach, fails due lack of 3-D motion modeling; in (4) color similarity leads to wrong labeling in both [3, 19] due to reliance on intensity similarities. In (5–8) [3, 19] fail due to inability to deal with objects moving behind others. (Color figure online)

Comparison on DAVIS: We compare to a modern implementation of the layered method [16], which is a equivalent to our method if the shape evolution of the flattened representation is not performed. We also compare to [19], which is another layered method based on motion. We also include in the comparison non-layered approaches [3], which addresses the problem of self-occlusion in motion segmentation, and [38], which is another motion segmentation approach. Qualitative comparison of the methods can be found in Fig. 6 and quantitative comparison can be found in Table 1. Quantitatively, our method outperforms all comparable motion-based approaches. Note the that the state-of-the-art approaches on this dataset use deep learning and are trained on large datasets (for instance, Pascal), however, they only perform segmentation and do not give a layered interpretation of the video and they are applicable to only binary segmentation, and they cannot be adapted to multiple objects. Our method requires no training data and is low-level, and comes close to the performance of these deep learning approaches. In fact, in 15/50 sequences, our method performs the best more than any other method.

FBMS-59 Dataset: To test our method on inferring more than two layered representations, we test our method on the FBMS-59 Dataset, which is used for benchmarking video segmentation algorithms. The test set of FBMS-59 contains 30 sequences with 69 labeled objects, and the number of frames range from 19 to 800. Ground truth is given on selected frames. We compare to [3] that is a video segmentation that handles self-occlusion but not layers (discussed in the previous section), the layered approach [19], and other motion segmentation approaches. Quantitative results and representative results are shown in Fig. 7. They show that our method has the best results among these methods, and shows a slight improvement over [3], with the additional advantage that our method gives a layered representation, more powerful than just a segmentation.

Fig. 7.
figure 7

Results (qualitative and quantitative) and comparison on FBMS-59.

Parameters: Our algorithm has few parameters, i.e., the parameter \(\gamma \), which is the weight on penalizing the area of the flattened representation, and \(\alpha \), which is the weight on the spatial regularity of the segmentation. These parameters are not sensitive. The values chosen in the experiments were \(\gamma =0.1\) and \(\alpha = 2\).

Computational Cost: Our algorithm is linear in the number of layers (due to the optical flow computation for each layer). For 2 layers and 480p video with 30 frames, our entire coarse-to-fine scheme runs in about 10 min with a Matlab implementation, on a standard modern processor.

5 Conclusion

We have generalized layered approaches to 3D planar motions and corresponding self-occlusion phenomena. This was accomplished with an intermediate 2D representation that concatenated all visible parts of an object in a monocular video sequence into a single compact representation. This allowed for representing parts that were self-occluded in one frame but visible in another. Depth ordering was formulated independent of the inference of the flattened representations, and is computationally efficient. Results on benchmark datasets showed that the advantage of this approach over other layered works. Further, increased performance was shown in the problem of motion segmentation over existing layered approaches, which do not account for 3D motion.

A limitation of our method is that is dependent on the initialization, which remains an open problem, although we provided a simple scheme. More advanced schemes could use semantic segmentation. Another limitation is in our representation, in that it does not account for all 3D motions and all self-occlusion phenomena. For instance, a person walking, the crossing of legs cannot be captured with a 2D representation (our method accounted for this case on datasets since the number of frames used was small enough that legs did not fully cross). A solution would be to infer a 3D representation of the object from the monocular video, but this could be expensive computationally, and it is valid for only rigid scenes. Our method trades off between complexity of a full 3D representation and its modeling power: although it does not model all 3D situations, it is a clear advance over existing layered approaches, without the complexity of a 3D representation and its limitation to rigid scenes. Another limitation is when Assumption 1 is broken (e.g., a hand grasping an object), in which our depth ordering would fail, but the layers are still inferred correctly.