1 Introduction

In this work we address the task of instance segmentation which involves segmenting each individual instance of a semantic class in an image. Many top-down approaches to this problem are based on object detection pipelines [1, 2] and each box is refined to generate a segmentation. Further, these methods do not consider entire image but rather independent proposals and as a result cannot handle occlusions between different objects. Since these methods are based on initial detections, they cannot recover from false detections motivating an approach that reasons globally.

A key aspect of our approach is to leverage the hierarchical segmentation trees [3] to sample potential object instances. To this end, we propose a new bottom-up approach to parse the regions in an hierarchical region tree. At the core of our approach lies Convolutional Tree-LSTM module which estimates the energies of the regions taking into account the entire image and tracking temporal relations across regions through different levels of the tree. Unlike MCG [4], that uses hand engineered features to generate object candidates, we exploit rich features learnt by Convolutional Neural Networks to sample object instances. Further, MCG involves complex pipeline involving proposal generation and ranking. The resulting system is very slow and takes more than 9.9 s for candidate generation alone. Ours on the other hand is trained end-to-end and on average takes 0.06 s at test time.

Our paper is outlined as follows. We begin by reviewing related work in Sect. 2. In Sect. 3 we describe the details of our approach. In Sect. 4, we dwell into implementation details. We investigate the performance of our method both qualitatively and quantitatively in Sect. 5. Finally, we conclude in Sect. 6.

2 Related Work

Our work is closely related to bottom-up methods exploiting superpixels [5]. Pham et al. [6] proposed a dynamic programming based approach to image segmentation by constructing a hierarchical segmentation tree. An unified energy function jointly quantifies geometric goodness-of-fit and objectness measure. A top-down traversal through the tree comparing the energies of the current node and its subtree results in optimal tree cut. Kirillov et al. [7] impose graph structure on the superpixels and formulate instance estimation as a MultiCut problem. One of the limitations of this method however is that, it cannot find instances that are formed by disconnected regions in the image. Unlike these methods, by training our model end-to-end we can find such instances as discussed in Sect. 6.

3 Method

Given an input image \(\mathcal {I}\), our goal is to segment the image into semantically meaningful non-overlapping regions. Figure 1 depicts the overview of our method. Henceforth, we adopt the following notation. For a given \(\mathcal {I}\), let \(\mathcal {T}\), \(L = \{1, 2, \dots , l_{max}\}\), \(\mathcal {R} = \{r_1, r_2, \dots , r_N\}\), \(\mathcal {F} = \{F_{r_{1}}, F_{r_{2}}, \dots , F_{r_{N}}\}\) and \(\mathcal {C} = \{C_{r_{1}}, C_{r_{2}}, \dots , C_{r_{N}}\}\) represent the hierarchical tree, set of distinct levels, set of regions in the tree, corresponding features for the regions and children of the regions in the tree respectively. For each level \(0 < l \le l_{max}\), we denote the set of regions, corresponding features and the threshold at this level as \(\mathcal {R}_l = \{r^l_1, r^l_2, \dots , r^l_{N_l}\} \subseteq \mathcal {R}\), \(\mathcal {F}_l = \{F_{r^l_{1}}, F_{r^l_{2}}, \dots , F_{r^l_{N_l}}\}\) and \(\alpha _l\) respectively. Tree cut at a level \(l^{'}\) for a horizontal cut-threshold \(\lambda _{cut} = \alpha _{l^{'}}\) results in a new set of levels \(L^{'} = \{l | l \ge l^{'}\}\).

Fig. 1.
figure 1

Overview of our method. We (1) construct hierarchical region tree using Ultrametric Contour Map (UCM), (2) estimate energies of each region in the tree starting from level 1 at the bottom and all the way to the top, and (3) threshold the regions based on the energies.

3.1 Feature Extraction

We first extract features \(\mathbf {F}\) by passing input image \(\mathcal {I}\) through a series of convolutions. For a given region \(r \in \mathcal {R}\) in the tree, we generate a tightest bounding box \(b_r\) covering the non-linear boundary of r. We then extract a fixed spatial dimensional feature map \(F_{r}^{*}\)(e.g., \(7 \times 7\)) from \(\mathbf {F}\) corresponding to \(b_r\). Our approach in extracting \(F_{r}^{*}\) is similar to ROIAlign layer [1]. Additionally, we mask out the features corresponding to the region \(b_r \setminus r\) giving rise to the final feature map \(F_r\).

3.2 Convolutional Tree-LSTM Module

The motivation behind the method is to estimate how the probabiliy distribution over the categories change when a new region is added to the region under consideration in the subsequent levels. The model implicitly learns the temporal relations which lead to the formation of a given region.

We process the hierarchical tree \(\mathcal {T}\) starting from level \(l^{'}\) which corresponds to the initial cut-threshold \(\lambda _{cut} = \alpha _{l^{'}}\) using Convolutional Tree-LSTM predicting softmax probabilities for each region \(r \in \mathcal {R}_l\) at all the levels \(l \in L^{'}\) in order. Input to the LSTM at each level l are the features \(\mathcal {F}_l\). Equations 17 summarizes the forward propagation through the LSTM module. For jth region at level l,

$$\begin{aligned} {\tilde{h^l_j}}&= \sum _{k \in C_{r^l_{j}}}h^l_k, \end{aligned}$$
(1)
$$\begin{aligned} i^l_j&= \sigma (W^i * F_{r^l_{j}} + U^i * {\tilde{h^l_j}} + b^i), \end{aligned}$$
(2)
$$\begin{aligned} f^l_{jk}&= \sigma (W^f * F_{r^l_{j}} + U^f * h^l_k + b^f) \quad \forall k \in C_{r^l_{j}}, \end{aligned}$$
(3)
$$\begin{aligned} o^l_j&= \sigma (W^o * F_{r^l_{j}} + U^o * {\tilde{h^l_j}} + b^o), \end{aligned}$$
(4)
$$\begin{aligned} u^l_j&= \tanh {(W^u * F_{r^l_{j}} + U^u * {\tilde{h^l_j}} + b^u)}, \end{aligned}$$
(5)
$$\begin{aligned} c^l_j&= i^l_j \odot u^l_j + \sum _{k \in C_{r^l_{j}}}f^l_{jk} \odot c^l_k, \end{aligned}$$
(6)
$$\begin{aligned} h^l_j&= o^l_j \odot \tanh {(c^l_j)}, \end{aligned}$$
(7)

where \(*, \odot \) denote convolution operation and Hadamard product respectively. We do the above for each region j and \(\forall \) \(l \in L^{'}\). For a region j at level l, \(c^l_k, h^l_k \forall k \in C_{r^l_{j}}\) are initialized to zeros provided they are the leaves of the tree and for the rest of the regions, \(c^l_k, h^l_k\) are governed by the Eqs. 6 and 7 respectively. Figure 2 depicts analysis on variation of sequence length and number of regions considered for different horizontal cuts.

On top of the LSTM module, we apply series of convolutions and fully connected layers which take input as \(h^l_j\) and predict probabilities.

Fig. 2.
figure 2

Variation of number of regions considered and sequence length for different initial horizontal cut thresholds.

3.3 Objective Formulation

For a given image \(\mathcal {I}\), let \(\mathcal {M} = \{m_1, m_2, \dots m_M\}, L^G = \{l_1, l_2, \dots l_M\}\) be the set of ground truth masks and one-hot labels respectively. For each mask \(m_i\), we construct the positive set \(\mathcal {P}^+_i = \{p^i_1, p^i_2, \dots p^i_{N_{i}}\}\) which consists of probabilities of regions from \(\mathcal {R}\) whose IoU with \(m_i\) is greater than \(\lambda _+\). Similarly, we construct \(\mathcal {P^-} = \{p^-_1, p^-_2, \dots p^-_{N_{-}}\}\) consisting of probabilities of regions from \(\mathcal {R}\) whose IoU with all \(m_i\) is less than \(\lambda _-\). We then formulate the loss as follows,

$$\begin{aligned} \mathcal {L} = -\frac{1}{M}\sum _{i=1}^{M}\sum _{r=1}^{|\mathcal {P}_{i}^{+}|} l^T_i\log (p_{r}^{i}) - \lambda \sum _{r=1}^{|\mathcal {P^-}|}\sum _{c=1}^{C}I_{c}^{b}\log (p_r^-), \end{aligned}$$
(8)

where \(I^b_c\) is 1 if class c corresponds to the background label b and T represents the transpose of vector. The hyperparameter \(\lambda \) in Eq. 8 controls the balance between positive and negative regions.

4 Implementation Details

4.1 Network Architecture

We use the pre-trained COB network for estimating contours which is a ResNet50 model. Features \(\mathbf {F}\) are extracted from res3 layer of ResNet50 model having spatial resolution of \(28\times 28\). ROIAlign extracts features having a fixed spatial resolution of \(7\times 7\). All the convolutions within the LSTM have kernel size of \(3\times 3\), stride 1 and use zero-padding. On top of convolutional LSTM, we have 2 \(3\times 3\) convolutions and 2 fully connected layers predicting softmax probabilities.

4.2 Training Details

We set the parameters \(\lambda _+\), \(\lambda _-\), \(\lambda \) to 0.7, 0.3 and 0.2 respectively in all our experiments. We train Convolutional LSTM and subsequent layers from scratch with a batch size of 1, initial learning rate of 0.001 and decay it by a factor of 0.1 after every 20 epochs. We experiment over various initial cut-thresholds from \(\lambda _{cut} = 0.3\) to \(\lambda _{cut} = 0.9\) in steps of 0.1.

Table 1. Time taken to process a single image in seconds.
Fig. 3.
figure 3

Precision-recall curves for all categories in VOC 2012 val dataset.

5 Experiments

We use the pretrained COB network to predict the contours which was trained on PASCAL Context dataset. We train our Convolutional Tree-LSTM and subsequent layers on PASCAL VOC 2012 dataset. We evaluate our model on PASCAL VOC 2012 val dataset using average precision, Jaccard Index and time taken to process an image as evaluation metrics. Table 1 compares the time taken to process a single image by different methods. Figure 3 denotes the precision-recall curves for all the classes.

Table 2. Variation of average precision for different tree cut thresholds.

On the VOC 2012 val set, our best performing model scores 48% mAP. Our model struggles on categories like bicycle, chair. However on categories like train and plane, our model achieves higher performance. Table 2 summarizes the average precision for all the categories. We further compare Jaccard Index with MCG and is presented in Table 3 (Fig. 4).

Table 3. Comparison of Jaccard Index for varying number of regions considered from the tree. (N, std stand for number of regions consdidered and standard deviation respectively)
Fig. 4.
figure 4

Qualitative results on VOC 2012 val set

6 Conclusions

We proposed an unique approach for bottom-up instance segmentation which overcomes the limitations of the current bottom-up and top-down approaches. Our method produces comparative results with good trade-off between segmentation accuracy and processing time. We would like to further investigate an end-to-end network predicting contours in tandem with the estimation of energies of regions. This leads to prediction of semantically accurate contours resulting in high-quality hierarchical region tree further aiding the estimation of energies.