An End-to-End Tree Based Approach for Instance Segmentation

Manohar, K. V.; Niitani, Yusuke

doi:10.1007/978-3-030-11021-5_30

K. V. Manohar¹⁴ &
Yusuke Niitani¹⁵

Part of the book series: Lecture Notes in Computer Science ((LNIP,volume 11133))

Included in the following conference series:

European Conference on Computer Vision

2179 Accesses

Abstract

This paper presents an approach for bottom-up hierarchical instance segmentation. We propose an end-to-end model to estimate energies of regions in an hierarchical region tree. To this end, we introduce a Convolutional Tree-LSTM module to leverage the tree-structured network topology. For constructing the hierarchical region tree, we utilize the accurate boundaries predicted from a pre-trained convolutional oriented boundary network. We evaluate our model on PASCAL VOC 2012 dataset showing that we obtain good trade-off between segmentation accuracy and time taken to process a single image.

K. V. Manohar—Work done when the author was an intern at Preferred Networks inc., Japan.

You have full access to this open access chapter, Download conference paper PDF

YOLO-CORE: Contour Regression for Efficient Instance Segmentation

Article 15 September 2023

Seed, Expand and Constrain: Three Principles for Weakly-Supervised Image Segmentation

Instance-Sensitive Fully Convolutional Networks

1 Introduction

In this work we address the task of instance segmentation which involves segmenting each individual instance of a semantic class in an image. Many top-down approaches to this problem are based on object detection pipelines [1, 2] and each box is refined to generate a segmentation. Further, these methods do not consider entire image but rather independent proposals and as a result cannot handle occlusions between different objects. Since these methods are based on initial detections, they cannot recover from false detections motivating an approach that reasons globally.

A key aspect of our approach is to leverage the hierarchical segmentation trees [3] to sample potential object instances. To this end, we propose a new bottom-up approach to parse the regions in an hierarchical region tree. At the core of our approach lies Convolutional Tree-LSTM module which estimates the energies of the regions taking into account the entire image and tracking temporal relations across regions through different levels of the tree. Unlike MCG [4], that uses hand engineered features to generate object candidates, we exploit rich features learnt by Convolutional Neural Networks to sample object instances. Further, MCG involves complex pipeline involving proposal generation and ranking. The resulting system is very slow and takes more than 9.9 s for candidate generation alone. Ours on the other hand is trained end-to-end and on average takes 0.06 s at test time.

Our paper is outlined as follows. We begin by reviewing related work in Sect. 2. In Sect. 3 we describe the details of our approach. In Sect. 4, we dwell into implementation details. We investigate the performance of our method both qualitatively and quantitatively in Sect. 5. Finally, we conclude in Sect. 6.

2 Related Work

Our work is closely related to bottom-up methods exploiting superpixels [5]. Pham et al. [6] proposed a dynamic programming based approach to image segmentation by constructing a hierarchical segmentation tree. An unified energy function jointly quantifies geometric goodness-of-fit and objectness measure. A top-down traversal through the tree comparing the energies of the current node and its subtree results in optimal tree cut. Kirillov et al. [7] impose graph structure on the superpixels and formulate instance estimation as a MultiCut problem. One of the limitations of this method however is that, it cannot find instances that are formed by disconnected regions in the image. Unlike these methods, by training our model end-to-end we can find such instances as discussed in Sect. 6.

3 Method

Given an input image $\mathcal {I}$, our goal is to segment the image into semantically meaningful non-overlapping regions. Figure 1 depicts the overview of our method. Henceforth, we adopt the following notation. For a given $\mathcal {I}$, let $\mathcal {T}$, $L = \{1, 2, \dots , l_{max}\}$, $\mathcal {R} = \{r_1, r_2, \dots , r_N\}$, $\mathcal {F} = \{F_{r_{1}}, F_{r_{2}}, \dots , F_{r_{N}}\}$ and $\mathcal {C} = \{C_{r_{1}}, C_{r_{2}}, \dots , C_{r_{N}}\}$ represent the hierarchical tree, set of distinct levels, set of regions in the tree, corresponding features for the regions and children of the regions in the tree respectively. For each level $0 < l \le l_{max}$, we denote the set of regions, corresponding features and the threshold at this level as $\mathcal {R}_l = \{r^l_1, r^l_2, \dots , r^l_{N_l}\} \subseteq \mathcal {R}$, $\mathcal {F}_l = \{F_{r^l_{1}}, F_{r^l_{2}}, \dots , F_{r^l_{N_l}}\}$ and $\alpha _l$ respectively. Tree cut at a level $l^{'}$ for a horizontal cut-threshold $\lambda _{cut} = \alpha _{l^{'}}$ results in a new set of levels $L^{'} = \{l | l \ge l^{'}\}$.

3.1 Feature Extraction

We first extract features $\mathbf {F}$ by passing input image $\mathcal {I}$ through a series of convolutions. For a given region $r \in \mathcal {R}$ in the tree, we generate a tightest bounding box $b_r$ covering the non-linear boundary of r. We then extract a fixed spatial dimensional feature map $F_{r}^{*}$(e.g., $7 \times 7$) from $\mathbf {F}$ corresponding to $b_r$. Our approach in extracting $F_{r}^{*}$ is similar to ROIAlign layer [1]. Additionally, we mask out the features corresponding to the region $b_r \setminus r$ giving rise to the final feature map $F_r$.

3.2 Convolutional Tree-LSTM Module

The motivation behind the method is to estimate how the probabiliy distribution over the categories change when a new region is added to the region under consideration in the subsequent levels. The model implicitly learns the temporal relations which lead to the formation of a given region.

We process the hierarchical tree $\mathcal {T}$ starting from level $l^{'}$ which corresponds to the initial cut-threshold $\lambda _{cut} = \alpha _{l^{'}}$ using Convolutional Tree-LSTM predicting softmax probabilities for each region $r \in \mathcal {R}_l$ at all the levels $l \in L^{'}$ in order. Input to the LSTM at each level l are the features $\mathcal {F}_l$. Equations 1–7 summarizes the forward propagation through the LSTM module. For jth region at level l,

$$\begin{aligned} {\tilde{h^l_j}}&= \sum _{k \in C_{r^l_{j}}}h^l_k, \end{aligned}$$

(1)

$$\begin{aligned} i^l_j&= \sigma (W^i * F_{r^l_{j}} + U^i * {\tilde{h^l_j}} + b^i), \end{aligned}$$

(2)

$$\begin{aligned} f^l_{jk}&= \sigma (W^f * F_{r^l_{j}} + U^f * h^l_k + b^f) \quad \forall k \in C_{r^l_{j}}, \end{aligned}$$

(3)

$$\begin{aligned} o^l_j&= \sigma (W^o * F_{r^l_{j}} + U^o * {\tilde{h^l_j}} + b^o), \end{aligned}$$

(4)

$$\begin{aligned} u^l_j&= \tanh {(W^u * F_{r^l_{j}} + U^u * {\tilde{h^l_j}} + b^u)}, \end{aligned}$$

(5)

$$\begin{aligned} c^l_j&= i^l_j \odot u^l_j + \sum _{k \in C_{r^l_{j}}}f^l_{jk} \odot c^l_k, \end{aligned}$$

(6)

$$\begin{aligned} h^l_j&= o^l_j \odot \tanh {(c^l_j)}, \end{aligned}$$

(7)

where $*, \odot $ denote convolution operation and Hadamard product respectively. We do the above for each region j and $\forall $ $l \in L^{'}$. For a region j at level l, $c^l_k, h^l_k \forall k \in C_{r^l_{j}}$ are initialized to zeros provided they are the leaves of the tree and for the rest of the regions, $c^l_k, h^l_k$ are governed by the Eqs. 6 and 7 respectively. Figure 2 depicts analysis on variation of sequence length and number of regions considered for different horizontal cuts.

On top of the LSTM module, we apply series of convolutions and fully connected layers which take input as $h^l_j$ and predict probabilities.

3.3 Objective Formulation

For a given image $\mathcal {I}$, let $\mathcal {M} = \{m_1, m_2, \dots m_M\}, L^G = \{l_1, l_2, \dots l_M\}$ be the set of ground truth masks and one-hot labels respectively. For each mask $m_i$, we construct the positive set $\mathcal {P}^+_i = \{p^i_1, p^i_2, \dots p^i_{N_{i}}\}$ which consists of probabilities of regions from $\mathcal {R}$ whose IoU with $m_i$ is greater than $\lambda _+$. Similarly, we construct $\mathcal {P^-} = \{p^-_1, p^-_2, \dots p^-_{N_{-}}\}$ consisting of probabilities of regions from $\mathcal {R}$ whose IoU with all $m_i$ is less than $\lambda _-$. We then formulate the loss as follows,

$$\begin{aligned} \mathcal {L} = -\frac{1}{M}\sum _{i=1}^{M}\sum _{r=1}^{|\mathcal {P}_{i}^{+}|} l^T_i\log (p_{r}^{i}) - \lambda \sum _{r=1}^{|\mathcal {P^-}|}\sum _{c=1}^{C}I_{c}^{b}\log (p_r^-), \end{aligned}$$

(8)

where $I^b_c$ is 1 if class c corresponds to the background label b and T represents the transpose of vector. The hyperparameter $\lambda $ in Eq. 8 controls the balance between positive and negative regions.

4 Implementation Details

4.1 Network Architecture

We use the pre-trained COB network for estimating contours which is a ResNet50 model. Features $\mathbf {F}$ are extracted from res3 layer of ResNet50 model having spatial resolution of $28\times 28$. ROIAlign extracts features having a fixed spatial resolution of $7\times 7$. All the convolutions within the LSTM have kernel size of $3\times 3$, stride 1 and use zero-padding. On top of convolutional LSTM, we have 2 $3\times 3$ convolutions and 2 fully connected layers predicting softmax probabilities.

4.2 Training Details

We set the parameters $\lambda _+$, $\lambda _-$, $\lambda $ to 0.7, 0.3 and 0.2 respectively in all our experiments. We train Convolutional LSTM and subsequent layers from scratch with a batch size of 1, initial learning rate of 0.001 and decay it by a factor of 0.1 after every 20 epochs. We experiment over various initial cut-thresholds from $\lambda _{cut} = 0.3$ to $\lambda _{cut} = 0.9$ in steps of 0.1.

Table 1. Time taken to process a single image in seconds.

Full size table

5 Experiments

We use the pretrained COB network to predict the contours which was trained on PASCAL Context dataset. We train our Convolutional Tree-LSTM and subsequent layers on PASCAL VOC 2012 dataset. We evaluate our model on PASCAL VOC 2012 val dataset using average precision, Jaccard Index and time taken to process an image as evaluation metrics. Table 1 compares the time taken to process a single image by different methods. Figure 3 denotes the precision-recall curves for all the classes.

Table 2. Variation of average precision for different tree cut thresholds.

Full size table

On the VOC 2012 val set, our best performing model scores 48% mAP. Our model struggles on categories like bicycle, chair. However on categories like train and plane, our model achieves higher performance. Table 2 summarizes the average precision for all the categories. We further compare Jaccard Index with MCG and is presented in Table 3 (Fig. 4).

Table 3. Comparison of Jaccard Index for varying number of regions considered from the tree. (N, std stand for number of regions consdidered and standard deviation respectively)

Full size table

6 Conclusions

We proposed an unique approach for bottom-up instance segmentation which overcomes the limitations of the current bottom-up and top-down approaches. Our method produces comparative results with good trade-off between segmentation accuracy and processing time. We would like to further investigate an end-to-end network predicting contours in tandem with the estimation of energies of regions. This leads to prediction of semantically accurate contours resulting in high-quality hierarchical region tree further aiding the estimation of energies.

References

He, K., Gkioxari, G., Dollár, P., Girshick, R.: Mask R-CNN. In: 2017 IEEE International Conference on Computer Vision (ICCV), pp. 2980–2988, October 2017
Google Scholar
Li, Y., Qi, H., Dai, J., Ji, X., Wei, Y.: Fully convolutional instance-aware semantic segmentation. In: 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 4438–4446 (2017)
Google Scholar
Arbelaez, P., Maire, M., Fowlkes, C.C., Malik, J.: Contour detection and hierarchical image segmentation. IEEE Trans. Pattern Anal. Mach. Intell. 33(5), 898–916 (2011)
Article Google Scholar
Arbeláez, P., Pont-Tuset, J., Barron, J., Marques, F., Malik, J.: Multiscale combinatorial grouping. In: Computer Vision and Pattern Recognition (2014)
Google Scholar
Felzenszwalb, P.F., Huttenlocher, D.P.: Efficient graph-based image segmentation. Int. J. Comput. Vis. 59(2), 167–181 (2004)
Article Google Scholar
Pham, T., Do, T.T., Sünderhauf, N., Reid, I.: SceneCut: joint geometric and object segmentation for indoor scenes. In: 2018 IEEE International Conference on Robotics and Automation (ICRA) (2018)
Google Scholar
Kirillov, A., Levinkov, E., Andres, B., Savchynskyy, B., Rother, C.: InstanceCut: from edges to instances with multicut. In: CVPR (2017)
Google Scholar

Download references

Author information

Authors and Affiliations

IIT Kharagpur, Kharagpur, India
K. V. Manohar
Preferred Networks Inc., Tokyo, Japan
Yusuke Niitani

Authors

K. V. Manohar
View author publications
You can also search for this author in PubMed Google Scholar
Yusuke Niitani
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding authors

Correspondence to K. V. Manohar or Yusuke Niitani .

Editor information

Editors and Affiliations

Technical University of Munich, Garching, Germany
Laura Leal-Taixé
Technische Universität Darmstadt, Darmstadt, Germany
Stefan Roth

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Manohar, K.V., Niitani, Y. (2019). An End-to-End Tree Based Approach for Instance Segmentation. In: Leal-Taixé, L., Roth, S. (eds) Computer Vision – ECCV 2018 Workshops. ECCV 2018. Lecture Notes in Computer Science(), vol 11133. Springer, Cham. https://doi.org/10.1007/978-3-030-11021-5_30

Download citation

DOI: https://doi.org/10.1007/978-3-030-11021-5_30
Published: 23 January 2019
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-11020-8
Online ISBN: 978-3-030-11021-5
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

An End-to-End Tree Based Approach for Instance Segmentation

Abstract

Similar content being viewed by others

YOLO-CORE: Contour Regression for Efficient Instance Segmentation

Seed, Expand and Constrain: Three Principles for Weakly-Supervised Image Segmentation

Instance-Sensitive Fully Convolutional Networks

1 Introduction

2 Related Work