Keywords

1 Introduction

Scene perception and understanding refers to multiple detection tasks regarding the driving scene such as object detection, free space detection, lane detection, traffic sign detection, semantic and instance segmentation. Dedicated, high complexity networks are employed for each of these tasks in order to achieve state of the art results. Recently, multi-task networks have been proposed to jointly predict multiple tasks in place of individual, specialized networks [16]. Using a single multi-task network reduces the computational burden and has the potential to achieve better performance than individual networks if the learned tasks are complementary [10].

In this paper we target an embedded architecture for use in autonomous driving systems – the NVIDIA Jetson TX2. Existing multi-task networks often rely on complex architectures that are unsuitable for embedded platforms due to their time and memory constraints. To achieve real-time inference on embedded platforms, unified models of lower complexity are used. Our proposed architecture is a unified, low-complexity model that jointly performs in real-time 4 tasks – object detection, bottom point prediction, semantic segmentation and instance segmentation – achieving 19 FPS on the NVIDIA Jetson TX2 embedded platform. The chosen tasks construct the free drivable area around the car and identify the surrounding objects.

We leverage the object detection and bottom point prediction low-complexity network from our previous work [3] and add semantic and instance segmentation tasks to the unified network. The network is built so that the additional tasks do not add significant overhead to the inference time. The unified network reuses features by having a common encoder backbone which subsequently splits into individual task branches. Each branch further refines the features and predicts the branch task. The unified network is shown in Fig. 1.

To maximize the amount of information that we can infer from a scene, we construct a wide-angle dataset that contains images captured with a fisheye camera with \(190^\circ \) FOV. The wide FOV allows the camera to capture more information from the surrounding area of the car. The dataset is discussed in detail in Sect. 4.

Fig. 1.
figure 1

Unified scene understanding network.

In summary, our contributions in this paper are:

  • We build on top of previous work and propose a low-complexity unified architecture that performs scene perception by jointly predicting 4 tasks: 5-class object detection, free space prediction through bottom point detection, semantic segmentation and instance segmentation.

  • A unified network designed to share the computational burden and reuse the layers can achieve real-time inference on an embedded platform even when predicting 4 complex tasks.

  • Our low-complexity unified network can achieve 19 FPS on the NVIDIA Jetson TX2 embedded platform.

2 Related Work

2.1 Object Detection

Object detection architectures can be divided into high-complexity, two-stage approaches and low-complexity, single-shot approaches. The two-stage methods first propose class-agnostic bounding boxes which are fed to the second stage where they are refined and assigned class probabilities. Among the two-stage approaches we mention state of the art PANet [13] that builds upon the Mask R-CNN [8] architecture by adding a top-down and bottom-up feature pyramid network (FPN) [12]. Two-stage architectures are computationally expensive and cannot be deployed on embedded devices. For platforms with limited resources, single-shot architectures are introduced in order to reduce both memory and resource requirements. SSD [14] is a single-shot architecture that predicts object bounding boxes with their associated class at multiple scales. Single-shot multi-scale architectures achieve better results than single-shot single-scale architectures by decorrelating the prediction of different object scales.

2.2 Free Space Detection

Free space detection delimits the free drivable area. The task can be formulated in terms of semantic segmentation, where the road class is predicted. The task can also be formulated as object bottom point prediction in which the objects include curb, sidewalk, and any obstacles so that the free space is defined as a space beneath those objects. The bottom point prediction refers to predicting the maximal height for each pixel along the image width, at which point an obstacle exists [2, 3].

2.3 Semantic and Instance Segmentation

Semantic segmentation predicts a dense, per-pixel class association. Deeplabv3 [4] achieves state of the art segmentation results on the Cityscapes [6] dataset by using an FCN [15] to extract image features that are subsequently passed through an Atrous Spatial Pyramid Pooling (ASPP) module that applies parallel atrous convolutions with different rates in order to exploit the multi-scale contextual information. The image resolution is recovered from the downscaled feature map through bilinear upsampling. In Deeplabv3+ [5] the direct upsampling operation is replaced by a decoder module in which the resolution is gradually recovered. In order to obtain a sharper segmentation, the information from an earlier layer in the encoder is added to the decoder features. In our proposed model, we use a simplified version of the Deeplabv3+ decoder, in order to maintain a fast inference time.

Instance segmentation predicts a per-pixel instance association for each object instance in a scene. This task can be predicted alongside either semantic segmentation, since they are both per-pixel tasks, or alongside object detection, in an attempt to better separate object instances and better fit the enclosing bounding boxes. Mask R-CNN and PANet achieve state of the art results on the COCO dataset for the object detection and instance segmentation tasks. The instance segmentation prediction task is simplified using these architectures on account of the two-stage approach. For each predicted bounding box, a reduced-resolution mask is predicted per class. For two-stage approaches, the masks are computed on the ROIs, and thus the task is simplified since we can only have one instance in each ROI. In contrast, for single-stage methods, where we do not have ROIs, the network is required to predict the instance segmentation mask on the entire image. This task is arguably harder since the number of instances in an image is not known beforehand, and a convolutional network is unable to count. To bypass the counting problem, the task is often reformulated in two ways. One formulation is to predict per-pixel vectors that point towards the instance center [10]. The network predicts per-pixel vectors (2 dimensions) and extracts the final instances in the post-processing step using the OPTICS clustering algorithm [1]. Another formulation transforms the instance segmentation task into a semantic segmentation task, in which instances are colored [11]. The segmentation network assigns a constant number of labels – colors – to an unknown number of instances. The ground truth is assigned at training time and can be viewed as being part of the network loss.

2.4 Multi-task Networks

Task specific networks are employed in order to achieve state of the art results without resource limitations. Multi-task networks reuse features among multiple tasks so as to reduce the computational burden. Additionally, if the tasks learned are similar and the training procedure is adequate, multi-task networks can achieve better results than their independent counterparts [10].

Existing multi-task networks commonly predict a combination of object detection, semantic or instance segmentation and depth prediction. MultiNet [16] is a multi-task network that predicts vehicle object detection, road segmentation and street classification. In [10] the multi-task network predicts semantic segmentation, instance segmentation and depth prediction. We find that existing networks do not provide a complete view of the car’s surroundings and the available drivable area. To this end, our proposed model predicts 5-class object detection (car, bus, truck, pedestrian and *-cycle), bottom point prediction, semantic segmentation and instance segmentation.

3 Low-Complexity Unified Scene Understanding Network

Our proposed scene understanding model jointly predicts 4 tasks – object detection, bottom point prediction, semantic segmentation and instance segmentation, using a lightweight architecture suitable for embedded devices. We build on our previous work of low-complexity object detection and bottom point prediction and augment our unified network with semantic and instance segmentation tasks. Multi-task networks share the same architecture design: a common encoder where the features are reused by the tasks followed by individual encoder and decoder layers for each individual task. Branching out at later layers in the encoder allows the sub-networks to share the computation and reuse features, while branching out at earlier layers allows the networks to specialize.

In order to reduce the inference time, we share not only the common encoder between all tasks but for the segmentation tasks (semantic and instance) we also share the task-specific encoders and decoders, leaving only the predictors to be task specific. This architecture modification provides a significant gain in inference time at a small accuracy cost for the segmentation tasks. The common encoder used is MobileNetV1 [9] until the Conv5 layer. Through experiments, we have found that splitting at the Conv5 layer offered the best trade-off in terms of accuracy and inference time. Each task keeps an independent copy of the remaining encoder layers (Conv5-Conv11), except the segmentation tasks which share the entire encoder as well as the decoder. The task predictors are independent. In Fig. 1 we show our proposed unified model with its task specific predictions. We illustrate the object detection predictions with bounding boxes colored according to the predicted class. We use the same colors to overlay the semantic segmentation predictions with the input image. For bottom point visualization we color the free space area delimited by the predicted bottom points. For instance segmentation we show the final extracted object instances.

3.1 Object Detection

In order to achieve competitive results while maintaining a fast inference time, we adopt SSD as our object detection architecture, viewed independently in Fig. 2. We predict objects at 6 scales, starting from Conv11 at 1/16 downscaled resolution for the first scale. We add the remaining 5 layers on top of the encoder.

Fig. 2.
figure 2

Object detection network deploying SSD with MobileNetV1.

3.2 Bottom Point Prediction

We perform free space detection through predicting the obstacle bottom point for each pixel in the image width. The network can be viewed as a standard classification network on height (H) classes, where for each pixel in the image width we predict the height index of the obstacle bottom. The union of the bottom points for all the width columns would build either the curb or the free drivable area. In Fig. 3 the column-wise bottom point predictions are illustrated. Since the bottom point network predicts the height index at each column in the image width, the encoded features need to be decoded to the required image resolution. The network’s final layer is the softmax classification layer that predicts the bottom height index for each column in the image width. After passing the image through the encoder, we recover the image resolution from the features through the depth-to-space operation as illustrated in Fig. 4. The depth-to-space operation reuses the channels in the feature maps to reconstruct the height and width of the image.

Fig. 3.
figure 3

Bottom point. Left: column-wise bottom prediction (red dot), Right: free space detection (red line). (Color figure online)

Fig. 4.
figure 4

Bottom point detection network deploying MobileNetV1.

3.3 Semantic and Instance Segmentation

The semantic segmentation and instance segmentation sub-networks can be viewed in Fig. 5. The segmentation sub-networks share the entire encoder, as well as the decoder to further reduce inference time. Similar to Deeplabv3+, we decode the features by bilinearly upsampling them from 1/16 to 1/8 downscaled resolution. In order to recover object boundaries we concatenate the features from a lower layer at the same resolution and apply a depthwise separable convolution in order to refine the features. The resulting features from the slim decoder are then passed to the semantic segmentation predictor. The resulting segmentation mask is further bilinearly upsampled to the input image resolution in order to compute the loss. The predictors for the semantic and instance segmentation are independent.

Fig. 5.
figure 5

Semantic and instance segmentation network deploying MobileNetV1 encoder and a slim decoder.

Following the vectors-to-center paradigm [10], we formulate our instance prediction task as predicting a per-pixel 2-D displacement vector, in which for each pixel in an instance we predict a 2-D vector pointing to the instance center of mass to which the pixel belongs. Adding the pixel coordinates to the predicted vector yield the instance center coordinates. For each instance I with its instance center of mass \(C(x_{center}, y_{center})\), adding the instance pixel coordinates \((x_{i}, y_{i})\) to the predicted displacement vectors \((dx_{i}, dy_{i})\) yields C. These vectors can be visualized similar to an optical flow, where the color represents the vector orientation and the color opacity represents the vector magnitude. In Fig. 6, instance segmentation ground truths and predictions are illustrated.

Fig. 6.
figure 6

Instance segmentation task.

On top of the shared decoded features we add a final instance segmentation predictor. The loss is computed on the input image resolution, using the bilinearly upsampled predictions. The instance segmentation predictions need to be post-processed in order to extract the object instances. We follow the same procedure as [10] and apply the OPTICS clustering algorithm. Figure 6 illustrates an example of extracted instances.

4 Implementation

AVM Camera System. In order to maximize the available information in the input, we use an Around View Monitoring (AVM) system to construct our dataset. We use the same dataset as described in our previous work [3]. An AVM system represents a 4-camera setup with which a top-view, \(360^\circ \) view of a car’s surroundings can be constructed. The 4 cameras are positioned around the car (front, rear and sides). Each camera is an HD fisheye camera with \(190^\circ \) FOV. We shall further refer to this dataset as the AVM dataset. The dataset is manually annotated for object detection and bottom point prediction. We use a high-complexity instance segmentation network in order to extract ground truth for the semantic and instance segmentation tasks. We use a pre-trained Mask R-CNN [8] model and only train the object detection sub-network on the AVM dataset, keeping the instance segmentation sub-network freezed. We find that the predicted masks are high quality and can be used as ground truth to train a low-complexity segmentation network. The dataset statistics can be found in Table 1.

Network Details. Our unified network receives as input a single RGB, front-facing AVM image of size \(640\times 360\). We resize our AVM 720p high resolution images to \(640\times 360\) in order to reduce inference time. We use MobileNetV1 encoder with depth multiplier 0.5 as the network backbone. All layers in the shared and individual encoders are initialized with pre-trained weights on the COCO dataset. Unless otherwise specified, we use depthwise separable convolutions with ReLU activations and Batch Normalization. The object detection branch predicts objects at 6 scales using the SSD architecture. We set the minimum bounding box scale to be 0.06 with the maximum 0.95 for our AVM dataset. The minimum value is specific to our dataset, being the smallest scale of an object in the dataset. We use the standard SSD settings for the remaining parameters. The network predicts the box localization and class association between the box and the 5 classes present in the AVM dataset (car, bus, truck, pedestrian and *-cycle). We use the standard SSD losses. For bottom point classification and semantic segmentation we use the cross-entropy loss. The instance segmentation loss is an L1 loss between the predicted displacement vector and the labels. The segmentation losses are computed on the input resolution, between the ground truth and the bilinearly upsampled predictors. The total loss is computed as the sum of the individual losses, weighted by the individual task weights. The instance segmentation task has a weight of \(\lambda =0.1\) with the others having a weight of \(\lambda =1\). The instance segmentation task is learned the slowest and should be allowed time to achieve better results.

Table 1. Dataset statistics: number of objects in class.
Table 2. Unified architecture ablation experiments.

5 Experiments

Metrics. For the object detection task we use the Pascal VOC mean average precision (mAP) metric at 0.5 intersection over union (IOU). For bottom prediction and instance segmentation we use the mean absolute error (MAE) and for semantic segmentation we use the mean IOU (mIOU).

AVM Experiments. The AVM training dataset consists of front-facing fisheye RGB images. The network is trained for 100 000 iterations with a batch size of 8, using the ADAM optimizer with an initial learning rate of 7e-4. The learning rate is decayed every 15 000 iterations with a 0.5 decay factor. The network is initialized with COCO weights. We use random cropping, horizontal flipping and color distortion (brightness, contrast, hue, saturation) in order to augment our data. We have experimented with the complexity of our unified architecture and the trade-off between accuracy and inference time. To this end we report results for independent decoders for the semantic and instance segmentation branches, a shared decoder and a final slim shared decoder in Table 2. When the decoders are independent between the two tasks, the specialized encoder features from Conv5 to Conv11 are also independent. The non-slim variant of the decoder follows the Deeplabv3+ architecture. By sharing the decoder between the semantic segmentation and instance segmentation we can gain significant speed-up at no significant cost to the model performance. The semantic segmentation branch seems largely unaffected by the shared decoder, while the instance segmentation accuracy seems to be impacted by this. We report instance segmentation MAE in terms of the predicted displacement vectors, without taking into account the OPTICS post-processed instances. The object detection and bottom prediction tasks achieve similar results between the three variants. We choose the third, slim variant as our subsequent unified model, that is able to achieve competitive results for 4 tasks – object detection, bottom point prediction, semantic segmentation and instance segmentation – in 19 FPS on the NVIDIA Jetson TX2 embedded platform.

Table 3. Component inference time.
Fig. 7.
figure 7

Bottom point prediction on the KITTI dataset.

For timing the inference we do not include OPTICS post-processing overhead. We test the inference time of the models using TensorFlow 1.7 on the NVIDIA Jetson TX2 embedded platform, with the 3.3 JetPack with CUDA 9.0 and cuDNN 7.0. We perform component ablation tests in Table 3 in order to quantify each component’s impact on the model inference time. A majority of the total inference time is spent in the shared network encoder (18 ms), with each component adding additional inference time. In our proposed model we prioritized our object detection and bottom tasks and chose to limit the inference time through the semantic and instance segmentation branches. The segmentation branches add only 11 ms inference time on top of the remaining architecture, providing relevant information at a small overhead.

KITTI Experiments. We perform additional experiments on the KITTI [7] open dataset. We use a padded resolution of \(1242\times 376\) in order to jointly train on all tasks. We train the object detection, semantic and instance segmentation tasks only on the evaluated classes in KITTI. The bottom prediction ground truths are extracted from the road segmentation ground truths, by finding the top point in the segmentation contour for each column in the image width. The network is trained for 40 000 iterations with a batch size of 8, using the ADAM optimizer with an initial learning rate of 1e-3, decayed every 10 000 iterations with a 0.5 decay factor. The network is initialized with Cityscapes weights. We use centered random cropping, horizontal flipping and color distortion in order to augment our data. We first report the KITTI results using the AVM evaluation metrics described previously. We randomly split the train set into a train and validation set with an 80/20 split. In Table 4 we report the results on the validation set. The object detection task was trained on approximately 6 000 images and obtains competitive results. The semantic and instance segmentation tasks perform poorly due in part to the small dataset with many classes, and in part due to the low-complexity segmentation branches. The bottom task is able to achieve good performance despite the small training dataset. The bottom results can be visualized in Fig. 7. The MAE for the bottom task is dependent on the image resolution, and as such the KITTI bottom MAE cannot be directly compared to the AVM bottom MAE performance. For comparison purposes we present the KITTI test set evaluations for object detection in Table 5. The results are competitive with respect to the network inference time. The low-complexity segmentation network requires sufficient data in order to learn a valuable representation, as such the results for the semantic and instance segmentation tasks on the KITTI dataset are bad. When the dataset is sufficiently large as in the AVM dataset, the tasks learn a useful representation. We do not report the bottom point results due to the task incompatibility with the road segmentation evaluation.

Table 4. Unified architecture trained on the KITTI dataset.
Table 5. KITTI object detection test set evaluation.

6 Conclusions

In this paper we have proposed a low-complexity unified architecture that jointly predicts in real-time 4 complex tasks –5-class object detection, bottom point detection, semantic segmentation and instance segmentation– achieving 19 FPS on the NVIDIA Jetson TX2 embedded platform. The flexibility and modularity of our model allows for future improvements to each individual task, as well as the possibility of adding additional tasks.