Abstract
Deep learning based image-to-image translation methods aim at learning the joint distribution of the two domains and finding transformations between them. Despite recent GAN (Generative Adversarial Network) based methods have shown compelling results, they are prone to fail at preserving image-objects and maintaining translation consistency, which reduces their practicality on tasks such as generating large-scale training data for different domains. To address this problem, we purpose a structure-aware image-to-image translation network, which is composed of encoders, generators, discriminators and parsing nets for the two domains, respectively, in a unified framework. The purposed network generates more visually plausible images compared to competing methods on different image-translation tasks. In addition, we quantitatively evaluate different methods by training Faster-RCNN and YOLO with datasets generated from the image-translation results and demonstrate significant improvement on the detection accuracies by using the proposed image-object preserving network.
S.-W. Huang and C.-T. Lin—Indicates equal contribution.
The original version of this chapter was revised: The presentation of Figure 1 was updated. The correction to this chapter is available at https://doi.org/10.1007/978-3-030-01240-3_50
You have full access to this open access chapter, Download conference paper PDF
Similar content being viewed by others
Keywords
- Generative adversarial network
- Image-to-image translation
- Semantic segmentation
- Object detection
- Domain adaptation
1 Introduction
Deep learning pipelines have stimulated substantial progress for general object detection. Detectors kept pushing the boundaries on several detection datasets. However, despite being able to efficiently detect objects seen by arbitrary viewing angles, CNN-based detectors are still limited in a way that they could not function properly when faced with domains significantly different from those in the original training dataset. The most common way to obtain performance gain is to go through the troublesome data collection/annotation process. Nevertheless, the recent successes of Generative Adversarial Networks (GANs) on image-to-image translation have opened up possibilities in generating large-scale detection training data without the need for object annotation.
Generative adversarial networks [1], which put two networks (i.e., a generator and a discriminator) competing against each other, have emerged as a powerful framework for learning generative models of random data distributions. While expecting GANs to produce an RGB image and its associated bounding boxes from a random noise vector still sounds like a fantasy, training GANs to translate images from one scenario to another could help skip the tedious data annotation process. In the past, GAN-based image-to-image translation methods, such as Pix2Pix [2], were considered to have limited applications due to the requirement for pairwise training data. Although these methods yielded impressive results, the fact that they require pairwise training images largely reduces their practicality for the problem that we aim to solve.
Recently, unpaired image-to-image translation methods have achieved astonishing results on various domain adaptation challenges. Having almost identical architectures, CycleGAN [3], DiscoGAN [4], and DualGAN [5] made unpaired image-to-image translation possible through introducing the cycle consistency constraint. CoGAN [6] is a model which also works on unpaired images, using two shared-weight generators to generate images of two domains with one random noise. UNIT [7] is an extension of CoGAN. Aside from having similar hard weight-sharing constraints as CoGAN, Liu et al. further implemented the latent space assumption by encouraging two encoders to map images from two domains into the same latent space, which largely increases the translation consistency. These methods all demonstrate compelling visual results on several image-to-image translation tasks; however, what hinders the capability of these methods for providing large-scale detection training data, specifically when faced with translation tasks with a large domain shift, is the fact that these networks often arrive at solutions where the translation results are indistinguishable from the the target domain in terms of style, and usually contain corrupted image-objects.
In this paper we propose a structure-aware image-to-image translation network, which allows us to directly benefit object detection by translating existing detection RGB data from its original domain other scenarios. The contribution of this work is three-fold: (1) We train the encoder networks to extract structure-aware information through the supervision of a segmentation subtask, (2) we experiment on different weight sharing strategy to ensure the preservation of image-objects during image-translations, and (3) our object-preserving network provides significant performance gain on the night-time vehicle detection.
We stress particularly on day-to-night image translation not only for the importance of night-time detection, but also for the fact that day/night image translation is one of the most difficult domain transformations. However, our method is also capable of handling various domain pairs. We train our network on synthetic (i.e., SYNTHIA [8], GTA dataset [9]) Compared to the competing methods, the domain translation results of our network significantly enhance the capability of the object detector for application on both synthetic (i.e., SYNTHIA, GTA) and real-world (i.e., KITTI [10], ITRI) data. In addition, we welcome those who are interested in the ITRI dataset to email us for provision.
2 Proposed Framework
In unsupervised image-to-image translation, models learn joint distribution where the network encodes images from the two domains into a shared feature space. We assume that, for an image to be properly translated to the other domain, the encoded information is required to contain (1) mutual style information between domain A and B, and (2) structural information of the given input image, as illustrated in Fig. 1. Based on the assumption we design our network to jointly optimize image-translation and semantic segmentation. Through our weight-sharing strategy, the segmentation subtask serves as an auxiliary regularization for image-translation.
Let X and Y denote the two image domains, \({\hat{X}}\) and \({\hat{Y}}\) denote the corresponding segmentation masks, and Z represent the encoded feature space. Our network, as depicted in Fig. 1, consists of two encoders \(E_{x}:\mathrm X \rightarrow \mathrm Z\) and \(E_{y}:\mathrm Y \rightarrow \mathrm Z\), two generators, \(G_{x}:\mathrm Z \rightarrow \mathrm {\bar{Y}}\) and \(G_{y}: \mathrm Z \rightarrow \mathrm {\bar{X}}\), two segmentation generators, \(P_{x}:\mathrm Z \rightarrow \mathrm {\hat{X}}_{pred}\), and \(P_{y}:\mathrm Z \rightarrow \mathrm {\hat{Y}}_{pred}\), and two discriminators \(D_{x}\) and \(D_{y}\) for the two image domains, respectively. Our network learns image domain translation in both directions and the segmentation sub-tasks simultaneously. For an input \(x\in X\) , \(E_{x}\) first encodes x into the latent space, and the 256-channel feature vector is then processed to produce (1) the translated output \({\bar{y}}\) via \(G_{x}\), and (2) the semantic representation \({\hat{x}}_{pred}\) via \(P_{x}\). The translated output \({\bar{y}}\) is then fed through the inverse encoder-generator pair \(\{E_{y}, G_{y}\}\) to yield the reconstructed image \(x_{rec}\). Detailed architecture of our network is given in Table 1.
2.1 Structure-Aware Encoding and Segmentation Subtask
We actively guide the encoder networks to extract context-aware features by regularizing them via segmentation subtask so that the extracted 256-channel feature vector contains not only mutual style information between X and Y domains, but also the intricate low-level semantic features of the input image that are valuable in the preservation of image-objects during translation. The segmentation loss is formulated as:
2.2 Weight Sharing for Multi-task Network
Sharing weights between the generator and parsing network allows the generator to fully take advantage of the context-aware feature vector. We hard-share the first 6 residual blocks and soft-share the subsequent two deconvolution blocks for generators and parsing networks. We experiment on different weight-sharing strategies, as illustrated in Sect. 3.2, such as hard-share, not sharing the deconvolution blocks, and not sharing the residual blocks, and come to the best sharing strategy. We calculate the weight difference between deconvolution layers of the two networks and model the difference as a loss function through mean square error with target as a zero matrix. The mathematical expression for the soft weight sharing loss function is given by
where \(\omega _{G}\) and \(\omega _{P}\) denote the weight vectors formed by the deconvolution layers of the generator and parsing networks, respectively.
2.3 Cycle Consistency
The cycle consistency loss has been proven quite effective in preventing network from generating random images in the target domain. We also enforce the cycle-consistency constraint in the proposed framework to further regularize the ill-posed unsupervised image-to-image translation problem. The loss function is given by
2.4 Adversarial Learning
Our network contains two Generative Adversarial Networks: \(GAN_{1}\): \(\{E_{x}, G_{x}, D_{x}\}\), and \(GAN_{2}\): \(\{E_{y}, G_{y}, D_{y}\}\). We apply adversarial losses to both GANs, and formulate the objective loss functions as:
2.5 Network Learning
We jointly solve the learning problems for the image-translation streams: \(\{E_{1}, G_{1}\}\) and \(\{E_{2}, G_{2}\}\), the image-parsing streams: \(\{E_{1}, P_{1}\}\) and \(\{E_{2}, P_{2}\}\), and two GAN networks: \(GAN_{1}\) and \(GAN_{2}\), for training the proposed network. The integrated objective function is given as follows:
3 Experimental Results
Though many works were dedicated on providing large-scale vehicle datasets for the research community [11,12,13,14,15], most public are collected in daytime. Considering that CNN-based detectors highly rely data augmentation techniques to stimulate performance, training detectors with both day and night images is necessary so as to make them more general. Synthetic dataset, such as SYNTHIA or GTA dataset, provides diverse on-road synthetic sequences as well as segmentation masks in scenarios such as day, night, snow, etc. As our network requires both segmentation mask and nighttime image, we conducted the training of our network with SYNTHIA and GTA datasets. For evaluation purpose, however, we utilize real-world data such as KITTI and our ITRI datasets.
The performance of the network was further analyzed through training YOLO [16] and Faster R-CNN (VGG 16-based) [17] detectors with generated image sets. Aside from revising both detectors to perform 1-class vehicle detection, all hyper-parameters were the same as those used for training on PASCAL VOC challenge. The IOU threshold for objects to be considered true-positives is 0.5, where we follow the standard for common object detection datasets. In the transformation of segmentation Ground-Truth to its counterpart in detection, we exclude the bounding boxes whose heights lower than 40 pixels or occluded for more than 75% in the subsequent AP estimation.
3.1 Synthetic Datasets
We first assess the effectiveness of training detectors with transformed images in both day and night scenarios. We evaluated our network, which is trained with SYNTHIA, by training detectors with transformed images produced by our network. As shown in Table 2, AugGAN outperforms competing methods in both day and night scenarios. AugGAN also surpasses its competitors when trained with GTA dataset, see Table 3. Visually, the transformation results of AugGAN is clearly better in terms of image-object preservation and preventing the appearance of artifacts as shown in Figs. 2 and 3.
3.2 KITTI and ITRI-Night Datasets
Aside from testing on SYNTHIA and GTA datasets, we also assess the capability of our network on real world data, such as KITTI, which has been widely used in assessing the performance of on-road object detectors used in autonomous driving systems. With the previously trained AugGAN, be it trained with SYNTHIA or GTA dataset, we transformed the KITTI dataset (7481 images with 6686 of which contains vehicle instances) [18] to its nighttime version and evaluate the translation results via detector training. We trained vehicle detectors with the translated KITTI dataset and tested on our ITRI-Night testing set (9366 images with 20833 vehicle instances). As experimental result indicates, real-world data transformed by AugGAN quantitatively and visually achieves better result even though AugGAN was trained with synthetic dataset, see Table 4, Figs. 4 and 5.
3.3 ITRI Daytime and Nighttime Datasets
We collected a set of real-driving daytime (25104 images/87374 vehicle instances) dataset, captured mostly in the same scenario as its our nighttime dataset (9366 images with 20833 vehicle instances). In Table 5, the experiments demonstrate similar results as in other datasets. The transformed day-to-night training images are proved to be helpful in vehicle detector training. Training images generated by AugGAN outperforms those by competing methods due to its preservation in image-objects, with some examples shown in Figs. 6 and 7.
3.4 Transformations Other Than Daytime and Nighttime
AugGAN is capable of learning transformation across unpaired synthetic and real domains and only segmentation supervision in domain-A is required. This increases the flexibility of learning cross-domain adaptation for subsequent detector training. As shown in Fig. 8: 2nd row, our method could learn image translation from not only synthetic-synthetic, but also synthetic-real domain pairs.
4 Model Analysis
4.1 Segmentation Subtask
In our initial experiment on introducing the segmentation subtask, the parsing network was only utilized in the forward cycle (e.g., only day-to-night). We later on discovered that our results are improved by utilizing the parsing network to regularize both forward and inverse cycles. As can be seen in Table 6, it is quite obvious that adding regularization to the inverse cycle leads to better transformation results which make detectors more accurate. Although using only single-sided segmentation has already outperformed the previous works, introducing segmentation in both forward and backward cycles brings further accuracy improvement for object detection.
4.2 Weight-Sharing Strategy
Our network design is based on the assumption that extracted semantic segmentation features of individual layers, through proper weight sharing, can serve as auxiliary regularization for image-to-image translation. Thus finding the proper weight sharing policy came to be the most important factor in our design. Weighting sharing mechanism in neural networks can be roughly categorized into soft weight-sharing and hard weight-sharing. Soft weight-sharing [19] was originally proposed for regularization and could be applied to network compression [20]. Recently, hard weight-sharing has been proven useful in generating images with similar high-level semantics [6]. The policy that we currently adopt is two-folded: (1) hard-share encoders and residual blocks of the generator-parsing net pairs, (2) soft-share deconvolution layers of the generator-parsing net pairs. We came to this setting based on extensive trial and error, and during the process we realized that both policies are integral for the optimization of our network. Without hard-sharing the said layers in (1), image-objects tends to be distorted; Without (2), the network tends to only optimize one of the tasks, see Table 7 and Fig. 9. In short, our network surpasses competing methods because our multi-task network can maintain realistic transformation style as well as preserving image-objects with the help of segmentation subtask.
5 Conclusion and Future Work
In this work, we proposed an image-to-image translation network for generating large-scale trainable data for vehicle detection algorithms. Our network is especially adept in preserving image-objects, thanks to the extra guidance of the segmentation subtask. Our method, though far from perfect, quantitatively surpasses competing methods for stimulating vehicle detection accuracy. In the future, we will continue to experiment on different tasks based on this framework, and our pursuit for creating innovative solutions for the world will continue to stride.
Change history
07 December 2018
In the originally published version of chapter 44 the quality of figure 1 was of poor quality. It was replaced by a high quality figure.
References
Goodfellow, I., et al.: Generative adversarial nets. In: NIPS (2014)
Isola, P., Zhu, J.Y., Zhou, T., Efros, A.A.: Image-to-image translation with conditional adversarial networks. CVPR (2017)
Zhu, J.Y., Park, T., Isola, P., Efros, A.A.: Unpaired image-to-image translation using cycle-consistent adversarial networks. In: ICCV (2017)
Kim, T., Cha, M., Kim, H., Lee, J., Kim, J.: Learning to discover cross-domain relations with generative adversarial networks. arXiv preprint arXiv:1703.05192 (2017)
Yi, Z., Zhang, H., Tan, P., Gong, M.: Dualgan: unsupervised dual learning for image-to-image translation. arXiv preprint (2017)
Liu, M.Y., Tuzel, O.: Coupled generative adversarial networks. In: NIPS (2016)
Liu, M.Y., Breuel, T., Kautz, J.: Unsupervised image-to-image translation networks. In: NIPS (2017)
Ros, G., Sellart, L., Materzynska, J., Vazquez, D., Lopez, A.M.: The synthia dataset: a large collection of synthetic images for semantic segmentation of urban scenes. In: CVPR (2016)
Richter, S.R., Vineet, V., Roth, S., Koltun, V.: Playing for data: ground truth from computer games. In: Leibe, B., Matas, J., Sebe, N., Welling, M. (eds.) ECCV 2016. LNCS, vol. 9906, pp. 102–118. Springer, Cham (2016). https://doi.org/10.1007/978-3-319-46475-6_7
Geiger, A., Lenz, P., Urtasun, R.: Are we ready for autonomous driving? The KITTI vision benchmark suite. In: CVPR (2012)
Sivaraman, S., Trivedi, M.M.: A general active-learning framework for on-road vehicle recognition and tracking. IEEE Trans. Intell. Transp. Syst. 11(2), 267–276 (2010)
Zhou, Y., Liu, L., Shao, L., Mellor, M.: DAVE: a unified framework for fast vehicle detection and annotation. In: Leibe, B., Matas, J., Sebe, N., Welling, M. (eds.) ECCV 2016. LNCS, vol. 9906, pp. 278–293. Springer, Cham (2016). https://doi.org/10.1007/978-3-319-46475-6_18
Yang, L., Luo, P., Change Loy, C., Tang, X.: A large-scale car dataset for fine-grained categorization and verification. In: CVPR (2015)
Krause, J., Stark, M., Deng, J., Fei-Fei, L.: 3D object representations for fine-grained categorization. In: ICCV Workshops (2013)
Everingham, M., Van Gool, L., Williams, C.K., Winn, J., Zisserman, A.: The Pascal visual object classes (VOC) challenge. IJCV 88, 303 (2010)
Redmon, J., Divvala, S., Girshick, R., Farhadi, A.: You only look once: unified, real-time object detection. In: CVPR (2016)
Ren, S., He, K., Girshick, R., Sun, J.: Faster R-CNN: towards real-time object detection with region proposal networks. In: NIPS (2015)
Xiang, Y., Choi, W., Lin, Y., Savarese, S.: Subcategory-aware convolutional neural networks for object proposals and detection. In: WACV (2017)
Nowlan, S.J., Hinton, G.E.: Simplifying neural networks by soft weight-sharing. Neural Comput. 4(4), 473–493 (1992)
Ullrich, K., Meeds, E., Welling, M.: Soft weight-sharing for neural network compression. arXiv preprint arXiv:1702.04008 (2017)
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
1 Electronic supplementary material
Below is the link to the electronic supplementary material.
Supplementary material 3 (mp4 18704 KB)
Rights and permissions
Copyright information
© 2018 Springer Nature Switzerland AG
About this paper
Cite this paper
Huang, SW., Lin, CT., Chen, SP., Wu, YY., Hsu, PH., Lai, SH. (2018). AugGAN: Cross Domain Adaptation with GAN-Based Data Augmentation. In: Ferrari, V., Hebert, M., Sminchisescu, C., Weiss, Y. (eds) Computer Vision – ECCV 2018. ECCV 2018. Lecture Notes in Computer Science(), vol 11213. Springer, Cham. https://doi.org/10.1007/978-3-030-01240-3_44
Download citation
DOI: https://doi.org/10.1007/978-3-030-01240-3_44
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-01239-7
Online ISBN: 978-3-030-01240-3
eBook Packages: Computer ScienceComputer Science (R0)