Abstract
To generate a realistic person image for pose-guided person image generation, especially for local body parts, is challenging. Two reasons account for it: (1) the difficulty for long-range relation modeling, (2) a deficiency in precise local correspondence capturing. We propose a Precise Correspondence Enhanced Generative Adversarial Network (PCE-GAN) to address these problems. PCE-GAN includes a global branch and a local branch. The former maintains the global consistency of the generated person image and the latter captures the precise local correspondence. More specifically, the long-range relation is well established via the spatial-channel Multi-layer Perceptrons module in the transformation blocks within both branches. The precise local correspondence is captured effectively by the local branch’s local-pair building and local-guiding modules. Finally, the outputs of each branch are combined for mutually improved benefits based on the enhanced correspondences. Experimental results show that, compared to previous state-of-the-art methods using the Market-1501 dataset, PCE-GAN performs quantitatively better, with a \(5.53\%\) and \(7.74\%\) improvement in SSIM and IS scores, respectively. Qualitative results for both Market-1501 and DeepFashion datasets are also provided herein to further validate the effectiveness of our method.
Similar content being viewed by others
References
Ma L, Jia X, Sun Q, Schiele B, Tuytelaars T, Van Gool L (2017) Pose guided person image generation. In: Advances in neural information processing systems, pp 406–416
Ma L, Sun Q, Georgoulis S, Van Gool L, Schiele B, Fritz M (2018) Disentangled person image generation. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 99– 108
Siarohin A, Sangineto E, Lathuiliere S, Sebe N (2018) Deformable gans for pose-based human image generation. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 3408– 3416
Zhu Z, Huang T, Shi B, Yu M, Wang B, Bai X (2019) Progressive pose attention transfer for person image generation. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 2347– 2356
AlBahar B, Huang J-B (2019) Guided image-to-image translation with bi-directional feature transformation. In: Proceedings of the IEEE international conference on computer vision, pp 9016– 9025
Men Y, Mao Y, Jiang Y, Ma W-Y, Lian, Z (2020) Controllable person image synthesis with attribute-decomposed gan. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 5084– 5093
Lv Z, Li X, Li X, Li F, Lin T, He D, Zuo W (2021) Learning semantic person image generation by region-adaptive normalization. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 10806– 10815
Siarohin A, Woodford OJ, Ren J, Chai M, Tulyakov S (2021) Motion representations for articulated animation. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 13653– 13662
Tang H, Xu D, Liu G, Wang W, Sebe N, Yan Y (2019) Cycle in cycle generative adversarial networks for keypoint-guided image generation. In: Proceedings of the ACM international conference on multimedia, pp 2052– 2060
Tang H. Bai S, Zhang L, Torr PH, Sebe N (2020) Xinggan for person image generation. In: Proceedings of the European conference on computer vision, pp 717– 734
Tang H, Bai S, Torr PH, Sebe N (2020) Bipartite graph reasoning gans for person image generation. In: British machine vision conference
LeCun Y, Bottou L, Bengio Y, Haffner P (1998) Gradient-based learning applied to document recognition. Proc IEEE 86(11):2278–2324
Goodfellow I, Pouget-Abadie J, Mirza M, Xu B, Warde-Farley D, Ozair S, Courville A, Bengio Y (2014) Generative adversarial nets. In: Advances in neural information processing systems, vol 27
Fister I Jr, Perc M, Ljubič K, Kamal SM, Iglesias A, Fister I (2015) Particle swarm optimization for automatic creation of complex graphic characters. Chaos Solit Fract 73:29–35
Kingma DP, Welling M (2013) Auto-encoding variational bayes. In: International conference on learning representations
Mirza M, Osindero S (2014) Conditional generative adversarial nets. arXiv preprint arXiv:1411.1784 (2014)
Han Z, Huang H (2021) Gan based three-stage-training algorithm for multi-view facial expression recognition. Neural Process Lett 53(6):4189–4205
Xiang X, Yu Z, Lv N, Kong X, Saddik AE (2020) Attention-based generative adversarial network for semi-supervised image classification. Neural Process Lett 51(2):1527–1540
Wen J, Shen Y, Yang J (2022) Multi-view gait recognition based on generative adversarial network. Neural Process Lett 1–23
Brock A, Donahue J, Simonyan K ( 2018) Large scale gan training for high fidelity natural image synthesis. In: International conference on learning representations
Shaham TR, Dekel T Michaeli T (2019) Singan: learning a generative model from a single natural image. In: Proceedings of the IEEE international conference on computer vision, pp 4570– 4580
Karras T, Laine S, Aila T (2019) A style-based generator architecture for generative adversarial networks. In: Proceedings of the IEEE Conference on computer vision and pattern recognition, pp 4401– 4410
Esser P, Sutter E, Ommer B (2018) A variational u-net for conditional appearance and shape generation. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 8857– 8866
Wang Z, Bovik AC, Sheikh HR, Simoncelli EP (2004) Image quality assessment: from error visibility to structural similarity. IEEE Trans Image Process 13(4):600–612
Zakharov E, Shysheya A, Burkov E, Lempitsky V (2019) Few-shot adversarial learning of realistic neural talking head models. In: Proceedings of the IEEE international conference on computer vision, pp 9459– 9468
Kim J, Kim M, Kang H, Lee KH (2019) U-gat-it: unsupervised generative attentional networks with adaptive layer-instance normalization for image-to-image translation. In: International conference on learning representations
Alami Mejjati Y, Richardt C, Tompkin J, Cosker D, Kim KI (2018) Unsupervised attention-guided image-to-image translation. Adv Neural Inf Process Syst 31:3693–3703
Park T, Liu M-Y, Wang T-C, Zhu J-Y (2019) Semantic image synthesis with spatially-adaptive normalization. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 2337– 2346
Ren B, Tang H, Sebe N (2021) Cascaded cross mlp-mixer gans for cross-view image translation. In: British machine vision conference
Balakrishnan G, Zhao A, Dalca AV, Durand F, Guttag J (2018) Synthesizing images of humans in unseen poses. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 8340– 8348
Lassner C, Pons-Moll G, Gehler PV (2017) A generative model of people in clothing. In: Proceedings of the IEEE international conference on computer vision, pp 853– 862
Wang B, Zheng H, Liang X, Chen Y, Lin L, Yang M (2018) Toward characteristic-preserving image-based virtual try-on network. In: Proceedings of the European conference on computer vision, pp 589– 604
Neverova N, Alp Guler R, Kokkinos I (2018) Dense pose transfer. In: Proceedings of the European conference on computer vision, pp 123– 138
Li Y, Huang C, Loy CC (2019) Dense intrinsic appearance flow for human pose transfer. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 3693– 3702
Zanfir M, Oneata E, Popa A-I, Zanfir A, Sminchisescu C (2020) Human synthesis and scene compositing. Proc AAAI Conf Art Intell 34:12749–12756
Zhang J, Li K, Lai Y-K, Yang J (2021) Pise: person image synthesis and editing with decoupled gan. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 7982–7990
Cao Z, Simon T, Wei S-E, Sheikh, Y (2017) Realtime multi-person 2D pose estimation using part affinity fields. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 7291– 7299
Zheng L, Shen L, Tian L, Wang S, Wang J, Tian Q (2015) Scalable person re-identification: a benchmark. In: Proceedings of the IEEE international conference on computer vision
Liu Z, Luo P, Qiu S, Wang X, Tang X (2016) Deepfashion: powering robust clothes recognition and retrieval with rich annotations. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 1096– 1104
Salimans T, Goodfellow I, Zaremba W, Cheung V, Radford A, Chen X (2016) Improved techniques for training gans. In: Advances in neural information processing systems, vol 29
Kingma DP, Ba J (2015) Adam: a method for stochastic optimization. In: International conference on learning representations
Huang S, Xiong H, Cheng Z-Q, Wang Q, Zhou X, Wen B, Huan J, Dou D (2020) Generating person images with appearance-aware pose stylizer. In: International joint conference on artificial intelligence
Paszke A, Gross S, Massa F, Lerer A, Bradbury J, Chanan G, Killeen T, Lin Z, Gimelshein N, Antiga L et al (2019) Pytorch: an imperative style, high-performance deep learning library. In: Advances in neural information processing systems, vol 32
Ren Y, Yu X, Chen J, Li TH, Li G (2020) Deep image spatial transformation for person image generation. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 7690– 7699
Andriluka M, Pishchulin L, Gehler P, Schiele B (2014) Human pose estimation: new benchmark and state of the art analysis. In: Proceedings of the IEEE computer vision and pattern recognition
Author information
Authors and Affiliations
Corresponding author
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Appendices
Appendix A Details of the Evaluation Metrics
SSIM and Mask-SSIM Structural Similarity Index Measure (SSIM) [24] measures the similarity between the generated images and real images based at the pixel level on their luminance, contrast and structural aspects. A larger SSIM value indicates a greater similarity.
The SSIM is calculated on various windows of an image. The value between two windows x and y of size \(N \times N\) can be formulized as follows:
where \(\mu _{x}\), \(\mu _{y}\), \(\sigma _{x}^{2}\), and \(\sigma _{y}^{2}\) mean the average of x, the average of y, the variance of x, and the variance of y. \(\sigma _{xy}\) indicates the variance of x and y. \(c_1\) and \(c_2\) are two variables to stabilize the division with weak denominator.
Note that the Mask-SSIM is the mask version of SSIM, the only difference is that the Mask-SSIM is applied to the images which with background removed.
IS and Mask-IS The Inception score (IS) is a popular metric for judging the outputs of GAN [40]. The score simultaneously measures: (a) The variety of the generated image, and (b) The diversity of the generated image. A higher score is better. It means your GAN can generate many different distinct images.
where p(y) means the marginal distribution of the generated images. \(p(y \mid x)\) means a distribution in terms of the category of input x. \(D_{K L}\) means the Kullback-Leibler Divergence.
Note that the Mask-IS is the mask version of IS, it is also calculated as IS, the only difference is that the Mask-IS is applied to the images which with background removed.
PCKh The PCKh is used to quantify the shape consistency of the generated images. To be specific, person shape is simply represented by 18 pose joints obtained from the human pose estimator [45]. Then the shape consistency is approximated by pose joints alignment which is evaluated from PCKh measure. According to the protocol of A, PCKh score is the percentage of the keypoints pairs whose offsets are below the half size of the head segment.
where i means the i-th keypoint, k means the index of the threshold \(T_k\), p means the given person. \(d_{pi}\) indicates the Euclidean distance between the i-th keypoint of the person p and the groundtruth. \(d_{p}^{d e f}\) indicates the half size of the head segment mentioned above.
Appendix B Code of Spatial-Channel MLP
In order to promote the usage of our proposed method, the PyTorch-style code of the plug-and-play Spatial-Channel MLP module is provided as follows:
Rights and permissions
About this article
Cite this article
Liu, J., Zhu, Y. Precise Correspondence Enhanced GAN for Person Image Generation. Neural Process Lett 54, 5125–5142 (2022). https://doi.org/10.1007/s11063-022-10853-2
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s11063-022-10853-2