Keywords

1 Introduction

Deep learning contributes significantly to the rapid progress in computer vision owing to its strong capabilities of data representation. However, there exists a non-negligible issue that training deep neural networks requires a huge amount of annotated data, which is usually unavailable in realistic scenarios due to labor-intensive data annotations. Meanwhile, with the explosive growth of new categories (e.g. of objects), it is even impossible to get any training data from certain classes. To deal with this, zero-shot learning (ZSL) has recently emerged as an effective solution [17,18,19, 28]. ZSL considers a more challenging case that training (seen) and test (unseen) classes are disjoint, i.e. the data of unseen classes is totally missing during the training process.

Fig. 1.
figure 1

The flow chart of the proposed approach. We address the zero-shot learning (ZSL) problem in a supervised way, by generating features for unseen classes via our generative framework. The red dotted box denotes the conventional ZSL task, and the blue dotted box denotes the generalized ZSL (GZSL) task. (Color figure online)

Specific intermediate representations (e.g. semantic attributes [8, 11, 18] and word vectors [10, 27, 37, 48]) have been widely used by ZSL methods to bridge the gap between seen and unseen classes. However, an inherent problem, known as ‘domain shift’ [12], still remains challenging for conventional ZSL methods. In other words, classifiers trained on seen classes are not suitable for unseen ones due to their different underlying distributions. Consequently, most existing methods have the strong bias towards seen data and their performance is unacceptable for conventional ZSL settings, let alone the recently proposed more realistic generalized ZSL (GZSL) settings [7, 42] where both seen and unseen classes are present at test time. Therefore, it is highly desirable to develop a generalized framework that could mitigate the domain shift and provide a universal classifier for both seen and unseen classes. As shown in Fig. 1, in this work, we aim to address the above issues from a new perspective, i.e. converting ZSL to supervised learning, by hallucinating unseen class features based on deep generative models.

Deep generative models, such as Generative Adversarial Networks (GAN) and Variational Autoencoders (VAE), have been extensively studied in the recent few years. GAN [13] is appealing to generate realistic images, especially conditioned on additional information [26, 33]. VAE [15], especially the conditional VAE (CVAE) [38], has great potential to generate data through element-wise similarity metrics. In a similar spirit to our work, Xian et al. [43] proposed a ZSL framework to generate features for unseen classes based on conditional GAN (CGAN). However, GAN generally concentrates on more abstract and global data structure. In our problem, element-wise reconstruction is also essential for hallucinating unseen classes. Thus, we propose a joint framework by taking the advantages of CGAN and CVAE for more delicate data generation. Note that existing works [4, 21] have already shown the effectiveness of this kind of generative model in image synthesis. In contrast, we aim at generating features instead of images for unseen classes since the generated images are typically of insufficient quality to train deep networks for the final classification [43]. We add an additional categorization network to ensure the discriminability of the synthesized features. Different from [43], the categorizer and the generator in our framework compete in a two-player minimax game. That is, the generator tries to generate features that belong to the classes of real features, and the categorizer tries to distinguish the generated features from the real ones in the category level simultaneously. Through the competition, the generated features will be well suited for training the final discriminative classifier. Moreover, we propose a perceptual reconstruction loss to preserve class-wise semantics based on the intermediate outputs of the discriminator and the categorizer.

The main contributions of this paper are summarized as follows:

  • We propose a novel generative framework for zero-shot learning, which addresses conventional ZSL problems in a supervised manner. The framework takes the advantages of CGAN and CVAE to generate features conditioned on semantic attributes with the additional help of a categorization network. As a result, the generated features are not only similar to the real ones but also discriminative for the subsequent classification task.

  • We leverage the intermediate outputs of the networks for perceptual reconstruction so that the generated features have the pixel-wise similarity as well as the semantic similarity to the real features.

  • Extensive experimental results on five standard ZSL benchmarks demonstrate that the proposed method achieves notable improvement over the state-of-the-art approaches in not only the conventional ZSL but also the more challenging GZSL tasks.

The remainder of the paper is organized as follows. In Sect. 2, we give a brief review of existing ZSL methods and generative models. In Sect. 3, we introduce the proposed joint generative framework, which includes several networks to synthesize high-quality features of unseen classes for the subsequent classification task. Section 4 first introduces the datasets and experimental setup and then provides the demonstration of the experimental results. We finally draw our conclusion in Sect. 5.

2 Related Work

2.1 Zero-Shot Learning

Zero-shot learning is a challenging task because of the lack of training data. Many attempts [1, 8, 17,18,19, 27, 28, 30, 31, 34, 45, 48] have been made to exploit the relationships between seen and unseen classes. Semantic representations, such as semantic attributes [8, 9, 11, 17, 18] and word vectors [10, 27, 37, 48], are employed as the intermediate embedding to bridge the gap between the visual space and class space. Typically, a mapping from the visual space to semantic space is learned and then leveraged for the following classification task.

Recently, there were some works that learned the inverse mapping from the semantic space to visual space [5, 23, 24, 43, 46], which was shown effective for mitigating the domain shift problem. For instance, Zhang et al. [46] proposed an end-to-end architecture to embed the semantic representation into the visual space. Different from the above works, we choose not to learn the inverse mapping directly but generate synthesized features of unseen classes conditioned on class-level semantic attributes. Recently, some works focused on data generation using generative models, which are similar to our work. For instance, Bucher et al. [5] generated features via GMMN [22]. Xian et al. [43] proposed a framework combining WGAN [3] and a categorization network to generate features. Our framework differs from them by exploiting two generative models (i.e. CVAE and CGAN) for realistic feature generation. Moreover, we propose both categorization and perceptual losses to generate discriminative features.

In comparison to the conventional ZSL, the generalized zero-shot learning (GZSL) is a more realistic and difficult task, where both seen and unseen classes are available at test time [7, 42]. Despite that the conventional ZSL has gained a lot of attention, few studies [7, 37] concentrated on solving GZSL problems. It is more desirable to design robust ZSL methods that could eliminate the bias towards the seen data for more realistic scenarios.

2.2 Deep Generative Models

Deep generative models [13, 15] have shown the great potential in data generation. There have been a variety of deep generative models [13, 15, 20, 21, 32, 38]. Among these models, Variational Autoencoder (VAE) [15] and Generative Adversarial Network (GAN) [13] play the indispensable roles. VAE models the relationship directly through element-wise reconstruction, while GAN captures the global relationship indirectly [21]. However, VAE has a disadvantage of often generating blurry images as reported in [4] because element-wise distance cannot describe the complex data structure. GAN can obtain more abstract information, but the training process is not stable [36].

Due to the above shortcomings, some recent works attempted to combine these two generative models for better data generation, such as VAE/GAN [21], adversarial autoencoder [25], and CVAE-GAN [4]. Our work is thus motivated by the above approaches; however, we utilize conditioned generative models to synthesize features instead of images as the quality of generated images are too low to achieve satisfactory performance in ZSL problems [43]. Specifically, our model is conditioned on semantic attributes instead of category-level labels, so that more delicate description can be used for feature generation. Moreover, we add a categorization network to ensure that the generated features are helpful for the following classification task. We also take advantage of the intermediate outputs of the networks for perceptual reconstruction to form a richer semantic similarity metric for feature generation.

3 Approach

This work aims to synthesize high-quality features for unseen classes by establishing a joint generative model, based on which conventional ZSL can be transformed into supervised learning. Specifically, our proposed model generates semantically expressive features for unseen classes conditioned on the class-level semantic attributes. Subsequently, we train classifiers based on the generated features of unseen classes w.r.t. conventional ZSL settings and on both the generated features of unseen classes and real features of seen classes w.r.t. GZSL settings. As a result, the domain shift between seen and unseen classes will be mitigated significantly as classifiers are learned on both seen and unseen features.

In the following, we will first introduce the problem settings for ZSL and GZSL, and then present our joint generative model in detail. Finally, how to perform zero-shot recognition in a supervised manner is elaborated.

3.1 Problem Settings

In zero-shot learning, the training set S consists of image features, attributes, and class labels of seen classes, i.e.  \(S = \{ ({x_s},{a_s},{y_s})|{x_s} \in {X},{a_s} \in {A},{y_s} \in {Y_s}\} \). \({x_s}\in R^{d_x}\) denotes the features of seen data, where \({d_x}\) denotes the feature dimension. \({Y_s} = \{ y_s^1,...,{y_s^{C_s}}\}\) represents the labels of \({C_s}\) seen classes. \(a_s\in \mathbb {R}^{d_a}\) denotes the class-level attributes of seen classes, where \({d_a}\) indicates the dimension of semantic attributes. In terms of unseen classes, no features are available during training and we can only employ some class-level information, e.g. semantic attributes in our case. Specifically, the unseen set is denoted by \(U = \{ ({a_u},{y_u})|{a_u} \in {A},{y_u} \in {Y_u}\}\), where \({Y_u} = \{y_u^1,...,y_u^{C_u}\}\) represents the labels of \({C_u}\) unseen classes and \({a_u}\in \mathbb {R}^{d_a}\) denotes the class-level attributes of unseen classes.

It should be noted that the seen and unseen classes are disjoint, namely \({Y_s} \cap {Y_u} = \emptyset \). Given S and U, the conventional zero-shot learning task is to learn a classifier \({f_{ZSL}}:{X} \rightarrow {Y_u}\), and the generalized zero-shot learning aims to learn a universal classifier \({f_{GZSL}}:X \rightarrow {Y_s} \cup {Y_u}\), which is a more challenging task.

Fig. 2.
figure 2

The illustration of our generative framework. Particularly, our framework consists of four networks: the Encoder E, the Generator G, the Discriminator D, and the Categorizer C. Given real features and the corresponding semantic attributes as the input, our framework will synthesize high-quality features after generative learning.

3.2 Joint Generative Model

In this subsection, we will introduce our proposed framework in detail. As shown in Fig. 2, our framework consists of four networks: (1) the encoder network E, (2) the decoder/generator network G, (3) the discriminator network D, and (4) the categorizer network C. As our framework combines CVAE and CGAN, the decoder in CVAE is identical to the generator in CGAN. Unless otherwise specified, we use the generator G to denote this network branch.

The combination of CVAE and CGAN provides well-designed guidance for feature generation. In the following, we will first introduce the network structures of VAE conditioned on semantic attributes and GAN conditioned on semantic attributes and category labels, respectively. An additional categorization network will also be introduced along with the conditional GAN. Subsequently, we will present our perceptual reconstruction loss and the overall objective for training in detail. Finally, we will introduce the procedure for zero-shot recognition at test time.

VAE Conditioned on Semantic Attributes. VAE consists of an encoder network and a generator network. In our architecture, VAE is conditioned on class-level semantic attributes. In other words, attributes act as a part of the input to both encoder and generator for the purpose of providing class-level semantic information for feature generation.

As for the encoder network E with parameters \({\theta _E}\), we aim to encode the real features \({x_s}\) into a latent representation

$$\begin{aligned} {z_f} \sim {p_E}(z|{x_s},{a_s}), \end{aligned}$$
(1)

where \({x_s} \sim p(x)\) and \({a_s} \sim p(a)\), and p(x) and p(a) denote the prior distributions of real features and semantic attributes, respectively. The encoder learns the inherent structure of features and then imposes this prior over the distribution p(z), which is usually \(z_{p} \sim \mathcal {N}(0,I)\). The generator G with parameters \({\theta _G}\) decodes the latent representation into the feature space to generate synthesized features

$$\begin{aligned} {x_f} \sim {p_G}(x|{z_f},{a_s}). \end{aligned}$$
(2)

The overall loss function of CVAE is a combination of the reconstruction loss and the Kullback-Leibler divergence loss:

$$\begin{aligned} {L_{CVAE}}({\theta _E},{\theta _G}) = {L_{KL}} + {L_{recon}}, \end{aligned}$$
(3)

where

$$\begin{aligned} {L_{KL}}({\theta _E},{\theta _G}) = KL({p_E}({z}|{x_s},{a_s})||p(z)), \end{aligned}$$
(4)
$$\begin{aligned} {L_{recon}}({\theta _E},{\theta _G}) = - E[\log ({p_G}(x|{z_f},{a_s}))]. \end{aligned}$$
(5)

By minimizing Eq. (3), we can reduce the reconstruction error and the difference between the distribution of latent representation and the prior distribution. As a consequence, the encoder is capable of capturing the inherent structure of data and the generator will generate features with similar structures as the real ones.

GAN Conditioned on Attributes and Categories. In the conventional generative adversarial network, the generator and the discriminator try to make a balance in a two-player minimax competition. In our framework, the generator is conditioned on the semantic attributes. In addition, the category-wise information (i.e. labels), exploited by a categorizer, works as another clue to help the generator obtain discriminative features. We define the discriminator with parameters \({\theta _D}\) and the categorizer with parameters \({\theta _C}\). Concretely, the generator tries to minimize the following loss:

$$\begin{aligned} \begin{aligned} {L_G}({\theta _G},{\theta _D},{\theta _C}) =&- E[\log ({p_D}(G({z_p},{a_s})))] - E[\log ({p_D}(G({z_f},{a_s})))] \\&- E[\log ({p_C}({y_s}|{x_p})] - E[\log ({p_C}({y_s}|{x_f})], \end{aligned} \end{aligned}$$
(6)

where

$$\begin{aligned} {x_p} = G({z_p},{a_s}) \sim {p_G}(x|{z_p},{a_s}), {x_f} = G({z_f},{a_s}) \sim {p_G}(x|{z_f},{a_s}). \end{aligned}$$

In the meantime, the discriminator tries to minimize

$$\begin{aligned} {L_D}({\theta _G},{\theta _D},{\theta _C}) = - E[\log ({p_D}({x_s}))] - E[\log (1 - {p_D}({x_f}))] - E[\log (1 - {p_D}({x_p}))] \end{aligned}$$
(7)

Given \({z_p}\) and \({z_f}\) along with the semantic attributes as the input, the generator aims to synthesize features that are similar to the real features and belong to the same class as the real ones at the same time. The discriminator tries to distinguish real features from synthesized ones. After iterative training, the network will generate high-quality features with the guidance from semantic attributes as well as from category-wise information.

As mentioned above, the categorizer helps to promote the discriminability of the generated features, which has the similar spirit with the classification network in [43]. However, we find that this additional regularization is not enough for the subsequent classification task. To this end, we make the categorizer as the other ‘discriminator’, which plays a minimax competition with the generator in the category level. Concretely, the real features \({x_s}\), and synthesized features \({x_f}\) and \({x_p}\), are fed into the categorizer, which tries to minimize the softmax based categorization loss:

$$\begin{aligned} {L_C}({\theta _C}) = - E[\log ({p_C}({y_s}|{x_s})] - E[\log ({p_C}({y_f}|{x_p})] - E[\log ({p_C}({y_f}|{x_f})], \end{aligned}$$
(8)

where \({y_f}\) denotes the label of the ‘fake’ class that is disjoint from the seen and unseen classes. In this way, the categorizer not only needs to classify the real features into the right classes but also regards the synthesized features as another ‘fake’ class. Through the competition, the generator is encouraged to generate features from the same classes as the real features.

Perceptual Reconstruction. In addition to the superior characteristics of CVAE and CGAN, we try to find a richer similarity metric to achieve more delicate generation results. As we mentioned above, element/pixel-wise information and holistic structures can be preserved by using VAE and GAN respectively, yet the semantic information may not be enough. Thus, we incorporate a perceptual reconstruction loss into our framework. The perceptual loss has been explored in the field of image style transfer and super-resolution [14], and could encourage the generated features to be semantically similar to real ones.

Specifically, we take advantage of the intermediate output of the discriminator and categorizer for perceptual reconstruction:

$$\begin{aligned} {L_{percept}}({\theta _D},{\theta _C}) = \left\| {{f_D}({x_s}) - {f_D}({x_f})} \right\| _2^2 + \left\| {{f_C}({x_s}) - {f_C}({x_f})} \right\| _2^2, \end{aligned}$$
(9)

where \({f_D}\) and \({f_C}\) are the outputs of the last hidden layers of the discriminator and categorizer, respectively.

Overall Objective. The ultimate goal of our framework is to minimize the following overall loss function:

$$\begin{aligned} L = {L_{KL}} + {L_{recon}} + {L_D} + {L_C} + {\alpha }{L_G} + {\beta }{L_{percept}}. \end{aligned}$$
(10)

In particular, we alternatively optimize every network branch in our framework as follows:

$$\begin{aligned} Encoder({\theta _E}) \leftarrow {L_{KL}} + {L_{recon}} + {\beta }{L_{percept}}; \end{aligned}$$
(11)
$$\begin{aligned} Generator({\theta _G}) \leftarrow {L_{recon}} + {\alpha }{L_G} + {\beta }{L_{percept}}; \end{aligned}$$
(12)
$$\begin{aligned} Discriminator({\theta _D}) \leftarrow {L_D}; \end{aligned}$$
(13)
$$\begin{aligned} Categorizer({\theta _C}) \leftarrow {L_C}. \end{aligned}$$
(14)

\({L_{KL}}\) only appears in Eq. (11) because it is only related to the encoder. Similarly, \({L_C}\) and \({L_D}\) are the objectives of the categorizer and discriminator respectively. The generator is shared between the CVAE and CGAN so its loss can be divided into two parts: i.e.  \({L_{recon}}\) and \({L_{percept}}\) form the loss w.r.t. CVAE and \({L_G}\) is the loss w.r.t. CGAN. All the objectives are complementary to each other, while the joint training process could result in superior performance.

figure a

3.3 Zero-Shot Recognition

After finishing the training process, the synthesized features of unseen classes can be obtained through our generator network. In particular, given an arbitrary latent representation drawn from the Gaussian distribution \({z_t} \sim \mathcal {N}(0,I)\) and the semantic attributes \({a_u}\) of the corresponding unseen class as the input, the generator will output the synthesized features as follows:

$$\begin{aligned} x_{gen}= G({z_t},{a_u}) \sim {p_G}(x|{z_t},{a_u}). \end{aligned}$$
(15)

Based on the generated features, zero-shot recognition can be transformed into the conventional supervised learning problem. As we previously mentioned, there exist two settings for zero-shot recognition, i.e. the conventional ZSL and the more challenging GZSL. In conventional ZSL settings, we train the softmax classifier based on \(x_{gen}\) and then test on the real features of unseen classes, i.e.  \(x_{u}\). As for GZSL settings, the original data of seen classes \(x_{s}\) will be divided into two parts, i.e.  \(x_{s}^{tr}\) for training and \(x_{s}^{ts}\) for test. During training, we employ \(x_{gen}\) together with \(x_{s}^{tr}\) as the training samples to learn the softmax classifier. At test time, we evaluate on \(x_{u}\) and \(x_{s}^{ts}\) to obtain the final recognition accuracy.

4 Experimental Results

In this section, we evaluate the proposed method on five ZSL benchmark datasets. First, we make a brief introduction of the datasets, implementation details of our framework and evaluation protocols. In order to show the effectiveness of our framework, we then present our experimental results on both conventional ZSL and GZSL tasks by comparing with several state-of-the-art ZSL methods and baseline methods. Finally, we show the high quality of the generated features quantitatively and qualitatively.

4.1 Datasets

Five classic datasets for ZSL are adopted in our experiments, i.e. AWA1 [18], AWA2 [42], CUB [40], SUN [29], and aPY [8]. AWA1 [18] is the original Animals with Attributes dataset, which has 30475 images in 50 classes, and each class is annotated with 85 attributes. However, the images of AWA1 are not publicly available. The AWA2 [42] dataset, containing 37322 images, is a good replacement for AWA1. These two datasets share the same classes and class-level attributes. Caltech-UCSD Birds 200-2011 (CUB) [40] is a fine-grained dataset with 11788 images of birds of 200 different types annotated with 312 attributes. SUN [29] is also a fine-grained dataset that contains 14340 images from 717 types of scenes annotated with 102 attributes. Attribute Pascal and Yahoo (aPY) [8] is a small-scale dataset with 15339 images from 32 classes annotated with 64 attributes. The details of the five datasets are summarized in Table 1.

Table 1. Statistics of datasets in term of number of images, attributes, and seen/unseen classes, and the training/test split.

As for image features, we employ the ResNet features proposed in [42]. Regarding class embeddings, we use the class-level continuous attributes for all datasets because using continuous attributes could achieve better performance than binary ones, as pointed out in [1]. As for data splits, in early standard splits [18] for each dataset, some of the test classes are among the 1 K classes of ImageNet, which are used to pre-train the ResNet. This will lead to biased results. Therefore, we follow the recently proposed split in [42] to avoid this. The detailed seen/unseen splits are also shown in Table 1.

4.2 Implementation Details and Parameter Settings

In our framework, all the networks are Multi-Layer Perceptrons (MLP) with LeakyReLU activations [44]. The encoder, generator, and discriminator consist of a single hidden layer with 1000 units, and the categorizer contains a single hidden layer with 1024 units. As each dataset has different attribute annotations, we set the dimension \({d_z}\) of \({z_f}\) and \({z_p}\) according to the number of class-level attributes respectively. Specifically, we set \({d_z}=256\) for AWA1, AWA2, SUN, and aPY, and \({d_z}=512\) for CUB as CUB dataset has much more attributes.

For network training, we first pre-train the categorizer branch using the seen data for fast convergence. In terms of the parameters, we empirically set \({\alpha }=0.01\), and \({\beta }=0.1\) across all the datasets. The number of the generated features are chosen to make a trade-off between the computational efficiency and classification accuracy. Specifically, in the conventional ZSL task, we set the number of generated features as eight times the number of ground-truth unseen features on CUB, SUN and aPY, and twice on AWA1 and AWA2. As for GZSL, we set the number of generated features as eight times the number of ground-truth unseen features on SUN and aPY, and six times on CUB, AWA1, and AWA2.

4.3 Evaluation Protocol

As mentioned above, in conventional ZSL settings, we aim to classify the unseen features \(x_u\) into the corresponding unseen classes \({Y_u}\). In GZSL settings, the class space is \({Y_s} \cup {Y_u}\) and we need to assign class labels to both unseen features and some of the seen features. Here we follow the unified evaluation protocol in [42].

In the conventional ZSL setting, we compute the average top-1 accuracy for each class and then average the per-class top-1 accuracy to mitigate the imbalance among the classes. The evaluation metric is defined as follows:

$$\begin{aligned} acc=\frac{1}{\left\| C \right\| }\sum _{c=1}^{\left\| C \right\| }\frac{n_{cp}}{n_c}, \end{aligned}$$
(16)

where \({\left\| C \right\| }\) denotes the number of classes, \({n_{c}}\) denotes the number of data in each class, and \({n_{cp}}\) is the number of correct predictions in each class. Regarding GZSL, we use the harmonic mean, which can be computed as follows:

$$\begin{aligned} H=\frac{2*s*u}{s+u} \end{aligned}$$
(17)

where s and u represent the average per-class top-1 accuracies of seen classes and unseen classes respectively. A higher harmonic mean indicates the high accuracies on both seen and unseen classes.

Table 2. Comparison results with the state-of-the-art methods in terms of both ZSL and GZSL settings. T1 = top-1 accuracy, u = top-1 accuracy on unseen data, s = top-1 accuracy on seen data, and H = harmonic mean. We report top-1 accuracies in %.

4.4 Comparison with the State-of-the-Art Methods

Table 2 shows the conventional ZSL results of our framework and the state-of-the-art methods. In this setting, the search space is restricted to unseen classes at test time. From the table, we can observe that our method achieves better zero-shot recognition accuracies than the traditional ZSL methods. The overall improvement is especially obvious on AWA1 and AWA2, with 6.6% and 11.2% higher accuracies than the second best ones in traditional ZSL methods, respectively. Compared with the generative models in ZSL tasks, our method have better performance on AWA1 and AWA2 datasets, with 0.6% higher accuracies on both of the datasets. Concerning CUB, SUN and aPY datasets, our method performs competitively with the best ones, i.e. SE-GZSL [39] and DEVISE [10] methods, respectively. The results clearly demonstrate that our framework is capable of generating useful and expressive features of unseen classes, which are beneficial for ZSL tasks.

In terms of the GZSL task, as illustrated in Table 2, our framework shows the superiority over the traditional ZSL methods on all the five datasets. For example, significant improvements w.r.t. harmonic mean are observed, with 172.2%, 123.7%, 122.2%, 32.7%, and 28.2% higher than the second best ones on aPY, AWA2, AWA1, SUN, and CUB, respectively. It is noteworthy that most traditional ZSL methods achieve high accuracies on seen classes but much worse performance on unseen classes, which indicates that those methods have strong biases towards seen classes. Our model can mitigate the bias to a large extent as shown in Table 2. Compared with the ZSL methods based on the generative models, our model shows the superiorities on AWA1 and AWA2 datasets. Moreover, our model has the highest accuracy for unseen classes on AWA1, SUN and aPY datasets, indicating that our model has the capability of balancing the accuracy between seen and unseen classes. Therefore, our generative framework is very useful and competitive in this realistic and challenging task.

Table 3. Comparison results with the baseline models in terms of both ZSL and GZSL settings. T1 = top-1 accuracy, u = top-1 accuracy on unseen data, s = top-1 accuracy on seen data, and H = harmonic mean. We report top-1 accuracies in %.

4.5 Comparison with the Baseline Models

As our framework contains several networks together with the perceptual reconstruction, we compare the proposed framework with four baseline models by omitting each of them, in order to verify the importance of each branch. For example, as shown in Table 3, ‘CVAE+CAT’ indicates the framework only containing the CVAE and categorizer, and ‘Proposed (w/o \(L_{percept}\))’ denotes the whole network without the perceptual reconstruction.

The results w.r.t. conventional ZSL settings are shown in Table 3. From the results of ‘CVAE+CAT’ and ‘CGAN+CAT’, we can conclude that integrating the CVAE and CGAN are beneficial for the ZSL task, and the improvement is more significant by incorporating the CVAE. This shows that element-wise reconstruction is essential for our task. The results of ‘CVAE+CGAN’ also demonstrate the necessity of the categorizer branch. For example, the accuracy is improved by 7.7% on aPY by adding the categorizer. Finally, we can see that our framework with perceptual reconstruction outperforms the one without \(L_{percept}\).

As for GZSL, compared with the above baselines, the proposed model has higher accuracies on unseen classes and higher harmonic mean since it can balance the seen and unseen classes. Overall, ‘CGAN+CAT’ achieves the worst performance probably because CGAN captures the holistic data structure, which is not enough for feature generation. After combining the CVAE, the performance is enhanced significantly. All the above results in ZSL and GZSL settings clearly demonstrate the indispensability of each part in our whole framework.

Fig. 3.
figure 3

Top-1 accuracy with different numbers of generated features on the CUB and SUN datasets.

4.6 Analysis of the Generated Features

In this section, we present some further analyses of the synthesized features of unseen classes. Figure 3 shows the classification accuracies for the unseen classes with the increasing numbers of the generated features. In general, the accuracy increases when generating more unseen features. We also observe that the satisfactory accuracies can be achieved when the numbers of the generated features are relatively small, which indicates that our model can generate high-quality features for the classification task. The generated features can be used as the excellent replacement of the missing unseen features. Taking some unseen classes on the AWA1 dataset as an example, we can see from Fig. 4 that the generated feature distribution is even more discriminative compared with the real feature distribution. This further indicates that our generative model can synthesize high-quality features that are beneficial for the classification task.

Fig. 4.
figure 4

t-SNE visualization of the real/generated features of some unseen classes on the AWA1 dataset.

5 Conclusion

In this work, we proposed an effective joint generative framework for feature generation in the context of zero-shot learning. Specifically, our model combined two popular generative models, i.e. VAE and GAN, to capture the element-wise and holistic data structures at the same time. We took advantage of the class-level semantic attributes as the conditional information. An additional categorization network worked as the guidance for generating discriminative features. Importantly, we incorporated the perceptual reconstruction into the framework to preserve semantic similarities. We showed the superiority of the proposed generative framework by conducting experiments on five standard datasets in terms of the conventional ZSL task as well as the more challenging GZSL task. The extensive experimental results indicated that our model could generate high-quality features to mitigate the domain gap in ZSL due to the lack of unseen data.