Keywords

1 Introduction

In the zero-shot learning task, a classifier is trained with datapoints from seen classes and applied to recognize previously unseen dataponts belonging to unseen classes. The main objective is to leverage knowledge from label embeddings, e.g. attributes, word embedding or class hierarchy information, to build a universal mapping that can classify unseen datapoints without retraining the system on new unseen classes. Firstly, let us denote \(\mathbf {X}_{tr}\) as training datapoints from seen classes \(C_s\), \(\mathbf {X}_{ts}\) to be testing datapoints from unseen classes \(C_u\) such that \(C_s \cap C_u = \emptyset \). The model is trained on \(\mathbf {X}_{tr}\) but needs to assign a label \(l \in C_u\) for each datapoint from \(\mathbf {X}_{ts}\). Recently, researchers have argued that standard zero-shot learning protocols are biased towards good results on unseen classes while neglecting performance on seen classes. To address this issue, a generalized zero-shot learning task was proposed for which testing datapoints come from seen and unseen classes, and the classifier needs to cope well with all classes \(C = C_s \cup C_u\).

It has emerged that most of zero-shot learning methods achieve low accuracy in such a protocol because training datapoints come only from the seen classes. In most cases, the strong imbalance of data distribution will make the classifier assign datapoints from seen classes to unseen classes.

The use of Generalized Adversarial Network (GAN) to generate auxiliary datapoints for unseen classes [1] enables the classifier to be trained on datapoints from both seen and unseen categories. Inspired by such an extension, we found that using the auxiliary and original training data to learn a classifier, e.g. Support Vector Machine (SVM), can be further improved by treating the classification of original datapoints separately, that is, by decomposing the generalized zero-shot learning into two disjoint classification tasks: one classifier dealing with datapoints from seen classes and another classifier dealing with datapoints of unseen classes.

In this paper, we propose to use the auxiliary data of unseen classes generated by GAN together with the original training data to build a model selection approach for generalized zero-shot learning. We refer to our approach as ModelSel and propose its three variants in Sect. 3. We evaluate ModelSel on four standard datasets and demonstrate state-of-the-art results.

2 Related Work

Zero-shot learning is a form of transfer learning. Specifically, it utilizes the knowledge learned on datapoints of seen classes and attribute vectors to generalize and recognize testing datapoints from new classes. The majority of previous zero-shot learning methods use some linear mapping to capture the relation between the feature and attribute vectors. Attribute Label Embedding (ALE) [2] uses the attributes as label embedding and presents an objective inspired by a structured WSABIE ranking method that assigns more importance to the top of the ranking list. Embarrassingly Simple Zero-Shot Learning (ESZSL) [3] uses a linear mapping and simple empirical objective with several regularization terms that impose penalty on the projection of features from the Euclidean into the attribute space and the projection of attribute vectors back to the Euclidean space. Structured Joint Embedding (SJE) [4] proposes an objective inspired by the structured SVM and applied as linear mapping while [5] proposes new data splits and evaluation protocols to eliminate the overlap between classes of ImageNet [6] and zero-shot learning datasets. Zero-shot Kernel Learning (ZSKL) [7] proposes a non-linear kernel method with weak incoherence constraints to make the columns of projection matrix weakly incoherent. Feature Generating Networks [1] leverages a conditional Wasserstein Generative Adversarial Network (WGAN) to generate auxiliary datapoints for unseen classes from attribute vectors followed by training a simple Softmax classifier. SoSN [8] and So-HoT [9] use second-order statistics [10] for similarity learning and domain adaptation.

3 Approach

3.1 Notations

Let us denote seen classes as \(C_s\), unseen classes as \(C_u\). \(\mathbf {X}_{tr}\) denotes original training datapoints, \(\mathbf {X}_{ge}\) are the generated datapoints for unseen classes. Each datapoint is a column vector in one of the above matrices. \(M_{sel}\) is the selector between seen/unseen class, \(M_s\) is the model for \(C_s\), \(M_u\) is the model for \(C_u\), \(M_t\) is a model for \(C_s\cup C_u\). Moreover, \(\varvec{w}_{sel}\), \(b_{sel}\), \(\varvec{W}_s\), \(\mathbf {b}_s\), \(\varvec{W}_u\), \(\mathbf {b}_u\), \(\varvec{W}_t\) and \(\mathbf {b}_t\) are the projection vector/matrices and biases used by our models as detailed below.

3.2 Model Selection Mechanism

In this paper, we propose a mechanism that leverages several classifiers to perform generalized zero-shot learning. Firstly, we label the original datapoints as 1 and auxiliary datapoints as \(-1\) to train \(M_{sel}\), which is a linear SVM classifier.

Model \(M_s\) is a classifier trained with datapoints from seen classes \(C_s\), model \(M_u\) is trained with auxiliary datapoints from GAN corresponding to unseen classes \(C_u\). Model \(M_t\) is trained for \(C_s\cup C_u\) simultaneously.

\(M_s\), \(M_u\) and \(M_t\) are trained separately via the SoftmaxLog classifier. While we use a single training process, we distinguish three selection models applied at the testing stage. The output of each classifier can be defined as:

$$\begin{aligned}&\mathbf {g}_s(\mathbf {x}) = \varvec{W}_s^T\mathbf {x}+ \mathbf {b}_s,\end{aligned}$$
(1)
$$\begin{aligned}&\mathbf {g}_u(\mathbf {x}) = \varvec{W}_u^T\mathbf {x}+ \mathbf {b}_u,\end{aligned}$$
(2)
$$\begin{aligned}&\mathbf {g}_t(\mathbf {x}) = \varvec{W}_t^T\mathbf {x}+ \mathbf {b}_t. \end{aligned}$$
(3)

ModelSel-2Way. The testing mechanism of ModelSel-2Way can be illustrated as follows. For each testing datapoint \(\mathbf {x}\in \mathbf {X}_{tr}\), we feed it firstly into \(M_{sel}\). The role of \(M_{sel}\) is to decide if \(\mathbf {x}\) belongs to the seen or unseen class based on which we select either \(M_s\) or \(M_u\) model for the final classification:

$$\begin{aligned} s(\mathbf {x}) = \varvec{w}_{sel}^T\mathbf {x}+ b_{sel}. \end{aligned}$$
(4)

Then, the final prediction for \(\mathbf {x}\) becomes:

$$\begin{aligned} \mathbf {f}(\mathbf {x}, s(\mathbf {x})) = {\left\{ \begin{array}{ll} {\mathbf {g}_s(\mathbf {x})}, &{}\text {if } s\ge 0, \\ {\mathbf {g}_u(\mathbf {x})}, &{}\text {otherwise.} \end{array}\right. } \end{aligned}$$
(5)
Fig. 1.
figure 1

Our ModelSel-2Way approach.

Fig. 2.
figure 2

Our ModelSel-2Way-SA approach.

ModelSel-2Way-SA. We also propose to use the Sigmoid function to generate soft assignment scores from the output of \(M_{sel}\) as the weights assigned to the outputs of \(M_s\) and \(M_u\). We call this method as ModelSel-2Way-SA. The intuition behind this model is that \(M_{sel}\) suffers from the quantization errors close to the classification boundary, thus we model the assignment uncertainty in \(M_{sel}\) to reduce quantization errors. The probability that \(\mathbf {x}\) belongs to seen classes \(C_s\) or \(C_u\) is denoted \(p_s(\mathbf {x})\) and \(p_u(\mathbf {x}) = 1 - p_s(\mathbf {x})\), respectively, and \(p_s(\mathbf {x})\) is given as (Figs. 1 and 2):

$$\begin{aligned} p_s(\mathbf {x}) = \frac{1}{1 + e^{-\sigma s(\mathbf {x})}}, \end{aligned}$$
(6)

where \(\sigma \) is the parameter to control the slope of the Sigmoid function. Then, the output of ModelSel-2Way-SA is given as:

$$\begin{aligned} \mathbf {f}(\mathbf {x}) = p_s(\mathbf {x})\cdot \mathbf {g}_s(\mathbf {x}) + p_u(\mathbf {x})\cdot \mathbf {g}_u(\mathbf {x}). \end{aligned}$$
(7)

ModelSel-3Way. For the ModelSel-3Way, we use additionally classifier \(M_t\) trained with both original and auxiliary datapoints so it can classify data from both seen and unseen classes. While its performance is worse than \(M_s\) and \(M_u\) in each domain, we leverage the output of \(M_t\) as a mask to correct some incorrect predictions from \(M_u\) and \(M_s\). The output of our ModelSel-3Way model, shown in Fig. 3, is defined as follows:

$$\begin{aligned} \mathbf {f}(\mathbf {x}, s(\mathbf {x})) = \max \left( \begin{array}{c} \!\! {\left\{ \begin{array}{ll} c\cdot \mathbf {g}_t(\mathbf {x}) + \mathbf {g}_s(\mathbf {x}) - o_s \text { if } s\!\ge \!0\\ c\cdot \mathbf {g}_t(\mathbf {x}) + \mathbf {g}_u(\mathbf {x}) - o_u \text { if } s\!<\!0 \end{array}\right. },\\ \mathbf {g}_t(\mathbf {x}) \end{array} \right) \,, \begin{array}{c} \leftarrow \text {gray regions in Fig. {4}} \\ \leftarrow \text {black regions in Fig. {4}} \\ \leftarrow \text {white region in Fig. {4}} \end{array} \end{aligned}$$
(8)

where c, \(o_s\) and \(o_u\) adjust the importance of \(M_t\) and offset for \(M_s\) and \(M_u\). Intuitively, close to the classification boundaries, predictions of \(\mathbf {g}_s(\mathbf {x})\) and \(\mathbf {g}_u(\mathbf {x})\) become replaced by \(\mathbf {g}_t(\mathbf {x})\) in this model.

Figure 4 illustrates the selection of classifiers in our ModelSel-3Way approach. We define N as the total number of testing data, \(N_s\) and \(N_u\) as the number of testing data assigned to seen and unseen classes \(C_s\) and \(C_u\), respectively. The distribution map has the same size as \(\mathbf {g}_t(\mathbf {X})\in \mathbb {R}^{C\times N}\), the light gray color highlights successful predictions from \(\mathbf {g}_s(\mathbf {X}_{tr}) \in \mathbb {R}^{C_s \times N_s}\) while the dark black color highlights successful predictions from \(\mathbf {g}_u(\mathbf {X}_{te}) \in \mathbb {R}^{C_u \times N_u}\).

Fig. 3.
figure 3

Our ModelSel-3Way approach.

Fig. 4.
figure 4

The selection of classifiers in our ModelSel-3Way.

4 Experiments

Below we detail datasets used in our experiments, describe evaluation protocols and show our experimental results to demonstrate usefulness of our approach.

4.1 Setup

Datasets. We evaluate proposed models on four datasets. Attribute Pascal and Yahoo (APY) contains 15339 images, 64 attributes and 32 classes. The 20 classes from Pascal VOC are used for training and 12 classes collected from Yahoo! are used for testing. Animals with Attributes (AWA1) contains 30475 images from 50 classes. Each class is annotated with 85 attributes. The zero-shot learning split of AWA1 is 40 classes for training and 10 classes for testing. The Animal with Attributes 2 (AWA2) proposed by [5] is the updated and open source version of AWA1. It has the same number of classes, attributes and train/test split with AWA1. Flower102 (FLO) [11] contains 8189 images from 102 classes.

An evaluation paper [5] proposes a novel zero-shot learning splits to eliminate the overlap between the classes in zero-shot datasets and ImageNet [5], and evaluates most popular zero-shot learning methods. In this paper, we follow the new splits to make a fair comparison to other state-of-the-art methods.

Parameters. We perform the mean extraction and standard deviation normalization on both original and auxiliary datapoints to train \(M_{sel}\) to alleviate the imbalance between two distributions. For \(M_s\) and \(M_u\), we simply use the original data provided in paper [5] without any preprocessing. Our models use classifiers with the SoftmaxLog objective. We use the Adam solver with mini-batches of size 60, the parameters of Adam are set to \(\beta 1 = 0.9\) and \(\beta 2 = 0.99\). We run the solver for 50 epochs. The learning rate is set to \(1e\!-\!4\). The parameters used by ModelSel-2Way and ModelSel-3Way are chosen via cross-validation.

Protocols. For training, all models are trained at once as the training process is the same for each model. To perform testing, we follow the generalized zero-shot learning protocols in [5]. There are two testing splits for seen and unseen classes, respectively. We evaluate the two testing splits, and collect two per-class mean top-1 accuracies \(Acc_S\) and \(Acc_U\) as suggested by [5]. We report the harmonic mean over the two results as the final score:

$$\begin{aligned} H = 2\frac{Acc_S\cdot Acc_U}{Acc_S + Acc_U}. \end{aligned}$$
(9)

4.2 Evaluations

Figure 5 shows how the classification accuracy varies w.r.t. \(\sigma \) of ModelSel-2Way-SA. It can be seen that the soft assignment score obtained by passing SVM scores via the Sigmoid function helps improve the performance of our model.

Fig. 5.
figure 5

The influence of \(\sigma \) on the classification accuracy.

Table 1 shows that our models obtain state-of-the-art results on AWA1, AWA2, FLO and APY datasets. Compared to f-CLSWGAN, our ModelSel-3Way achieves a \(2.8\%\) higher accuracy on AWA1, \(3.6\%\) on AWA2 and \(0.8\%\) on FLO. The biggest improvement for ModelSel-2Way-SA is observed on APY, where the accuracy increased from \(20.5\%\) of ZSKL [7] to \(42.3\%\). The above evaluations illustrate that our models can combine predictions on seen and auxiliary datapoints better than current state-of-the-art approaches.

Table 1. Evaluations on generalized zero-shot learning

5 Conclusions

In this paper, we have presented three approaches to the model selection, which introduce a novel way of leveraging generated datapoints on generalized zero-shot learning task. Different from [1], our models use original and generated datapoints to train a selector function which distinguishes between classifiers for seen and unseen training datapoints. Evaluations on our ModelSel variants achieve state-of-the-art results on four publicly available datasets.