Keywords

1 Introduction

Metric learning aims to learn a mapping with covariant relationship of distance. A good metric produces embeddings where samples from the same class have small distances and samples from different classes have large distances. Recent supervised metric learning methods uncover the potential of deep convolutional neural networks as the nonlinear mapping function through designing sampling algorithms [8, 10, 18, 21, 24, 34, 37, 40] or modifying loss functions [3, 7, 18, 23, 33, 36, 37]. These methods usually share the same motivation of better maximizing inter-class distance and minimize intra-class distance. Behind this motivation, there is actually a basic assumption that every sample from the same class shares the same embedding feature. However, is this assumption really accurate?

In this paper, we provide a negative answer. Indeed, there are intra-class variances, such as pose, view point, illumination, etc., and a robust model should be able to handle these variances. However given a limited training set, deep models will easily be over-fitting if we force it to be indiscriminating to these intra-class variances. For example, in image classification, if the illumination of the object region varies too much among different samples, the model would probably be trained to ignore these important parts but to classify the training samples from their backgrounds. This leads to poor generalization ability. Therefore this assumption is ideal but not practical.

Fig. 1.
figure 1

(Best viewed when zoomed in.)

Our insight: in the central latent space, the distribution of intra-class variance is independent on classes. This is the visualization of central latent space of features learned with the N-pair loss [23] using Barnes-Hut t-SNE [30] on the Cars196 test set. The color of the bounding box for each image represents the class label. Here we construct central latent space through subtracting samples’ class centers from their features. We assumed and verified that similar change of original images, like the same pose change or the same view-point change, affects their features in a similar way.

Our insight is that the distribution of intra-class variance is actually independent on classes. It is obvious that for each class, the possible intra-class variances are from exactly the same set. As presented in Fig. 1, a similar pose change in samples from different classes leads to a cluster in the central latent space. Furthermore, if we know the distribution of intra-class variance, then we can generate potential hard samples from easy samples by adding intra-class variance to it. Therefore we can confidently propose our modified assumptions: Embedding features of samples from the same class consist of two parts; One represents the intra-class invariance, and the other represents the intra-class variance which obeys the identical distribution among different classes.

In this paper, we propose a deep variational metric learning (DVML) framework following this assumption. Utilizing variational inference, we can force the conditional distribution of intra-class variance, given a certain image sample, to be isotropic multivariate Gaussian. Moreover, we can utilize most of the current metric learning algorithms to train the intra-class invariance. To be specific, the training procedure of DVML is simultaneously constrained by the following four loss functions: (1) the KL divergence between learned distribution and isotropic multivariate Gaussian; (2) the reconstruction loss of original images and images generated by the decoder; (3) the metric learning loss of learned intra-class invariance; (4) the metric learning loss of the combination of sampled intra-class variance and learned intra-class invariance. The first two losses ensure that the intra-class variance shares the same distribution and does contain sample-specific information of each image sample. The third ensures that the intra-class invariance represents a good class center for each class, and the fourth ensures a robust boundary among classes.

To the best of our knowledge, this is the first work that utilizes variational inference to disentangle intra-class variance and leverages the distribution to generate discriminative samples to improve robustness. It is noticeable that our framework is also applicable to hard negative mining methods. Additionally, experimental results on three benchmark datasets including CUB-200-2011, Cars196 and Stanford Online Products, show that DVMLFootnote 1 significantly boosts the performance of existing deep metric learning algorithms.

2 Related Work

Great progress has been made about metric learning [6, 9, 10, 15, 35, 39, 41] recently. In the conventional metric learning algorithms, our goal is to learn a linear Mahalanobis distance to measure the similarities of samples [1, 5, 19, 20, 36]. Some of the previous works [32, 38] also tried to formulate metric learning as a variational inference problem, while focusing on the distribution of pairwise distance. There are also attempts on combining latent variables and metric learning [26], while in this work latent variables are the features of patches cropped from images.

Recently, metric learning with deep neural networks has been densely explored. There are mainly two subjects: sampling methods and loss functions. By sampling methods [8, 10, 18, 21, 24, 34, 37, 40], we aim to mine samples which improve robustness. For example, Wu et al. [37] proposed a distance weighted sampling method. By loss functions [3, 7, 18, 23, 33, 36, 37], we aim to fully use the data in a mini-batch to learn a discriminative boundary among classes. For example, Song et al. [23] presented a N-pair loss which takes advantage of the whole training batch. There are also works about synthesized negative samples. In [16], they generated a proxy for each class which represents the tight upper bound of the class. However, this is different from our generating hard samples simultaneously in the training procedure, which better uncover the potential of easy negative samples. Inspired by the central limit theorem and recent works [2, 12, 14, 29], we begin to think about the invariance among classes.

In [4], they also model intra-class variance with an isotropic Gaussian, while it is based on the assumption that each class shares the same prior probability and covariance, which is aimed to tackle the imbalance among samples in long-tailed datasets. The core distinction is that we disentangle intra-class variance and class centers. In [4], they only learn the conditional probability of belonging to a certain class given the input image. Instead, DVML is the combination of a discriminative model and a generative model, where the former outputs class centers and the latter fits intra-class variance. Our DVML is able to boost current metric learning methods by disentangling intra-class variance and class centers, and generating potentially hard and positive samples.

Fig. 2.
figure 2

(Best viewed when zoomed in.)

Our proposed DVML framework. Taking the output of a backbone feature extractor as input, the following layers consist of two parts. The upper part is to model intra-class variance, and it only works in the training procedure. The third fully connected layers following the feature extractor is used to learn intra-class invariant features \(\mathbf {z}_I\), namely, the class centers, which is also the output of our model. The generator takes as inputs the class centers \(\mathbf {z}_I\) and the features sampled from the learned distribution \(\mathcal {N}(\mathbf {z}_V;\varvec{\mu }^{(i)},\varvec{\sigma }^{2(i)}\mathbf {I})\), and then outputs element-wise sum of them as synthesized discriminative samples. In order to reduce computation cost, we reconstruct the 1024-dimension features, which are the output of the backbone feature extractor, instead of the whole images.

3 Proposed Approach

In the conventional metric learning methods, intra-class variance and class centers are entangled, which brings two limitations to further improvement of metric learning algorithms:

  • Given a limited dataset with a large range of variance within classes, current metric learning methods are easily over-fitting and lose discriminative power on unseen classes;

  • Without disentangling intra-class variance and class centers, current methods learn a metric by exploring the boundary among classes, which means numerous easy negative samples contribute little to the training procedure.

We explored a way beyond these two limitations, with the proposed deep variational metric learning (DVML) framework. In this section, we first review current deep metric learning methods, and introduce the variational inference for intra-class variance distribution. After explaining discriminative sample generation, we give the whole picture of deep variational metric learning. In the end, we introduce the implementation details.

3.1 Preliminaries

Most of recently popular deep metric learning algorithms optimize an appropriate objective function L to get the parameters of a deep neural network F.

$$\begin{aligned} \phi _F = \text {arg}\min _{\phi _F}L_{\text {m}}(\phi _F; \mathbf {X}, F), \end{aligned}$$
(1)

Here \(\mathbf {X}\) represents the whole training set. In the training procedure, we usually construct mini-batches of training data, \(X_b\). Based on different ways of constructing mini-batches, various types of objective functions are designed. There are mainly three types of methods to construct mini-batches: pair-based, triplet-based, and batch-based.

In pair-based mini-batch construction, a mini-batch consists of pairs of positive and negative samples, \(\mathbf {x}_p\) and \(\mathbf {x}_n\). In triplet-based mini-batch construction, a mini-batch consists of triplets. In a triplet, there are three samples, the negative \(\mathbf {x}_n\), the positive \(\mathbf {x}_p\), and the anchor \(\mathbf {x}_a\). The positive and the anchor have the same class label, and the negative is from other classes. In batch-based mini-batch construction, we know each sample’s class information. Many hard negative mining algorithms are also batch-based, for they usually have to leverage class information to mine hard pairs or triplets within a mini-batch.

With these mini-batch construction methods, most of current objective functions aim to enforce the negative samples to be away from positive ones. We utilized the following loss as our baseline methods.

Triplet-based, Triplet [18, 36]:

$$\begin{aligned} L_{\text {m}} = \sum _{i=1}^{N} \max {(\alpha + D(\mathbf {z}_{(a)}^{(i)}, \mathbf {z}_{(p)}^{(i)})^2 - D(\mathbf {z}_{(a)}^{(i)}, \mathbf {z}_{(n)}^{(i)})^2,0)}, \end{aligned}$$
(2)

where \(\mathbf {z}_{(p)}^{(i)}=F(\mathbf {x}_{(p)}^{(i)})\), \(\mathbf {z}_{(a)}^{(i)}=F(\mathbf {x}_{(a)}^{(i)})\), and \(\mathbf {z}_{(n)}^{(i)}=F(\mathbf {x}_{(n)}^{(i)})\). Here \(\mathbf {x}_{(p)}^{(i)}\), \(\mathbf {x}_{(n)}^{(i)}\), \(\mathbf {x}_{(a)}^{(i)}\) denote the positive, the negative, and the anchor samples. N is the number of triplets. \(D(\mathbf {z}_{(p)}^{(i)},\mathbf {z}_{(n)}^{(i)})\) is the distance between features embedded from image samples.

Batch-based, N-pair [23]:

$$\begin{aligned} L_{\text {m}} = \frac{1}{N}\sum _{i=1}^N\log {(1+\sum _{j\ne i}\exp {(\mathbf {z}^{(i)T}\mathbf {z}^{(j)}_+-\mathbf {z}^{(i)T}\mathbf {z}^{(i)}_+)})}, \end{aligned}$$
(3)

where \(\mathbf {z}_i=F(\mathbf {x}^{(i)})\), and the batch consists of \(\mathbf {x}\) and \(\mathbf {x}_+\). Here \(\mathbf {x}^{(i)}\) and \(\mathbf {x}^{(j)}_+\) are from the same class, only when \(i=j\).

Batch-based, Triplet\(_2\) with Distance Weighted Sampling [37]:

$$\begin{aligned} L_{\text {m}} = \sum _{i=1}^{N} \max {(\alpha + D(\mathbf {z}_{(a)}^{(i)}, \mathbf {z}_{(p)}^{(i)}) - D(\mathbf {z}_{(a)}^{(i)}, \mathbf {z}_{(n)}^{(i)}),0)}, \end{aligned}$$
(4)

where \(\mathbf {z}_{(a)}^{(i)}\)’s, \(\mathbf {z}_{(p)}^{(i)}\)’s and \(\mathbf {z}_{(n)}^{(i)}\)’s original image samples are determined by distance weighted sampling over a mini-batch. According to [37], this loss performed better than aforementioned triplet loss.

3.2 Variational Inference for Intra-Class Variance

With our insight that the distribution of intra-class variance is inter-class invariant, we can disentangle intra-class variance and intra-class invariance. Therefore our model can more explicitly learn appropriate class centers, and has the nature of robustness toward a large range of intra-class variance.

As it is hard to directly represent intra-class variance without extra annotations about pose, view point, illumination, etc., we refer to the setting of generative models. It is natural to believe that from the sum of good intra-class variance and class centers, we can reconstruct the original image.

To be specific, given \(\mathbf {X}={({\mathbf {x}^{(1)}, \cdots , \mathbf {x}^{(n)}})}\) as the dataset, consisting of N i.i.d images from M classes, we assume the data are generated by a random process, involving an unobserved continuous random variable \(\mathbf {z}=\mathbf {z}_V+\mathbf {z}_{I_k}\), which is actually the embedding features of given samples. The process consists of three steps: (1) a value \(\mathbf {z}^{(i)}_{V}\) is generated from some conditional distribution \(\mathbf {p}_\theta ^*({\mathbf {z}})\), which is intra-class variance of sample i from class k; (2) \(\mathbf {z}^{(i)}_{I_k}\) is the intra-class invariance of sample i from class k, and \(\mathbf {z}^{(i)}\) equals the sum of \(\mathbf {z}^{(i)}_{V}\) and \(\mathbf {z}^{(i)}_{I_k}\); (3) an image \(\mathbf {x}^{(i)}\) is generated from some conditional distribution \(\mathbf {p}_\theta ^*(\mathbf {x|z})\).

Here, we assume that the prior \(\mathbf {p}_\theta ^*({\mathbf {z}})\) and the the likelihood \(\mathbf {p}_\theta ^*(\mathbf {x|z})\) is generated from some parametric families of distributions \(\mathbf {p}_\theta ({\mathbf {z}})\) and \(\mathbf {p}_\theta (\mathbf {x|z})\). As we simultaneously learn the intra-class variance and intra-class invariance, so here for sample i from class k, its \(\mathbf {z}^{(i)}_{I_k}\) is deterministic. Therefore all the distributions related \(\mathbf {z}\) could be taken as the distribution related to \(\mathbf {z}_{V}\). Using Monte Carlo estimator similar to VAE [12], we can get the approximated loss for the modeling of intra-class variance,

$$\begin{aligned} L(\mathbf {\theta },\phi ;\mathbf {x}^{(i)}){\approx }-D_{KL}(q_{\phi }(\mathbf {z}_V|\mathbf {x}^{(i)})||p_\theta (\mathbf {z}_V))+\frac{1}{L}\sum _{l=1}^{L}\log p_{\mathbf {\theta }}(\mathbf {x}^{(i)}|\mathbf {z}_V^{(i,l)}). \end{aligned}$$
(5)

Here we let the prior distribution of \(\mathbf {z}_{V}\) is the centered isotropic multivariate Gaussian \(\mathbf {p}_\theta ({\mathbf {z}_V})=\mathcal {N}(\mathbf {z}_V;\mathbf {0},\mathbf {I})\). For the approximation posterior, we let it be a multivariate Gaussian with a diagonal covariance.

$$\begin{aligned} \log q_{\phi }(\mathbf {z}_V|\mathbf {x}^{(i)})=\mathcal {N}(\mathbf {z}_V;\varvec{\mu }^{(i)},\varvec{\sigma }^{2(1)}\mathbf {I}). \end{aligned}$$
(6)

We use the outputs of fully-connected layers, to approximate the mean and s.d. of the posterior, \(\varvec{\mu }^{(i)}\) and \(\varvec{\sigma }^{(i)}\). With the reparameterization trick, we can finally get the first two terms of our objective.

$$\begin{aligned} \mathcal {L}(\mathbf {\theta },\phi ;\mathbf {X}_{b})\approx & {} \frac{1}{2B}\sum _{i=1}^{B}\sum _{j=1}^{J}(1+\log (({\sigma }_j^{(i)})^2)-({\mu }_j^{(i)})^2-({\sigma }_j^{(i)})^2) \nonumber \\+ & {} \frac{1}{TB}\sum _{i=1}^{B}\sum _{t=1}^{L}\log p_{\mathbf {\theta }}(\mathbf {x}^{(i)}|\mathbf {z}^{(i,t)}) \nonumber \\\triangleq & {} L_1+L_2, \end{aligned}$$
(7)

where T is the number of generating iterations and B is the batch-size of the mini-batch. \(L_1\) enforces the distribution of intra-class variance to be isotropic centered Gaussian, and \(L_2\) ensures the intra-class variance preserve sample-specific information. Derivation details are in the supplementary materials.

Furthermore, for simplicity, in the training procedure, we utilize L-2 distance instead of original maximum likelihood estimation to handle the decoding term \(p_{\mathbf {\theta }}(\mathbf {x}^{(i)}|\mathbf {z}^{(i,t)})\), which gives us a simplified term:

$$\begin{aligned} L_1=\frac{1}{2B}\sum _{i=1}^{B}\sum _{j=1}^{J}(1+\log (({\sigma }_j^{(i)})^2)-({\mu }_j^{(i)})^2-({\sigma }_j^{(i)})^2), \end{aligned}$$
(8)
$$\begin{aligned} L_2=\frac{1}{TB}\sum _{i=1}^{B}\sum _{t=1}^{T}||\mathbf {x}^{(i)}-\hat{\mathbf {x}}^{(i,t)}||_2. \end{aligned}$$
(9)

\(\mathbf {x}^{(i)}\) represents original image samples, and \(\hat{\mathbf {x}}^{(i,t)}\) is the fake sample synthesized from the sum of intra-class invariance features and intra-class variance features sampled from the distribution \(\mathcal {N}(\mathbf {z}_V;\varvec{\mu }^{(i)},\varvec{\sigma }^{2(i)}\mathbf {I})\).

3.3 Discriminative Sample Generation

As we addressed previously, most of current metric learning algorithms cannot uncover the full potential of easy samples. However, with the learned distribution of intra-class variance, we can generate potential hard samples from easy negative samples by adding the embedding features of easy samples with an biased term sampled from the distribution of intra-class variance.

Since we have learned an approximated conditional distribution of the intra-class variance, \(\mathcal {N}(\mathbf {z}_V;\varvec{\mu }^{(i)},\varvec{\sigma }^{2(1)}\mathbf {I})\), an idea is naturally raised: we can also draw samples from this distribution to construct synthesized embedding features, and take them as the inputs of metric learning loss functions.

$$\begin{aligned} L_3=L_{\text {m}}(\hat{\mathbf {z}}), \end{aligned}$$
(10)

where \(\hat{\mathbf {z}}=\mathbf {z}_{I_k}+\hat{\mathbf {z}}_{V}\), and \(\hat{\mathbf {z}}_{V}\) is sampled intra-class variance features. By \(\mathbf {z}_{I_k}\), we want to stress that different classes have different intra-class variance, yet we do not compute class centers over classes.

Remembering that the distribution of intra-class variance is independent on classes, we confidently conclude that these synthesized embedding features contain a larger range of intra-class variance than original samples, and training with them will bring us a more robust model. Here is a simple example. In class A, the original samples only contain view point changes, and in class B, the original samples only contain illumination changes. Different from current models trained with only their original samples, our model is also robust to an unseen class which contains both view point changes and illumination changes.

Table 1. Comparisons of clustering and retrieval performance (%) on the Cars196 dataset

3.4 Deep Variational Metric Learning

Finally we have the whole picture of our proposed DVML framework. Besides aforementioned three terms of loss functions, our final objective also contains a constraint term of intra-class invariance:

$$\begin{aligned} L_4=L_{\text {m}}(\mathbf {z}_{I}), \end{aligned}$$
(11)

where \(\mathbf {z}_{I}\) is the intra-class invariance features and also the output of our model in testing. This term enforces the intra-class invariance, namely the class centers, to be discriminative. It is noticeable that here we do not calculate class centers. We call them class centers for we have disentangled this part from the intra-class variance. Therefore our method is applicable to most of current deep metric learning algorithms.

The final objective function is:

$$\begin{aligned} L = \lambda _1L_1+\lambda _2L_2+\lambda _3L_3+\lambda _4L_4. \end{aligned}$$
(12)

By simply applying sampling methods to both original features and synthesized features, or replace \(L_\text {m}\) with custom loss functions, we can combine our method with most of current metric learning approaches.

Here we want to highlight our contributions. First, to the best of our knowledge, this is the first work to disentangle intra-class variance and intra-class invariance, which makes it possible to explicitly learn appropriate class centers by simultaneously minimizing \(L_4\). Second, different from previous hard negative mining methods which ignore numerous easy negative samples, with the learned distribution of intra-class variance, we generate discriminative samples which contains the possible intra-class variance over the whole training set. It is obvious that our discriminative sample generation is entirely different from conventional data augmentation methods. We simultaneously generate latent variables, namely, the embedding features, in the training procedure. More importantly, our synthesized samples have the variance of the whole training set.

Table 2. Comparisons of clustering and retrieval performance (%) on the Stanford Online Products dataset

3.5 Implementation Details

We implement all the compared baseline methods and our methods on Chainer [28], with the GoogLeNet [27] pre-trained on ILSVRC2012 [17] as the backbone for a fair comparison. Following standard pre-processing of data, we first normalize the images into \(256\times 256\), and then we perform random crop and horizontal mirroring for data augmentation. We add three parallel fully-connected layers after the average pooling layer of GoogLeNet, with same output dimension which is the required embedding size. Two of them are used to approximate \(\varvec{{\mu }}\) and \(\log {\varvec{\sigma }^2}\). The other’s output is the intra-class variance. For the reconstruction part, due to the high cost of image reconstruction, we use the output features of GoogLeNet’s last average pooling layer as the reconstruction target. We use two fully-connected layers with output dimension 512 and 1024 respectively as the decoder network, where tanh is used as the activation function. We randomly initialize all the added fully-connected layers.

There are two phases in the training procedure. In the first phase, we cut off the back-propagation of the gradients from the decoder network for the stability of the embedding part. We empirically set \(\lambda _1=1, \lambda _2=1, \lambda _3=0.1, \text {and} \lambda _4=1\). In the second phase, we release the constraint and empirically set \(\lambda _1=0.8, \lambda _2=1, \lambda _3=0.2,\, \text {and}\, \lambda _4=0.8\). As the experimental study in [25] showed that the embedding size does not largely affect the performance, we follow [33] and fix the embedding size to 512 in all the experiments. We set the batch size as 128 for the pair-based and batch-based input and 120 for the triplet input. For the iterations of discriminative sample generation, we set \(T=20\) throughout the experiments. To optimize the objective, we take Adam [11] as the optimizer and set the training rate to be 0.0001.

Table 3. Comparisons of clustering and retrieval performance (%) on the CUB-200-2011 dataset
Table 4. Average p-value and relative deviation (%) on three datasets

4 Experiments

To demonstrate the effectiveness of our DVML, we conduct experiments on three widely-used datasets for both retrieval and clustering tasks.

4.1 Settings

We follow [24, 25, 33] to split the training and testing set in a zero-shot manner for all the datasets.

  • The CUB-200-2011 dataset [31] contains 11,788 images from 200 bird species. We take the first 100 classes with 5,864 images for training, and the rest 100 classes with 5,924 images for testing.

  • The Cars196 dataset [13] contains 16,185 images of 196 car types. We take the first 98 classes with 8,054 images for training, and the rest 98 classes with 8,131 images for testing.

  • The Stanford Online Products dataset [25] contains 120,053 images of 22,634 products. We take the first 11,318 classes with 59,551 images for training, and the rest 11,316 classes with 60,502 images for testing.

Fig. 3.
figure 3

(Best viewed when zoomed in.)

Visualization of the proposed DVML+N-pair with Barnes-Hut t-SNE [30] on the CUB-200-2011 test set. The color of the bounding box for each image represents the label.

Fig. 4.
figure 4

(Best viewed when zoomed in.)

Visualization of the proposed DVML+N-pair with Barnes-Hut t-SNE [30] on the Cars196 test set. The color of the bounding box for each image represents the label.

Fig. 5.
figure 5

(Best viewed when zoomed in.)

Visualization of the proposed DVML+N-pair with Barnes-Hut t-SNE [30] on the Stanford Online Products test set. The color of the bounding box for each image represents the label.

In the retrieval task, we calculate the percentage of test samples that have at least one sample from the same class in R nearest neighbors. In the clustering task, we report the NMI [25] score and \(\text {F}_1\) [25] score. For NMI, the input is a set of clusters \(\Omega =\{\omega _1, \cdots , \omega _K \}\) and the ground truth classes \(\mathbb {C}=\{c_1, \cdots , c_K\}\). \(\omega _i\) indicates the samples that are assigned to the ith cluster, and \(c_j\) is the set of samples with the ground truth label j. NMI is the ratio of mutual information and the mean entropy of clusters and the ground truth: \(\text {NMI}(\Omega ,\mathbb {C})=\frac{2I(\Omega ;\mathbb {C})}{(H(\Omega )+H(\mathbb {C}))}\). \(\text {F}_1\) score is defined as the harmonic mean of precision and recall: \(\text {F}_1=\frac{2PR}{P+R}.\)

4.2 Compared Methods

We apply our deep variational metric learning framework to three aforementioned baseline methods. They are Triplet loss [36], N-pair loss [23], and Triplet\(_2\) loss with Distance Weighted Sampling [37]. We compare the performance of baseline methods before and after using our DVML framework to demonstrate the effectiveness of our proposed framework. We also compare DVML with other widely used or state-of-the-art methods, where there are two categories: designing sampling algorithms and modifying loss functions. For loss functions, we compare our methods with the widely-used Contrastive loss [7], the Lifted-Structure loss [25], and the state-of-the-art Angular loss [33]. For sampling methods, we take the state-of-the-art methods including HDC [40] and an upper bound generating method Proxy-NCA [16]. We report the most relevant results according to their original papers. Except them, we re-implement all the compared methods. In the re-implementation, we observe some differences from the reported results in original papers, but this does not affect the fairness of comparison.

4.3 Quantitative Results

Tables 1, 2, and 3 present the experimental results of our DVML and all compared methods on the Cars196, Stanford Online Product, and CUB-200-2011 datasets respectively.

From the comparison with baseline method, we notice that our proposed DVML significantly improved the performance of baseline methods. It is surprising that our proposed DVML significantly improves the performance of N-pair loss which has already gained success on the Cars196 and Stanford Online Products datasets, which further proves the limitation we stress before does exist. In the CUB-200-2011, our proposed DVML’s effectiveness is relatively less significant than the other two datasets. We suppose that it is due to the different nature of datasets. In Cars196 and Stanford Online Products, the difficulty lies in a large range of intra-class variance, while in CUB-200-2011, the difficulty lies in localizing discriminative fine-grained regions.

In the comparison with other methods, we observe that on the Cars196 dataset, our DVML+Triplet\(_2\)+DWS achieves better performance than the previous state-of-the-art. In the other two datasets, our DVML also achieves comparable performance. It is noticeable that we take both sampling algorithms and modified loss functions as baselines and compete with the state-of-the-art in both of the categories, which further shows the effectiveness of our DVML.

To further verify our assumption, we first apply Kolmogorov-Smirnov test [22] to the central features learned with Triplet loss [18, 36] and measure the p-value. The results in Table 4 show that central features of classes in every dataset probably obey the isotropic Gaussian and their distributions are probably similar because the deviation of p-value over classes is small. We also measure the average relative deviation of features with and without DVML, and the result suggests that our DVML does remove intra-class variance from output features and helps to explicitly learn class centers.

4.4 Qualitative Results

Using a well-known visualization method t-SNE [30], we first visualize the central latent space of features learned with the N-pair loss [23] to illustrate our insight in Fig. 1.

Figures 3, 4, and 5 show the visualization of DVML+N-pair on the CUB-200-2011, Cars196 and Stanford Online Products datasets. The figures are best viewed when zoomed in. The color of the bounding box on each samples’ images represent their class label. Following [25, 33], we enlarge certain regions to highlight the discriminability of learned features. The visualization explicitly show that our proposed DVML learns a good metric which well preserves the distance relationships among classes, given a large range of intra-class variance.

5 Conclusion

In this paper, we have presented a novel applicable framework: deep variational metric learning (DVML). We assume and illustrate that the distribution of intra-class variance is invariant among classes. To the best of our knowledge, this is the first work to disentangle intra-class variance via variational inference, and the first to leverage the intra-class variance’s distribution to generate discriminative samples. We stress that with our DVML, current metric learning algorithms could be significantly improved. Furthermore, there are many future works, including image generating given certain classes, and utilizing those generated images to further improve robustness of metric learning models.