Keywords

1 Introduction

A common approach to identifying features in CBIR is to train a multi-class deep model with a large fully supervised training set, and then use features from various layers of the network as a basis for coding database images (which need not be drawn from the classes used to train the network). Early attempts at retrieval were based on cross-entropy loss. Triplet loss has been used to train networks for image retrieval [4]. However optimizing triplet loss is challenging because the level of relative similarity or dissimilarity in each training triplet determines how fast the network learns.

In this paper we study the use of center loss [14, 16] for image retrieval. Center loss reduces the distance of each data point to its class center. It is not as difficult to train as triplet loss and performance is not based on the selection process of the training data points (triplets). Combining it with a softmax loss, prevents embeddings from collapsing.

Experiments will show that for training datasets with few images per class but with a large number of classes, the improvement using center loss for retrieval is significant.

figure a

2 Related Works

Some of the classical papers in image retrieval include [3, 7, 8, 13]. Most of the recent work is based on training CNN models [2, 5, 12]. Both [11] and [17] review these techniques.

[2] achieved huge performance improvements by training the network on datasets related to the query. [9] showed that using intermediate layers captures local patterns of objects which performs better than using the final layer output for image retrieval. Similarly [15] uses the regional maximum activations of convolutions, R-MAC, for the same purpose. R-MAC uses a CNN to obtain a local descriptor of the image, which is then max pooled from different regions in a rigid grid, normalized, whitened and sum-aggregated to give a compact output vector. [4] also uses a similar process but with region proposals instead of the rigid grid to define the aggregation regions.

Center Loss was first used for face recognition by [16]. They update centers per mini-batch based on the gradient of center loss, and combines center loss with softmax loss for stability. [14] used a similar idea for few shot learning where they apply softmax over center distances. Instead of updating centers, they recalculate the centers per mini-batch based on the image classes in the support set in the mini-batch using episodic learning.

3 Our Algorithm

Our technique combines center loss with cross-entropy loss on a Resnet18 [6] based network as shown in Fig. 1. Suppose there are K classes and that the \(k^{th}\) class has \(N_{k}\) images. Let \(f^{1}_{y_{i}}(x_{i})\) be the pre-final layer output by passing the \(i_{th}\) image (\(x_{i}\)) with label \(y_{i}\) through the network. Similarly let \(f^{2}_{y_{i}}(x_{i})\) be the final FC layer output and let B be the number of images per batch.

Fig. 1.
figure 1

(a) Our Algorithm (Resnet18 image from [1]) (b) The center loss computation block

First the training images are passed through a network pre-trained on Imagenet, giving us \(f^{1}_{y_{i}}(x_{i})\) feature descriptor. Then the center \(c_{k}\) of the \(k^{th}\) class is computed as follows:

$$\begin{aligned} c_{k}=\frac{1}{N_{k}} \sum _{y_{i}=k}{f^{1}_{y_{i}}(x_{i})} \end{aligned}$$
(1)

Also the distance \(d_{ik}\) of the feature descriptor for each image to each class center \(c_{k}\) is calculated as follows:

$$\begin{aligned} d_{ik}=||{f^{1}_{y_{i}}(x_{i})-c_{k}}||^{2}_{2} \end{aligned}$$
(2)

Let this matrix be D with each element \(d_{ik}\). Each \(d_{ik}\) is inverted to get \(\frac{1}{D}\) so that it can be equated to a normal cross-entropy loss model where the input to the loss layer is a scores array. Let each row in \(\frac{1}{D}\) be represented as \(\frac{1}{d_{i}}\) and the labels corresponding to each row be \(y_{i}\). Finally \(\frac{1}{D}\) values are passed into a cross-entropy loss function which yields the center loss, \(L_{c}\). This is combined with a normal cross-entropy loss applied on the final Fully-Connected layer with number of classes as output size, \(L_{s}\). The total loss L can be expressed as:

$$\begin{aligned} L=L_{s}+L_{c} = -\sum _{i=1}^{B}{log \frac{e^{W^{T}_{y_{i}}f^{2}_{y_{i}}(x_{i})+b_{y_{i}}}}{\sum _{j=1}^{K}e^{W^{T}_{j}f^{2}_{y_{i}}(x_{i})+b_{j}} } } -\sum _{i=1}^{B}{log \frac{e^{W^{T}_{y_{i}}\frac{1}{d_{i}}+b_{y_{i}}}}{\sum _{j=1}^{K}e^{W^{T}_{j}\frac{1}{d_{i}}+b_{j}} } } \end{aligned}$$
(3)

This is similar to the loss in [16] except that we replace the squared Euclidean center loss with cross entropy function being applied on this distance as in [14]. The difference with [14] is that we use inverse instead of negative distance function. The use of cross-entropy function on the squared Euclidean distance helps to remove the instability of the center loss. At the end of each epoch we use Eq. 1 to recompute the centers globally for the entire dataset. [16] uses an update formula to update the centers whereas [14] recomputes them, but both of them recalculate only at the mini-batch level, and not globally.

4 Dataset

Google Landmark [10] has 14951 classes with about 1 million images in the original train dataset. We split this into training set consisting of the first 8951 classes and the query set containing the remaining 6000 classes, so training and query partitions do not have any classes in common. Since each query class should have at least 2 images - one as the query and the other to be included in the retrieval/index set - the classes containing only one image are not used. We take a maximum of 10 images per class. So finally there are 8951 training classes with 72244 images, 5943 query classes with 1 query image per class and an index set consisting of 42709 images from these 5943 classes.

5 Results

We use Resnet models in Pytorch pre-trained on Imagenet as initialization. The final layer size is modified to suit the number of classes in our training set and it is initialized using Xavier uniform initialization. The output size of the pre-final layer is model dependent (512 for Resnet18), which would be the size of the feature descriptor for the image. For all networks, we used Adam optimization for training with a weight decay of 2e–4. The initial learning rate was set at 0.001 and a stepwise scheduler with drop rate of 0.92 per epoch was used. We ran the experiments with a batch size of 224.

Mean average precision or mAP score was used as the evaluation criterion. For Google Landmark dataset, given a query image all other images from the same class are correct retrieval results and images from other classes are incorrect retrieval results.

Table 1. Comparative study of mAP scores for different losses using different models. We see that the model fine tuned using both cross-entropy loss and center loss performs better than just using cross-entropy loss
Fig. 2.
figure 2

We plot the t-SNE scatter plots for 10 random classes with 500 images from each class. The first figure (a) is the scatter plot for model pre-trained on Imagenet, the second (b) for model fine-tuned with cross-entropy loss only and the third figure (c) is for model fine-tuned with cross-entropy loss and center loss. As we can see in the figure, center+ cross-entropy loss performs better clustering than just cross-entropy loss and they both perform better than the model just pre-trained on Imagenet. Specifically between (b) and (c) - in (b) classes 0, 1 and 6 are split into 2 groups with other classes in between. This is not observed in (c)

From Table 1 when the training datasets have few (\(<=10\)) images per class, center loss leads to improvement. To understand the performance of center loss based network, we conducted a t-sne analysis for all the 3 models in Table 1 as seen in Fig. 2.

One main point of difference with previous works is that we are training on a very different data distribution with huge number of classes and few images per class. Unfortunately we do not have any previous results that have been trained on a similar data distribution as the partial Google Landmarks for comparison purposes.

6 Conclusion

We explored the effect of center loss training on image retrieval applications. A combination of center loss and cross-entropy loss performs better than just using cross-entropy loss or center loss separately. Also using cross-entropy on center distance to compute center loss instead of just the squared Euclidean distance stabilizes the center loss network. Any of the earlier techniques including VLAD encoding of intermediate layers, R-MAC etc can be used on top of this network for better results. Center loss based network is most useful when the training dataset has a large number of classes with few images per class. In the future, we plan to apply the model to other applications such as clustering and few-shot learning.