1 Introduction

Image retrieval has been a fundamental research topic in computer vision. Given a query image, it aims to find the relevant images from a database. Precision and efficiency are two key aspects for a retrieval system. These two key aspects also drives the image retrieval research to progress in two directions.

One direction is to design or learn a more effective image representation for a higher retrieval precision [2,3,4,5, 11, 18, 28, 30, 31, 40]. Good image representation maintains a large distance between irrelevant images in feature space and keeps a close distance between relevant images. Traditional image retrieval systems generated image representation by aggregating hand-craft local features like SIFT and the research focuses on designing a more effective aggregation method [18, 30, 31]. With the progress of deep learning, the convolutional neural network provides an effective image representation [2, 3, 28, 38, 41], which is trained by the semantic information and is robust to low-level image transformations.

On the other hand, to achieve a satisfactory efficiency in image retrieval, especially when dealing with a large-scale dataset, a compact image representation is necessary. Generically speaking, there are two types of schemes to gain a compact image representation, hashing and quantization. Hashing maps the real-value vectors into binary codes, which enables a faster distance computation and lower memory cost. One of most widely used hashing method is locality sensitivity hashing (LSH) [7]. Nevertheless, LSH is data-independent, which ignores the data distribution and is sub-optimal to a specific dataset. To further improve the performance, some hashing methods [10, 32, 36] learn the projection from the data, which caters better to a specific dataset and achieves higher retrieval precision. Traditional hashing methods are based on off-the-shelf visual features. They optimize the feature extraction and Hamming embedding independently. More recently, inspired by the progress of deep learning, some deep hashing methods [22, 23, 25, 37] are proposed, which simultaneously conduct the feature learning and compression through a unified network.

Nevertheless, hashing methods are only able to produce a few distinct distances, limiting its capability of describing the distance between data points. In parallel to hashing methods, another widely used data compression method in image retrieval is product quantization. It represents each feature vector by a Cartesian product of several codewords. Thanks to the asymmetric distance calculation mechanism, it enables a more accurate distance calculation than hashing methods using the same code length. The product quantization (PQ) [17] and its optimized versions like OPQ [9], CKmeans [29], APQ [1] and CQ [12, 43] are originally designed for an unsupervised scenario where no labeling data are provided. SQ [35] extends product quantization to a supervised scenario. However, SQ is based on the hand-crafted features or CNN features from the pretrained model, therefore it might not be optimal with respect to the a specific dataset.

To simultaneously optimize the feature learning and quantization, Cao et al. [6] propose a deep quantization network (DQN) which can be trained in an end-to-end manner. It optimizes a weighted sum of similarity-preserve loss and product quantization loss. It iteratively updates codewords and other parameters of a neural network. Therefore, in each iteration, the codewords are directly updated by k-means whereas the label information is ignored. Recently, Klein et al. [20] propose a deep product quantization (DPQ). They learn a cascade of two fully-connected layers followed by a softmax layer to determine a soft codeword assignment. It is different from original product quantization, the codeword assignment is no longer determined by distance between the original feature and codewords. Nevertheless, the additional parameters introduced in the cascade of fully-connected layers make DPQ more prone to over-fitting.

In this paper, we also attempt to incorporate the product quantization in a neural network and train it in an end-to-end manner. We propose a soft product quantization layer which is differentiable and the original product quantization is a special case of the proposed soft product quantization when \(\alpha \rightarrow +\infty \). Different from DPQ, we no longer need fully-connected layers to obtain the codebook assignment, instead, in our method, the codeword assignement is determined by the similarity between the original feature and the codewords. Therefore, we significantly reduce the number of parameters to be trained, making our PQN more immune to over-fitting compared with DPQ. Meanwhile, inspired by the success of the triplet loss in metric learning and the triumph of the asymmetric similarity measurement in feature compression, we propose a novel asymmetric triplet loss to directly optimize the asymmetric similarity measurement in an end-to-end manner. In summary, the contribution of our work is three-fold:

  • We introduce a novel soft product quantization layer, which is a generalized version of the original product quantization. It is differentiable and thus brings an end-to-end training of the product quantization network.

  • We propose a novel asymmetric triplet loss, which directly optimizes the asymmetric distance brought based on product quantization. It enables a more effective training of the convolutional neural network.

  • Due to its simplicity, effectiveness and efficiency, we provide the image retrieval community a strong baseline. Some more sophisticated image retrieval methods can be further built upon the proposed framework.

2 Related Work

Hashing [7, 10, 13,14,15, 22, 23, 25, 32, 36, 37, 39] aims to map a feature vector into a short code consisting of a sequence of bits, which enables a fast distance computation mechanism as well as a light memory cost. Traditional hashing methods like locality sensitivity hashing (LSH) [7], spetral hashing (SH) [36] and iterative quantization (ITQ) [10] first obtain real-value image features and then compress the features into binary codes. They conduct the representation learning and the feature compression separately and the mutual influence between them is ignored. Recently, motivated by the success of deep learning, some works [22, 23, 25, 37] propose deep hashing methods by incorporating hashing as a layer into a deep neural network. The end-to-end training mechanism of deep hashing simultaneously optimizes the representation learning and feature compression, achieving better performance than the traditional hashing methods.

Since the hashing methods are only able to produce a few distinct distances, it has limited capability of describing the distance between data points. Parallelly, another scheme termed product quantization (PQ) [17] decomposes the space into a Cartesian product of subspaces and quantizes each subspace individually. Some following works [1, 9, 43] further optimize the product quantization through reducing the distortion errors and achieve higher retrieval precision. Note that production quantization and its optimized versions such as OPQ [9], AQ [1] and CQ [43] are originally designed for an unsupervised scenario where no label information is provided.

Wang et al. [35] propose supervised quantization (SQ) by exploiting the label information. Nevertheless, SQ conducts feature extraction and quantization individually, whereas the interaction between these two steps are ignored. To simultaneously learn image representation and product quantization, deep quantization network (DQN) [6] adds a fully connected bottleneck layer in the convolutional network. It optimizes a combined loss consisting of a similarity-preserving loss and a product quantization loss. Nevertheless, the codebook in DPQ is trained through k-means clustering and thus the supervised information is ignored. Recently, deep product quantization (DPQ) [20] is proposed where the codebook as well as the parameters are learned in an end-to-end manner. Different from original product quantization which determines the codeword assignment according to the distance between the original feature and codewords, DPQ determines the codeword assignment through a fully-connected layer whose parameters are learned from data. Nevertheless, the additional parameters in the cascade of fully-connected layers will make the network more prone to over-fitting.

Our work is also an attempt of incorporating the product quantization in a neural network. We propose a soft product quantization layer and build our product quantization network (PQN), which can be trained in an end-to-end manner. Different from DPQ, our PQN determines the codeword assignment according to the similarity between the feature for coding and codewords, which can be seen as a soft extension of original product quantization. Unlike DPQ, we do not need additional fully-connected layers to determine the codeword assignment and the parameters in our soft product quantization layer are only the codewords. Therefore, the number of parameters in our quantization layer is considerably less than that of DPQ, which mitigates the over-fitting. As shown in experiments, our PQN consistently outperforms DPQ by a large margin (Fig. 1).

3 Product Quantization Network

Fig. 1.
figure 1

The overview of the proposed product quantization network, where CNN represents the convolutional neural network and SPQ represents the proposed soft product quantization layer. The asymmetric triplet loss takes as input a triplet consisting of the CNN feature of an anchor image (I), the SPQ feature of a positive sample (\(I_{+}\)) and the SPQ feature of a negative sample (\(I_{-}\)).

3.1 From Hard Quantization to Soft Quantization

Let us denote by \(\mathbf {x} \in \mathbb {R}^d\) the feature of an image I, we divide the feature \(\mathbf {x}\) into M subvectors \( [\mathbf {x}_1,\cdots ,\mathbf {x}_m,\cdots ,\mathbf {x}_M]\) in the feature space where \(\mathbf {x}_m \in \mathbb {R}^{d/M}\) is a subvector. The product quantization further approximates \(\mathbf {x}\) by

$$\begin{aligned} \mathbf {q} = [{q}_{1}(\mathbf {x}_1),\cdots ,{q}_{m}(\mathbf {x}_m),\cdots ,{q}_{M}(\mathbf {x}_M)], \end{aligned}$$
(1)

where \({q}_{m}(\cdot ) \) is the quantizer for \(\mathbf {x}_m\) defined as

$$\begin{aligned} {q}_{m}(\mathbf {x}_m) = \sum _{k} \mathbbm {1} (k = k^*) \mathbf {c}_{mk}, \end{aligned}$$
(2)

where \(k^* = \underset{k}{\mathrm {argmin}}~ \Vert \mathbf {c}_{mk} - \mathbf {x}_m \Vert _2\), \(\mathbbm {1}(\cdot )\) is the indicator function and \(\mathbf {c}_{mk}\) is the k-th codeword from the m-th codebook. The hard assignment makes it infeasible to derive its derivative and thus it can not be incorporated in a neural network. This embarrassment motivates us to replace the hard assignment \(\mathbbm {1} (k = k^*)\) by the soft assignment \({e^{-\alpha \Vert \mathbf {x}_m - \mathbf {c}_{mk} \Vert _2^2}}/{\sum _{k'} e^{-\alpha \Vert \mathbf {x}_m - \mathbf {c}_{mk'} \Vert _2^2} } \) and obtain

$$\begin{aligned} \mathbf {s} = [{s}_1(\mathbf {x}_1),\cdots ,{s}_m(\mathbf {x}_m),\cdots ,{s}_M(\mathbf {x}_M)], \end{aligned}$$
(3)

where \({s}_m(\cdot )\) is the soft quantizer for m-th subvector defined as

$$\begin{aligned} {s}_m(\mathbf {x}_m) =\sum _{k} \frac{e^{-\alpha \Vert \mathbf {x}_m - \mathbf {c}_{mk} \Vert _2^2} \mathbf {c}_{mk}}{\sum _{k'} e^{-\alpha \Vert \mathbf {x}_m - \mathbf {c}_{mk'} \Vert _2^2} }. \end{aligned}$$
(4)

It is not difficult to observe that

$$\begin{aligned} \mathbbm {1} (k = k^*) = \lim _{\alpha \rightarrow +\infty } \frac{e^{-\alpha \Vert \mathbf {x}_m - \mathbf {c}_{mk} \Vert _2^2}}{\sum _{k'} e^{-\alpha \Vert \mathbf {x}_m - \mathbf {c}_{mk'} \Vert _2^2} } \end{aligned}$$
(5)

Therefore, when \(\alpha \rightarrow +\infty \), the soft quantizer \( {s}_m(\mathbf {x}_m)\) will be equivalent to the hard quantizer \(q_m(\mathbf {x}_m)\). Since the soft quantization operation is differentiable and thus it can be readily incorporated into a network as a layer.

3.2 Soft Product Quantization Layer

Before we conduct soft product quantization in the network, we first pre-process the original feature \(\mathbf {x} = [\mathbf {x}_1,\cdots ,\mathbf {x}_m,\cdots ,\mathbf {x}_M]\) through intra-normalization and conduct \(\ell _2\)-normalization on codewords \(\{\mathbf {c}_{mk}\}_{m=1,k=1}^{M,K}\):

$$\begin{aligned} {\mathbf {x}}_m \leftarrow \mathbf {x}_{m}/{\Vert \mathbf {x}_m \Vert _2} \end{aligned}$$
(6)
$$\begin{aligned} \mathbf {c}_{mk} \leftarrow \mathbf {c}_{mk}/\Vert \mathbf {c}_{mk}\Vert _2 \end{aligned}$$
(7)

The pre-processing step is motivated by two reasons: (1) intra-normalization and \(\ell _2\)-normalization can balance the contribution of each sub-vector and each codeword; (2) it simplifies the gradient computation.

Forward Pass. After intra-normalization on original features and \(\ell _2\)-normalization on the codewords, we can obtain \(\Vert \mathbf {x}_m - \mathbf {c}_{mk} \Vert _2^2 = 2 - 2\langle \mathbf {x}_m, \mathbf {c}_{mk}\rangle \), where \(\langle \cdot , \cdot \rangle \) denotes the inner product between two vectors. Based on the above property, we can rewrite Eq. (4) into:

$$\begin{aligned} {s}_m(\mathbf {x}_m) =\sum _{k} \frac{e^{2\alpha \langle \mathbf {x}_m , \mathbf {c}_{mk} \rangle } \mathbf {c}_{mk}}{\sum _{k'} e^{ 2 \alpha \langle \mathbf {x}_m , \mathbf {c}_{mk'} \rangle } }. \end{aligned}$$
(8)

Backward Pass. To elaborate the backward pass of the soft quantization layer, we introduce an immediate variable \({a}_{mk}\) defined as

$$\begin{aligned} {a}_{mk} = \frac{e^{2\alpha \langle \mathbf {x}_m , \mathbf {c}_{mk} \rangle } }{\sum _{k'} e^{ 2 \alpha \langle \mathbf {x}_m , \mathbf {c}_{mk'} \rangle } }. \end{aligned}$$
(9)

Based on the above definition, Eq. (8) will be converted into

$$\begin{aligned} {s}_m(\mathbf {x}_m) = \sum _{k} {a}_{mk} \mathbf {c}_{mk}. \end{aligned}$$
(10)

Through chain rule, we can obtain the derivative of loss with respect to \(\mathbf {c}_{mk}\) by

$$\begin{aligned} \frac{\partial L}{\partial \mathbf {c}_{mk}} = {a}_{mk} \frac{\partial L}{\partial s_m(\mathbf {x}_m)} + \sum _{k'}\frac{\partial a_{mk'}}{\partial \mathbf {c}_{mk} } (\frac{\partial s_m(\mathbf {x}_{m}) }{\partial {a}_{mk'}})^{\top } \frac{\partial L}{\partial s_m(\mathbf {x}_{m})}, \end{aligned}$$
(11)

where

$$\begin{aligned} \frac{\partial s_m(\mathbf {x}_{m}) }{\partial {a}_{mk'}} = \mathbf {c}_{mk'}, \end{aligned}$$
(12)

and

$$\begin{aligned} \frac{\partial a_{mk'}}{\partial \mathbf {c}_{mk} } = \end{aligned}$$
(13)

By plugging Eqs. (12) and (13) into Eq. (11), we can obtain \(\frac{\partial L}{\partial \mathbf {c}_{mk}}\).

3.3 Initialization

We initialize the parameters of convolutional layers by fine-tuning a standard convolutional neural network without quantization, e.g., Alexnet, on the specific dataset. Note that, we add an intra-normalization layer to fine-tune the network to make it compatible with our deep product quantization network. After the initialization of convolutional layers, we extract the features from the fine-tuned network and conduct k-means followed by \(\ell _2\)-normalization to obtain the initialized codewords \(\{\mathbf {c}_{mk} \}_{k=1,m=1}^{K,M}\) in the soft product quantization layer.

3.4 Asymmetric Triplet Loss

We propose a novel asymmetric triplet loss to optimize the parameters of the network. We define \((I, I_+, I_-)\) as a training triplet, where \(I_-\) and \(I_+\) represent a relevant image and an irrelevant image with respect to the anchor image I. We denote by \(\mathbf {x}_I\) as the feature of I before soft product quantization and denote by \(\mathbf {s}_{I_+}\) and \(\mathbf {s}_{I_-}\) the features of \(I_+\) and \(I_-\) after soft product quantization. We define asymmetric similarity between I and \(I+\) as \(\langle \mathbf {x}_I, \mathbf {s}_{I_+} \rangle \), where \(\langle \cdot , \cdot \rangle \) denotes the inner-product operation. The proposed asymmetric triplet loss is defined as

$$\begin{aligned} l = \langle \mathbf {x}_I, \mathbf {s}_{I_-} \rangle - \langle \mathbf {x}_I, \mathbf {s}_{I_+} \rangle . \end{aligned}$$
(14)

Intuitively, it aims to increase the asymmetric similarity between the pairs of relevant images and decrease that of pairs consisting of irrelevant images. It is a natural extension of original triplet loss on the condition of asymmetric distance. The difference is that, a training triplet used in original triplet loss consists of three features of the same type, whereas a training triplet used in the proposed asymmetric triplet loss consists of one feature without quantization and two features after quantization. In fact, our experiments show that a better performance is achieved by processing above loss through sigmoid function and a revised loss function is defined as Eq. (15). The better performance might be contributed by the fact that the sigmoid function can normalize the original loss so that the training will not be biased by some samples causing huge loss.

$$\begin{aligned} l = \frac{1}{ 1 + e^{\langle \mathbf {x}_I, \mathbf {s}_{I_+} \rangle - \langle \mathbf {x}_I, \mathbf {s}_{I_-} \rangle } }. \end{aligned}$$
(15)

3.5 Encode and Retrieval

After training the proposed product quantization network, the reference images in the database will be encoded by hard product quantization. We define the layer before the soft product quantization layer as embedding layer. Given a reference image I of the database, we obtain its output from embedding layer \(\mathbf {x} = [ \mathbf {x}_1,\cdots ,\mathbf {x}_m,\cdots ,\mathbf {x}_M]\) and further obtain its product quantization code \( \mathbf {b} = [b_1,\cdots , b_m,\cdots ,b_M]\) where \(b_m\) is computed by

$$\begin{aligned} b_m = \underset{k}{\mathrm {argmax}} \langle \mathbf {x}_m, \mathbf {c}_{mk} \rangle , \end{aligned}$$
(16)

where \(\{ \mathbf {c}_{mk}\}_{m=1,k=1}^{M,K}\) are codewords learned from our product quantization network. In the retrieval phase, we obtain the query feature from the embedding layer \(\mathbf {q} = [\mathbf {q}_1,\cdots ,\mathbf {q}_m,\cdots ,\mathbf {q}_M]\). The relevance between the query image and a reference image represented by its product quantization code \(\mathbf {b} = [b_1,\cdots ,b_m,\cdots ,b_M]\) is computed by the asymmetric similarity \(s(\mathbf {q}, \mathbf {b})\) defined as

$$\begin{aligned} s(\mathbf {q}, \mathbf {b}) = \sum _{m=1}^{M} \langle \mathbf {q}_m, \mathbf {c}_{mb_{m}} \rangle . \end{aligned}$$
(17)

Since \(\langle \mathbf {q}_m, \mathbf {c}_{mb_{m}} \rangle \) is computed only once for all the reference images in the database and thus obtaining \( s(\mathbf {q}, \mathbf {b})\) only requires to sum up the pre-computed similarity scores in the look-up table, considerably speeding up the image retrieval process. Meanwhile, storing the product quantization code \(\mathbf {b}\) only requires Mlog\(_2K\) bits, which considerably reduces the memory cost.

3.6 Relation to Existing Methods

DQN [6] is the first attempt of incorporating product quantization in the neural network. It alternatively optimizes codewords and other parameters of the network. It is worth noting that when updating codewords, it only minimizes the quantizaition errors through k-means. Therefore, when learning codewords, the supervision information is ignored and the solution might be sub-optimal.

SUBIC [16] integrates the one-hot block encoding layer in the deep neural network. It represents each image by a product of one-hot blocks, following the spirit of product quantization. Nevertheless, the sparse property limits its representation capability, making it perform not as well as ours.

DPQ [20] is another attempt of incorporating the product quantization into the neural network. It determines the codeword assignment through a cascade of two fully-connected layers. In contrast, our method determines the codeword assignment according to the similarity between original feature and the codewords. Note that, the additional parameters from these two fully-connected layers in DPQ no only increase the computation complexity in training the neural network but also are more prone to over-fitting. Our experiments show that our proposed PQN considerably outperforms DPQ.

4 Experiments

We evaluate the performance of our PQN on two public benchmark datasets, CIFAR-10 and NUS-WIDE. CIFAR-10 [21] is a dataset containing 60, 000 color images in 10 classes, and each class has 6, 000 images in size \(32\times 32\). Different from CIFAR-10, NUS-WIDE [8] is a dataset for evaluating multi-class classification, in which one sample is assigned to one or multiple labels. We follow the settings in [6, 22] and use the subset of 195, 834 images that are associated with the 21 most frequent concepts, where each concept consists of at least 5, 000 images. We resize all images into \(256\times 256\).

On the CIFAR-10 dataset, the performance reported by different baselines are based on different base convolutional neural networks, making it unfair to directly compare their reported retrieval accuracy. To make a fair comparison, we evaluate our method based on two types of convolutional neural networks. The first convolutional neural network we use is 3CNet which is also used by SUBIC [16] and DQN [20]. 3CNet is proposed in [25], which consists of \(L = 3\) convolutional layers with 32, 32 and 64 filters of size \(5 \times 5\) respectively, followed by a fully connected layer with \(d = 500\) nodes. The second convolutional neural network we choose is AlexNet. It is worth noting that the baselines we compare may apply different models. For example, DQN [6] adopts AlexNet whereas other work [23, 34] adopt VGG-F model. These two models are similar in the architecture. To be specific, both the CNN-F and AlexNet consist of five convolutional layers and two fully connected layers. As shown in [19], the CNN-F generally performs better than Alexnet in image retrieval, therefore, the better performance of ours based on AlexNet than existing state-of-art methods based on CNN-F is not owing to better base network. In other words, our method can achieve better performance even with an inferior base network. On the NUS-WIDE dataset, we also adopt AlexNet as our base model. On both datasets, we report the performance of the proposed method through mAP, which is a standard metric in evaluating the performance of retrieval algorithms.

Fig. 2.
figure 2

The influence of parameters on CIFAR-10 dataset using 3CNet.

4.1 CIFAR-10 Using 3CNet

Following the experimental setup in SUBIC [16] and DPQ [20], the training is conducted on 50K image training set. The test set is split into 9K database images and 1K query images (100 per class).

Influence of M and K. In this section, we evaluate the influence of the number of subvectors M and the number of codewords per sub-codebook K on the retrieval precision of the proposed PQN. We vary M among \(\{1,2,4,8\}\), and vary \(B = log_2 K\) among \(\{3,6,9,12\}\). As shown in Fig. 2(a), the proposed method achieves the best performance when \(M = 4\). By default, we set \(M=4\) on CIFAR-10 Dataset. Note that when \(M=1\) and \(M=2\), the performance of the proposed PQN increases as B increases. This is expected since the larger B can partition the feature space into finer cells. Nevertheless, when \(M = 4\), the performance drops when B increases from 9 to 12. Meanwhile, when \(M = 8\), there is also a performance drop when K increases from 6 to 9. The worse performance might be caused by over-fitting when both M and K are large.

Influence of \(\varvec{\alpha }\) . \(\alpha \) controls the quantization softness of the soft product quantization layer. We evaluate the performance of our method when \(\alpha \) varies. We test the influence of \(\alpha \) when \(M = 4\) and K varies among \(\{2^3,2^6, 2^9, 2^{12}\}\). As shown in Fig. 2(b), the performance of the proposed PQN is relatively stable when \(\alpha \) increases from 1 to 80. Note that, when \(\alpha = 1\), the performance is slightly worse than that when \(\alpha = 5\). The worse performance is due to the fact a small \(\alpha \) will make the quantization too soft and thus the soft quantization in training phase differs too much from the hard quantization in the testing phase. Meanwhile, we also observe a performance drop when \(\alpha \) increases from 20 to 40. This drops might be caused by the fact that a huge \(\alpha \) tends to push the input of soft-max function to the saturation region and lead to gradient vanishing.

Comparison with Unsupervised PQ/LSQ. We compare with unsupervised PQ and LSQ [27] based on fine-tuned features trained through triplet loss. As shown in Table 1, ours considerably outperforms both TL+PQ and TL+LSQ. Meanwhile, we also show the performance of original features trained through triplet loss without quantization (TL+Full) in Table 1. The performance of ours is even better than that of features without quantization, this is owing to the regularization imposed by quantization, which suppresses over-fitting.

Table 1. Comparisons with PQ and LSQ.

Effectiveness of Asymmetric Triplet Loss. Meanwhile, in order to show the effectiveness of the proposed asymmetric triplet loss, we compare with two alternatives, cross-entropy loss (CEL) and triplet loss (TL). To make a fair comparison, we only change the loss function and keep the other parts of the network unchanged. As shown in Fig. 3, the proposed assymetric loss consistently outperforms the cross-entropy loss and triplet loss when L varies among \(\{12,24,36,48\}\). For instance, when \(L=36\), our ATL achieves a 0.787 mAP whereas TL only achieves a 0.778 mAP and CEL only achieves a 0.768 mAP.

Compare with State-of-the-art Methods. We compare our method with two state-of-the-art methods (SUBIC and DPQ), which adopt the same 3CNet as well as the same experimental settings. We change bit length L among \(\{12,24,36,48\}\). We set \(M=4\) and \(K = 8,64,512,4096 \), respectively. Since SUBIC adopts cross-entropy loss, it is unfair to directly compare it with ours using asymmetric triplet loss. Therefore, we report the performance of our PQN based on the cross-entropy loss (CEL) as well as the proposed asymmetric triplet loss (ATL). As shown in Table 2, our method based on both CEL and ATL significantly outperform the existing state-of-the-art methods including SUBIC and DPQ. For instance, when \(L = 24\), ours achieves a 0.771 mAP based on the cross-entropy loss and a 0.782 mAP using the proposed asymmetric triplet loss whereas SUBIC only achieves a 0.672 mAP and DPQ only achieves a 0.692 mAP.

Fig. 3.
figure 3

Comparisons of our asymmetric triple loss (ATL) with cross-entropy loss (CEL) and triplet loss (TL).

Table 2. mAP comparisons with state-of-the-art methods using 3CNet.

4.2 CIFAR-10 Using AlexNet

Following the experimental settings in [23, 34], we randomly sample 1000 images per class (10000 images in total) as the testing query images, and the remaining 50000 images are used as the training set as well as reference images in the database. We set \(M=4\) and vary K among \(\{2^4, 2^6, 2^9, 2^{12}\}\), and thus the code length L varies among \(\{16, 24, 36, 48\}\).

Comparions with State-of-the-art Methods. As shown in Table 3, ours consistently outperforms the existing state-of-the-art methods, especially when the bit length is small. For instance, when the bit length is 16, our method based on asymmetric triplet loss (Ours+ATL) achieves a 0.947 mAP whereas the second best method, DSDH, only achieves a 0.935 mAP.

Table 3. mAP comparisons with existing state-of-the-art methods using AlexNet base model on the CIFAR10 dataset.

Extremely Short Code Evaluation. As shown in Table 3, the mAP achieved by our method does not drop when the bit length decreases from 48 to 16. In contrast, the performance of other methods in Table 3 all turn worse due to decrease of the bit length. To fully exploit the potential of the proposed product quantization network on the CIAFR-10 dataset, we evaluate it by setting the code length L extremely small. We vary M among 1, 2 and 4, and meanwhile vary the code length (bit number) L within \(\{4,6,8,10,12\}\). As shown in Fig. 4, when code length is extremely small, e.g., \(L = 4\), the performance of PQN when \(M=1\) significantly outperforms that when \(M=2,4\). Meanwhile, when \(M=1\), there is not significant performance drop when L decreases from 12 to 4. Note that, when \(M=1\), the proposed PQN achieves a 0.945 mAP when using only 4 bits per code. It considerably outperforms the existing state-of-art method DSDH [23] which only achieves 0.935 mAP using 16 bits.

Fig. 4.
figure 4

Evaluation on the extremely short code.

Table 4. mAP comparisons with existing state-of-the-art methods using AlexNet base model on the NUS-WIDE dataset. The mAP is based on top 5000 nearest neighbors.

4.3 NUS-WIDE

Following the experiment setup in [23, 34], we randomly sample 100 images per class (2100 images in total) as the test query set and the remaining images are used as database images. 500 database images per label are randomly sampled as training images. The mAP is calculated based on the top 5000 returned neighbors. Due to multi-label settings, the cross-entropy loss used in SUBIC [16] and the softmax loss in DPQ [20] are no longer feasible, which explains neither SUBIC [16] nor DPQ [20] conducts the experiments on NUS-WIDE dataset. Inspired by the success of label embedding proposed in [23], we also adopt a combined loss, which is a weighed sum of our asymmetric triplet loss and a mean square loss defined as

$$\begin{aligned} l = \frac{1}{ 1 + e^{\langle \mathbf {x}_I, \mathbf {s}_{I_+} \rangle - \langle \mathbf {x}_I, \mathbf {s}_{I_-} \rangle } } + \beta \Vert \mathbf {W} \mathbf {s}_{I} - \mathbf {y}_{I} \Vert _2^2, \end{aligned}$$
(18)

where \(\mathbf {W}\) is the parameter in an additional FC layer after the soft product quantization layer and \(\mathbf {y}_{I}\) is the label of the sample I. We set \(\beta = 10\) by default.

Comparisons with Existing State-of-the-art Methods. We compare our method with two types of baselines. The first type extracts the features from CNN and then convert the extracted features into binary codes. We directly copy the reported results in [23] which conducts experiments on several traditional hashing methods such as SH [32], ITQ [10], KSH [26], SDH [33], etc. The baselines of the second type are deep hashing/quantization methods, where the binary codes are learned in an end-to-end manner. We compare several methods of the second type such as DQN [6], DPSH [24], DTSH [34], DSDH [23], etc. As shown in Table 4, the proposed PQN consistently outperforms these two types of baselines when code length L varies among \(\{12,24,36,48\}\). The advantage of our PQN over other methods is more obvious when the code length L is short. For instance, when \(L=12\), our PQN achieves a 0.795 mAP whereas the second best method, FASTH+CNN [23] only achieves a 0.779 mAP.

5 Conclusion

In this paper, by incorporating product quantization in the neural network, we propose product quantization nework (PQN) to learn a discriminative and compact image representation in an end-to-end manner. Meanwhile, we propose a novel asymmetric triplet loss, which directly optimizes the image retrieval based on asymmetric distance to train our network more effectively. Systematic experiments conducted on benchmark datasets demonstrate the state-of-the-art performance of the proposed PQN.