Keywords

These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.

1 Introduction

A main challenge for “learning to hash” methods lies in the discrete nature of the problem. Most approaches are formulated as non-linear mixed integer programming problems which are computationally intractable. Common optimization remedies include discarding the binary constraints and solving for continuous embeddings [1,2,3,4,5]. At test time the embeddings are typically thresholded to obtain the desired binary codes. However, even the relaxed problem is highly non-convex requiring nontrivial optimization procedures (e.g., [6]), and the thresholded embeddings are prone to large quantization errors, necessitating additional measures (e.g., [7]).

One prominent alternative to the relaxation approach is two-stage hashing, which decomposes the optimization problem into two stages: binary code inference (i) and hash function learning (ii). For a training set, binary codes are inferred in the inference stage, which are then used as target vectors in the hash function learning stage. Such methods closely abide to the discrete nature of the problem as the binary codes are directly incorporated into the optimization procedure. In two-stage hashing, most of the attention is drawn to the more challenging binary code inference step. Typically, this task is itself decomposed into a stage-wise problem where binary codes are learned in an iterative fashion. While theoretical guarantees for the underlying iterative scheme are usually provided, the overall quality of the binary codes is often overlooked. It is desirable to also determine the quality of the constructed binary codes.

In this paper, our first contribution is to provide an analysis on the quality of learned binary codes in two-step hashing. We focus on the frequently considered matrix fitting formulation (e.g., [6, 8,9,10,11]), in which a “neighborhood structure” is defined through an affinity matrix and the task is to generate binary codes so as to preserve the affinity values. We first demonstrate that ordinary Hamming distances are unable to fully preserve the neighborhood. Then, with a weighted Hamming metric, we prove that a residual learning scheme can construct binary codes that can preserve any neighborhood with arbitrary accuracy under mild assumptions. Our analysis reveals that distance scaling, as well as fixing the dimensionality of the Hamming space, which are often employed in many hashing studies [6, 12,13,14], are both unnecessary.

On the other hand, one common inconvenience in two-stage hashing methods is that, steps (i) and (ii) are often interleaved, so as to enable bit correction during training [11, 15, 16]. Bit correction has shown to improve retrieval performance, especially when the hash mapping constitutes simple functions such as linear hyperplanes and decision stumps [9]. In contrast, we show that such an interleaved process is unnecessary with high capacity hash functions such as Convolutional Neural Networks (CNNs).

A further benefit of removing interleaving is that the affinity matrix can be constructed directly according to the definition of the neighborhood structure, instead of the pairwise similarities between training instances. For example, when preserving semantic similarity, the neighborhood is generally defined through class label agreement. Defining the affinity with respect to labels rather than instances yields a much smaller optimization problem for the inference task (i), and provides robustness for the subsequent hash function learning (ii). In contrast, instance-based inference schemes result in larger optimization problems, often necessitating subsampling to reduce the scale.

With these insights in mind, we implement our novel two-stage hashing method with standard CNN architectures, and conduct experiments on multiple image retrieval datasets. The affinity matrix in our formulation may or may not be derived from class labels, and can constitute binary or multi-level affinities. In fact, we consider a variety of experiments that include multi-class (\(\mathsf {CIFAR}\)-\(\mathsf {10}\) [17], \(\mathsf {ImageNet100}\) [18]), multi-label (\(\mathsf {NUSWIDE}\) [19]) and unlabeled (\(\mathsf {22K}\) \(\mathsf {LabelMe}\) [20]) datasets. We achieve new state-of-the-art performance for all of these datasets. In summary, our contributions are:

  1. 1.

    We provide a technical analysis on the quality of the inferred binary codes demonstrating that under mild assumptions we can fit any neighborhood with arbitrary accuracy. Our analysis is relevant to the formulations used in many two-stage hashing methods (e.g., [8, 9, 11, 21, 22]).

  2. 2.

    We demonstrate that with high-capacity hash functions such as CNNs, the bit correction task is expendable. As a result, binary code inference can be performed on items that directly define the neighborhood, yielding more robust target vectors and improving the retrieval performance. We achieve state-of-the-art performance in four standard image retrieval benchmarks.

2 Related Work

We only review hashing studies most relevant to our problem. For a general survey, please refer to [23].

The two-stage strategy for hashing was pioneered by Lin et al. [21] in which the authors reduced the binary code inference task into a series of binary quadratic programming (BQP) problems. The target codes are optimized in an iterative fashion and traditional machine learning classifiers such as Support Vector Machines (SVMs) and linear hyperplanes that fit the target vectors are employed as the hash functions. In [9], the authors proposed a graph-cut algorithm to solve the BQP problem and employed boosted decision trees as the hash functions. The graph-cut algorithm has shown to yield a solution well bounded with respect to the optimal value [24]. The authors also demonstrated that, with shallow models an interleaved process of binary code inference and hash function learning allowed bit correction and improved the retrieval performance. Differently, Xia et al. [8] proposed using a coordinate descent algorithm with Newton’s method to solve the BQP problem and utilized CNNs as the hash mapping. Do et al. [11] solved the the BQP problem using semidefinite relaxation and Lagrangian approaches. They also investigate the quality of the relaxed solution and prove that it is within a factor of the global minimum. Zhuang et al. [22] demonstrated that the same BQP approach can be extended to solve a triplet-based loss function. Other work reminiscent of these two-stage methods include hashing techniques that employ alternating optimization to minimize the original optimization problem [10, 15, 16, 25].

While error-bounds and convergences properties of the underlying iterative scheme is usually provided, none of the aforementioned studies provide a technical guarantee on the overall quality of the constructed binary codes. In this study we provide such an analysis. Our technical analysis has connections to low-rank matrix learning [26,27,28,29] in which we construct binary codes in a gradient descent or matrix pursuit methodology. Differently, we constrain ourselves with binary rank-one matrices, which are required for Hamming distance computations. Also, while not all two-stage hashing studies follow an interleaved process (e.g., [21, 30, 31]), to the best of our knowledge, all construct the affinity matrix using training instances. This warrants an in-depth look to the necessity of such a process when high-capacity hash functions are employed.

Our hashing formulation follows the matrix fitting formulation which is almost exclusively used in two-stage methods. This formulation was originally proposed in [6] and has been widely adopted in subsequent hashing studies (e.g., [8, 9, 11, 21, 32]). Whereas the major contribution in this paper lies in establishing convergence properties of the binary code inference task, our formulation also has subtle and key differences to [6] and other two-stage methods. Specifically, we allow weighted hamming distances with optimally learned weights given the inferred binary codes. We perform inference directly on items that define the neighborhood, enabling more robust target vector construction as will be shown. In retrieval experiments, we compare against recent hashing studies, including [4, 14, 16, 33,34,35,36,37,38,39], and achieve state-of-the-art performances.

3 Formulation

In this section, we first discuss the two stages of our hashing formulation: binary code inference and hash mapping learning. An analysis on affinity matrix construction comes next. All proofs are provided in the supplementary material.

3.1 Binary Code Inference

In this section, we explain our inference step (i). We are given a metric space \((\mathcal {X}, d)\) where \(\mathcal {X}=\{\mathbf {x}_1,\cdots , \mathbf {x}_n\}\) denotes a set of items and \(d:\mathcal {X} \times \mathcal {X} \rightarrow \mathbb {R}_{\ge 0}\) is a metric. Note that \(\mathbf {x}\) can correspond to instances, labels, multi-labels or any item that is involved in defining the neighborhood. Given the assumption that the neighborhood is defined through metric d, we learn the hash mapping \(\varPhi :\mathcal {X}\rightarrow \mathbb {H}^b\) by optimizing the neighborhood preservation fit:

$$\begin{aligned} \min _{\varPhi }\sum _{i,j} [\gamma d(\mathbf {x}_i, \mathbf {x}_j) -d_h(\varPhi (\mathbf {x}_i), \varPhi (\mathbf {x}_j))]^2, \end{aligned}$$
(1)

where \(d_h\) is the Hamming distance and \(\gamma \) is a suitably selected scaling parameter. In order to scale distances to the range of \(d_h\), we set \(\gamma ={b}/{d_{\max }}\) where \(d_{\max } = \max _{\mathbf {x}, \mathbf {y} \in \mathcal {X}} d(\mathbf {x}, \mathbf {y})\) is known.Footnote 1 Solving Eq. 1 entails discrete loss minimization, which in general is a non-linear mixed-integer programming problem. Instead, two-stage methods decompose the solution into two steps, the first involving a binary integer program to find a set of binary codes, or auxilliary variables \(\{\mathbf {u}_i \in \mathbb {H}^b\}_{i=1}^n\) that minimize Eq. 1. This program can be formulated as:

$$\begin{aligned} \min _{\mathbf {u}} \sum _{i,j } [ \gamma d(\mathbf {x}_i, \mathbf {x}_j) - d_h(\mathbf {u}_i, \mathbf {u}_j)]^2 = \min _{\mathbf {u}}\frac{1}{4} \sum _{i,j } [\mathbf {u}_i^\top \mathbf {u}_j - s(\mathbf {x}_i, \mathbf {x}_i)]^2, \end{aligned}$$
(2)

where \( s(\mathbf {x}_i, \mathbf {x}_j) = b- 2 \gamma d(\mathbf {x}_i, \mathbf {x}_j), \forall i,j \in \mathcal {X}\). While the LHS of Eq. 2 is a distance equivalence problem, the RHS is an affinity matching task. Such affinity based preservation objectives have also been considered previously [1, 6, 14, 33].

In our formulation, we consider weighted Hamming distances by weighting each bit in \(\mathbf {u}\). The weighted Hamming distance has been used in past studies to provide more granular similarities compared to its unweighted counterpart (e.g., [40,41,42,43,44]). While this hashing scheme still enjoys low memory footprint and fast distance computations, weighting the individual bits enables us to construct binary codes that better preserve affinity values, as will be shown later.

We reformulate Eq. 2 by defining weight vector \(\varvec{\alpha } = [\mathbf {\alpha }_1, \cdots , \mathbf {\alpha }_b]^\top \):

$$\begin{aligned} \frac{1}{4} \sum _{i,j } [(\varvec{\alpha } \odot \mathbf {u}_i)^\top \mathbf {u}_j - s(\mathbf {x}_i, \mathbf {x}_i)]^2 \propto \frac{1}{2} \Vert \mathcal {U} - \mathcal {R}\Vert _F^2 = f(\mathcal {U}), \end{aligned}$$
(3)

where \(\odot \) denotes the Hadamard product, \(\mathcal {U}_{ij} = (\varvec{\alpha } \odot \mathbf {u}_i)^\top \mathbf {u}_j , \mathcal {R}_{ij} = s(\mathbf {x}_i, \mathbf {x}_j), \forall i,j \in \mathcal {X}\) and \(\Vert \,{\cdot }\,\Vert _F\) denotes the Frobenius norm. We note that the affinity matrix \(\mathcal {R}\) is real and symmetric as per its construction from metric d.

Let \(\mathbf {V} = [\mathbf {u}_1, \cdots , \mathbf {u}_n]^\top \in \mathbb {H}^{n \times b}\) denote the binary code matrix, then \(\mathcal {U}\) can be written as the weighted sum of b rank-one matrices \(\sum _{k=1}^b \alpha _k \mathbf {v}_k\mathbf {v}_k^\top \) where \(\mathbf {v}_k \in \{-1, 1\}^n\) is the k-th column in \(\mathbf {V}\). Given this fact, our binary inference problem can be reformulated as:

$$\begin{aligned} {\begin{matrix}&\min ~f(\mathcal {U}), ~~\text {s.t.}~~\mathcal {U} = \sum _{k=1}^b \alpha _k \mathbf {v}_k\mathbf {v}_k^\top , ~ \mathbf {v} \in \{-1,+1\}^n. \end{matrix}} \end{aligned}$$
(4)

The additive property of \(\mathcal {U}\) is attractive, since it suggests that the problem could be solved by a stepwise algorithm that adds the \(\mathbf {v}_k\)’s one by one. In particular, we will apply the projected gradient descent algorithm to solve Eq. 4. Starting with an initial value, \(\mathcal {U}_0 = \mathbf {0}\), an update step can be formulated as:

$$\begin{aligned} \mathcal {U}_t \leftarrow \mathcal {U}_{t-1} + \alpha _t \mathbf {v}_t\mathbf {v}_t^\top , \end{aligned}$$
(5)

where

$$\begin{aligned} \mathbf {v}_t = \mathop {\mathrm {arg\,max}}\limits _{{\mathbf {v} \in \{-1,+1\}^n}} \langle \mathbf {v}\mathbf {v}^\top , -\nabla f(\mathcal {U}_{t-1})\rangle \end{aligned}$$
(6)

finds the projection of the negative gradient direction \(-\nabla f(\mathcal {U}_{t-1})\) in the subspace spanned by rank-one binary matrices, and \(\alpha _t\) is a step size. This projection is important for maintaining the additive property in Eq. 4.

Since \(\langle \mathbf {v}\mathbf {v}^\top , \nabla f \rangle = \mathbf {v}^\top \nabla f\mathbf {v}\), Eq. 6 is a BQP problem which in general is NP-hard. Here, we take a spectral relaxation approach which is also used in past methods (e.g., [6, 21, 33]). A closed-form solution to Eq. 6 exists if the binary vector \(\mathbf {v}\) is relaxed to continuous values. Specifically, if \(Q = -\nabla f(\mathcal {U})\), the following relaxation yields the Rayleigh Quotient [45]:

$$\begin{aligned} \max _{\mathbf {v}^\top \mathbf {v} = n} \mathbf {v}^\top Q\mathbf {v} = n\lambda _{\max }(Q), \end{aligned}$$
(7)

where \(\lambda _{\max }\) denotes the largest eigenvalue, and the optimal solution, \(\mathbf {v}^*\), is the corresponding eigenvector. The binarized value of \(\mathbf {v}^*\), \(\text {sgn}(\mathbf {v}^*)\), is an approximate solution for Eq. 6. This solution can optionally be used as an initial point for BQP solvers in further maximizing Eq. 6, (e.g., [46,47,48]). Note that the main technical results to be given are independent of the particular BQP solver.

The negative gradient \(-\nabla f(\mathcal {U}_{t-1}) = \mathcal {R} - \sum _{k=1}^{t-1} \mathbf {v}_k\mathbf {v}_k^\top \), also a symmetric matrix, can be considered as the residual at iteration \(t-1\). At each iteration, we find the most correlated rank-one matrix with this residual and move our solution in that direction. If the step size \(\alpha _t\) is set to 1 for all t, then \(\mathcal {U}\) can be decomposed as the product of the binary code matrices \(\mathbf {V}\mathbf {V}^\top \), yielding ordinary Hamming distances. However, with constant step sizes, the below property states that there exist certain affinity matrices \(\mathcal {R}\) such that no \(\mathcal {U}\) exists that fits \(\mathcal {R}\).

Property 1

Let \(Q_{t}\) be the residual \(-\nabla f(\mathcal {U}_{t})\) at iteration t. There exists a \(\mathcal {R}\) such that \(\forall t, \Vert Q_{t}\Vert _F > 0\).

Such a result motivates us to relax the constraint on the step size parameter \(\alpha _t\). If \(\alpha \) is relaxed to any real value, then what we have essentially is weighted Hamming distances and we demonstrate that one can monotonically decrease the residual \(\mathcal {R}\) in this case. We now provide our main theorem:

Theorem 2

If \(\alpha _t \in \mathbb {R}\), then the gradient descent algorithm Eqs. 56 satisfies

$$\begin{aligned} \Vert Q_t\Vert _F \le \eta ^{t-1}\Vert Q_{t-1}\Vert _F,~~\forall t \end{aligned}$$
(8)

where \(\eta \in [0, 1]\).

Theorem 2 states that the norm of the residual is only monotonically non-increasing. However, it may not strictly decrease, since the solution \(\mathbf {v}_t\) of Eq. 6 can actually be orthogonal to the gradient, i.e., \(\mathbf {v}_t^\top Q_{t-1}\mathbf {v}_t\) might be zero. If we ensure non-orthogonal directions are selected at each iteration, then the residual strictly decreases, as the following corollary states.

Corollary 3

If \(\mathbf {v}_t^\top Q_{t-1} \mathbf {v}_t \ne 0, ~\forall t\) then the residual norm \(\Vert Q_t\Vert _F\) strictly decreases.

Although the directions \(\mathbf {v}_t\mathbf {v}_t^\top \) are greedily selected with step sizes \(\alpha _t\), one can refine step sizes of all past directions at each iteration. This generally leads to much faster convergence. More formally, we can refine the step size parameters by solving the following regression problem:

$$\begin{aligned} \varvec{\alpha }^* = {\mathop {\mathrm {arg\,min}}\limits _{{\alpha _1, \cdots , \alpha _t}}} \frac{1}{2}\Vert \sum _{k=1}^t \alpha _k \mathbf {v}_k\mathbf {v}_k^\top - \mathcal {R}\Vert _F^2. \end{aligned}$$
(9)

Fortunately, Eq. 9 is an ordinary least squares problem admitting a closed-form solution. Let \(\mathbf {\widehat{v}}_k = \text {vec}(\mathbf {v}_k \mathbf {v}_k^\top )\) and \(\mathbf {\widehat{r}} = \text {vec}(\mathcal {R})\) where vec\((\cdot )\) denotes the vectorization operator. Given \(\mathbf {\widehat{V}}_t = [\mathbf {\widehat{v}}_1, \cdots , \mathbf {\widehat{v}}_t]\), the minimizer of Eq. 9 is

$$\begin{aligned} \varvec{\alpha }_t^* = (\mathbf {\widehat{V}}_t^\top \mathbf {\widehat{V}}_t)^{-1}\mathbf {\widehat{V}}_t^\top \mathbf {\widehat{r}}, \end{aligned}$$
(10)

where \(\varvec{\alpha }_t^* = [{\alpha _1^*, \cdots , \alpha _t^*}]^\top \). The solution requires \( \mathcal {O}(t^3) + \mathcal {O}(t^2 n^2) +\mathcal {O}(tn^2)\) operations with \(n=|\mathcal {X}|\). If \(n > \sqrt{t}\), the time complexity is dominated by the \(\mathcal {O}(t^2 n^2)\) term. Note that in practice, typical values for t, the number of bits, are small (\({<}100\)) and can be considered a constant factor.

We now provide a property indicating that this refinement of the step-sizes does not break the monotonicity as defined in Theorem 2 and Corollary 3.

Property 4

Let \(Q_t\) be the residual matrix at iteration t and \(\alpha _t\) set according to Theorem 2. Let \(\widehat{Q}_t\) be the residual after refining the step-sizes \(\varvec{\alpha _t} = [\alpha _1, \cdots , \alpha _t]^\top \) using Eq. 10. Then \(\Vert \widehat{Q}_t\Vert _F \le \Vert Q_t\Vert _F\).

After learning \(\mathcal {U} = \sum _{k=1}^\top \alpha _k \mathbf {v}\mathbf {v}^t = \mathbf {A} \odot \mathbf {V}\mathbf {V}^\top \) where \(A_{k,\cdot } = [\alpha _1, \cdots , \alpha _t], \forall k\) we obtain our binary code matrix \(\mathbf {V} = [\mathbf {u}_1, \cdots , \mathbf {u}_n]^\top \) that contains the target codes for each element \(\{\mathbf {x}_1,\cdots , \mathbf {x}_n\} \in \mathcal {X}\). This ends our inference step (i). We summarize our inference scheme in Alg. Binary code inference.

Remarks

We consider two different binary inference schemes: constant where the binary codes are constructed with constant step sizes yielding ordinary Hamming distances; and, regress where each bit is weighted yielding the weighted Hamming distance. For regress, since \((\varvec{\alpha } \odot \mathbf {u}_i)^\top \mathbf {u}_j = b(1-2{d(\mathbf {x}_i, \mathbf {x}_j)}/{d_{\max }}) \) in Eq. 3, we can embed the constant b into the weight vector variable \(\varvec{\alpha }\). As a result, in contrast to hashing methods where the Hamming space dimensionality b must be specified (e.g., to set margin and scaling parameters [6, 10, 14]), our method only requires \(d_{\max }\) to be bounded. On the other hand, regular Hamming distance, or constant, requires scaling with b beforehand. The approximate solution of Eq. 7 can be improved by using off-the-shelf BQP solvers. In Alg. Binary code inference, we refer to such solvers as the subroutine Improve\((\cdot )\). In this paper, we consider using a simple heuristic [46], which merely requires a positive objective value for Eq. 6.

We now proceed with step (ii): hash mapping learning.

figure a

3.2 Hash Mapping Learning

Recall that we inferred target codes \(\mathbf {u} \in \mathbb {H}^b\) for each item \(\mathbf {x} \in \mathcal {X}\), where \(\mathbf {x}\) may correspond to data instances, classes, multi-labels etc., depending on the neighborhood definition. For example, when the dataset is unsupervised and the neighborhood is defined merely through data instances, then \(\mathcal {X}\) may correspond to the feature space with \(d(\mathbf {x}_i, \mathbf {x}_j)\) being the Euclidean distance. For multi-class datasets, \(\mathcal {X}\) and \(d(\mathbf {x}_i, \mathbf {x}_j)\) may represent the set of classes and the distance values between pairs of classes, respectively. For multi-label datasets, \(\mathcal {X}\) may correspond to the set of possible label combinations. Our binary inference scheme constructs target codes to items that directly define the neighborhood. In the experiments section, we cover various scenarios.

If \(\mathcal {X}\) does not represent the feature space, then after the binary inference step (i), the target codes get assigned to data instances in a one-to-many fashion, depending on the relationship between the target code and data instance. For sake of clarity, we assume \(\mathcal {X}\) is the feature space in this section.

We employ a collection of hash functions to learn the mapping, where a function \(f:\mathcal {X} \rightarrow \{-1, 1\}\) accounts for the generation of a bit in the binary code. Many types of hash functions are considered in the literature. For simplicity, we consider the thresholded scoring function:

$$\begin{aligned} f(\mathbf {x})\triangleq \text {sgn} (\psi (\mathbf {x})), \end{aligned}$$
(11)

where \(\psi \) can be either a shallow model such as a linear function, or a deep neural network. In experiments, we consider both types of embeddings. \(\varPhi (\mathbf {x}) = [f_1 (\mathbf {x}),\cdots ,f_b (\mathbf {x})]^\top \) then becomes a vector-valued function to be learned.

Recall that we inferred target codes \(\mathbf {u} \in \mathbb {H}^b\) for each element \(\mathbf {x} \in \mathcal {X}\). Having the target codes at our disposal, we now would like to find \(\varPhi \) such that the Hamming distances between \(\varPhi (\mathbf {x})\) and the corresponding target codes \(\mathbf {u}\) are minimized. Hence, the objective can be formulated as:

$$\begin{aligned} \sum _{i=1}^n d_h(\varPhi (\mathbf {x}_i),\mathbf {u}_i). \end{aligned}$$
(12)

The Hamming distance is defined as \(d_h( \varPhi (\mathbf {x}_i), \mathbf {u}_i) = \sum _t [\![f_t(\mathbf {x}_i) \ne u_{it}]\!]\) where both \(d_h\) and the functions \(f_t\) are non-differentiable. Fortunately, we can relax \(f_t\) by dropping the sgn function in Eq. 11 and derive an upper bound on the Hamming loss. Note that \(d_h( \varPhi (\mathbf {x}_i), \mathbf {u}_i) = \sum _t [\![f_t(\mathbf {x}_i) \ne u_{it}]\!] \le \sum _t l(-u_{it} \psi _t(\mathbf {x}_i))\) with a suitably selected convex margin-based function l. Thus, by substituting this surrogate function into Eq. 12, we can directly minimize this upper bound using stochastic gradient descent. We use the hinge loss as the upper bound l.

As similar to other two-stage hashing methods, at the heart of our formulation are the target vectors which are inferred as to fit the affinity matrix \(\mathcal {R}\). Next, we take a closer look on how to construct this affinity matrix.

3.3 Affinity Matrix Construction

The affinity matrix can be defined through pairwise similarities of items that directly define the neighborhood, which may not correspond to training instances. Despite this flexibility of the formulation, previous related hashing studies generally consider using training instances.

For certain neighborhoods, constructing the affinity matrix with training instances might yield suboptimal binary codes. To illustrate this case, consider Fig. 1 where we compare two sets of binary codes inferred from two different affinity matrices in a series of experiments. The neighborhood definition in these experiments is a standard one, typically found in nearly all hashing work. Specifically, we assume 10 classes and define the class affinity matrix as shown in Fig. 1(a). We also consider a hypothetical set of 1000 instances, each assigned to one of these 10 classes, and construct the affinity matrix as shown in Fig. 1(b) which we simply refer as the instance affinity matrix. Similarity of the instances are based on their class id’s and deduced from the class affinity matrix. We infer binary codes under varying lengths as to reconstruct the class and instance affinity matrices. As explained in Sect. 3.2, instances are assigned the binary code of their respective classes for the class based inference. The experiments are repeated 5 times and average results are reported.

Fig. 1.
figure 1

In a series of experiments, we compare two sets of binary codes constructed with two different affinity matrices: class (a) and instance based (b). (c)–(e) contrasts the binary codes with respect to residual norm, \(\mathsf {mAP}\) and inference time. Results for binary codes inferred from the class and instance affinity matrices are denoted with () and (), respectively. We also learn hash functions with varying complexities to fit the inferred binary codes and plot the fraction of non-matched bits to the total number bits (f).

We first highlight the residual matrix \(Q_t\) norm in Fig. 1(c). Note that the residual norm of the class based inference converges to zero with fewer iterations: 40 bit codes are able to reconstruct the class affinity matrix with minimal discrepancy. On the other hand, lengthier codes are required to fully reconstruct the instance affinity matrix. We also provide the retrieval performance for the two sets of binary codes. Mean Average Precision (\(\mathsf {mAP}\)) is the evaluation criterion. For this experiment, 100 instances are sampled from the instance set as queries, while the rest constitute the retrieval set. As demonstrated in Fig. 1(d), their is a dramatic difference in \(\mathsf {mAP}\) values especially with compact codes. The difference can be as large as 0.40. This type of sub-optimality for the binary codes inferred through the instance affinity matrix have also been observed previously (e.g., [22]). Lastly, Fig. 1(e) gives the training time for the two inference schemes. While the inference time depends on the particular BQP solver, the number of decision variables nevertheless scales quadratically with the number of items in \(\mathcal {X}\), as seen by the dramatic difference in the training time between the two inference schemes, especially with lengthier codes. Depending on the instance matrix size, the difference can easily scale up requiring subsampling to reduce the scale of the optimization task.

Given the evident disadvantages, why is the affinity matrix constructed from instances? The primary reason is because in most two-stage hashing methods the inference and hash function learning steps are interleaved for on-the-fly bit correction purposes. This requires the affinity matrix to correspond to pairwise instance similarities as the inferred bits will immediately be used for training the hash functions. However, given recent advances in deep learning, high-capacity predictors are becoming available, nullifying the need for bit correction. Consequently, one can opt to solve a smaller and more robust optimization problem defined on items that directly define the neighborhood.

To illustrate this point we learn hash functions of varying complexities to fit the set of binary codes,Footnote 2 and plot the fraction of non-matched bits to total number bits during hash function learning. We use the training set of \(\mathsf {CIFAR}\)-\(\mathsf {10}\) and train the hash functions to fit the inferred 32-bit binary codes (total: 32 \(\times \) 50,000 bits). We consider single layer neural networks on \(\mathsf {GIST}\) [49] and \(\mathsf {fc7}\) features of a \(\mathsf {VGG}\)-\(\mathsf {F}\) network [50] pretrained on ImageNet [19], in addition to fine-tuning all the \(\mathsf {VGG}\)-\(\mathsf {F}\) layers. Figure 1(f) gives the results. Notice that as the capacity of the hash function increases the ratio of non-matched bits decrease significantly. While this ratio is above 0.25 with a single layer neural net on \(\mathsf {GIST}\), the single layer neural net trained on \(\mathsf {fc7}\) features yields just above 10% unmatched bits. When we fine-tune all layers of a \(\mathsf {VGG}\)-\(\mathsf {F}\) network this percentage reduces well below 10%. We can induce that with more complex architectures the ratio will diminish even more so.

We incorporate these insights into our formulation and conduct retrieval experiments against competing methods in the next section, where we achieve new state-of-the-art performances.

4 Experiments

We conduct experiments on widely used image retrieval benchmarks: \(\mathsf {CIFAR}\)-\(\mathsf {10}\) [17], \(\mathsf {NUSWIDE}\) [18], \(\mathsf {22K}\) \(\mathsf {LabelMe}\) [20] and \(\mathsf {ImageNet100}\) [19].

\(\mathsf {CIFAR}\)-\(\mathsf {10}\) is a dataset for image classification and retrieval, containing 60K images from 10 different categories. We follow the setup of [2, 14, 22, 38]. This setup corresponds to two distinct partitions of the dataset. In the first case (cifar-1), we sample 500 images per category, resulting in 5,000 training examples to learn the hash mapping. The test set contains 100 images per category (1000 in total). The remaining images are then used to populate the hash table. In the second case (cifar-2), we sample 1000 images per category to construct the test set (10,000 in total). The remaining items are both used to learn the hash mapping and populate the hash table. Two images are considered neighbors if they belong to the same class.

\(\mathsf {NUSWIDE}\) is a dataset containing 269K images. Each image can be associated with multiple labels, corresponding with 81 ground truth concepts. Following the setup in [2, 14, 22, 38], we only consider images annotated with the 21 most frequent labels. In total, this corresponds to 195,834 images. The experimental setup also has two distinct partitionings: nus-1 and nus-2. For both cases, a test set is constructed by randomly sampling 100 images per label (2,100 images in total). To learn the hash mapping, 500 images per label are randomly sampled in nus-1 (10,500 in total). The remaining images are then used to populate the hash table. In the second case, nus-2, all the images excluding the test set are used in learning the hash mapping and populating the hash table. Two images are considered neighbors if they share a single label. We also specify a richer neighborhood by allowing multi-level affinities. In this scenario, two images have an affinity value equal to the number of common labels they share.

\(\mathsf {22K}\) \(\mathsf {LabelMe}\) consists of 22K images, each represented with a 512-dimensionality \(\mathsf {GIST}\) descriptor. Following [3, 12], we randomly partition the dataset into two: a training and test set consisting of 20K and 2K instances, respectively. A 5K subset of the training set is used in learning the hash mapping. As this dataset is unsupervised, we use the \(l_2\) norm in determining the neighborhood. Similar to \(\mathsf {NUSWIDE}\), we allow multi-level affinities for this dataset. We consider four distance percentiles deduced from the training set and assign multi-level affinity values between the instances.

\(\mathsf {ImageNet100}\) is a subset of ImageNet [19] containing 130K images from 100 classes. We follow [4] and sample 100 images per class for training. All images in the selected classes from the ILSVRC 2012 validation set are used as the test set. Two images are considered neighbors if they belong to the same class.

Experiments without using multi-level affinities in defining the neighborhood are evaluated using a variant of Mean Average Precision (\(\mathsf {mAP}\)), depending on the protocol we follow. We collectively group these as binary affinity experiments. Multi-level affinity experiments are evaluated using Normalized Discounted Cumulative Gain (\(\mathsf {NDCG}\)), a metric standard in information retrieval for measuring ranking quality with multi-level similarities. In both experiments, Hamming distances are used to retrieve and rank data instances.

We term our method \(\mathsf {HBMP}\) (Hashing with Binary Matrix Pursuit), and compare it against state-of-the-art hashing methods. These methods include: Spectral Hashing (SH) [33], Iterative Quantization (ITQ) [34], Supervised Hashing with Kernels (SHK) [6], Fast Hashing with Decision Trees (FastHash) [9], Structured Hashing (StructHash) [37], Supervised Discrete Hashing (SDH) [16], Efficient Training of Very Deep Neural Networks (VDSH) [36], Deep Supervised Hashing with Pairwise Labels (DPSH) [38], Deep Supervised Hashing with Triplet Labels (DTSH) [14] and Mutual Information Hashing (MIHash) [39, 51]. These competing methods have been shown to outperform earlier and other works such as [1, 2, 8, 12, 13, 41, 52].

For \(\mathsf {CIFAR}\)-\(\mathsf {10}\) and \(\mathsf {NUSWIDE}\) experiments, we fine tune the \(\mathsf {VGG}\)-\(\mathsf {F}\) architecture. For \(\mathsf {ImageNet100}\) experiments, we fine-tune the \(\mathsf {AlexNet}\) architecture. Both deep learning models are pretrained using the ImageNet dataset. For non-deep methods, we use the output of the penultimate layer of both architectures. For the \(\mathsf {22K}\) \(\mathsf {LabelMe}\) benchmark, we learn shallow models on top of the \(\mathsf {GIST}\) descriptor. For deep learning based hashing methods, this corresponds to using a single fully connected neural network layer.

4.1 Results

We provide results for experiments with binary similarities with \(\mathsf {mAP}\) as the evaluation criterion, and then for multi-level similarities with \(\mathsf {NDCG}\). In \(\mathsf {CIFAR}\)-\(\mathsf {10}\), set \(\mathcal {X}\), in which the binary inference is performed upon, represents the 10 classes. For \(\mathsf {NUSWIDE}\), as the neighborhood is defined using the multi-labels, it is then intuitive for set \(\mathcal {X}\) to represent label combinations. In our case, we consider unique label combinations in the training set resulting in \(\mathcal {X}=4850\) items for binary inference. For the 22K \(\mathsf {LabelMe}\) dataset, the items directly correspond to training instances. We provide results for the regress binary inference scheme, denoted simply as \(\mathsf {HBMP}\). A comparison between constant and regress is given in the supplementary material.

Table 1. Binary affinity experiments on \(\mathsf {CIFAR}\)-\(\mathsf {10}\) and \(\mathsf {NUSWIDE}\) datasets with cifar-1 and nus-1 partitionings. The underlying deep learning architecture is \(\mathsf {VGG}\)-\(\mathsf {F}\). \(\mathsf {HBMP}\) outperforms competing methods on \(\mathsf {CIFAR}\)-\(\mathsf {10}\), and shows improvements, especially with lengthier codes on \(\mathsf {NUSWIDE}\).
Table 2. Binary affinity experiments on \(\mathsf {CIFAR}\)-\(\mathsf {10}\) and \(\mathsf {NUSWIDE}\) datasets with cifar-2 and nus-2 partitionings (with \(\mathsf {VGG}\)-\(\mathsf {F}\) architecture). \(\mathsf {HBMP}\) achieves new state-of-the-art performances, significantly improving over competing methods.

Binary Affinity Experiments. Table 1 gives results for the cifar-1 and nus-1 experimental settings in which \(\mathsf {mAP}\) and \(\mathsf {mAP@5K}\) values are reported for the \(\mathsf {CIFAR}\)-\(\mathsf {10}\) and \(\mathsf {NUSWIDE}\) datasets, respectively. Deep-learning based hashing methods such as DPSH, DTSH and MIHash outperform most non-deep hashing solutions. This is not surprising as feature representations are simultaneously learned along the hash mapping in these methods. Certain two-stage methods, e.g., FastHash, remain competitive and top deep learning methods including DTSH and MIHash for various hash code lengths, especially for \(\mathsf {NUSWIDE}\). Our two-stage method, \(\mathsf {HBMP}\), outperforms all competing methods in majority of the cases, including MIHash, DTSH and DPSH with very large improvement margins. Specifically for \(\mathsf {CIFAR}\)-\(\mathsf {10}\), the best competing method is MIHash, a recent study that learns the hash mapping using a mutual information formulation. The improvement over MIHash is over \(\mathbf {6\%}\) for certain hash code lengths, e.g., for 12 bits \(\mathbf {0.799}\) vs. 0.738 \(\mathsf {mAP}\). Our method significantly improves over SHK as well, which also proposes a matrix fitting formulation but learns its hash mapping in an interleaved manner. This validates defining the binary code inference over items that directly define the neighborhood, i.e. classes for \(\mathsf {CIFAR}\)-\(\mathsf {10}\).

For the \(\mathsf {NUSWIDE}\) dataset, the binary inference is done over the set of label combinations in the training data. \(\mathsf {HBMP}\) demonstrates either comparable results or outperforms the state-of-the-art hashing methods. A relevant recent two-stage hashing method is [22] in which the same settings (cifar-1 and nus-1) are used but with fine-tuning a \(\mathsf {VGG-16}\) architecture. Their \(\mathsf {CIFAR}\)-\(\mathsf {10}\) and \(\mathsf {NUSWIDE}\) results have at most 0.80 \(\mathsf {mAP}\) and 0.75 \(\mathsf {mAP@5K}\) values, respectively, for all hash code lengths. \(\mathsf {HBMP}\), on the other hand, achieves these performance values with the inferior \(\mathsf {VGG}\)-\(\mathsf {F}\) architecture.

To further emphasize the merits of \(\mathsf {HBMP}\), we consider the experimental settings cifar-2 and nus-2 and compare against recent deep-learning hashing methods. In this setting, we again fine-tune the \(\mathsf {VGG}\)-\(\mathsf {F}\) architecture pretrained on ImageNet. Table 2 gives the results. Notice that our method significantly outperforms all techniques, and yields new state-of-the-art results for \(\mathsf {CIFAR}\)-\(\mathsf {10}\) and \(\mathsf {NUSWIDE}\).

Retrieval results for \(\mathsf {ImageNet100}\) are given in Table 3. In these experiments, we only compare against MIHash, the overall best competing method in past experiments and HashNet [4], another very recent deep learning based hashing study. As demonstrated, \(\mathsf {HBMP}\) establishes the new state-of-the-art in image retrieval for this benchmark. \(\mathsf {HBMP}\) outperforms both methods significantly, e.g., with 64-bits, we demonstrate 4–6% improvement. This further validates the quality of the binary codes produced with \(\mathsf {HBMP}\).

Table 3. \(\mathsf {mAP@1K}\) values on \(\mathsf {ImageNet100}\) using \(\mathsf {AlexNet}\). \(\mathsf {HBMP}\) outperforms the two state-of-the-art formulations using mutual information [39] and continuation methods [4].

Multilevel Affinity Experiments. In these experiments, we allow multi-level similarities between items of set \(\mathcal {X}\) and use \(\mathsf {NDCG}\) as the evaluation criterion. For \(\mathsf {NUSWIDE}\), we consider the number of shared labels as affinity values. For \(\mathsf {22K}\) \(\mathsf {LabelMe}\) dataset, we consider using distance percentiles \(\{2\%, 5\%, 10\%, 20\%\}\) deduced from the training set to assign inversely proportional affinity values between the training instances. This emphasizes multi-level rankings among neighbors in the original feature space. In \(\mathsf {22K}\) \(\mathsf {LabelMe}\), we use a single fully connected layer as the hash mapping for the deep-learning based methods.

Table 4 gives results. For \(\mathsf {NUSWIDE}\), \(\mathsf {HBMP}\) outperforms all state-of-the-art methods including MIHash. In \(\mathsf {22K}\) \(\mathsf {LabelMe}\), \(\mathsf {HBMP}\) either achieves state-of-the-art performance, or is a close second. An interesting observation is that, when the feature learning aspect is removed due to the use of precomputed \(\mathsf {GIST}\) features, non-deep methods such as FastHash and StructHash outperform deep-learning hashing methods DPSH and DTSH. While FastHash and StruchHash enjoy non-linear hash functions such as boosted decision trees, this also indicates that the prowess of DPSH and DTSH might come primarily through feature learning. On the other hand, both \(\mathsf {HBMP}\) and MIHash show top performances with a single fully connected layer as the hash mapping, indicating that they produce binary codes that more accurately reflect the neighborhood. Regarding \(\mathsf {22K}\) \(\mathsf {LabelMe}\), for \(\mathsf {HBMP}\), set \(\mathcal {X}\) corresponds to training instances, as similarly in other methods. This suggests that the performance improvement of \(\mathsf {HBMP}\) is not merely due to the fact that the binary inference is performed upon items that directly define the neighborhood, but also due to our formulation that learns a Hamming metric with optimally selected bit weights.

Table 4. Multi-level affinity experiments on \(\mathsf {NUSWIDE}\) and \(\mathsf {22K}\) \(\mathsf {LabelMe}\) using \(\mathsf {VGG}\)-\(\mathsf {F}\) and \(\mathsf {GIST}\), respectively. The partitioning used for \(\mathsf {NUSWIDE}\) is nus-1. The evaluation criterion is Normalized Discounted Cumulative Gain (\(\mathsf {NDCG})\). \(\mathsf {HBMP}\) improves over the state-of-the-art in majority of the cases.

5 Conclusion

We have proposed improvements to a commonly used formulation in two-stage hashing methods. We first provided a theoretical result on the quality of the binary codes showing that, under mild assumptions, we can construct binary codes that fit the neighborhood with arbitrary accuracy. Secondly, we analyzed the sub-optimality of binary codes constructed as to fit an affinity matrix that is not defined on items directly related to the neighborhood. Incorporating our findings, we proposed a novel two-stage hashing method that significantly outperforms previous hashing studies on multiple benchmarks.