Keywords

1 Introduction

Collaborative filtering (CF) is one of the most widely used recommendation techniques [14, 47]. Given a user, CF recommends items by aggregating the preferences of similar users. Among CF recommendation approaches, methods based on nearest-neighbors (NN) are widely used, thanks to their simplicity, efficiency and ability to produce accurate and personalized recommendations [13, 35, 44]. Although deep learning (DL) methods [16, 19, 43] have attracted much attention in the recommendation community over the past few years, a very recent study [12] shows that NN-based CF is still a strong baseline and outperforms many DL methods. For NN-based methods, the user similarity measure plays an important role. It serves as the criterion to select a group of similar users whose ratings form the basis of recommendations, and is used to weigh users so that more similar users have greater impact on recommendations. Besides CF, user similarity is also important for applications such as link prediction [4], community detection [34] and so on.

Related Work. Traditional similarity measures, such as cosine distance (COS) [9], Pearson’s Correlation Coefficient (PCC) [9] and their variants [18, 29, 38, 39], have been widely used in CF [13, 44]. However, such measures only consider co-rated items and ignore ratings on other items, and thus may only coarsely capture users’ preferences as ratings are sparse and co-rated items are rare for many real-world datasets [35, 40, 44]. Some other similarity measures, such as Jaccard [22], MSD [39], JMSD [8], URP [27], NHSM [27], PIP [5] and BS [14] do not utilize all the rating information [6]. For example, Jaccard only uses the number of rated items and omits the specific rating values, while URP only uses the mean and the variance of the ratings. Critically, all these measures give zero similarity value when there are no co-rated items, which would harm recommendation performance. Recently, BCF [35] and HUSM [44] were proposed to alleviate the co-rating issue by modeling user similarity as a weighted sum of item similarities, where the weights are obtained using heuristics. As the weights are not derived in a principled manner, they do not satisfy important properties such as triangle inequality and zero self-distance, which are important for a high quality similarity measure.

The Earth Mover’s Distance (EMD) is a distance metric on probabilistic space that originates from the optimal transportation theory [25, 37]. EMD has been applied to many applications, such as computer vision [7], natural language processing [17, 23] and signal processing [41]. EMD has also been applied to CF [48] but is used as a regularizer to force the latent variable to fit a Gaussian prior in auto-encoder training rather than a user similarity measure.

Our Solution. We propose the Preference Mover’s Distance (PMD), which considers all ratings made by each user and is able to evaluate user similarity even in the absence of co-rated items. Similar to BCF and HUSM, PMD uses the item similarity as side information and assumes that if two users have similar opinions on similar items, then their tastes are similar. But the key difference is: PMD formulates the distance between a pair of users as an optimal transportation problem [26, 36] such that the weights for item similarities can be derived in a principled manner. In fact, PMD can be viewed as a special case of EMD [33, 37, 45], which is a metric that satisfies important properties such as triangle inequality and zero self-distance. We also make PMD practical for large datasets by employing the Sinkhorn algorithm [10] to speed up distance computation and using HNSW [30] to further accelerate the search for similar users. Experimental results show that PMD leads to superior recommendation accuracy over the state-of-the-art similarity measures, especially on sparse datasets.

2 Preference Mover’s Distance

Problem Definition. Let \(\mathcal {U}\) be a set of m users, and \(\mathcal {I}\) a set of n items. The user-item interaction matrix is denoted by \( \mathbf {R} \in \mathbb {R}^{m\times n}\), where \(\mathbf {R}(u,i) \ge 0\) is the rating user u gives to item i. \(\mathbf {R}\) is a partially observed matrix and usually highly sparse. For user \(u \in \mathcal {U}\), her rated items are denoted by \(\mathcal {I}_u \subset \mathcal {I}\). The item similarities are described by matrix \(\mathbf {D}\) and \(\mathbf {D}(i,j)\ge 0\) denotes the distance between items i and j. Item similarities can be derived from the ratings on items [35, 44] or content information [46], such as item tags, comments, etc. In this paper, we assume \(\mathbf {D}\) is given. We are interested in computing the distance between any pair (uv) of users in \(\mathcal {U}\) given \(\mathbf {R}\) and \(\mathbf {D}\). User similarity can be easily derived from the user distance as they are negatively correlated.

PMD. Let \(\varSigma _k=\{\mathbf {p}\in [0,1]^k \;|\;{\mathbf {p}^{\top }\mathbbm {1}}=1\}\) denote a \((k-1)\)-dimensional simplex and \(\mathbbm {1}\) is an all-1 column vector. We model a user’s preferences as a probabilistic distribution \(\mathbf {p}_u\in \varSigma _{|\mathcal {I}_u |}\) on \(\mathcal {I}_u\), where \(\mathbf {p}_u(i)\) indicates how much user u likes item i. In practice, the ground truth of \(\mathbf {p}_u\) cannot be observed and we estimate it by normalizing user u’s ratings on \(\mathcal {I}_u\), i.e., \(\mathbf {p}_u(i) \approx \frac{\mathbf {R}(u,i)}{\sum _{j \in \mathcal {I}_u}\mathbf {R}(u,j)}\) for \(i \in \mathcal {I}_u\). We model the distance between users u and v, denoted by \(d(\mathbf {p}_u,\mathbf {p}_v)\), as the weighted average of the distances among their rated items, i.e.,

$$\begin{aligned} \sum _{i \in \mathcal {I}_u}\sum _{j \in \mathcal {I}_v}\mathbf {W}_{u,v}(i,j)\mathbf {D}(i,j), \end{aligned}$$
(1)

where \(\mathbf {W}_{u,v}(i,j)\ge 0\) is the weight for an item pair (ij) and we introduce the constraint \(\sum _{i \in \mathcal {I}_u}\sum _{j \in \mathcal {I}_v}\mathbf {W}_{u,v}(i,j)\!=\!1\) to control the scaling. \(\sum _{j \in \mathcal {I}_v}\mathbf {W}_{u,v}(i,j)\) is the aggregate weight received by item i for user u and it should be large if \(\mathbf {p}_u(i)\) is large such that \(d(\mathbf {p}_u,\mathbf {p}_v)\) can focus on the items that user u likes. Similarly, \(\sum _{i \in \mathcal {I}_u}\mathbf {W}_{u,v}(i,j)\) should also be large if \(\mathbf {p}_v(j)\) is large. Thus, we constrain the marginal distributions of \(\mathbf {W}_{u,v}\) follow \(\mathbf {p}_u\) and \(\mathbf {p}_v\), i.e., \(\mathbf {W}_{u,v} \in U(\mathbf {p}_u,\mathbf {p}_v) \), where

$$\begin{aligned} U(\mathbf {p}_u,\mathbf {p}_v) :=&\left\{ \mathbf {W}_{u,v}\in [0,1]^{|\mathcal {I}_u|\times |\mathcal {I}_v|} \;|\; \mathbf {W}_{u,v}\mathbbm {1}=\mathbf {p}_u, \mathbf {W}_{u,v}^T \mathbbm {1} =\mathbf {p}_v \right\} . \end{aligned}$$
(2)

However, \(U(\mathbf {p}_u,\mathbf {p}_v)\) contains many different configurations of \(\mathbf {W}_{u,v}\), which means that the user distance is indeterminate. Therefore, we define the user distance as the smallest among all possibilities:

$$\begin{aligned} d(\mathbf {p}_u,\mathbf {p}_v) :=\min _{\mathbf {W}_{u,v}\in U(\mathbf {p}_u,\mathbf {p}_v)} \sum _{i \in \mathcal {I}_u}\sum _{j \in \mathcal {I}_v}\mathbf {W}_{u,v}(i,j)\mathbf {D}(i,j). \end{aligned}$$
(3)

Equation (3) is a special case of the earth mover’s distance (EMD) [11], when the moment parameter \(p=1\) and the probability space is discrete. Moreover, PMD is a metric as long as \(\mathbf {D}\) is a metric [37]. We call \(d(\mathbf {p}_u,\mathbf {p}_v)\) the preference mover’s distance (PMD) to highlight its connection to EMD. Being a metric has some nice properties that make the user distance meaningful. For example, the triangle inequality indicates that if both user A and user B are similar to a third user C, then user A and user B are also similar. Moreover, a user should be most similar to himself among all users if \(\mathbf {D}(i,i)=0\). In contrast, it is unclear whether BCF and HUSM also have these properties as they determine weights using heuristics.

Fig. 1.
figure 1

An example of PMD. (a) shows the preference distributions of \(u_0\), \(u_1\) and \(u_2\) using histogram and the arrows depict the optimal transportation plan (i.e., \(\mathbf {W}_{u,v}\)) between the preference distributions. (b) is the distance matrix for the 5 movies, in which movies with the same genre have smaller distance, i.e., are more similar.

Illustration. Intuitively, \(d(\mathbf {p}_u,\mathbf {p}_v)\) can be viewed as the minimum cost of transforming the ratings of user u to the ratings of user v, which we show in Fig. 1. \(\mathbf {p}_u\) and \(\mathbf {p}_v\) define two distributions of mass, while \(\mathbf {D}(i,j)\) models the cost of moving one unit of mass from \(\mathbf {p}_u(i)\) to \(\mathbf {p}_v(j)\). Therefore, PMD can model the similarity between u and v even if they have no co-rated items. If two users like similar items, \(\mathbf {W}_{u,v}(i,j)\) takes a large value for item pairs with small \(\mathbf {D}(i,j)\), which results in a small distance. This is the case for \(u_0\) and \(u_1\) in Fig. 1 as they both like science fiction movies. In contrast, if two users like dissimilar items, \(\mathbf {W}_{u,v}(i,j)\) is large for item pairs with large \(\mathbf {D}(i,j)\), which produces a large distance. In Fig. 1, \(u_0\) likes science fiction movies while \(u_2\) likes romantic movies, and thus \(d(\mathbf {p}_{u_0},\mathbf {p}_{u_2})\) is large. Even if \(u_0\) has no co-rated movies with \(u_1\) and \(u_2\), PMD still gives \(d(\mathbf {p}_{u_0},\mathbf {p}_{u_1})<d(\mathbf {p}_{u_0},\mathbf {p}_{u_2})\), which implies that \(u_0\) is more similar to \(u_1\) than to \(u_2\).

Computation Speedup. An exact solution to the optimization problem in Eq. (3) takes a time complexity of \(O(q^3\log q)\) [36], where \(q=|\mathcal {I}_u \cup \mathcal {I}_v|\). To reduce the complexity, we use the Sinkhorn algorithm [10], which produces a high-quality approximate solution with a complexity of \(O(q^2)\). To speed up the lookup for similar users in large datasets, we employ HNSW [30], which is the state-of-the-art algorithm for similarity search. HNSW builds a multi-layer k-nearest neighbour (KNN) graph for the dataset and returns high quality nearest neighbours for a query with \(O(\log N)\) distance computations, in which N is the number of users. With these two techniques, looking up for the top 100 neighbours takes only 0.02 s on average for a user and achieves a high recall of 99.2% for the Epinions dataset in our experiments. We conduct the experiment on a machine with two 2.0 GHz E5-2620 Intel(R) Xeon(R) CPU (12 physical cores in total), 48 GB RAM, a 450 GB SATA disk (6 Gb/s, 10k rpm, 64 MB cache), and 64-bit CentOS release 7.2.

Positive/Negative Feedback. We can split the user ratings into positive ratings \( \mathbf {R}^{p}\), e.g., 3, 4 and 5 if a score of 1–5 is allowed, which indicates that the user likes the item, and negative ratings \(\mathbf {R}^{n}\), e.g., 1 and 2, which indicates that the user dislikes the item. Based on \( \mathbf {R}^p\) and \( \mathbf {R}^n\), we define positive preference \(\mathbf {p}^p_u\) and negative preference \(\mathbf {p}^n_u\), i.e., \(\mathbf {p}^p_u(i) = \frac{\mathbf {R}^p(u,i)}{\sum _{j \in \mathbf {R}^p}\mathbf {R}^p(u,j)}\) and \(\mathbf {p}^n_u(i) = \frac{\frac{1}{\mathbf {R}^n(u,i)}}{\sum _{j \in \mathbf {R}^n}\frac{1}{\mathbf {R}^n(u,j)}}\). Then we can define more fine-grained user distances using Eq. (3), e.g., \(d(\mathbf {p}^p_u, \mathbf {p}^p_v)\), \(d(\mathbf {p}^n_u, \mathbf {p}^n_v)\), \(d(\mathbf {p}^n_u, \mathbf {p}^p_v)\) and \(d(\mathbf {p}^p_u, \mathbf {p}^n_v)\). A small \(d(\mathbf {p}^n_u, \mathbf {p}^n_v)\) indicates that the two users dislike similar items and can be used to avoid making bad recommendations that may lose users. A small \(d(\mathbf {p}^p_u, \mathbf {p}^n_v)\) or \(d(\mathbf {p}^n_u, \mathbf {p}^p_v)\) means that the interests of the two users complement each other and may be used for friend recommendation in social networks. We may also construct composite PMD (CPMD) such as:

$$\begin{aligned} \tilde{d}(\mathbf {p}_u,\mathbf {p}_v):=\mu d(\mathbf {p}^p_u,\mathbf {p}^p_v)+(1-\mu ) d(\mathbf {p}^n_u,\mathbf {p}^n_v), \end{aligned}$$
(4)

where \(\mu \in [0,1]\) is a tuning parameter weighting the importance of the distances of positive and negative preferences.

3 Experiments

We evaluate PMD by comparing its performance for NN-based recommendation with various user similarity measures. Two well-known datasets, i.e., MovieLens-1M [2] and Epinions [1], are used and their statistics are reported in Table 1. The rating user u gives to item i is predicted as a weighted sum of its top-K neighbours in the training set, i.e., \(\hat{\mathbf {R}}(u,i)=\bar{u}+\sum _{v\in \mathcal {N}_u}\frac{s(u,v)\times (\mathbf {R}(v,i)-\bar{v})}{\sum _{v\in \mathcal {N}_u}s(u,v)}\) [13], in which \(\bar{u}\) is the average of the ratings given by user u, \(\mathcal {N}_u\) contains the top-K neighbours of u and s(uv) is the similarity between a user pair u and v. We convert PMD into a similarity measure using \(s(u,v)=2-d(\mathbf {p}_u,\mathbf {p}_v)\) and divide all ratings into train/validation/test sets, with an 8:1:1 ratio. Hyper-parameters are tuned to be optimal on the validation set for all methods. The mean absolute error (MAE) and the root mean square error (RMSE) [15, 31, 32] of the predicted ratings on the test set are used to evaluate the recommendation performance.

Table 1. Data statistics.
Table 2. CPMD under different K and \(\mu \).

Item Similarity. Both MovieLens and Epinions come with side information for computing item similarities. For MovieLens, we compute movie similarity using Tag-genomes [3, 42]. For Epinions, we evaluate item similarity by applying Doc2Vec [24] on the comments. Since both Tag-genome and doc2vec derive item similarity by cosine, we convert item similarity into distance using \(\mathbf {D}(i,j)=\arccos (s(i,j))\), which is a metric on the item space. For fair comparison, the same item similarity matrix is used for PMD, BCF and HUSMFootnote 1.

Comparison Methods. COS, PCC and MSD are three classical user similarity measures. Jaccard, JMSD, NHSM, BCF, HUSM are five state-of-the-art measures. NMF [28], SVD [21] and SVD++ [20] are latent factor models for CF.

Table 3. Comparison with other user similarity measures.
Table 4. Comparison with latent factor models.

We report the performance of various similarity measures in Table 3, where PMD is based on Eq. (3) and CPMD is based on Eq. (4). The results show that PMD and CPMD consistently outperform other similarity measures and the improvement is more significant on the Epinions dataset which is much more sparse. We believe that our methods achieve good performance on sparse datasets mainly because it utilizes all rating information and derives the weights of the items using the optimal transportation theory, which works well when there are only few or no co-rated items. This is favorable as ratings are sparse in many real-world datasets [40]. CPMD achieves better performance than PMD, which suggests that it is beneficial to distinguish positive and negative feed-backs.

We also compare our methods with the latent factor models in Table 4. On the sparse Epinions dataset, both PMD and CPMD outperform the latent factor models. We report the performance of CPMD-based NN CF under different configurations of K and \(\mu \) in Table 2. CPMD performs best when \(\mu \) is around 0.6 on both datasets possibly because positive ratings can better represent the taste of a user than the negative ratings. In contrast, the optimal value of K is dataset dependent.

4 Conclusions

We proposed PMD, a novel user distance measure based on optimal transportation, which addresses the limitation of existing methods in dealing with datasets with few co-rated items. PMD also has the favorable properties of a metric. Experimental results show that PMD leads to better recommendation accuracy for NN-based CF than the state-of-the-art user similarity measures, especially when the ratings are highly sparse.