Keywords

These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.

1 Introduction

Making data publicly available creates unexpected privacy risks. Recent examples include AOL’s release of users’ search keywords [30], which has led to the identification of users and their profiles [1]. Data released by Netflix was de-anonymized by leveraging IMDB and dates of user ratings [28], showing that the release of data cannot be analyzed in isolation. The privacy risks of combining different public records have led to several [36] de-anonymization attacks. Recent studies of anonymized mobility data showed that mobility traces can be de-anonymized by leveraging a few observations [19]. One source of consumer information involves their spending patterns. To date however, it was unclear to what extent consumer prices leak information about the respective purchase.

Consumer purchase histories are typically recorded by store chains with loyalty programs and are used to compute consumer spending profiles [6]. Banks, payment card issuers, and point-of-sale system providers collect this data at different levels of granularity. In a number of scenarios, it might be desirable to share this data within different departments of a company, across companies, or with the public [7]. Before disclosure, the data is sanitized so that it does not leak sensitive data, such as personally identifiable information and that it (partially or fully) hides location information. In new digital currency systems such as Bitcoin [33] and Ripple [10], transaction values are stored on a public ledger. Irrespective of whether transaction values are made available so that a system can fulfill its functions or are being disclosed for research purposes, it is important to understand the privacy implications of such disclosures.

In this paper we focus on quantifying location disclosure resulting from the release of prices from consumer’s purchase histories. Intuitively, the price distribution for a product differs from country to country, which allows us to identify possible purchase locations. We focus on consumer products which are generally inexpensive (\(\le 25\) USD) and frequently-bought. More precisely, based on global prices (leveraging the Numbeo dataset [9]), we show that given access to a few consumer prices (and even without the product categories, precise times of purchase or currency), an adversary can determine the country in which the purchase occurred. Similarly, given the country, the city can be determined and within a city (leveraging the Chicago dataset [11]), the local store can be identified. We further demonstrate that it is possible to distinguish purchases among store chains (leveraging the Kaggle dataset [7]).

Fig. 1.
figure 1

Framework overview for quantifying location privacy leakage from consumer price datasets.

We present a generic framework (cf. Fig. 1) that allows the modeling and quantitative evaluation of location leakage from consumer price datasets. In our framework we model the adversarial knowledge, composed of a public dataset of consumer prices and location-specific information. We assume that the adversary has access to the individual product prices of a purchase (similar to the Kaggle dataset) and a coarse-grained value of the purchase time. In order to make the framework more flexible, our model supports different prior knowledge scenarios, e.g., the adversary additionally has access to the merchant category (e.g., knowledge that the product was bought in a market or a restaurant) or the product category (e.g., apples). Furthermore, we model the adversarial attack by detailing the corresponding probability functions. In particular, we point out how the adversary leverages multiple product prices in order to increase the probability of identifying the correct location.

Within our framework, we quantify the location privacy of consumer purchases in relation to different dimensions. For example, we measure how well the adversary estimates the location probability of the purchases with the \(F_1\)-score [35], capturing the test’s accuracy. Furthermore, we use mutual information [18] to quantify the absolute location privacy loss of consumers, based on the considered price dataset. In addition, we capture the relative privacy loss by measuring the reduction in entropy. The proposed metrics are independent of the choice of adversarial strategy and therefore allow us to quantitatively measure the privacy loss induced from any price dataset known to the adversary.

We apply our framework to three real-world datasets: (i) the Numbeo dataset [9] contains, after outlier filtering, crowd-sourced real-world consumer prices from 112 countries and 23 US cities for 23 distinct product categories; (ii) the Chicago dataset [11] contains 24 million prices for 28 product categories capturing on average of 6304 products sold in Dominick’s stores within the Chicago metropolitan area; finally, (iii) the Kaggle dataset [7] contains 350 million purchases from 311,541 consumer across 134 store chains.

Our evaluation shows that in order to infer the country based on a vector of purchases, an adversary often needs to observe less than 30 prices. Similarly, after having identified the country of the purchases and given roughly 30 prices, we show that we can reliably predict among 23 major cities within the United States. Finally, when the adversary narrowed down the coarse location, such as the Chicago metropolitan area, we show that based on a regional price dataset, and given a vector of purchases, an adversary can distinguish with high confidence among local stores using 100 purchases. For comparison, a weaker adversary with access only to coarse-grained time, i.e., the day of the purchase and price information, requires 50 purchases to identify the country. Furthermore, to establish practical utility of our methodology, we evaluate it on a dataset of purchase records (Kaggle [7]) and show that an adversary requires approximately 250 purchases to distinguish with high confidence among 134 store chains.

The main contributions of this paper are as follows:

  • We propose a generic quantitative framework for evaluating attacks against the location privacy of consumer purchases. We validate our framework on three independent price datasets of real-world consumer prices and show that location information can be extracted reliably.

  • We introduce three privacy metrics to capture the performance of the adversary in the attack as well as the extent to which location privacy of consumers is reduced when the adversary has access to a specific dataset of purchases.

To the best of our knowledge, this is the first work to infer the location of a purchase based on the price value in consumer purchases. The remainder of this paper is organized as follows. In Sect. 2, we model purchase history and describe the adversarial model. In Sect. 3, we present the datasets selected for our evaluation in Sect. 4. We survey the related work in Sect. 5 and conclude the paper in Sect. 6.

2 Model

In this section we introduce our system and adversarial model. We present the privacy metrics that quantify the probability of location disclosure based on the assumption that the adversary has access to a part of a consumer’s purchase history.

2.1 System Model

A consumer interacts with merchants and performs purchases of one or more products. This interaction leaves a trace of purchase activity as a sequence of purchase events. We model each of the consumer’s purchase events together with their contextual information as e: {consumer u, value v, product p, product category c, location l, time t}, where v is the price value spent on product p of product category c at location l and time t. In our model, one purchase event is limited to one product, similar to the data contained in the Kaggle dataset. In addition, the price value is given in a global currency, which usually is different from the local currency of the purchase (e.g., the original price is SEK, but recorded in USD). The trace of purchases performed by the target consumer U, given as a series of purchase events, is denoted by \(S_U\):\(\{e_1,e_2,\ldots ,e_n\}\). We define the following functions to represent the adversarial knowledge:

  • Location Probability: It describes the prior probability of a purchase event taking place in a specific location, e.g., \(P(\text {USA})\) is the prior probability with which a random purchase event e has \(e.l =\) USA. We define \(\mathbb {L}\) as the set of all considered locations.

  • Category Probability: Given location l, \(P(c \mid l)\) describes the conditional probability of a purchase event to belong to a certain product category, e.g., \(P(\text {Milk} \mid \text {USA})\) is the conditional probability with which a random event e from the USA has \(e.c = \) milk. This conditional probability models the product category preferences in a location. We define \(\mathbb {C}\) as the set of all considered product categories.

  • Value Probability: Given location l and product category c, \(P(v\mid l, c)\) describes the conditional probability of a purchase event at a given price value. It models the price distributions for different product categories in different locations, e.g., \(P(1.5 \mid \text {USA}, \text {Milk})\) is the conditional probability with which milk can be bought in the USA for 1.5 worth of a global currency.

The adversary can now model the spending behavior and identify likely candidate locations. Specifically, the adversary computes the posterior probability that a single price value v for a product category c originated from a location l. The computation involves the prior and the conditional probabilities described above and the application of Bayes’ theorem:

$$\begin{aligned} P(l\mid c, v) = \frac{P(l) \cdot P(c, v\mid l)}{P(c, v)} \end{aligned}$$
(1)

In order to infer the location without knowing the product category, the adversary computes the probability that a price value v originates from location l:

$$\begin{aligned} P(l\mid v) = \frac{P(l) \cdot P(v\mid l)}{P(v)} \end{aligned}$$
(2)

2.2 Adversarial Model

The adversary’s goal is to identify the location of the events in \(S_U\). In this section we present two different adversaries: (1) an adversary with complete knowledge and (2) an adversary with only public knowledge.

Adversary with Complete Knowledge. The ideal adversary represents a strong adversary with complete access to global purchase events. In particular, the adversary has access to the following prior knowledge:

  • Global Purchase History: The complete series of purchase events in the history of global purchasesFootnote 1, denoted by \(\mathcal {H}_G\). The adversary computes the posterior probability of a location based on \(\mathcal {H}_G\).

  • History for Target Consumer: The adversary might have access to prior information about the target consumer’s purchase history, denoted by \(\mathcal {H}_U\). This could help the adversary to optimize the model for the target consumerFootnote 2.

Based on this knowledge, the ideal adversary computes the probabilities in Eqs. 1 and 2.Footnote 3

Adversary with Public Knowledge. Our second adversarial model is a more realistic one, where the adversary only makes use of public information.

  • Population: Given the population at each location, the adversary estimates the location probability P(l).

  • Product Basket: A product basket indicates which products an average consumer purchases during a year, both in terms of quantity and monetary amount. We leverage the product basket in order to estimate the probability of a product category given the location (\(P(c \mid l)\))Footnote 4.

  • Price Dataset: For each location and product category combination, a price value distribution D is available, e.g., the Numbeo or the Chicago dataset. The adversary can use the distribution to estimate \(P(v\mid l, c)\). We define D(lcv) as the number of occurrences of price value v for product category c in location l and D(lc) as the number of price values for product category c and location l.

    Since D might be imperfect, the adversary can have incomplete or incorrect knowledge about the price value probabilities (i.e. unknown or rounded product prices). In this case the adversary should perform additive smoothing, which assigns a small probability \(\alpha \) to each event [26]. On the contrary, if the adversary has or assumes complete knowledge of the price value probabilities, additive smoothing is not required.

The adversary with public knowledge computes the following probabilities:

$$\begin{aligned} P(l)&= \frac{\text {Population}(l)}{\sum \limits _{l' \in \mathbb {L}}\text {Population}(l')}\end{aligned}$$
(3)
$$\begin{aligned} P(c\mid l)&= \frac{\text {Basket}(l,c)}{\sum \limits _{c' \in \mathbb {C}}\text {Basket}(l,c')} \end{aligned}$$
(4)
$$\begin{aligned} P(v \mid l, c)&= \frac{D(l,c,v)+\alpha }{D(l,c)+\alpha \cdot |S_U|} \end{aligned}$$
(5)

In order to compute the probabilities defined earlier in Eqs. 1 and 2, the adversary requires access to either \(P(l\mid c, v)\) or \(P(l\mid v)\). Next, we describe how the adversary computes these probabilities and we define the adversary’s knowledge.

2.3 Knowledge Scenarios

As mentioned, the adversary’s objective is to identify the location of the events in \(S_U\). The adversary is given a finite set of events \(S_U\) on which the attack is executed—the adversary is not allowed to choose or request new purchase events e. We consider an adversary with public knowledge and distinguish among three distinct adversarial knowledge scenarios, each consisting of a subset of the public knowledge. Depending on the knowledge scenario, the adversary might not have access to all information from a purchase event e. Therefore, we define a family of functions \(V_\text {scenario}(e) = V(e)\) that filter, depending on the given scenario, the public knowledge accessible to the adversary.

Price: This scenario corresponds to an adversary that has access to multiple purchase events e, only the corresponding price value and a notion of the purchase time e.t. The adversary is not aware of the product e.p or the product category e.c. The precision of the purchase time depends on further specifications of the scenario. More formally, \(V_\text {price}(e) = \{e.v, e.t\}\). Given the public knowledge modeled by Eqs. 34 and 5, the adversary computes the posterior probability \(P(l \mid v)\) of a price value v from location l. The intermediate steps for computing \(P(v \mid l)\) and P(v) are detailed in the Appendix A in Eqs. 10 and 12.

Price_Merchant: Similar to the former knowledge scenario, the adversary here has access to \(S_U\), a series of multiple purchase events. In this scenario, however, the adversary knows the price value e.v of the event as well as which merchant category m sold the product. Formally, for each purchase event e, \(V_\text {price}\_\text {merchant}(e) = \{e.v, e.t,m\}\), where \(V_\text {price}\_\text {merchant}\) requires a function \(M(e)=m\). We consider three merchant categories: restaurant, market and local transportation. The \(V_\text {price}\_\text {merchant}(e)\) function estimates the merchant category m from the product category e.c of the respective eventFootnote 5. Analogously, using Eq. 1, the adversary computes the probability of a location, based on the merchant and the price value:

$$\begin{aligned} P(l\mid m, v) = \frac{P(l) \cdot P(m, v\mid l)}{P(m, v)} \end{aligned}$$
(6)

where \(P(m, v\mid l)\) is computed as follows:

$$\begin{aligned} P(m, v\mid l)&= \sum _{c \in M^{-1}(m)} P(c, v \mid l) \end{aligned}$$
(7)

Price_Product-Category: This scenario corresponds to the most knowledgeable adversary with public knowledge. Similarly to the former scenarios, the adversary receives multiple purchase events \(S_U\). In addition, the adversary has access to the product category e.c as well as the price value e.v. Note that e.c implicitly assumes knowledge of the merchant, resulting in more formally \(V_\text {price}\_\text {product-category}(e) = \{e.v, e.t, e.c\}\).

Given the public knowledge described in Sect. 2.2, the adversary computes the probability \(P(l\mid c, v)\) of a purchase event with product category c and price value v originating in location l. The intermediate steps for computing \(P(c,v\mid l)\) and P(cv) are detailed in the appendix in Eqs. 11 and 13.

In the following section we provide an intuitive perspective on the probabilities \(P(l \mid v)\) and \(P(l \mid c,v)\).

2.4 Conditional Probability Intuition

\(P(l \mid v)\) is the probability of a location, given a price value in a purchase event. An example plot based on our evaluation can be found in Fig. 2. We have chosen the purchase event e with a price value of \(e.v = 1\) Euro and estimated the location of the price. The figure shows that the most likely location for 1 Euro is France, closely followed by Germany, Italy and Spain. The plot also shows \(P(l\mid c,v)\) for a purchase event with \(e.v = 1\) Euro and the product category is milk. The most likely country is again France, followed by Germany and Italy. Surprisingly, China ranks as \(5^{th}\). This can be explained by the fact that (i) some prices from China in the dataset were erroneously reported in Euros and (ii) that the location probability P(l) influences the overall outcome, and, since China’s population is considerable, there is an increased probability of purchases occurring there. Overall we observed that the probability distribution changes when the product category is known, i.e., France is more likely to have a 1 Euro price for milk, than a 1 Euro price in general.

Fig. 2.
figure 2

Probability distribution of \(P(l \mid v)\) and \(P(l \mid c,v)\), given 1 Euro and milk.

2.5 Multiple Purchase Events

Up to this point, the analysis has been based on a single purchase event. To naturally combine multiple purchase events, we assume that the purchase events are conditionally independent, given the location l. Therefore, the probability of a location l, given a set of purchase events \(S_U\), is calculated as follows:

$$\begin{aligned} \begin{aligned} P(l\mid S_U)&= P(l\mid V(e_1), V(e_2), \dots , V(e_n)) \\ \\&= \frac{P(l) \cdot \prod \limits _{e \in S_U} P(V(e)\mid l)}{P(V(e_1), \ldots , V(e_n))} \end{aligned} \end{aligned}$$
(8)

The intermediate steps for computing \(P(l \mid S_U)\) can be found in the appendix in Eq. 18. We experimentally verified the conditional independence of V(e) given l for the three knowledge scenarios and therefore Eq. 8 applies equally to the different adversarial knowledge scenarios. Note that we effectively weaken the adversary by considering the products of different purchases independent from each other.

2.6 Privacy Metrics

We introduce three privacy metrics in order to capture the privacy of consumers revealing their purchase histories across different dimensions: We (i) measure the performance of the adversary in identifying the true location with the \(F_1\)-score. Then, (ii) using the notion of mutual information [18], we quantify the absolute privacy loss of the consumer due to the adversary’s knowledge of a price dataset. Finally, (iii) we use the relative reduced entropy as a relative privacy metricFootnote 6.

\(F_1\)-score: The objective of the adversary is to assign the purchase events to the correct location. In the worst case, the adversary is forced to randomly guess among all possible locations. If the adversary, however, can estimate location probabilities more accurately, location privacy is reduced. Our problem corresponds to a multi-class classification problem and we therefore quantify the adversarial performance by averaging the \(F_1\)-score [35] of each individual class. The \(F_1\)-score corresponds to the harmonic mean of recall and precision, measuring the test’s accuracy.

Mutual Information: A purchase event dataset enables the adversary to infer the distribution of prices among locations. Therefore, we want to measure how much privacy consumers lose when their purchase events are revealed and when the adversary has access to a dataset of purchase events. We quantify this privacy objective by measuring the absolute reduced location entropy given the purchase events. To this extent, we use the Mutual Information [18], denoted by I(lV(e)), which measures how much the entropy of the locations is reduced given the purchase events (cf. Eq. 9).

$$\begin{aligned} I(l, V(e)) = \sum _{l \in \mathbb {L},e \in S_U}P(l, V(e)) \cdot \log _2\frac{P(l,V(e))}{P(l)P(V(e))} \end{aligned}$$
(9)

Relative Reduced Entropy: Recall that the mutual information quantifies what we call the absolute privacy loss. In fact, there is an inherent randomness in the price distribution among locations. It is important to capture to what extent the original uncertainty about the locations can be reduced when a dataset of purchase events is given. The relative reduced entropy therefore captures the relative privacy, as the complement of the fraction of the conditional entropy over the location entropy. Given \(H(l) = I(l, V(e)) + H(l \mid V(e))\), we compute the relative reduced entropy as \(1-\frac{H(l \mid V(e))}{H(l)}\) over all purchase events.

The proposed evaluation metrics are independent of a particular adversarial strategy. In return, the output of the privacy leakage quantification only depends upon the employed dataset of purchase events. In the next section we present the datasets utilized for our experimental evaluation.

3 Datasets

There are only a couple of datasets accurately accumulating the worldwide product price information. For individual products (e.g., a Big Mac [5] or Starbucks coffee [8]), the average price values per country are available. Because a product often appears multiple times with different price values in the same country or city, the average is not a good estimator for elaborate studies. In the following, we describe the three independent price datasets considered in our work.

The first dataset, Numbeo [9], is a crowd-sourced dataset containing worldwide price values per product category, city and country. It is the most complete dataset of worldwide harvested prices available to our knowledge. We restricted our analysis to 23 frequently bought product categories, and split the Numbeo dataset into two separate datasets: (i) two years of data as the Numbeo dataset and (ii) five months of data as the Numbeo test dataset (cf. Table 3). Numbeo performs sanity checks on the crowdsourced inputs, and we additionally filtered extreme outlier [3]Footnote 7 from the data to account for possible mistakes from crowdsourced data. We identified 112 countries, with a total of 328,720 price values. Note that the provided data mostly contains prices from the US (18 %) and India (14 %).

The second dataset, referred to as the Chicago dataset [11], covers 84 stores in the Chicago metropolitan area over a period of five years. The data is sourced on a weekly basis from Dominick’s supermarket stores. We sample 85 weeks with the most data, each containing on average 283,181 prices, spanning 28 product categories for an average of 6304 different products.

The third dataset originates from Kaggle [7], a Machine Learning competition platform. The dataset contains 350 million purchase events from 311,539 consumers across 134 store chains. The data is anonymized, but contains the individual product price, product category, date of purchase and purchase amount. Most purchase events cost less than 25 USD. The country of the dataset is not disclosed, but purchase prices are given in USD and purchase amounts are described in the imperial system.

In order to estimate the location probability, an adversary requires the knowledge of the population in each location. On the country granularity, we use the data available from the World Bank [12] for the year 2013, while for the US city granularity we used the data from the US Census Bureau [37].

As described in Sect. 2.2, we increase the knowledge of the adversary with the product basket. A product basket details which and how many products an average person purchases, both in terms of quantity and monetary amount. We leverage a national product basket [4] from 2010 containing over 300 product categories in order to infer the ratio in which different products are bought over the year.

4 Experimental Evaluation

In this section we evaluate the adversarial models designed in Sect. 2.2. We start by presenting the assumptions and choices made for the evaluation.

4.1 Experimental Considerations

With respect to the value probability \(P(v \mid l,c)\), we assume that the frequency of price values in the Numbeo dataset reflects the frequency of real-world purchase events with the corresponding price values. This is a natural assumption and is further motivated by the fact that e.g., Numbeo contributors likely entered the most popular price values for the considered product categories. Because our datasets contain a limited amount of products and product categories, our analysis is naturally confined to the available products. Note that, if the adversary knows the product categories of the purchases, e.g. milk, other categories such as apples can be ignored, which allows precise predictions with knowledge about few products. In order to compute the product category probability, \(P(c \mid l)\), we only consider one national product basket and apply it to every country. Note that we do not use the product basket as an indicator of how much money is spent on average by a person, but rather as an indicator in which ratio products are bought.

Sampling Price Values: Given a location l, we generate synthetic consumer purchase events by sampling price values from the respective dataset. For the three datasets we consider adversaries with complete knowledge of the price values. In addition we instantiate an adversary with incomplete knowledge with the Numbeo test dataset. Given the product basket of the location l we compute the probability of a product category being sampled (cf. Eq. 4). Thus, we sample each product category with the product category probability \(P(c \mid l)\). For each location we repeat the sampling of the price values \(n=1000\) times and average the result.

Additive Smoothing Parameter: In the case of an adversary with incomplete knowledge, we make use of additive smoothing to avoid zero probabilities when aggregating the probabilities of multiple purchase events for locations (see Sect. 2.2). We choose a smoothing parameter \(\alpha = 0.01\) which provides us with the best results on our data (cf. appendix Fig. 6).

In the following, we evaluate up to three knowledge scenarios (cf. Sect. 2.3) for four location granularities: (i) across 112 countries worldwide; (ii) across 23 cities within the United States; (iii) across 84 stores within the Chicago metropolitan area; (iv) we distinguish among 134 store chains in a country.

4.2 Country Granularity

The adversary has to distinguish 112 candidate countries for each purchase event. We quantify the privacy given the three privacy metrics defined in Sect. 2.6. In particular, we performed our study in two settings. First, (i) we assumed that the adversary does not have complete knowledge. This means that the adversary receives purchase events from the Numbeo test dataset and estimate their location based on the Numbeo dataset. In the second case, (ii) the adversary assumes complete knowledge of price values, and therefore, the sampled prices are included in the price dataset which is adversarial knowledge.

Figure 3 shows the \(F_1\)-score for the first case based on the number of purchase events accessible to the adversary. Given one purchase event, the price, price_merchant and price_product-category knowledge scenario achieve an average of 0.38, 0.41 and 0.49 respectively. The high \(F_1\)-score after one purchase event shows, that even one event allows a decent prediction. We observe that the adversary is more likely to identify the correct location when it knows the product category of the purchase event. On the contrary, if the adversary has access to 10 purchase events, the respective \(F_1\)-scores are 0.80, 0.85 and 0.90. In other words, 10 purchase events significantly improve the ability of the adversary to identify the location of the purchase events. The reported values are averaged over \(n=1000\) iterations.

Fig. 3.
figure 3

\(F_1\)-score for identifying the country given purchase events sampled from the Numbeo test dataset, corresponding to incomplete knowledge. We are not overfitting as we successfully classify new prices based on previously known prices.

Fig. 4.
figure 4

\(F_1\)-score for identifying the country given purchase events sampled from the Numbeo dataset, corresponding to complete knowledge. Averaging does not hide poorly performing countries (cf. appendix).

Figure 4 corresponds to the second case, where the adversary assumes complete knowledge of the price values. We observe that the adversary can distinguish more accurately between the possible locations. The \(F_1\)-scores are averaged over all considered countries. For each considered country in the price knowledge scenario, we verify that averaging does not hide poorly performing countries (cf. Fig. 7 in the appendix).

Table 1 presents the results of the mutual information and the relative reduced entropy for each knowledge scenario. We observe that the price_product-category knowledge scenario reduces the entropy more significantly than the other knowledge scenarios. Naturally, this is because the price_product-category knowledge scenario provides the adversary with more information than the price knowledge scenario, thus effectively reducing uncertainty when identifying the location.

Table 1. Mutual information and relative reduced entropy for the three knowledge scenarios when estimating the country, city, store or chain of purchase events. The respective abbreviations P., PM., PPC. stand for Price, Price Merchant and Price Product-Category knowledge scenario respectively.

4.3 US City Granularity

In this section we analyze an adversary that aims to distinguish among the purchase events of 23 US cities. As before, we quantify the privacy based on the three privacy metrics defined in Sect. 2.6. We sample and test purchase events on the Numbeo dataset only, since our test dataset does not contain sufficiently many purchase events per considered US city.

Figure 10 illustrates the \(F_1\)-score depending on the number of purchase events. We observe, that after 10 purchase events, the \(F_1\)-score is greater than 0.7. Therefore, our methodology also provides accurate estimations on a city granularity. Table 1 reports the mutual information and relative reduced entropy when estimating the US city. We observe that the relative reduced entropies of country and city granularity match across the knowledge scenarios. This exemplifies the usefulness of the relative reduced entropy to highlight similarities across different price datasets.

4.4 Chicago Metropolitan Granularity

In this section, we analyze an adversary that aims to distinguish among the purchase events of 84 Dominick’s stores within the Chicago metropolitan area. We sample the price values from the Chicago dataset, and assume an adversary with complete knowledge; we therefore do not apply additive smoothing. We consider the location prior probability P(l) to be uniform, because we do not have reliable store popularity information for the Chicago area.

In Fig. 11 we can observe that the adversary can identify a local store given 100 purchase events with high confidence. We expected a weaker result, since all stores are operated by the same chain, implying relatively similar price structures. We ran our attack on each of the 85 weeks with most data, averaged the results and report the standard deviation as shown in the blue area of Fig. 11.

Table 1 shows that the Chicago price dataset reveals less information about the considered locations than the Numbeo dataset. This observation holds for both knowledge scenarios, and is consistent with the result that more price points are required to localize purchase events within the Chicago area.

4.5 Store Chain Granularity

The large-scale Kaggle dataset does not provide precise location information of purchase events, but allows the adversary to distinguish among 134 store chains. Knowing the store chain of purchase events effectively reduces the possible locations of the purchases. Note, that the prices of Kaggle are distributed over a year and the adversary therefore does not know the precise time of the purchase events.

We uniformly sample purchase events of different consumers and perform our attack on the Kaggle dataset. Figure 5 reveals that given approximately 250 price values we achieve an \(F_1\)-score of over 0.95 for the origin of the purchase events. Note, that the price_product-category knowledge scenario is particularly strong due to many product categories. This is reflected by the particularly high Mutual Information (cf. Table 1).

Fig. 5.
figure 5

\(F_1\)-score for identifying the store chain. The purchase events are sampled from the Kaggle dataset.

Given these results, we conclude, that our framework and methodology apply to a wide variety of different price datasets and allow us to quantitatively compare their respective privacy leakage. In the following, we extract further insights from our data to strengthen the attack.

4.6 Most Revealing Product Category

In this section we investigate which of the 23 considered product categories from the Numbeo dataset leak more information. This is a useful insight since an adversary would pick purchase events of this product category in order to increase the probability of correctly identifying their location. Therefore, with the mutual information we measure the extent to which the location entropy is reduced, given the purchase events of a particular product category. Contrary to the previous analysis, we evaluate the mutual information per product category based on the price_product-category knowledge scenario defined in Sect. 2.3. More specifically, we compute the mutual information using only purchase events of a particular product category.

The results of the evaluation can be found in Fig. 13. According to this metric, the most revealing product categories are milk, a one-way ticket for local transportation, and a loaf of white bread. On the contrary, the product categories that disclose less information about a location are oranges, chicken breasts and rice.

4.7 Required Time Precision

Previously, we assumed that knowledge of the exact currency conversion rates is required to compare non-localized purchase events. Exact currency conversion rates, however, require a precise knowledge of the purchase event times. In this section, we show that our attack does not require the exact currency conversion rates, but also works if the adversary knows only the date or even week of the purchase, i.e. it has an uncertainty of 24 h or 7 days in relation to the conversion rates. We therefore relax the requirements on the time precision.

Due to the conversion rate differences, the adversarial estimation of \(P(v \mid l,c)\) is inaccurate. To compensate for the conversion rate differences, the adversary can use a price tolerance. We study two options for the tolerance: a static tolerance and a dynamic tolerance. For the static tolerance, the adversary estimates \(P(v \mid l,c)\) in the presence of uncertainty by considering price values in the interval \([v-tol_s,v+tol_s]\) where the static tolerance \(tol_s\) is a small amount in global currency (e.g., 0.02 USD). The dynamic tolerance value \(tol_d\) is a percentage-wise estimate of uncertainty (e.g., 2 %). To estimate \(P(v \mid l,c)\) the adversary considers price values from the interval \([v\cdot (1-tol_d),v\cdot (1+tol_d)]\).

We evaluated the attack to infer the country of purchase events with imprecise purchase times and compensated the time error with different tolerance values. To simulate imprecise purchase times, we converted the adversarial knowledge using conversion rates of 30 different days from the year 2014 and then converted the non-localized purchase events \(S_U\) using the previous days’ conversion rates. As before, we computed the \(F_1\)-score to evaluate the quality of the estimated \(P(l \mid S_U)\).

For static and dynamic tolerance values, we found that the attack is still accurate, i.e. reaches an \(F_1\)-score above 95 % with less than 50 purchase events. A higher tolerance value has two opposing effects: (i) it compensates for differences in currency conversion rates and increases the number of correctly considered price values; (ii) a higher tolerance, however, also increases the number of incorrectly considered price values which fall into larger intervals. Therefore, the tolerance value presents a trade off between the true-positive and true-negative rate. Our experimental results reflect this trade off both for static and dynamic tolerance values (cf. Appendix B). Based on our experimental results we propose a dynamic tolerance of 2 % for a 24 h time imprecision.

We also evaluated the uncertainty of one week on the currency conversion rates. We used real-world currency conversion rates that were seven days apart from each other. Figure 14 shows the result of this experiment for the different knowledge scenarios and a dynamic tolerance value of 2 % on the Numbeo dataset. We conclude that our attack does not require precise purchase event times.

5 Related Work

Location Privacy. Blumberg [16et al. provide a non-technical discussion of location privacy, its issues and implications. Gruteser and Grunwald [23] initiate major research in the area of the anonymization approaches to location privacy. Further, Narayanan et al. [29] investigate location privacy from a theoretical standpoint and present a variety of cryptographic protocols motivated by and optimized for practical constraints while focusing on proximity testing. Shokri et al. [34] propose a formal framework for quantifying location privacy in the case where users expose their location sporadically. They model various location-privacy-preserving mechanisms, such as location obfuscation and fake location injections. This work is orthogonal to ours, since in our setting the consumers are not willingly revealing their locations. Voulodimos et al. [38] address the issue of privacy protection in context-aware services through the use of entropy as a means of measuring the capability of locating a user’s whereabouts and identifying personal selections. Narayanan [28] and Shmatikov propose statistical de-anonymization attacks against high-dimensional micro-data. We do not rely on their methods, since we are not aiming to de-anonymize the consumers. De Montjoye et al. [39] show that consumers can be uniquely identified within credit card records with only a few spatiotemporal triples containing location, time and price value. Contrary to their work, we focus on the price values and we localize instead of identify consumers.

Payment systems. The privacy implications of public transaction prices have been widely ignored. One prominent example is Bitcoin [17, 33], where transactions are exchanged between peers by means of pseudonyms. The actual transaction prices are archived and publicly available. The literature features many different methods for analyzing the privacy implications of Bitcoin, e.g., by means of appropriate heuristics [13], tainting [22], or other techniques [21, 32]. Reid and Harrigan [31] analyze the flow of Bitcoin transactions in a small part of the Bitcoin log, and show that external information like publicly-announced addresses, can be used to link identities and organizations to some transactions. In [27] the authors propose Zerocoin, a cryptographic extension to Bitcoin that augments the protocol to allow for fully anonymous currency transactions using a distributed ECash scheme. To the best of our knowledge only two contributions [14, 15] have aimed to hide the transaction prices in Bitcoin.

Price rigidity. Herrmann and Moeser [24] perform a quantitative analysis on price variability and conclude that prices are often rigid for several weeks. Pricing strategies for identical brands, however, vary significantly among retailers. Their observations match the studies of the Big Mac index [5] (the Economist), the Starbucks coffee index [8] (the Wall Street Journal) and the Ikea Billy Bookshelf index [2] (Bloomberg). The former studies show that prices of identical products from a single brand vary across locations. Dutta et al. [20] find that retail prices respond promptly to direct cost changes as well as upstream manufacturers’ costs. Hosken and Reiffen [25] find that each product has a price mode—a price that the product stays at most of the time. Note that Hosken’s non-public dataset contains nearly as many price observations as our Numbeo dataset.

6 Conclusion

Having a systematic methodology to reason quantitatively about the privacy leakage from datasets containing price relevant information is a necessary step to avoid privacy leakages. While further tests with more datasets will help to generally claim that price values alone can reveal the location of a purchase, our empirical results provide evidence that with relatively few purchase events it is possible to identify a consumer’s location. In this paper, we have raised the following two questions: How much location information is leaked by consumer purchase datasets? How can it be quantified with the considered adversarial model and knowledge? In our proposed framework, we have modeled several adversaries and quantified the privacy leakage according to different dimensions. We make extensive use of Bayesian inference in our framework to model the different attack strategies. Our framework can be easily applied to any price dataset of consumer purchases and allows one to compare the privacy leakage of different datasets. We applied our methodology to three real-world datasets and achieve comparable results. The results presented in this paper strongly motivate the need for careful consideration when sharing price datasets and should be considered when designing public ledger cryptocurrencies.