Keywords

1 Introduction

Among the data privacy protection technologies, existing research is based on the solution from the following perspectives: anonymity-based methods, encryption-based methods, noise-based method, and differential privacy-based method. There have been many reliable encryption-based method technology, such as DES [7], 3DES, Blowfish [23], RC5 [21], IDEA, RSA, etc. The advance of the encryption technology is their security. However, the analyzability will be lost due to the encryption. The anonymity-based methods to protect data privacy can keep the data’s analyzability, the mainly anonymity-based technologies are k-anonymity [24], L-diversity [3] and T-closeness [18]. However anonymity-based methods have fatal weaknesses, and the anonymous data might suffer anti-anonymity. For the data organizers, there exist security and privacy problems on data collection and publishing. Among the data privacy attacks, differential attack is a way that the attacker infers private data through statistical information over two homogeneous datasets. For example, an attacker can infer a person’s specific shopping goods by differential attacks via different queries. To explore whether a person bought an object, the attacker can conduct two queries, and one query obtains the count of persons that have bought the object, and another query the count on the data set that excludes the person by the quasi-identifier, such as timestamp, gender, region, age, etc. By the two query results, the attack can infer whether the person bought the object.

To solve the differential attack, many differential privacy algorithms can be used, such as matrix mechanism [17], DAWA algorithm [16], MWEM [13], and RAPPOR [10], etc. The differential privacy technology can be used in many fields [5, 6, 11, 20, 26]. Differential privacy was first defined by Dwork et al. [8, 9], and it protects the individual data by injecting noise to the results according to the privacy budget. A number of \(\varepsilon \)-differential privacy algorithms have been proposed [2, 13, 15,16,17], and some of them workload-aware and data-dependent [2, 13, 16, 17]. From the method of disturbing results view, \(\varepsilon \)-differential adopts three ways: Laplace Mechanism [8], Exponential Mechanism [19], and Randomized Response [25]. Random response mechanism is an effective way to protect the privacy of the frequency vector. The random response mechanism has been used in privacy protection of collecting sensitive data since the 1960s. RAPPOR [10] is \(\varepsilon \)-differential privacy technology that Google company has already used in the browser, and it adopts the random response. MWEM [13] is classical \(\varepsilon \)-differential privacy, and it is based on a combination of the Mechanism Exponential Mechanism with the Multiplicative Weights update rule. The MWEM algorithm selects and poses queries using the Exponential and Laplace Mechanisms, and it improves the approximation using the Multiplicative Weights update rule. DAWA [16] is a data-dependent and workload-aware algorithm, and it adds noise according to the input data and the given workload and it is a two-stage mechanism for answering range queries under \(\varepsilon \)-differential privacy. In 2016, Michael Hay et al. propose an evaluation framework for standardized evaluation of privacy, called DPBENCH [14]. In 2018, Dan Zhang et al. [27] propose a programming framework and system called \(\epsilon \)ktelo to implement the existing algorithms. For the task of answering linear counting queries, \(\epsilon \)ktelo allows both privacy novices and experts to easily design algorithms, and the APEx [12] is a novel differential privacy system that allows data analysts to pose adaptively chosen sequences of queries along with required accuracy bounds.

Most of the algorithms are related to data distribution, especially when the data items are sparse, i.e., there are a large number of items are empty, these algorithms can effectively reduce the introduced errors. The same conclusion can be reached in the paper [14, 16]. While, these algorithms are not suitable for all data situations, as in the situation the data items are intensive and the data has complex distribution, and the conclusion is also shown in [14, 16]. Current \(\varepsilon \)-differential privacy algorithms will cause computation error for linear computations over the intensive data domain. Inspired by the partition of the data domain, we propose a novel \(\varepsilon \)-differential privacy algorithm via Laplace-based noise and optimized workload division to decrease the computation error in complex data situation. We make the following contributions:

  1. (1)

    We propose a novel \(\varepsilon \)-differential privacy algorithm in complex data situation. We used Laplace-based noise to disturb the query results. This disturbation can reduce the error of the linear computation queries under intensive data items by workload-aware noise.

  2. (2)

    We propose an optimized workload division method. We divide the queries recursively to reduce the added noise. This division can effectively reduce computation error when there exists a hot spot, i.e., some domain is frequently queried in the workload.

  3. (3)

    We conduct extensive experiments on six real-world datasets and conduct a comparison with differential privacy algorithms (MWEM, DAWA, and Identity). The evaluation results show that the proposed algorithm can effectively reduce the computation error and has better efficiency relatively.

2 Approach Overview

We propose a \(\varepsilon \)-differential privacy algorithm for the linear computation queries. The algorithm aims to reduce the results error in the case that the sensitivity of workload is high and there exists frequency queried dom(\(\mathbb {B}\)) item due to the hot issue or statistical attack queries and the frequency count x is complex. In the algorithm, we adopt Laplace Mechanism to disturb the query results. To reduce the random added noise, we propose a novel perspective that the added noise might be reduced by dividing the queries into several clusters and add Laplace-base noise respectively. Furthermore, based on the Laplace division, we propose a simple and effective recursion division for the query workloads and the privacy budget. The method recursively divides the queries workload and privacy budget into two parts when the expected noise is less than that before dividing. To sum up, the algorithm can solve three problems: (1) The current \(\varepsilon \)-algorithms can reduce the error limitly, meanwhile, those algorithms will cost much computation resources. (2) When the sensitivity of a query workload is large, the current algorithms can’t reduce the noise obviously. This can be shown in [16]. (3) In the situation that the data distribution is intensive, the current algorithms cannot fit it and will cause much error for the answer to query workload.

The method we propose satisfies \(\varepsilon \)-differential privacy rigorously, and \(\varepsilon \)-differential privacy is the privacy protection mechanism proposed by Dwork in 2006 and regulates privacy protection. We will define \(\varepsilon \)-differential privacy formally.

Definition 1

(\(\varepsilon \)-differential privacy). An algorithm M is a \(\varepsilon \)-differential privacy algorithm if for any neighboring database I and \(I'\ (\left| I-I'\right| \le 1)\), and any subset of output S satisfies the following formula:

$$Pr[M(I) \in S ] \le \exp \left( {\varepsilon }\right) \times Pr[M(I') \in S ]$$

The \(\varepsilon \)-differential privacy algorithm protects privacy data by disturbing the answer and the attackers cannot distinguish the results over the neighboring database I and \(I'\), and the parameter \(\varepsilon \) is the privacy budget and it determines the privacy-preserving capacity. If the privacy budget is lower, the differential algorithm will protect privacy more effectively. For the random algorithm M, if the results over the two adjacent datasets I and \(I'\) are close to each other, and it is difficult to infer whether a data item exists by M(X) and M(Y). \(\varepsilon \)-differential privacy has the following three primary properties.

Property 1

For the random algorithm \(\varepsilon _i\)-difference privacy \(M_1\), and function M(X) is an arbitrary deterministic function: \(R\rightarrow R'\). Then \(M_1(M(X))\) still satisfies \(\varepsilon \) differential privacy.

Property 2

For the random algorithm \(M_i\) and it satisfies \(\varepsilon _{i}\)-difference privacy. Defining a random function M that it is a process of a random sequence of \(M_i\). The random function M satisfies \(\sum _{i=1}^{k}\varepsilon _i\)-difference privacy.

Property 3

The data set X make up of k data sets \(\{X_1,...Xi,...X_k\}\), and \(M_i(X_i)\) satisfies \(\varepsilon \)-differential privacy, respectively. \(M(X) = \{M_1(X_1),...,X_k(M_k)\}\) satisfies \(\mathop {max}\limits _{\varepsilon _i}\varepsilon _i\)-differential privacy.

Our algorithm reduces the results error when answering the linear computation query under the \(\varepsilon \)-differential privacy, and we will define the linear computation query. For a database instance I whose relational schema attributes \(\mathbb {A} = \{A_1, A_2,...,A_l\}\). In \(\mathbb {A}\), each attribute data can be discrete or continuous. For the continuous data, the data can be treated as discrete in the data domain as well. The workload means a set of queries over the attributes \(\mathbb {B} =\{B_1,B_2,...,B_k\}, \mathbb {B} \in \mathbb {A}\). For example, if the workload queries in a subset of three-dimensional range query over attributes \(A_1\), \(A_2\), and \(A_3\), \(\mathbb {B}=\{A_1,A_2,A_3\}\). We then present a frequency vector x, and \(x_i \in dom(\mathbb {B})\). For example, dom\((\mathbb {B})=\{(1,1,1),(1,1,2),...\}\) and for each dom\((\mathbb {B})_i\), \(x_i\) is the frequency of tuples values dom\((\mathbb {B})_i\). A linear computation query computes a linear combination of the frequency in x, as described the following SQL query and we define the linear computation query as follows definition formally.

$$ Select \ count (*) \ from \ R \ Where \ dom(\mathbb {B}) = dom(\mathbb {B})_i \ or... \ dom(\mathbb {B}) = dom(\mathbb {B})_k $$
Fig. 1.
figure 1

A sample of linear computation query workload, frequency vector, and answer to the workload.

Definition 2

(Linear computation query). A linear computation query is a length-n vector \(q=[q_1,q_2,...,q_n]\) , each \(q_i\) \(\in \) {0,1}. The answer to a linear query q on x is the vector product \(q\cdot x =q_1x_1+q_2x_2+...,+q_nx_n\).

The linear computation can be called range count query, linear count query, and point count query when the query q can be marked as range, length-n vector, or a position in x.

In the data collection situation, calculating the frequency in x can be done by the data organizers. And the data organizers has the capability to answer the linear computation query over the frequency vector x. The workload W makes up of a set of linear computation queries. If W is an \(m\times n\) matrix, it means m length-n linear computation queries and the query results can be computed as the matrix product \(W\cdot x\). The linear computation query is one of the most important and common queries in data mining and data analysis. The linear computation can help the analyst understand the distribution information of data and to make intelligent decisions and data prediction. Figure 1 shows a workload W, frequency vector x, and the answer to W over x.

3 Laplace-Based Disturbation

Our algorithm adopts Laplace Mechanism to add noise, and we transform the Laplace Machainsim [8] to fit the query workload and data distribution. To ensure our algorithm satisfy \(\varepsilon \)-differential privacy, the algorithm adds random noise rigorously conform to the Laplace distribution.

3.1 Laplace Mechanism

The Laplace mechanism [8] is proposed by Dwork, and the key method of Laplace Mechanism is to add noise that randomly generated through the Laplace distribution to the query results. The probability density function of the Laplace distribution is described as following:

$$\begin{aligned} Lab\left( x;a,b\right) =\frac{1}{2b}\exp {\left( \frac{\left| x-a \right| }{b}\right) } \end{aligned}$$
(2)

The variance of a random variable that satisfies the Laplace distribution is \(\sigma ^2=2b^2\). To make the algorithm satisfy \(\varepsilon \)-differential privacy, we can add random noise from the Lap(xa, 0), and we denote the Laplace distribution random variable as Lap(a) in the following section. For different query or query workload, the Laplace Mechanism adds noise differs against the sensitivity of the query or workload.

Definition 3

(Sensitivity). Given a query q and the frequency vector x and \(x'\), the sensitivity of the query q is:

$$\varDelta _q=\max {\left| \left| q\left( x\right) -q\left( x'\right) \right| \right| _1}\ \ \left( \left| \left| x-x'\right| \right| _1\le 1\right) $$

It can be seen that the sensitivity of a query is the maximum change of the answer to a query on the neighboring frequency vectors. When the sensitivity of a query is high, the privacy data has a high probability to be attacked, and the reason is that the presence or absence of certain data can greatly change the result of the query, and it is more calculable to infer the certain sensitive data. For a query workload W, we use an \(m\times n\) matrix to represent W, as shown in Fig. 2. According to the sensitivity of a query, the sensitivity of the query workload W can be defined as the following:

$$\varDelta _W=max \parallel Wx^t- Wx'^t\parallel = \mathop {max}\limits _{j}\parallel \sum _{i=1}^{i=m}\left| W_{ij}\right| , (\parallel x - x'\parallel _1\le 1)$$

Given the definition of Laplace distribution and sensitivity, we can define the Laplace mechanism as following formally.

Definition 4

(Laplace Mechanism). Given a workload W and a frequency vector x, \(M_L(x,W,\varepsilon )\) is \(\varepsilon \)-differential privacy, if it satisfies the following condition:

$$M_L(x,W,\varepsilon )=W\cdot x^t+(Y_1,\ldots ,Y_k))$$

The random variable \(Y_i\) is generated by \(Lap\left( \varDelta _W \cdot \frac{1}{\varepsilon }\right) \). The proof is presented as the following, where database I and I differ at most one record, \(P_I(s)\) is the probability that the output for the query database I is s.

$$\begin{aligned} \begin{aligned} \frac{p_I\left( s\right) }{p_{I'}\left( s\right) }&=\prod _{i=1}^{k}\left( \frac{\exp {\left( -\frac{\varepsilon \ \left| {q\left( I\right) }_i-s_i\right| }{{\nabla q}}\right) }}{\exp {\left( -\frac{\varepsilon \left| {q\left( I'\right) }_i-z_i\right| }{{\nabla q}}\right) }}\right) \\&=\prod _{i=1}^{k\ }\left( \frac{\exp {\left( -\frac{\varepsilon \ \left| {q\left( I\right) }_i-s_i\right| }{{\nabla q} }\right) }}{\exp {\left( -\frac{\varepsilon \ \left| {q\left( I'\right) }_i-s_i\right| }{{\nabla q}}\right) }}\right) \\&=\prod _{i=1}^{k }\left( \exp {\left( \frac{\varepsilon \left( \left| {q\left( I'\right) }_i-s_i\right| -\left| {f\left( I\right) }_i-s_i\right| \right) }{{\nabla q} }\right) }\right) \\&\le \prod _{i=1}^{k\ }\left( \exp {\left( \frac{\varepsilon \left( \left| {q\left( I'\right) }_i-q\left( i\right) _i\right| \right) }{{\nabla q} }\right) }\right) \\&\le \exp {\left( \varepsilon \ \right) } \end{aligned} \end{aligned}$$
(3)

3.2 Workload-Aware Noise

To reduce the noise, we will divide the queries in workload into several workloads. Meanwhile, the privacy budget \(\varepsilon \) will be divided into the same number of privacy budgets. After dividing, different workloads will add corresponding noise according to the divided privacy budget. Formally, for the workload W, we divide it as \(\{W_1, W_2,..., W_m\}\), For each divided workload \(W_i\), the privacy budget is also divided into \(\varepsilon =\{\varepsilon _1,\varepsilon _2,...,\varepsilon _m\}\),and add random noise from the distribution \(Lap(\varDelta _{W_i}1/\varepsilon _i)\). That is, the answer to the workload is \(S = [W_1, W_2,...,W_m]\cdot x_t+ [Lap (\varDelta _{W_1}1/\varepsilon _1 ),\ Lap (\varDelta _{W_2}1/\varepsilon _2 ),... \ Lap ( \varDelta _{W_m}1/\varepsilon _m)]\). It can be proved that the algorithm satisfies \(\varepsilon \)-differential privacy as the property2, and we can also prove it by the following process. For the neighboring database I and \(I'\), that is, \(\parallel I-I' \parallel _1 \le 1\). Let \(p_I(s)\) represent the distribution probability of query x on W, and \(s\in R^k\):

$$\begin{aligned} \begin{aligned} \frac{p_I\left( s\right) }{p_{I'}\left( s\right) }&= \prod _{i=1}^{k}\left( \frac{\exp {\left( -\frac{\varepsilon _i\left| {q\left( I\right) }_i-s_i\right| }{{\nabla q}_i}\right) }}{\exp {\left( -\frac{\varepsilon _i\left| {q\left( I'\right) }_i-z_i\right| }{{\nabla q}_i}\right) }}\right) \\&=\prod _{j=1}^{m}\prod _{i\in W_j}^{\ }\left( \frac{\exp {\left( -\frac{\varepsilon _i\left| {q\left( I\right) }_i-s_i\right| }{{\nabla q}_i}\right) }}{\exp {\left( -\frac{\varepsilon _i\left| {q\left( I'\right) }_i-s_i\right| }{{\nabla q}_i}\right) }}\right) \\&\le \prod _{j=1}^{m}\exp {\left( \varepsilon _j\right) }\\&=\exp {\left( \varepsilon \right) } \end{aligned} \end{aligned}$$
(4)

We discuss the error change by the dividing for workload \(W=\{W_1, W_2,..., W_m\}\), and privacy budget \(\varepsilon =\{\varepsilon _1,\varepsilon _2,...,\varepsilon _m\}\). We calculate the average \(L_1\) error for the answer to the workload. The excepted \(L_1\) error of the answer before dividing and after dividing the workload is as follows:

$$ E\left( \mid Lap(\varDelta _W\cdot 1/\varepsilon )\mid \right) = \varDelta _{W_i}\cdot \frac{1}{\varepsilon _i}$$
$$ E \left( \frac{1}{k}\sum _{i=1}^{m}\mid Lap(\varDelta _{W_i}\cdot 1/\varepsilon _i)\mid \cdot \mid W_i\mid \right) = \frac{1}{k}\sum _{i=1}^{m} \varDelta _{W_i}\cdot \frac{1}{\varepsilon _i} \cdot \mid W_i \mid $$

4 Optimized Workload Division

Taking the above workload in Fig. 2 for example, the original workload and divided workloads as following. When the workload W adopts 1-differential privacy by the Laplace Mechanism. The excepted \(L_1\) error is \(\varDelta _W/\varepsilon \) = 4, and after dividing the workload into \(W_1\) and \(W_2\), the privacy budget into \(\varepsilon _1=0.58\) and \(\varepsilon _2=0.42\), the \(L_1\) error will be 3.4275.

Fig. 2.
figure 2

A sample of dividing for workload.

Basing on the dividing for workload and privacy budget, we propose a specific division in the data situation that the data is relatively large and the distribution of data is complex. We take the mean square error of the frequency vector x to discriminate the data distribution complexity. And in query workload, there exist data domain queried with high frequency. To reduce the added Laplace-based noise, we divide the privacy budget into two equal parts iteratively, and the workload is divided according to the sensitivity, and the process can be described in Algorithm 1.

figure a

The dividing in the algorithm will continue until the recursion finished. We will discuss the rationality of the division. The dichotomy is used as the reason that the query workload and privacy budget are divided into two parts \(W_1\), \(W_2\), \(\varepsilon _1\), \(\varepsilon _2\), and for \(\ E\left( L_1\ \right) =\frac{\nabla _{W_1}}{\varepsilon _1}*\left| W_1\right| +\frac{\nabla _{W_2}}{\varepsilon _2}*\left| W_2\right| , \ \varepsilon _1=\varepsilon _2 \) is the minimum extreme point of the function. As described in Algorithm 1, at first, we set the data domain queried by high frequency as high-frequency items. For workload W, its division is \(W_1\) and \(W_2\) and supposing that the sensitivity of W is the sum of \(W_1\) and \(W_2\). The total expected noise under the \(\varepsilon \)-differential privacy is

$$\begin{aligned} E(noise(\varepsilon _1,\varepsilon _2,W_1,W_2))=\frac{\varDelta _{W_1}}{\varepsilon _1}*\left| W_1\right| +\frac{\varDelta _W-\varDelta _{W_1}}{\varepsilon _2}*(\left| W\right| -\left| W_1\right| ) \end{aligned}$$
(6)

We can infer that \((\varepsilon /2,\varepsilon /2,W_1,W_2)\) is a point of minimum, so we adopt \(\varepsilon _1=\varepsilon 2=2/\varepsilon \). To get \(min\ E(noise(\varepsilon _1,\varepsilon _2,W_1,W_2))\), we set \(\varDelta _{W_1} = \left| W_1\right| \), and we can compute that when \(\left| W_1\right| = \left( W + \varDelta _W\right) /4\), the \(E(noise\varepsilon _1,\varepsilon _2,W_1,W2)\) will be a minimal value. The parameter g can optimize the result as a consequence of that for a workload W and its divisions \(W_1\),\(W_2\), \(\varDelta _W > (\varDelta _{W_1}+\varDelta _{W_2})\), which is not in accordance with our assumption. Therefore, we introduce the parameter to regulate the result and the g can be estimation by the specific workload.

5 Experimental Evaluation

We now evaluate the performance of our approach on multiple datasets and workloads and compare our algorithm with state-of-the-art differential privacy algorithms. The main metric is average error, and we evaluate the metric on differential datasets and workloads (Fig. 3).

In follow evaluation, we test our algorithm with the metric average \(L_1\) error per query result of the given workload. The workloads we use are generated randomly and the data set is from the real public database. To make the result more convincing, we run 5 trials for each evaluation. Furthermore, we test the time efficiency of our algorithm. The synthetic workload also is used by all the comparison algorithms. In the experiment, we set the privacy budget varies in {10.0,5.0,1.0,0.5,0.1}. In the following sections, we describe the datasets, workload, and the parameters in the experiment. In the above section, we have described the properties of liner count query. When there are multiple attributes in datasets, we can still use one-dimensional frequency vector x to represent the datasets. In the experiment, we use one-dimensional data sets. We use six real data sets. Adult comes from American statistical data [4]. The frequency vector x is built on the attribute “capital loss”, which is also used in the experiment [13]. The Adult is sparse, and many frequency counts in x are zero. Income is from IPUMS American community survey data from 2001–2011, and frequency vector x is the count of personal Income [22], and Income is also used in DAWA [16]. Patent is a citation network among a subset of US patents [16]. Fidaeta is from census of fatal occupational injuries in the United States of American labor statistics [1], and both Nw and Nw2 are from a survey of compensation in the United States of American labor statistics [1], and they are the frequency vector by setting unit as 1 and 2 in the continuous value attribute. We take the length of x of the five datasets as 4096. The overview of datasets is described in Table 1.

Fig. 3.
figure 3

Average error on the workload that the frequency count item is queried with probability as p = 0.9

Table 1. Overview of the datasets in the experiments.
Fig. 4.
figure 4

Average error on the workload that the frequency count item is queried with probability as p = 0.6

For the query workload, we conduct the experiment on eight synthetic query workloads W. For the frequently queried item in frequency vector x, we set probability of being queried \(p=\{0.2,0.3,0.4,0.5,0.6,0.7,0.8,0.9\}\). A workload has 2000 queries, and each query \(q \in W\) randomly selected a center cluster, and the frequency counts in x are randomly generated via the normal distribution with c as the center and 10 as the variance. Furthermore, we compare three algorithms with our algorithm. The Identity [8] algorithm adopts Laplace Mechanism that the answer results are directly added Laplace distribution noise for disturbance. MWEM [13] achieves differential privacy technology by obtaining an estimate of x through Laplace Mechanism and Exponential Mechanism. DAWA [16] algorithm adopts the partitioning method to achieve the differential privacy for range count workload and linear count workload.

Fig. 5.
figure 5

Average error on the workload that the frequency count item is queried with probability as p = 0.2

Fig. 6.
figure 6

The decrement of the average error by comparing our method with Identity.

Among the experimental datasets, Adult is a “sparse” data set. The data distribution is relatively even-distributed as shown in Table 1, and zero accounts for more than 97% in the frequency vector x. The other four experimental data sets are “complex” data sets with a large scale and complex data distribution. Figure 4, 5, and 6 show the \(L_1\) average error for the parameter p as 0,9, 0.6 and 0.2. It can be seen that MWEM [13] and DAWA [16] will add more noise than the Identity [8] algorithms, meanwhile, our algorithm always adds less noise than the Identity. The results figures show that MWEM [13] and DAWA [16] algorithm are datasets-aware and when facing different datasets, both algorithms perform differently over the same workload. The MWEM is most erratic, and when the data sets are simple or approximately even-distributed, the algorithm can add less noise than the other algorithms, but not for the complex data. In Fig. 6, we compare the discount of \(L_1\) average error by comparing it with the Identity [8]. In the experiment, we compare the different perforation with different parameter p, which represents frequency of a certain \(dom(\mathbb {B})_i\) in x. Figure 5 shows that in the experiment sets, when p = 0.2, the algorithm can reduce more than 40% the \(L_1\) error than the Identity.

6 Conclusions

The \(\varepsilon \)-differential privacy is an effect privacy-preserving technology for linear computation. It can prompt data organizers to provide a secure third-party interface for statistical query. In this paper, we propose a novel \(\varepsilon \)-differential privacy algorithm, which uses Laplace-based noise and optimized workload division to decrease the computation error in complex data distribution for linear computations. The evaluation results show that our approach can reduce nearly 40% computation error when compared with the start-of-the-art differential privacy algorithms MWEM, DAWA, and Identity. As further work, we plan to extend our approach by optimizing the proposed work load division to reduce the introduced error.