Conditional Masking to Numerical Data

Ghatak, Debolina; Roy, Bimal K.

doi:10.1007/s42519-019-0042-y

Conditional Masking to Numerical Data

Original Article
Published: 16 May 2019

Volume 13, article number 44, (2019)
Cite this article

Journal of Statistical Theory and Practice Aims and scope Submit manuscript

Debolina Ghatak¹ &
Bimal K. Roy¹

84 Accesses
1 Citation
Explore all metrics

Abstract

Protecting the privacy of datasets has become hugely important these days. Many real-life datasets like income data and medical data need to be secured before making it public. However, security comes at the cost of losing some useful statistical information about the dataset. Data obfuscation deals with this problem of masking a dataset in such a way that the utility of the data is maximized while minimizing the risk of the disclosure of sensitive information. Two popular approaches to data obfuscation for numerical data involve (i) data swapping and (ii) adding noise to data. While the former masks well sacrificing the whole of correlation information, the latter gives estimates for most of the popular statistics like mean, variance, quantiles and correlation but fails to give an unbiased estimate of the distribution curve of the original data. In this paper, we propose a mixed method of obfuscation combining the above two approaches and discuss how the proposed method succeeds in giving an unbiased estimation of the distribution curve while giving reliable estimates of the other well-known statistics like moments and correlation.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Protecting Values Close to Zero Under the Multiplicative Noise Method

Reviewing the Methods of Estimating the Density Function Based on Masked Data

The Vulnerability of Multiplicative Noise Protection to Correlation-Attacks on Continuous Microdata

Article 29 March 2019

References

Steinberg J, Pritzker L (1967) Some experiences with and reactions on data linkage in the United States. Bulletin of the International Statistical Institute, pp 786–808
Bachi R, Baron R (1969) Confidentiality problems related to data banks. Bulletin of the International Statistical Institute, vol 43, pp 225–241
Dalenius T (1974) The invasion of privacy problem and statistics production—an overview. Statistisk Tidskrzft 12:213.A–225.A
Google Scholar
Dalenius T (1977) Computers and individual privacy some international implications. Bulletin of the International Statistical Institute, vol 47, pp 203–211
Mugge RH (1983) Issues in protecting confidentiality in national health statistics. In: Proceedings of the Section on Survey Research Methods. American Statistical Association, pp 592–594
Fienberg SE (1994) Conflict between the needs for access to statistical information and demands for confidentiality. J Off Stat 10(2):115–132
Google Scholar
Trabelsi S, Salzgeber V, Bezzi M, Montagnon G (2009) Data disclosure risk evaluation. IEEE Xplore. https://doi.org/10.1109/CRISIS.2009.5411979
Fuller WA (1993) Masking procedures for microdata disclosure limitation. J Off Stat 9:383–406
Google Scholar
Fuller WA (1987) Measurement error models. Wiley, New York
Book Google Scholar
Dalenius T, Reiss SP (1982) Data-swapping: a technique for disclosure control. J Stat Plan Inference 6:73–85
Article MathSciNet Google Scholar
Moore RA (1996) Controlled data swapping techniques for masking use microdata sets. US Bureau of the Census, Statistical Research Division. http://www.census.gov/srd/www/byyear.html
Sarathy R, Muralidhar K, Parsa R (2002) Perturbing non-normal confidential attributes: the copula approach. Manag Sci 48(12):1613–1627
Article Google Scholar
Ghatak D, Roy B (2018) Estimation of true quantiles from quantitative data obfuscated with additive noise. J Off Stat 34:671–694
Article Google Scholar

Download references

Author information

Authors and Affiliations

Applied Statistics Unit, Indian Statistical Institute, 203 B. T. Road, Kolkata, 700108, India
Debolina Ghatak & Bimal K. Roy

Authors

Debolina Ghatak
View author publications
You can also search for this author in PubMed Google Scholar
Bimal K. Roy
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Debolina Ghatak.

Additional information

Publisher’s Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Appendix

1.1 Proof of Theorem 2.1

Proof

To find the raw moments of X, denoted by $\mu _{(X,k)}$, in terms of moments of Z, we first need to check that the moments of Z exist whenever that of X does. kth absolute raw moment of $Z_i$ is given by

$$\begin{aligned} E[|Z_i^k|] &{}= E[|Z_i|^k \,|\,B_i=1]\cdot P[B_i=1] + E[|Z_i|^k \,|\,B_i=0]\cdot P[B_i=0]\\ &{}= p\cdot \frac{1}{n-1}\sum _{j=1,j \ne i}^{n}E[|X_{j}|^k \,|\,j\;{\text{is}}\;{\text{chosen}}] + (1-p)\cdot E[|X_i+Y_i|^k] \\ &{}= p\cdot \frac{1}{n-1}\sum _{j=1,j \ne i}^{n}{E[|X_j|^k]} + (1-p)\cdot E[|X_i+Y_i|^k] \\ &{}\le p\cdot {E[|X_1|^k]} + (1-p)\cdot E[|X_1|^k+|Y_1|^k] , \\ &\quad [{\text{by}}\;{\text{Minkowski's}}\;{\text{Inequality}}\;{\text{and}} \;{\text{the}}\;{\text{fact}}\;{\text{that}} \;X_i\;{\text{and}}\;Y_i^{\prime}\;{\text{s}}\;{\text{are}}\;{\text{i.i.d.}}] \\ &{}< \infty \end{aligned}$$

since $E|X_j|^k <\infty $, and $E|Y_j|^k =\frac{\sigma ^k 2^\frac{k}{2}\Gamma (\frac{k+1}{2})}{\sqrt{\pi }} < \infty $ $\forall k \in {\mathbb {N}}$.

Thus, the moments of Z exist if that of X exists. Now, we try to find the estimates of moments of X in terms of moments of Z.

$$\begin{aligned} \mu _{(Z,k)} &{}= E[Z_i^k] \\ &{}= E[Z_i^k \,|\,B_i=1]\cdot P[B_i=1] + E[Z_i^k \,|\,B_i=0]\cdot P[B_i=0]\\ &{}= p\cdot \frac{1}{n-1}\sum \limits _{j=1,j \ne i}^{n}{E [X_j^k]} + (1-p)\cdot E[(X_i+Y_i)^k] \\ &{}= E[X_i^k]+ (1-p)\{ k.E[X_i^{k-1}]\cdot E[Y_i] + \ldots + E[Y_i^k]\} \\ &{}= \mu _{(X,k)} + (1-p)\cdot \sum \limits _{\begin{array}{c} j=2 \\ j\; {\text{is}}\;{\text{even}} \end{array}}^{2[\frac{k}{2}]}{{k\atopwithdelims ()j}\mu _{(X,k-j)}\mu _{(Y,j)}} \end{aligned}$$

The above equation follows since odd order moment of $Y_i$ is zero.

$$\begin{aligned} \therefore \mu _{(X,k)}=\mu _{(Z,k)} -(1-p)\cdot \sum \limits _{j=1}^{[\frac{k}{2}]}{{k\atopwithdelims ()2j} {\mu _{(X,k-2j)}\mu _{(Y,2j)}}}. \end{aligned}$$

Now, $E[Z]=p\cdot E[X] + (1-p) \cdot E[X+Y] =E[X]$. $\therefore E[\frac{1}{n}\sum _{j=1}^n{Z_j}]=E[Z]=E[X]$.

$\therefore \frac{1}{n}\sum _{j=1}^n{Z_j}$ is an unbiased estimator of E[X].

Define, in general,

$$\begin{aligned} {\hat{\mu }}_{(X,k)} &= \frac{1}{n}\sum _{j=1}^n{Z_j^k} - (1-p)\cdot \sum \limits _{j=1}^{[\frac{k}{2}]}{{k\atopwithdelims ()2j} \mu _{(X,k-2j)}\mu _{(Y,2j)}}, \mu _{(Y,2j)}\\ &= \sigma ^{2j}\frac{2^j\Gamma (j+\frac{1}{2})}{\sqrt{\pi }}<\infty \end{aligned}$$

If $E[{\hat{\mu }}_{(X,j)}]=\mu _{(X,j)}$ , $\forall j=1,2, \ldots k,$ then

$$\begin{aligned} E[{\hat{\mu }}_{(X,k+1)}]={\mu }_{(Z,k+1)} - (1-p)\cdot \sum _{j=1}^{[\frac{k}{2}]}{\mu _{(X,k-2j)}\mu _{(Y,2j)}}={\mu }_{(X,k+1)} \end{aligned}$$

Thus, by induction, ${\hat{\mu }}_{(X,k)}$ is an unbiased estimator for ${\mu }_{(X,k)}$.

Also, note that ${\hat{\mu }}_{(X,k)}=\frac{1}{n}\sum _{j=1}^n{f(Z_j)}$ , where f(.) is a polynomial function of finite degree. Also, ${\text{Var}}(Z_j)=E(Z^{2j})-E(Z_j)^2 < \infty $ if the moments of X and hence Z exist and is finite $\forall n \in {\mathbb {N}}$. Thus, ${\text{Var}}({\hat{\mu }}_{(X,k)})=O(\frac{1}{n})$ and ${\hat{\mu }}_{(X,k)}$ is also a consistent estimator of $\mu _{(X,k)}$. $\square $

Now, we state and prove the following lemma, which we will require while proving Theorems 2.2 and 2.3 in subsequent sections.

Lemma 6.1

For $(x_1,x_2,x_3) \in {\mathbb {R}}^3$ and $(\sigma _1,\sigma _2) \in \mathbb {R^+}^2$,

$$\begin{aligned} \int _{-\infty }^{\infty }{ \phi _{\sigma _1}{(x_1-x_2)}\cdot \phi _{\sigma _2}{(x_2-x_3)} {{\mathrm{d}}}x_2}=\phi _{\sqrt{\sigma _1^2+ \sigma _2^2}}{(x_1-x_3)} \end{aligned}$$

where $\phi _\sigma (x)=$ normal density at $x \in {\mathbb {R}}$ for mean 0 and standard deviation $\sigma. $

Proof

To prove the given lemma we consider,

$$\begin{aligned}L.H.S. &{}=\frac{1}{2\pi \sigma _1\sigma _2}\cdot \int _{-\infty }^{\infty }{e^{{-}\frac{1}{2}\cdot \left[ \left( \frac{x_1-x_2}{\sigma _1}\right) ^2+\left( \frac{x_2-x_2}{\sigma _2}\right) ^2\right] } {\mathrm{d}}x_2} \\ &{}=\frac{1}{2\pi \sigma _1\sigma _2}\cdot \int _{-\infty }^{\infty }{e^{{-}\frac{1}{2}\cdot \left[ x_2^2\left( \frac{1}{\sigma _1^2}+\frac{1}{\sigma _2^2}\right) -2x_2 \left( \frac{x_1}{\sigma _1^2}+\frac{x_3}{\sigma _2^2}\right) +\frac{x_1^2}{\sigma _1^2}+\frac{x_3^2}{\sigma _2^2}\right] }{\mathrm{d}}x_2}\\ &{}=\frac{1}{\sqrt{2\pi }\sqrt{\sigma _1^2+\sigma _2^2}}\cdot e^{{-}\frac{1}{2\left( \frac{1}{\sigma _1^2}+\frac{1}{\sigma _2^2}\right) }\left[ \left( \frac{x_1}{\sigma _1^2}+\frac{x_3}{\sigma _2^2}\right) ^2 -\left( \frac{x_1}{\sigma _1^2}+\frac{x_3}{\sigma _2^2}\right) \cdot \left( \frac{1}{\sigma _1^2}+\frac{1}{\sigma _2^2}\right) \right] }\\ &{} \quad \cdot \int _{-\infty }^{\infty }{\phi _{\frac{1}{\frac{1}{\sigma _1^2}+\frac{1}{\sigma _2^2}}} \left( x_2-\left( \frac{\frac{x_1}{\sigma _1^2}+\frac{x_3}{\sigma _2^2}}{\frac{1}{\sigma _1^2} +\frac{1}{\sigma _2^2}}\right) \right) {\mathrm{d}}x_2}\\ &{}=\frac{1}{\sqrt{2\pi }\sqrt{\sigma _1^2+\sigma _2^2}}\cdot e^{-\frac{(x_1-x_3)^2}{2(\sigma _1^2+\sigma _2^2)}} \\ &{}=R.H.S. \end{aligned}$$

$\square $

1.2 Proof of Theorem 2.2

Proof

Let H(.) be the c.d.f. of Z. Then,

$$\begin{aligned} H(x) &{}= P[Z_i \le x] \\ &{}=P[Z_i \le x \,|\,B_i=1]\cdot P[B_i=1] + P[Z_i \le x \,|\,B_i=0]\cdot P[B_i=0] \\ &{}=p \cdot G(x) + (1-p)\cdot \int _{-\infty }^{\infty }{\phi _\sigma (x-y)G(y){\mathrm{d}}y} \end{aligned}$$

If ${\tilde{G}}(x)=p\cdot G(x)$, then we have the equation

$$\begin{aligned} {\tilde{G}}(x)= H(x) + \lambda \int _{-\infty }^{\infty }{\phi _\sigma (x-y){\tilde{G}}(y){\mathrm{d}}y} \end{aligned}$$

(4)

To find a solution to Eq. (4), note that this is a Fredholm equation of second kind and a solution for such a problem for $|\lambda |<1$ is given by the famous Liouville-Neumann series,

$$\begin{aligned} {\tilde{G}}(x)=\underset{n \rightarrow \infty }{lim}\sum _{t=0}^n{\lambda ^t u_t(x)}\; {\text{where}}, u_o(x)=H(x), \end{aligned}$$

(5)

$u_t(x)=\int _{-\infty }^{\infty } \ldots \int _{-\infty }^{\infty } {\phi _\sigma (x-x_1)\cdot \phi _\sigma (x_2-x_1)\cdots \phi _\sigma (x_{t-1}-x_t)H(x_t){\mathrm{d}}x_1 {\mathrm{d}}x_2 \ldots {\mathrm{d}}x_t}$

Now, H(x) is unknown, so we use ${\hat{H}}(x)=\frac{1}{n}\sum _{j=1}^n{{\mathbb {I}}_{[Z_j \le x]}}$ instead. Here, ${\mathbb {I}}_A$ is the indicator function for event A.

For $t \in {\mathbb {N}}$, the integrand in ${\hat{u}}_t(x)$ is integrable since,

$$\begin{aligned}|{\hat{u}}_t(x)| &{}= \left|\int _{-\infty }^{\infty } \ldots \int _{-\infty }^{\infty } {\phi _\sigma (x-x_1)\cdot \phi _\sigma (x_2-x_1)\cdots \phi _\sigma (x_{t-1}-x_t){\hat{H}}(x_t){\mathrm{d}}x_1 {\mathrm{d}}x_2 \ldots {\mathrm{d}}x_t}\right| \\ &{}\le \int _{-\infty }^{\infty } \ldots \int _{-\infty }^{\infty } |{\phi _\sigma (x-x_1)\cdot \phi _\sigma (x_2-x_1)\cdots \phi _\sigma (x_{t-1}-x_t){\hat{H}}(x_t)|{\mathrm{d}}x_1 {\mathrm{d}}x_2 \ldots {\mathrm{d}}x_t} \\ &{}\le \int _{-\infty }^{\infty } \ldots \int _{-\infty }^{\infty } {\phi _\sigma (x-x_1)\cdot \phi _\sigma (x_2-x_1)\cdots \phi _\sigma (x_{t-1}-x_t){\mathrm{d}}x_1 {\mathrm{d}}x_2 \ldots {\mathrm{d}}x_t} \\ &{}= \int _{-\infty }^{\infty }{\phi _{\sigma \sqrt{t}}(x-x_t){\mathrm{d}}x_t},\; {\text{by}}\;{\text{lemma}}\; 5.1 \\ &{}=1 < \infty , \forall x \in {\mathbb {R}} \end{aligned}$$

Also,

$$\begin{aligned}{\hat{u}}_t(x) &{}= \int _{-\infty }^{\infty } \ldots \int _{-\infty }^{\infty } {\phi _\sigma (x-x_1)\cdot \phi _\sigma (x_2-x_1)\cdots \phi _\sigma (x_{t-1}-x_t){\hat{H}}(x_t){\mathrm{d}}x_1 {\mathrm{d}}x_2 \ldots {\mathrm{d}}x_t} \\ &{}=\int _{-\infty }^{\infty }{\phi _{\sigma \sqrt{t}}(x-x_t){\hat{H}}(x_t){\mathrm{d}}x_t}, \;{\text{by}}\;{\text{lemma}}\; 5.1 \\ &{}=\frac{1}{n}\sum _{j=1}^n{\int _{Z_j}^{\infty }{\phi _{\sigma \sqrt{t}}(x-x_t){\mathrm{d}}x_t}}\\ &{}=\frac{1}{n}\sum _{j=1}^n{\int _{-\infty }^{x-Z_j}{\phi _{\sigma \sqrt{t}}(y){\mathrm{d}}y}},\;{\text{Taking}}\;y=x-x_t \\ &{}=\frac{1}{n}\sum _{j=1}^n{\Phi _{\sigma \sqrt{t}}(x-Z_j)}\end{aligned}$$

Thus, $\hat{{\tilde{G}}}(x)= \sum _{t=0}^\infty {\lambda ^t {\hat{u}}_t(x)}= \frac{1}{n}\sum _{j=1}^n {\sum _{t=0}^\infty {\lambda ^t \Phi _{\sigma \sqrt{t}}(x-Z_j)}}$ is an estimator of ${\tilde{G}}(x)$.

$$\begin{aligned}E[\hat{{\tilde{G}}}(x)] &{}=\sum _{t=0}^\infty {\lambda ^t E[{\hat{u}}_t(x)]} \\ &{}= \sum _{t=0}^\infty {\lambda ^t E[\int _{-\infty }^{\infty } \ldots \int _{-\infty }^{\infty } {\phi _\sigma (x-x_1)\cdots \phi _\sigma (x_{t-1}-x_t){\hat{H}}(x_t){\mathrm{d}}x_1 \ldots {\mathrm{d}}x_t}]} \\ &{}= \sum _{t=0}^\infty {\lambda ^t E[\int _{-\infty }^{\infty }{\phi _{\sigma \sqrt{t}}(x-x_t){\hat{H}}(x_t){\mathrm{d}}x_t}]}\\ &{}= \sum _{t=0}^\infty {\lambda ^t \int _{-\infty }^{\infty }{\phi _{\sigma \sqrt{t}}(x-x_t)E[{\hat{H}}(x_t)]{\mathrm{d}}x_t}} ,\;{\text{by}}\;{\text{Tonelli's}}\;{\text{Theorem}} \\ &{}= \sum _{t=0}^\infty {\lambda ^t \int _{-\infty }^{\infty }{\phi _{\sigma \sqrt{t}}(x-x_t)H(x_t){\mathrm{d}}x_t}} \\ &{}= \sum _{t=0}^\infty {\lambda ^t \int _{-\infty }^{\infty } \ldots \int _{-\infty }^{\infty } {\phi _\sigma (x-x_1)\cdots \phi _\sigma (x_{t-1}-x_t)H(x_t){\mathrm{d}}x_1 \ldots {\mathrm{d}}x_t}} \\ &{}= {\tilde{G}}(x)\end{aligned}$$

Thus, $\frac{{\hat{G}}(x)}{p}$ is an unbiased estimator for G(x). $\square $

1.3 Proof of Theorem 2.3

Proof

Note that in Eq. (4), the integral term is a convolution of two functions and hence can be easily interchanged. Thus, taking derivatives on both sides, we have

$$\begin{aligned} {\tilde{g}}(x)= h(x) + \lambda \int _{-\infty }^{\infty }{\phi _\sigma (x-y){\tilde{g}}(y){\mathrm{d}}y} \end{aligned}$$

(6)

where ${\tilde{g}}(x)=\frac{{\mathrm{d}}}{{\mathrm{d}}x}\{{\tilde{G}}(x)\}=pg(x)$ and g(x) and h(x) are density of X and Z, respectively.

To find a solution to Eq. (6), note that this is a Fredholm Equation of second kind and a solution for such a problem for $|\lambda |<1$ is given by the Liouville-Neumann Series,

$$\begin{aligned} {\tilde{g}}(x) &= \underset{n \rightarrow \infty }{lim}\sum _{t=0}^n{\lambda ^t u_t(x)},\; {\text{where}},\; u_o(x)=h(x),\\ u_t(x) &= \int _{-\infty }^{\infty } \ldots \int _{-\infty }^{\infty } {\phi _\sigma (x-x_1)\cdot \phi _\sigma (x_2-x_1)\cdots \phi _\sigma (x_{t-1}-x_t)h(x_t){\mathrm{d}}x_1 {\mathrm{d}}x_2 \ldots {\mathrm{d}}x_t} \end{aligned}$$

Now, h(x) is unknown, so we use ${\hat{h}}(x)=\frac{1}{n}\sum _{j=1}^{nb}{K(\frac{x-Z_j}{b})}$ instead. Here, K(.) is the Gaussian Kernel function and b is the bandwidth selected by Silvermans rule of thumb.

For $t \in {\mathbb {N}}$, the integrand in ${\hat{u}}_t(x)$ is integrable in a similar reason as in the proof of Theorem 2.1. Also,

$$\begin{aligned}{\hat{u}}_t(x) &{}= \int _{-\infty }^{\infty } \ldots \int _{-\infty }^{\infty } {\phi _\sigma (x-x_1)\cdot \phi _\sigma (x_2-x_1)\cdots \phi _\sigma (x_{t-1}-x_t){\hat{h}}(x_t){\mathrm{d}}x_1 {\mathrm{d}}x_2 \ldots {\mathrm{d}}x_t} \\ &{}=\int _{-\infty }^{\infty }{\phi _{\sigma \sqrt{t}}(x-x_t){\hat{h}}(x_t){\mathrm{d}}x_t}, \; {\text{by}}\;{\text{lemma}}\;5.1 \\ &{}=\frac{1}{n}\sum _{j=1}^n{\int _{-\infty }^{\infty }{\phi _{\sigma \sqrt{t}}(x-x_t)\phi _{b}(x_t-Z_j){\mathrm{d}}x_t}}\\ &{}=\frac{1}{n}\sum _{j=1}^n{\phi _{\sqrt{t\sigma ^2 + b^2}}(x-Z_j)},\;{\text{by}}\;{\text{lemma}}\;5.1 \end{aligned}$$

Thus, $\hat{{\tilde{g}}}(x)= \sum _{t=0}^\infty {\lambda ^t {\hat{u}}_t(x)}= \frac{1}{n}\sum _{j=1}^n {\sum _{t=0}^\infty {\lambda ^t \phi _{\sqrt{t\sigma ^2 + b^2}}(x-Z_j)}}$ is an estimator of ${\tilde{g}}(x),$ and hence,

$$\begin{aligned} {\hat{G}}(x)= \frac{1}{np}\sum _{j=1}^{n} {\sum _{t=0}^\infty {\lambda ^t \Phi _{\sqrt{t\sigma ^2 + b^2}}(x-Z_j)}} \end{aligned}$$

is an estimator for G(x).

Note that this estimator is a linear series of normal c.d.f.s each with positive variance and hence are smooth functions, which makes the function ${\hat{G}}(x)$, a smooth function. $\square $

1.4 Proof of Theorem 2.4

Proof

We have $T_i(x)=\frac{1}{np}\sum _{j=1}^n {\sum _{t=0}^\infty {\lambda ^t \Phi _{\sigma _t}(x-Z_j)}}$ where $\sigma _t=\sigma \sqrt{t}$ for $T_1(x)$ and $\sqrt{t\sigma ^2 + b^2}$ for $T_b(x)$. Since $\Phi _{\sigma _t}(x-Z_j) \le 1$ $\forall t \in {\mathbb {N}}$ and $\sum _{t=0}^\infty {\lambda ^t}=p$, $\frac{1}{p}\sum _{t=0}^\infty {\lambda ^t \Phi _{\sigma _t}(x-Z_j)} = T_{ij} \; ({\text{say}})\le 1$.

$T_i(x)=\frac{1}{n}\sum _{j=1}^n{T_{ij}}$, i.e., average of n i.i.d. random variables each of whose value is less than or equal to 1. Thus,

$$\begin{aligned} {\text{Var}}(T_i(x)) = \frac{1}{n}\cdot {\text{Var}}(T_{i1}(x)) \le \frac{1}{n}E(T_{i1}^2) = O\left( \frac{1}{n}\right) \end{aligned}$$

Since $T_1(x)$ is unbiased, the last equation implies convergence in probability of $T_1(x)$ to its expected value G(x). For the smooth estimator,

$$\begin{aligned} E(T_b(x))=E\left[ \frac{1}{np}\sum _{j=1}^n{\sum _{t=0}^\infty {\lambda ^t\cdot \Phi _{b_t}(x-Z_j)}}\right] =\frac{1}{p}\sum _{t=0}^\infty {\lambda ^t\int _{-\infty }^{\infty }{\Phi \left( \frac{x-z}{\sqrt{t\sigma ^2 +b^2}}\right) h(z){\mathrm{d}}z}} \end{aligned}$$

The true c.d.f. can be written as (using Equation 4),

$$\begin{aligned} G(x)&{}= \frac{1}{p} \sum \limits _{t=0}^\infty {\lambda ^t \int _{-\infty }^{\infty } \ldots \int _{-\infty }^{\infty } {\phi _\sigma (x-x_1)\cdots \phi _\sigma (x_{t-1}-x_t)H(x_t){\mathrm{d}}x_1 \ldots {\mathrm{d}}x_t}} \\ &{}= \frac{1}{p} \sum \limits _{t=0}^\infty {\lambda ^t \int _{-\infty }^{\infty } {\phi _{\sqrt{t\sigma ^2}}(x-x_t)\cdot H(x_t){\mathrm{d}}x_t}} \\ &{}= \frac{1}{p}\sum \limits _{t=0}^\infty {\lambda ^t\int _{-\infty }^{\infty }{\Phi _{\sqrt{t\sigma ^2}}(x-x_t)h(x_t){\mathrm{d}}x_t}}\\ &{}= \frac{1}{p}\sum \limits _{t=0}^\infty {\lambda ^t\int _{-\infty }^{\infty }{\Phi \left(\frac{x-x_t}{\sqrt{t\sigma ^2}}\right)h(x_t){\mathrm{d}}x_t}} \end{aligned}$$

Since $\int _{-\infty }^{\infty }{\phi _{\sqrt{t\sigma ^2}}(x-x_t)\cdot H(x_t){\mathrm{d}}x_t}=\int _{-\infty }^{\infty }{\Phi _{\sqrt{t\sigma ^2}}(x-x_t)\cdot h(x_t){\mathrm{d}}x_t},$ both being c.d.f. of $Z+N(0,\sqrt{t\sigma ^2})$.

Thus, we have an expression for the bias at point x, denoted by B(x), as given below.

$$\begin{aligned} |B(x)|&= \left| \frac{1}{p}\sum \limits _{t=0}^\infty {\lambda ^t\int _{-\infty }^{\infty }\left[ {{\Phi \left( \frac{x-z}{\sqrt{t\sigma ^2 +b^2}}\right) }-\Phi \left( \frac{x-z}{\sqrt{t\sigma ^2}}\right) }h(z)\right] {\mathrm{d}}z}\right| \\&\le \frac{1}{p}\left| \underbrace{\left. \int _{-\infty }^{\infty }{{\mathbb {I}}(x-z)h(z){\mathrm{d}}z} -\int _{-\infty }^{\infty }{\Phi \left( \frac{x-z}{b}\right) h(z){\mathrm{d}}z}\right| }\right. _{\textsf {Tr}_1}\\&\quad + \frac{1}{p} \underbrace{\sum _{t=1}^\infty {|\lambda |^t\int _{-\infty }^{\infty }{\left| {\Phi \left( \frac{x-z}{\sqrt{t\sigma ^2 +b^2}}\right) }-\Phi \left( \frac{x-z}{\sqrt{t\sigma ^2}}\right) \right| }h(z){\mathrm{d}}z}}_{\textsf {Tr}_2}\\ \textsf {Tr}_1&= \left|\int _{-\infty }^{\infty }{{\mathbb {I}}(x-z)h(z){\mathrm{d}}z}-\int _{-\infty }^{\infty }{\Phi \left( \frac{x-z}{b}\right) h(z){\mathrm{d}}z}\right|\\&= \left|H(x)- \int _{-\infty }^{\infty }{\Phi (x-z,0,b)h(z){\mathrm{d}}z}\right| \\&= |H(x)-H^{\star }(x)| \end{aligned}$$

where $H^{\star }(x)$ is the c.d.f. of $Z+N(0,b^2)$, Z is a r.v. with p.d.f. h(.). As $n \rightarrow \infty $, $b \rightarrow 0$ and $N(0,b^2) \overset{P}{\rightarrow } 0$. Thus, by Slutsky’s theorem, $Z+N(0,b^2) \Longrightarrow Z$ in distribution as $b \rightarrow 0$, or, $|H(x)-H^{\star }(x)| \rightarrow 0$ as $n \rightarrow \infty $.

For $\textsf {Tr}_2$, since $\Phi (\frac{x-z}{\sqrt{y}})$ is a continuous differentiable function in y, expanding the function by Taylor series at $y=t\sigma ^2,$ we have $|{\Phi (\frac{x-z}{\sqrt{t\sigma ^2 +b^2}})}-\Phi (\frac{x-z}{\sqrt{t\sigma ^2}})|=|(t\sigma ^2+b^2 - t\sigma ^2)(\phi (\frac{x-z}{\sqrt{y*}})\frac{x-z}{y*^{3/2}})|$ where $y^*$ is a point between $t\sigma ^2$ and $t\sigma ^2 +b^2$. Also, it is easy to show that $|x\phi (x)| \le 1$ for all $x \in {\mathcal {R}}$( as discussed in Note 5.1), which implies $|\phi (\frac{x-z}{\sqrt{y*}})\frac{x-z}{y*^{1/2}}| \le 1$. Thus,

$$\begin{aligned} \textsf {Tr}_2 &{}= \sum _{t=1}^\infty {|\lambda |^t\int _{-\infty }^{\infty }{\left| {\Phi \left( \frac{x-z}{\sqrt{t\sigma ^2 +b^2}}\right) }-\Phi \left( \frac{x-z}{\sqrt{t\sigma ^2}}\right) \right| }h(z){\mathrm{d}}z} \\ &{}\le \sum _{t=1}^\infty {|\lambda |^t \frac{b^2}{t \sigma ^2}} \\ &{}\le \frac{b^2}{\sigma ^2} \sum _{t=1}^\infty {\frac{|\lambda |^t}{t}} \\ &{}= \frac{b^2}{\sigma ^2}\{-\log (1-|\lambda |)\} , {\text{where}}\;0<|\lambda |<1. \\ &{}= C.b^2 \rightarrow 0, {\text{as}}\;b \rightarrow 0,{\text{where}}\;{\text{C}}\;{\text{is}}\;{\text{a}}\;{\text{constant}} \end{aligned}$$

Thus, $MSE(T_b(x))={\text{Var}}(T_b(x))+Bias(x)^2 \rightarrow 0,\; {\text{as}}\; n \rightarrow \infty $. Hence, we have $T_b(x) \overset{{\mathbb {L}}_2}{\longrightarrow } G(x) , \forall x {\text{as}}\;n \rightarrow \infty $ which implies the result. $\square $

Note 5.1 For $|x| \le 1$, $\frac{1}{\sqrt{2\pi }}<1$, $e^{-x^2/2}\le 1$ implies $|x\phi (x)|<1$. For $|x|>1$, since $|x\phi (x)|=|x|\phi (|x|)$, it can be written as $\frac{1}{\sqrt{2\pi }}\frac{|x|}{e^{|x|^2/2}}=\frac{1}{\sqrt{2\pi }}\frac{1}{\frac{1}{|x|}+\frac{|x|}{2} + \delta } \le \frac{2}{\sqrt{2\pi }}<1$, where $\delta > 0$.

1.5 Proof of Theorem 2.5

Proof

For $1 \le i \le n$, we have

$$\begin{aligned}&E[(Z_i-E(Z_i))^2(X^{\prime }_i-E(X^{\prime }_i))^2] \\ &\quad = p E[(Z_i-E(Z_i))^2(X^{\prime }_i-E(X^{\prime }_i))^2 \,|\,B_i=1] \\&\qquad+ (1-p)E[(Z_i-E(Z_i))^2(X^{\prime }_i-E(X^{\prime }_i))^2 \,|\,B_i=0] \\&\quad = pE[\frac{1}{n-1} \sum _{j=1,j \ne i}^n{[(X_j-E(X_j))^2(X^{\prime }_i-E(X^{\prime }_i))^2]}] \\&\qquad+(1-p)E[[(X_i+Y_i-E(X_i+Y_i))^2(X^{\prime }_i-E(X^{\prime }_i))^2]] \\&\qquad[ \because E[Y_i]=0\;{\text{and}}\; E[Z_i]=E[X_j] \forall 1 \le i,j \le n{\text{as}}\;{\text{shown}}\;{\text{in}}\;{\text{Proof}}\;{\text{of}}\;{\text{Theorem}}\;2.1] \\ &\quad = p\cdot {\text{Var}}(X^{\prime })\cdot {\text{Var}}(X) + (1-p)\cdot \{ d(X,X^{\prime })+ 2\cdot 0 +{\text{Var}}(Y)\cdot {\text{Var}}(X^{\prime }) \} < \infty \end{aligned}$$

Thus, ${\hat{Cov}}(Z,X^{\prime })=\frac{1}{n}\cdot \sum _{j=1}^n[{Z_j\cdot X^{\prime }_j}]-{\bar{Z}}\cdot \bar{X^{\prime }}$ exists and hence is a consistent estimator of the true covariance between Z and $X^{\prime }$.

$$\begin{aligned} Cov(Z,X^{\prime }) &= E(ZX^{\prime })-E(Z)\cdot E(X^{\prime }) \\& = p E(X)\cdot E(X^{\prime })+ (1-p) E((X+Y)\cdot X^{\prime }) - E(Z)\cdot E(X^{\prime })\\ &= (1-p) [E(X\cdot X^{\prime })- E(Z)\cdot E(X^{\prime })] \\&\quad[\because E[Z]=E[X]\;{\text{and}}\; E[YX^{\prime }]=E[Y]E[X^{\prime }]=0] \\ &= (1-p). Cov(X,X^{\prime }) \end{aligned}$$

Thus, ${\hat{Cov}}(Z,X^{\prime })$ is a consistent estimator of $(1-p). Cov(X,X^{\prime })$, or $\frac{1}{1-p}{\hat{Cov}}(Z,X^{\prime })$ is a consistent estimator of $Cov(X,X^{\prime })$. Also, we have seen in Theorem 2.1, Var(Z) exists if Var(X) does. Thus, $\sqrt{{\hat{Var}}(Z)} \overset{P}{\longrightarrow } \sqrt{{\text{Var}}(Z)}$ and since ${\text{Var}}(X^{\prime })$ exists, $\sqrt{{\hat{Var}}(X^{\prime })} \overset{P}{\longrightarrow } \sqrt{{\text{Var}}(X^{\prime })}$. Hence, by property of convergence in probability, the result follows. $\square $

Rights and permissions

Reprints and permissions

About this article

Cite this article

Ghatak, D., Roy, B.K. Conditional Masking to Numerical Data. J Stat Theory Pract 13, 44 (2019). https://doi.org/10.1007/s42519-019-0042-y

Download citation

Published: 16 May 2019
DOI: https://doi.org/10.1007/s42519-019-0042-y

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Conditional Masking to Numerical Data

Abstract

Access this article

Similar content being viewed by others

Protecting Values Close to Zero Under the Multiplicative Noise Method

Reviewing the Methods of Estimating the Density Function Based on Masked Data

The Vulnerability of Multiplicative Noise Protection to Correlation-Attacks on Continuous Microdata

References

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher’s Note

Appendix

1.1 Proof of Theorem 2.1

Proof

Lemma 6.1

Proof

1.2 Proof of Theorem 2.2

Proof

1.3 Proof of Theorem 2.3

Proof

1.4 Proof of Theorem 2.4

Proof

1.5 Proof of Theorem 2.5

Proof

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Conditional Masking to Numerical Data

Abstract

Access this article

Similar content being viewed by others

Protecting Values Close to Zero Under the Multiplicative Noise Method

Reviewing the Methods of Estimating the Density Function Based on Masked Data

The Vulnerability of Multiplicative Noise Protection to Correlation-Attacks on Continuous Microdata

References

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher’s Note

Appendix

Appendix

1.1 Proof of Theorem 2.1

Proof

Lemma 6.1

Proof

1.2 Proof of Theorem 2.2

Proof

1.3 Proof of Theorem 2.3

Proof

1.4 Proof of Theorem 2.4

Proof

1.5 Proof of Theorem 2.5

Proof

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation