Skip to main content
Log in

Conditional Masking to Numerical Data

  • Original Article
  • Published:
Journal of Statistical Theory and Practice Aims and scope Submit manuscript

Abstract

Protecting the privacy of datasets has become hugely important these days. Many real-life datasets like income data and medical data need to be secured before making it public. However, security comes at the cost of losing some useful statistical information about the dataset. Data obfuscation deals with this problem of masking a dataset in such a way that the utility of the data is maximized while minimizing the risk of the disclosure of sensitive information. Two popular approaches to data obfuscation for numerical data involve (i) data swapping and (ii) adding noise to data. While the former masks well sacrificing the whole of correlation information, the latter gives estimates for most of the popular statistics like mean, variance, quantiles and correlation but fails to give an unbiased estimate of the distribution curve of the original data. In this paper, we propose a mixed method of obfuscation combining the above two approaches and discuss how the proposed method succeeds in giving an unbiased estimation of the distribution curve while giving reliable estimates of the other well-known statistics like moments and correlation.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2

Similar content being viewed by others

References

  1. Steinberg J, Pritzker L (1967) Some experiences with and reactions on data linkage in the United States. Bulletin of the International Statistical Institute, pp 786–808

  2. Bachi R, Baron R (1969) Confidentiality problems related to data banks. Bulletin of the International Statistical Institute, vol 43, pp 225–241

  3. Dalenius T (1974) The invasion of privacy problem and statistics production—an overview. Statistisk Tidskrzft 12:213.A–225.A

    Google Scholar 

  4. Dalenius T (1977) Computers and individual privacy some international implications. Bulletin of the International Statistical Institute, vol 47, pp 203–211

  5. Mugge RH (1983) Issues in protecting confidentiality in national health statistics. In: Proceedings of the Section on Survey Research Methods. American Statistical Association, pp 592–594

  6. Fienberg SE (1994) Conflict between the needs for access to statistical information and demands for confidentiality. J Off Stat 10(2):115–132

    Google Scholar 

  7. Trabelsi S, Salzgeber V, Bezzi M, Montagnon G (2009) Data disclosure risk evaluation. IEEE Xplore. https://doi.org/10.1109/CRISIS.2009.5411979

  8. Fuller WA (1993) Masking procedures for microdata disclosure limitation. J Off Stat 9:383–406

    Google Scholar 

  9. Fuller WA (1987) Measurement error models. Wiley, New York

    Book  Google Scholar 

  10. Dalenius T, Reiss SP (1982) Data-swapping: a technique for disclosure control. J Stat Plan Inference 6:73–85

    Article  MathSciNet  Google Scholar 

  11. Moore RA (1996) Controlled data swapping techniques for masking use microdata sets. US Bureau of the Census, Statistical Research Division. http://www.census.gov/srd/www/byyear.html

  12. Sarathy R, Muralidhar K, Parsa R (2002) Perturbing non-normal confidential attributes: the copula approach. Manag Sci 48(12):1613–1627

    Article  Google Scholar 

  13. Ghatak D, Roy B (2018) Estimation of true quantiles from quantitative data obfuscated with additive noise. J Off Stat 34:671–694

    Article  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Debolina Ghatak.

Additional information

Publisher’s Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Appendix

Appendix

1.1 Proof of Theorem 2.1

Proof

To find the raw moments of X, denoted by \(\mu _{(X,k)}\), in terms of moments of Z, we first need to check that the moments of Z exist whenever that of X does. kth absolute raw moment of \(Z_i\) is given by

$$\begin{aligned} E[|Z_i^k|] &{}= E[|Z_i|^k \,|\,B_i=1]\cdot P[B_i=1] + E[|Z_i|^k \,|\,B_i=0]\cdot P[B_i=0]\\ &{}= p\cdot \frac{1}{n-1}\sum _{j=1,j \ne i}^{n}E[|X_{j}|^k \,|\,j\;{\text{is}}\;{\text{chosen}}] + (1-p)\cdot E[|X_i+Y_i|^k] \\ &{}= p\cdot \frac{1}{n-1}\sum _{j=1,j \ne i}^{n}{E[|X_j|^k]} + (1-p)\cdot E[|X_i+Y_i|^k] \\ &{}\le p\cdot {E[|X_1|^k]} + (1-p)\cdot E[|X_1|^k+|Y_1|^k] , \\ &\quad [{\text{by}}\;{\text{Minkowski's}}\;{\text{Inequality}}\;{\text{and}} \;{\text{the}}\;{\text{fact}}\;{\text{that}} \;X_i\;{\text{and}}\;Y_i^{\prime}\;{\text{s}}\;{\text{are}}\;{\text{i.i.d.}}] \\ &{}< \infty \end{aligned}$$

since \(E|X_j|^k <\infty \), and \(E|Y_j|^k =\frac{\sigma ^k 2^\frac{k}{2}\Gamma (\frac{k+1}{2})}{\sqrt{\pi }} < \infty \) \(\forall k \in {\mathbb {N}}\).

Thus, the moments of Z exist if that of X exists. Now, we try to find the estimates of moments of X in terms of moments of Z.

$$\begin{aligned} \mu _{(Z,k)} &{}= E[Z_i^k] \\ &{}= E[Z_i^k \,|\,B_i=1]\cdot P[B_i=1] + E[Z_i^k \,|\,B_i=0]\cdot P[B_i=0]\\ &{}= p\cdot \frac{1}{n-1}\sum \limits _{j=1,j \ne i}^{n}{E [X_j^k]} + (1-p)\cdot E[(X_i+Y_i)^k] \\ &{}= E[X_i^k]+ (1-p)\{ k.E[X_i^{k-1}]\cdot E[Y_i] + \ldots + E[Y_i^k]\} \\ &{}= \mu _{(X,k)} + (1-p)\cdot \sum \limits _{\begin{array}{c} j=2 \\ j\; {\text{is}}\;{\text{even}} \end{array}}^{2[\frac{k}{2}]}{{k\atopwithdelims ()j}\mu _{(X,k-j)}\mu _{(Y,j)}} \end{aligned}$$

The above equation follows since odd order moment of \(Y_i\) is zero.

$$\begin{aligned} \therefore \mu _{(X,k)}=\mu _{(Z,k)} -(1-p)\cdot \sum \limits _{j=1}^{[\frac{k}{2}]}{{k\atopwithdelims ()2j} {\mu _{(X,k-2j)}\mu _{(Y,2j)}}}. \end{aligned}$$

Now, \(E[Z]=p\cdot E[X] + (1-p) \cdot E[X+Y] =E[X]\). \(\therefore E[\frac{1}{n}\sum _{j=1}^n{Z_j}]=E[Z]=E[X]\).

\(\therefore \frac{1}{n}\sum _{j=1}^n{Z_j}\) is an unbiased estimator of E[X].

Define, in general,

$$\begin{aligned} {\hat{\mu }}_{(X,k)} &= \frac{1}{n}\sum _{j=1}^n{Z_j^k} - (1-p)\cdot \sum \limits _{j=1}^{[\frac{k}{2}]}{{k\atopwithdelims ()2j} \mu _{(X,k-2j)}\mu _{(Y,2j)}}, \mu _{(Y,2j)}\\ &= \sigma ^{2j}\frac{2^j\Gamma (j+\frac{1}{2})}{\sqrt{\pi }}<\infty \end{aligned}$$

If \(E[{\hat{\mu }}_{(X,j)}]=\mu _{(X,j)}\) , \(\forall j=1,2, \ldots k,\) then

$$\begin{aligned} E[{\hat{\mu }}_{(X,k+1)}]={\mu }_{(Z,k+1)} - (1-p)\cdot \sum _{j=1}^{[\frac{k}{2}]}{\mu _{(X,k-2j)}\mu _{(Y,2j)}}={\mu }_{(X,k+1)} \end{aligned}$$

Thus, by induction, \({\hat{\mu }}_{(X,k)}\) is an unbiased estimator for \({\mu }_{(X,k)}\).

Also, note that \({\hat{\mu }}_{(X,k)}=\frac{1}{n}\sum _{j=1}^n{f(Z_j)}\) , where f(.) is a polynomial function of finite degree. Also, \({\text{Var}}(Z_j)=E(Z^{2j})-E(Z_j)^2 < \infty \) if the moments of X and hence Z exist and is finite \(\forall n \in {\mathbb {N}}\). Thus, \({\text{Var}}({\hat{\mu }}_{(X,k)})=O(\frac{1}{n})\) and \({\hat{\mu }}_{(X,k)}\) is also a consistent estimator of \(\mu _{(X,k)}\). \(\square \)

Now, we state and prove the following lemma, which we will require while proving Theorems 2.2 and 2.3 in subsequent sections.

Lemma 6.1

For \((x_1,x_2,x_3) \in {\mathbb {R}}^3\) and \((\sigma _1,\sigma _2) \in \mathbb {R^+}^2\),

$$\begin{aligned} \int _{-\infty }^{\infty }{ \phi _{\sigma _1}{(x_1-x_2)}\cdot \phi _{\sigma _2}{(x_2-x_3)} {{\mathrm{d}}}x_2}=\phi _{\sqrt{\sigma _1^2+ \sigma _2^2}}{(x_1-x_3)} \end{aligned}$$

where \(\phi _\sigma (x)=\) normal density at \(x \in {\mathbb {R}}\) for mean 0 and standard deviation \(\sigma. \)

Proof

To prove the given lemma we consider,

$$\begin{aligned}L.H.S. &{}=\frac{1}{2\pi \sigma _1\sigma _2}\cdot \int _{-\infty }^{\infty }{e^{{-}\frac{1}{2}\cdot \left[ \left( \frac{x_1-x_2}{\sigma _1}\right) ^2+\left( \frac{x_2-x_2}{\sigma _2}\right) ^2\right] } {\mathrm{d}}x_2} \\ &{}=\frac{1}{2\pi \sigma _1\sigma _2}\cdot \int _{-\infty }^{\infty }{e^{{-}\frac{1}{2}\cdot \left[ x_2^2\left( \frac{1}{\sigma _1^2}+\frac{1}{\sigma _2^2}\right) -2x_2 \left( \frac{x_1}{\sigma _1^2}+\frac{x_3}{\sigma _2^2}\right) +\frac{x_1^2}{\sigma _1^2}+\frac{x_3^2}{\sigma _2^2}\right] }{\mathrm{d}}x_2}\\ &{}=\frac{1}{\sqrt{2\pi }\sqrt{\sigma _1^2+\sigma _2^2}}\cdot e^{{-}\frac{1}{2\left( \frac{1}{\sigma _1^2}+\frac{1}{\sigma _2^2}\right) }\left[ \left( \frac{x_1}{\sigma _1^2}+\frac{x_3}{\sigma _2^2}\right) ^2 -\left( \frac{x_1}{\sigma _1^2}+\frac{x_3}{\sigma _2^2}\right) \cdot \left( \frac{1}{\sigma _1^2}+\frac{1}{\sigma _2^2}\right) \right] }\\ &{} \quad \cdot \int _{-\infty }^{\infty }{\phi _{\frac{1}{\frac{1}{\sigma _1^2}+\frac{1}{\sigma _2^2}}} \left( x_2-\left( \frac{\frac{x_1}{\sigma _1^2}+\frac{x_3}{\sigma _2^2}}{\frac{1}{\sigma _1^2} +\frac{1}{\sigma _2^2}}\right) \right) {\mathrm{d}}x_2}\\ &{}=\frac{1}{\sqrt{2\pi }\sqrt{\sigma _1^2+\sigma _2^2}}\cdot e^{-\frac{(x_1-x_3)^2}{2(\sigma _1^2+\sigma _2^2)}} \\ &{}=R.H.S. \end{aligned}$$

\(\square \)

1.2 Proof of Theorem 2.2

Proof

Let H(.) be the c.d.f. of Z. Then,

$$\begin{aligned} H(x) &{}= P[Z_i \le x] \\ &{}=P[Z_i \le x \,|\,B_i=1]\cdot P[B_i=1] + P[Z_i \le x \,|\,B_i=0]\cdot P[B_i=0] \\ &{}=p \cdot G(x) + (1-p)\cdot \int _{-\infty }^{\infty }{\phi _\sigma (x-y)G(y){\mathrm{d}}y} \end{aligned}$$

If \({\tilde{G}}(x)=p\cdot G(x)\), then we have the equation

$$\begin{aligned} {\tilde{G}}(x)= H(x) + \lambda \int _{-\infty }^{\infty }{\phi _\sigma (x-y){\tilde{G}}(y){\mathrm{d}}y} \end{aligned}$$
(4)

To find a solution to Eq. (4), note that this is a Fredholm equation of second kind and a solution for such a problem for \(|\lambda |<1\) is given by the famous Liouville-Neumann series,

$$\begin{aligned} {\tilde{G}}(x)=\underset{n \rightarrow \infty }{lim}\sum _{t=0}^n{\lambda ^t u_t(x)}\; {\text{where}}, u_o(x)=H(x), \end{aligned}$$
(5)

\(u_t(x)=\int _{-\infty }^{\infty } \ldots \int _{-\infty }^{\infty } {\phi _\sigma (x-x_1)\cdot \phi _\sigma (x_2-x_1)\cdots \phi _\sigma (x_{t-1}-x_t)H(x_t){\mathrm{d}}x_1 {\mathrm{d}}x_2 \ldots {\mathrm{d}}x_t}\)

Now, H(x) is unknown, so we use \({\hat{H}}(x)=\frac{1}{n}\sum _{j=1}^n{{\mathbb {I}}_{[Z_j \le x]}}\) instead. Here, \({\mathbb {I}}_A\) is the indicator function for event A.

For \(t \in {\mathbb {N}}\), the integrand in \({\hat{u}}_t(x)\) is integrable since,

$$\begin{aligned}|{\hat{u}}_t(x)| &{}= \left|\int _{-\infty }^{\infty } \ldots \int _{-\infty }^{\infty } {\phi _\sigma (x-x_1)\cdot \phi _\sigma (x_2-x_1)\cdots \phi _\sigma (x_{t-1}-x_t){\hat{H}}(x_t){\mathrm{d}}x_1 {\mathrm{d}}x_2 \ldots {\mathrm{d}}x_t}\right| \\ &{}\le \int _{-\infty }^{\infty } \ldots \int _{-\infty }^{\infty } |{\phi _\sigma (x-x_1)\cdot \phi _\sigma (x_2-x_1)\cdots \phi _\sigma (x_{t-1}-x_t){\hat{H}}(x_t)|{\mathrm{d}}x_1 {\mathrm{d}}x_2 \ldots {\mathrm{d}}x_t} \\ &{}\le \int _{-\infty }^{\infty } \ldots \int _{-\infty }^{\infty } {\phi _\sigma (x-x_1)\cdot \phi _\sigma (x_2-x_1)\cdots \phi _\sigma (x_{t-1}-x_t){\mathrm{d}}x_1 {\mathrm{d}}x_2 \ldots {\mathrm{d}}x_t} \\ &{}= \int _{-\infty }^{\infty }{\phi _{\sigma \sqrt{t}}(x-x_t){\mathrm{d}}x_t},\; {\text{by}}\;{\text{lemma}}\; 5.1 \\ &{}=1 < \infty , \forall x \in {\mathbb {R}} \end{aligned}$$

Also,

$$\begin{aligned}{\hat{u}}_t(x) &{}= \int _{-\infty }^{\infty } \ldots \int _{-\infty }^{\infty } {\phi _\sigma (x-x_1)\cdot \phi _\sigma (x_2-x_1)\cdots \phi _\sigma (x_{t-1}-x_t){\hat{H}}(x_t){\mathrm{d}}x_1 {\mathrm{d}}x_2 \ldots {\mathrm{d}}x_t} \\ &{}=\int _{-\infty }^{\infty }{\phi _{\sigma \sqrt{t}}(x-x_t){\hat{H}}(x_t){\mathrm{d}}x_t}, \;{\text{by}}\;{\text{lemma}}\; 5.1 \\ &{}=\frac{1}{n}\sum _{j=1}^n{\int _{Z_j}^{\infty }{\phi _{\sigma \sqrt{t}}(x-x_t){\mathrm{d}}x_t}}\\ &{}=\frac{1}{n}\sum _{j=1}^n{\int _{-\infty }^{x-Z_j}{\phi _{\sigma \sqrt{t}}(y){\mathrm{d}}y}},\;{\text{Taking}}\;y=x-x_t \\ &{}=\frac{1}{n}\sum _{j=1}^n{\Phi _{\sigma \sqrt{t}}(x-Z_j)}\end{aligned}$$

Thus, \(\hat{{\tilde{G}}}(x)= \sum _{t=0}^\infty {\lambda ^t {\hat{u}}_t(x)}= \frac{1}{n}\sum _{j=1}^n {\sum _{t=0}^\infty {\lambda ^t \Phi _{\sigma \sqrt{t}}(x-Z_j)}}\) is an estimator of \({\tilde{G}}(x)\).

$$\begin{aligned}E[\hat{{\tilde{G}}}(x)] &{}=\sum _{t=0}^\infty {\lambda ^t E[{\hat{u}}_t(x)]} \\ &{}= \sum _{t=0}^\infty {\lambda ^t E[\int _{-\infty }^{\infty } \ldots \int _{-\infty }^{\infty } {\phi _\sigma (x-x_1)\cdots \phi _\sigma (x_{t-1}-x_t){\hat{H}}(x_t){\mathrm{d}}x_1 \ldots {\mathrm{d}}x_t}]} \\ &{}= \sum _{t=0}^\infty {\lambda ^t E[\int _{-\infty }^{\infty }{\phi _{\sigma \sqrt{t}}(x-x_t){\hat{H}}(x_t){\mathrm{d}}x_t}]}\\ &{}= \sum _{t=0}^\infty {\lambda ^t \int _{-\infty }^{\infty }{\phi _{\sigma \sqrt{t}}(x-x_t)E[{\hat{H}}(x_t)]{\mathrm{d}}x_t}} ,\;{\text{by}}\;{\text{Tonelli's}}\;{\text{Theorem}} \\ &{}= \sum _{t=0}^\infty {\lambda ^t \int _{-\infty }^{\infty }{\phi _{\sigma \sqrt{t}}(x-x_t)H(x_t){\mathrm{d}}x_t}} \\ &{}= \sum _{t=0}^\infty {\lambda ^t \int _{-\infty }^{\infty } \ldots \int _{-\infty }^{\infty } {\phi _\sigma (x-x_1)\cdots \phi _\sigma (x_{t-1}-x_t)H(x_t){\mathrm{d}}x_1 \ldots {\mathrm{d}}x_t}} \\ &{}= {\tilde{G}}(x)\end{aligned}$$

Thus, \(\frac{{\hat{G}}(x)}{p}\) is an unbiased estimator for G(x). \(\square \)

1.3 Proof of Theorem 2.3

Proof

Note that in Eq. (4), the integral term is a convolution of two functions and hence can be easily interchanged. Thus, taking derivatives on both sides, we have

$$\begin{aligned} {\tilde{g}}(x)= h(x) + \lambda \int _{-\infty }^{\infty }{\phi _\sigma (x-y){\tilde{g}}(y){\mathrm{d}}y} \end{aligned}$$
(6)

where \({\tilde{g}}(x)=\frac{{\mathrm{d}}}{{\mathrm{d}}x}\{{\tilde{G}}(x)\}=pg(x)\) and g(x) and h(x) are density of X and Z,  respectively.

To find a solution to Eq. (6), note that this is a Fredholm Equation of second kind and a solution for such a problem for \(|\lambda |<1\) is given by the Liouville-Neumann Series,

$$\begin{aligned} {\tilde{g}}(x) &= \underset{n \rightarrow \infty }{lim}\sum _{t=0}^n{\lambda ^t u_t(x)},\; {\text{where}},\; u_o(x)=h(x),\\ u_t(x) &= \int _{-\infty }^{\infty } \ldots \int _{-\infty }^{\infty } {\phi _\sigma (x-x_1)\cdot \phi _\sigma (x_2-x_1)\cdots \phi _\sigma (x_{t-1}-x_t)h(x_t){\mathrm{d}}x_1 {\mathrm{d}}x_2 \ldots {\mathrm{d}}x_t} \end{aligned}$$

Now, h(x) is unknown, so we use \({\hat{h}}(x)=\frac{1}{n}\sum _{j=1}^{nb}{K(\frac{x-Z_j}{b})}\) instead. Here, K(.) is the Gaussian Kernel function and b is the bandwidth selected by Silvermans rule of thumb.

For \(t \in {\mathbb {N}}\), the integrand in \({\hat{u}}_t(x)\) is integrable in a similar reason as in the proof of Theorem 2.1. Also,

$$\begin{aligned}{\hat{u}}_t(x) &{}= \int _{-\infty }^{\infty } \ldots \int _{-\infty }^{\infty } {\phi _\sigma (x-x_1)\cdot \phi _\sigma (x_2-x_1)\cdots \phi _\sigma (x_{t-1}-x_t){\hat{h}}(x_t){\mathrm{d}}x_1 {\mathrm{d}}x_2 \ldots {\mathrm{d}}x_t} \\ &{}=\int _{-\infty }^{\infty }{\phi _{\sigma \sqrt{t}}(x-x_t){\hat{h}}(x_t){\mathrm{d}}x_t}, \; {\text{by}}\;{\text{lemma}}\;5.1 \\ &{}=\frac{1}{n}\sum _{j=1}^n{\int _{-\infty }^{\infty }{\phi _{\sigma \sqrt{t}}(x-x_t)\phi _{b}(x_t-Z_j){\mathrm{d}}x_t}}\\ &{}=\frac{1}{n}\sum _{j=1}^n{\phi _{\sqrt{t\sigma ^2 + b^2}}(x-Z_j)},\;{\text{by}}\;{\text{lemma}}\;5.1 \end{aligned}$$

Thus, \(\hat{{\tilde{g}}}(x)= \sum _{t=0}^\infty {\lambda ^t {\hat{u}}_t(x)}= \frac{1}{n}\sum _{j=1}^n {\sum _{t=0}^\infty {\lambda ^t \phi _{\sqrt{t\sigma ^2 + b^2}}(x-Z_j)}}\) is an estimator of \({\tilde{g}}(x),\) and hence,

$$\begin{aligned} {\hat{G}}(x)= \frac{1}{np}\sum _{j=1}^{n} {\sum _{t=0}^\infty {\lambda ^t \Phi _{\sqrt{t\sigma ^2 + b^2}}(x-Z_j)}} \end{aligned}$$

is an estimator for G(x).

Note that this estimator is a linear series of normal c.d.f.s each with positive variance and hence are smooth functions, which makes the function \({\hat{G}}(x)\), a smooth function. \(\square \)

1.4 Proof of Theorem 2.4

Proof

We have \(T_i(x)=\frac{1}{np}\sum _{j=1}^n {\sum _{t=0}^\infty {\lambda ^t \Phi _{\sigma _t}(x-Z_j)}}\) where \(\sigma _t=\sigma \sqrt{t}\) for \(T_1(x)\) and \(\sqrt{t\sigma ^2 + b^2}\) for \(T_b(x)\). Since \(\Phi _{\sigma _t}(x-Z_j) \le 1\) \(\forall t \in {\mathbb {N}}\) and \(\sum _{t=0}^\infty {\lambda ^t}=p\), \(\frac{1}{p}\sum _{t=0}^\infty {\lambda ^t \Phi _{\sigma _t}(x-Z_j)} = T_{ij} \; ({\text{say}})\le 1\).

\(T_i(x)=\frac{1}{n}\sum _{j=1}^n{T_{ij}}\), i.e., average of n i.i.d. random variables each of whose value is less than or equal to 1. Thus,

$$\begin{aligned} {\text{Var}}(T_i(x)) = \frac{1}{n}\cdot {\text{Var}}(T_{i1}(x)) \le \frac{1}{n}E(T_{i1}^2) = O\left( \frac{1}{n}\right) \end{aligned}$$

Since \(T_1(x)\) is unbiased, the last equation implies convergence in probability of \(T_1(x)\) to its expected value G(x). For the smooth estimator,

$$\begin{aligned} E(T_b(x))=E\left[ \frac{1}{np}\sum _{j=1}^n{\sum _{t=0}^\infty {\lambda ^t\cdot \Phi _{b_t}(x-Z_j)}}\right] =\frac{1}{p}\sum _{t=0}^\infty {\lambda ^t\int _{-\infty }^{\infty }{\Phi \left( \frac{x-z}{\sqrt{t\sigma ^2 +b^2}}\right) h(z){\mathrm{d}}z}} \end{aligned}$$

The true c.d.f. can be written as (using Equation 4),

$$\begin{aligned} G(x)&{}= \frac{1}{p} \sum \limits _{t=0}^\infty {\lambda ^t \int _{-\infty }^{\infty } \ldots \int _{-\infty }^{\infty } {\phi _\sigma (x-x_1)\cdots \phi _\sigma (x_{t-1}-x_t)H(x_t){\mathrm{d}}x_1 \ldots {\mathrm{d}}x_t}} \\ &{}= \frac{1}{p} \sum \limits _{t=0}^\infty {\lambda ^t \int _{-\infty }^{\infty } {\phi _{\sqrt{t\sigma ^2}}(x-x_t)\cdot H(x_t){\mathrm{d}}x_t}} \\ &{}= \frac{1}{p}\sum \limits _{t=0}^\infty {\lambda ^t\int _{-\infty }^{\infty }{\Phi _{\sqrt{t\sigma ^2}}(x-x_t)h(x_t){\mathrm{d}}x_t}}\\ &{}= \frac{1}{p}\sum \limits _{t=0}^\infty {\lambda ^t\int _{-\infty }^{\infty }{\Phi \left(\frac{x-x_t}{\sqrt{t\sigma ^2}}\right)h(x_t){\mathrm{d}}x_t}} \end{aligned}$$

Since \(\int _{-\infty }^{\infty }{\phi _{\sqrt{t\sigma ^2}}(x-x_t)\cdot H(x_t){\mathrm{d}}x_t}=\int _{-\infty }^{\infty }{\Phi _{\sqrt{t\sigma ^2}}(x-x_t)\cdot h(x_t){\mathrm{d}}x_t},\) both being c.d.f. of \(Z+N(0,\sqrt{t\sigma ^2})\).

Thus, we have an expression for the bias at point x, denoted by B(x), as given below.

$$\begin{aligned} |B(x)|&= \left| \frac{1}{p}\sum \limits _{t=0}^\infty {\lambda ^t\int _{-\infty }^{\infty }\left[ {{\Phi \left( \frac{x-z}{\sqrt{t\sigma ^2 +b^2}}\right) }-\Phi \left( \frac{x-z}{\sqrt{t\sigma ^2}}\right) }h(z)\right] {\mathrm{d}}z}\right| \\&\le \frac{1}{p}\left| \underbrace{\left. \int _{-\infty }^{\infty }{{\mathbb {I}}(x-z)h(z){\mathrm{d}}z} -\int _{-\infty }^{\infty }{\Phi \left( \frac{x-z}{b}\right) h(z){\mathrm{d}}z}\right| }\right. _{\textsf {Tr}_1}\\&\quad + \frac{1}{p} \underbrace{\sum _{t=1}^\infty {|\lambda |^t\int _{-\infty }^{\infty }{\left| {\Phi \left( \frac{x-z}{\sqrt{t\sigma ^2 +b^2}}\right) }-\Phi \left( \frac{x-z}{\sqrt{t\sigma ^2}}\right) \right| }h(z){\mathrm{d}}z}}_{\textsf {Tr}_2}\\ \textsf {Tr}_1&= \left|\int _{-\infty }^{\infty }{{\mathbb {I}}(x-z)h(z){\mathrm{d}}z}-\int _{-\infty }^{\infty }{\Phi \left( \frac{x-z}{b}\right) h(z){\mathrm{d}}z}\right|\\&= \left|H(x)- \int _{-\infty }^{\infty }{\Phi (x-z,0,b)h(z){\mathrm{d}}z}\right| \\&= |H(x)-H^{\star }(x)| \end{aligned}$$

where \(H^{\star }(x)\) is the c.d.f. of \(Z+N(0,b^2)\), Z is a r.v. with p.d.f. h(.). As \(n \rightarrow \infty \), \(b \rightarrow 0\) and \(N(0,b^2) \overset{P}{\rightarrow } 0\). Thus, by Slutsky’s theorem, \(Z+N(0,b^2) \Longrightarrow Z\) in distribution as \(b \rightarrow 0\), or, \(|H(x)-H^{\star }(x)| \rightarrow 0\) as \(n \rightarrow \infty \).

For \(\textsf {Tr}_2\), since \(\Phi (\frac{x-z}{\sqrt{y}})\) is a continuous differentiable function in y, expanding the function by Taylor series at \(y=t\sigma ^2,\) we have \(|{\Phi (\frac{x-z}{\sqrt{t\sigma ^2 +b^2}})}-\Phi (\frac{x-z}{\sqrt{t\sigma ^2}})|=|(t\sigma ^2+b^2 - t\sigma ^2)(\phi (\frac{x-z}{\sqrt{y*}})\frac{x-z}{y*^{3/2}})|\) where \(y^*\) is a point between \(t\sigma ^2\) and \(t\sigma ^2 +b^2\). Also, it is easy to show that \(|x\phi (x)| \le 1\) for all \(x \in {\mathcal {R}}\)( as discussed in Note 5.1), which implies \(|\phi (\frac{x-z}{\sqrt{y*}})\frac{x-z}{y*^{1/2}}| \le 1\). Thus,

$$\begin{aligned} \textsf {Tr}_2 &{}= \sum _{t=1}^\infty {|\lambda |^t\int _{-\infty }^{\infty }{\left| {\Phi \left( \frac{x-z}{\sqrt{t\sigma ^2 +b^2}}\right) }-\Phi \left( \frac{x-z}{\sqrt{t\sigma ^2}}\right) \right| }h(z){\mathrm{d}}z} \\ &{}\le \sum _{t=1}^\infty {|\lambda |^t \frac{b^2}{t \sigma ^2}} \\ &{}\le \frac{b^2}{\sigma ^2} \sum _{t=1}^\infty {\frac{|\lambda |^t}{t}} \\ &{}= \frac{b^2}{\sigma ^2}\{-\log (1-|\lambda |)\} , {\text{where}}\;0<|\lambda |<1. \\ &{}= C.b^2 \rightarrow 0, {\text{as}}\;b \rightarrow 0,{\text{where}}\;{\text{C}}\;{\text{is}}\;{\text{a}}\;{\text{constant}} \end{aligned}$$

Thus, \(MSE(T_b(x))={\text{Var}}(T_b(x))+Bias(x)^2 \rightarrow 0,\; {\text{as}}\; n \rightarrow \infty \). Hence, we have \(T_b(x) \overset{{\mathbb {L}}_2}{\longrightarrow } G(x) , \forall x {\text{as}}\;n \rightarrow \infty \) which implies the result. \(\square \)

Note 5.1 For \(|x| \le 1\), \(\frac{1}{\sqrt{2\pi }}<1\), \(e^{-x^2/2}\le 1\) implies \(|x\phi (x)|<1\). For \(|x|>1\), since \(|x\phi (x)|=|x|\phi (|x|)\), it can be written as \(\frac{1}{\sqrt{2\pi }}\frac{|x|}{e^{|x|^2/2}}=\frac{1}{\sqrt{2\pi }}\frac{1}{\frac{1}{|x|}+\frac{|x|}{2} + \delta } \le \frac{2}{\sqrt{2\pi }}<1\), where \(\delta > 0\).

1.5 Proof of Theorem 2.5

Proof

For \(1 \le i \le n\), we have

$$\begin{aligned}&E[(Z_i-E(Z_i))^2(X^{\prime }_i-E(X^{\prime }_i))^2] \\ &\quad = p E[(Z_i-E(Z_i))^2(X^{\prime }_i-E(X^{\prime }_i))^2 \,|\,B_i=1] \\&\qquad+ (1-p)E[(Z_i-E(Z_i))^2(X^{\prime }_i-E(X^{\prime }_i))^2 \,|\,B_i=0] \\&\quad = pE[\frac{1}{n-1} \sum _{j=1,j \ne i}^n{[(X_j-E(X_j))^2(X^{\prime }_i-E(X^{\prime }_i))^2]}] \\&\qquad+(1-p)E[[(X_i+Y_i-E(X_i+Y_i))^2(X^{\prime }_i-E(X^{\prime }_i))^2]] \\&\qquad[ \because E[Y_i]=0\;{\text{and}}\; E[Z_i]=E[X_j] \forall 1 \le i,j \le n{\text{as}}\;{\text{shown}}\;{\text{in}}\;{\text{Proof}}\;{\text{of}}\;{\text{Theorem}}\;2.1] \\ &\quad = p\cdot {\text{Var}}(X^{\prime })\cdot {\text{Var}}(X) + (1-p)\cdot \{ d(X,X^{\prime })+ 2\cdot 0 +{\text{Var}}(Y)\cdot {\text{Var}}(X^{\prime }) \} < \infty \end{aligned}$$

Thus, \({\hat{Cov}}(Z,X^{\prime })=\frac{1}{n}\cdot \sum _{j=1}^n[{Z_j\cdot X^{\prime }_j}]-{\bar{Z}}\cdot \bar{X^{\prime }}\) exists and hence is a consistent estimator of the true covariance between Z and \(X^{\prime }\).

$$\begin{aligned} Cov(Z,X^{\prime }) &= E(ZX^{\prime })-E(Z)\cdot E(X^{\prime }) \\& = p E(X)\cdot E(X^{\prime })+ (1-p) E((X+Y)\cdot X^{\prime }) - E(Z)\cdot E(X^{\prime })\\ &= (1-p) [E(X\cdot X^{\prime })- E(Z)\cdot E(X^{\prime })] \\&\quad[\because E[Z]=E[X]\;{\text{and}}\; E[YX^{\prime }]=E[Y]E[X^{\prime }]=0] \\ &= (1-p). Cov(X,X^{\prime }) \end{aligned}$$

Thus, \({\hat{Cov}}(Z,X^{\prime })\) is a consistent estimator of \((1-p). Cov(X,X^{\prime })\), or \(\frac{1}{1-p}{\hat{Cov}}(Z,X^{\prime })\) is a consistent estimator of \(Cov(X,X^{\prime })\). Also, we have seen in Theorem 2.1, Var(Z) exists if Var(X) does. Thus, \(\sqrt{{\hat{Var}}(Z)} \overset{P}{\longrightarrow } \sqrt{{\text{Var}}(Z)}\) and since \({\text{Var}}(X^{\prime })\) exists, \(\sqrt{{\hat{Var}}(X^{\prime })} \overset{P}{\longrightarrow } \sqrt{{\text{Var}}(X^{\prime })}\). Hence, by property of convergence in probability, the result follows. \(\square \)

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Ghatak, D., Roy, B.K. Conditional Masking to Numerical Data. J Stat Theory Pract 13, 44 (2019). https://doi.org/10.1007/s42519-019-0042-y

Download citation

  • Published:

  • DOI: https://doi.org/10.1007/s42519-019-0042-y

Keywords

Navigation