Skip to main content

On the Privacy Guarantees of Synthetic Data: A Reassessment from the Maximum-Knowledge Attacker Perspective

  • Conference paper
  • First Online:
Privacy in Statistical Databases (PSD 2018)

Part of the book series: Lecture Notes in Computer Science ((LNISA,volume 11126))

Included in the following conference series:

Abstract

Generating synthetic data for the dissemination of individual information in a privacy-preserving way is an approach that is often presented as superior to other statistical disclosure control techniques. The reason for such claim is straightforward at first glance: since all records disseminated are synthetic and not actual observed values, no individual can reasonably claim to face a privacy threat. Thus, and if the synthesizer used is good enough, synthetic data will potentially always offer a high level of information with low disclosure risk attached. Building on recent advances in the literature regarding the conceptualization of an intruder, this paper aims at challenging this claim by reassessing the privacy guarantees of synthetic data. Using the concept of a maximum-knowledge intruder, we demonstrate that synthetic data can in fact be always expressed as a re-arrangement of the original data and that, as a result, they may lead to configurations where disclosure risk may be higher than for non-synthetic disclosure control approaches. We illustrate the application of these results by an empirical example.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 54.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 69.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

  1. 1.

    Using these notations, \( o_{ij} \) is the rank of attribute j in original record i and \( s_{lj}^{m} \) is the rank of attribute j in synthetic record l of the mth synthetic data set.

  2. 2.

    The two other attributes are not shown here due to space constraints but their reverse-mapped versions can be displayed in exactly the same way.

References

  1. Domingo-Ferrer, J., Muralidhar, K.: New directions in anonymization: permutation paradigm, verifiability by subjects and intruders, transparency to users. Inf. Sci. 337, 11–24 (2016)

    Article  Google Scholar 

  2. Domingo-Ferrer, J., Ricci, S., Soria-Comas, J.: Disclosure risk assessment via record linkage by a maximum-knowledge attacker. In: 13th Annual International Conference on Privacy, Security and Trust-PST 2015, Izmir, Turkey, September 2015

    Google Scholar 

  3. Domingo-Ferrer, J., Sánchez, D., Rufian-Torrell, G.: Anonymization of nominal data based on semantic marginality. Inf. Sci. 242, 35–48 (2013)

    Article  Google Scholar 

  4. Drechsler, J.: Synthetic Datasets for Statistical Disclosure Control. Springer, New York (2011). https://doi.org/10.1007/978-1-4614-0326-5

    Book  MATH  Google Scholar 

  5. Drechsler, J., Bender, S., Rässler, S.: Comparing fully and partially synthetic datasets for statistical disclosure control in the German IAB establishment panel. Trans. Data Priv. 1, 105–130 (2008)

    MathSciNet  Google Scholar 

  6. Hu, J., Reiter, J.P., Wang, Q.: Disclosure risk evaluation for fully synthetic categorical data. In: Domingo-Ferrer, J. (ed.) PSD 2014. LNCS, vol. 8744, pp. 185–199. Springer, Cham (2014). https://doi.org/10.1007/978-3-319-11257-2_15

    Chapter  Google Scholar 

  7. Hundepool, A., et al.: Statistical Disclosure Control. Wiley, Hoboken (2012)

    Book  Google Scholar 

  8. Muralidhar, K., Domingo-Ferrer, J.: Rank-based record linkage for re-identification risk assessment. In: Domingo-Ferrer, J., Pejić-Bach, M. (eds.) PSD 2016. LNCS, vol. 9867, pp. 225–236. Springer, Cham (2016). https://doi.org/10.1007/978-3-319-45381-1_17

    Chapter  Google Scholar 

  9. Muralidhar, K., Domingo-Ferrer, J.: Microdata masking as permutation. In: UNECE/EUROSTAT Work Session on Statistical Data Confidentiality, Helsinki, Finland, October 2015

    Google Scholar 

  10. Muralidhar, K., Sarathy, R.: A comparison of multiple imputation and data perturbation for masking numerical variables. J. Off. Stat. 22, 507–524 (2006)

    Google Scholar 

  11. Muralidhar, K., Sarathy, R., Domingo-Ferrer, J.: Reverse mapping to preserve the marginal distributions of attributes in masked microdata. In: Domingo-Ferrer, J. (ed.) PSD 2014. LNCS, vol. 8744, pp. 105–116. Springer, Cham (2014). https://doi.org/10.1007/978-3-319-11257-2_9

    Chapter  Google Scholar 

  12. Reiter, J.P., Wang, Q., Zhang, B.: Bayesian estimation of disclosure risks in multiply imputed, synthetic data. J. Priv. Confid. 6(1), 17–33 (2014). Article no. 2

    Google Scholar 

  13. Reiter, J.P.: Satisfying disclosure restrictions with synthetic data sets. J. Off. Stat. 18, 531–544 (2002)

    Google Scholar 

  14. Reiter, J.P.: Releasing multiply imputed, synthetic public use microdata: an illustration and empirical study. J. Roy. Stat. Soc. Ser. A 168, 185–205 (2005)

    Article  MathSciNet  Google Scholar 

  15. Rubin, D.B.: Discussion: statistical disclosure control limitation. J. Off. Stat. 9, 462–468 (1993)

    Google Scholar 

  16. Ruiz, N.: On some consequences of the permutation paradigm for data anonymization: centrality of permutation matrices, universal measures of disclosure risk and information loss, evaluation by dominance. Inf. Sci. 430–431, 620–633 (2018)

    Article  MathSciNet  Google Scholar 

  17. Ruiz, N.: A general cipher for individual data anonymization. Inf. Sci. (2017, under review). (https://arxiv.org/abs/1712.02557)

  18. Soria-Comas, J., Domingo-Ferrer, J.: A non-parametric model for accurate and provably private synthetic data sets. In: Proceedings of International Conference on Availability, Reliability and Security-ARES 2017, Article no. 3. ACM (2017)

    Google Scholar 

  19. Willenborg, L., De Waal, T.: Elements of Statistical Disclosure Control. Springer, New York (2001). https://doi.org/10.1007/978-1-4613-0121-9

    Book  MATH  Google Scholar 

Download references

Acknowledgments and Disclaimer

The following funding sources are gratefully acknowledged by the third author: European Commission (project H2020-700540 “CANVAS”), Government of Catalonia (ICREA Acadèmia Prize) and Spanish Government (projects TIN2014-57364-C2-1-R “SmartGlacis” and TIN2015-70054-REDC). The views in this paper are the authors’ own and do not necessarily reflect the views of UNESCO or any of the funders.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Nicolas Ruiz .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2018 Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Ruiz, N., Muralidhar, K., Domingo-Ferrer, J. (2018). On the Privacy Guarantees of Synthetic Data: A Reassessment from the Maximum-Knowledge Attacker Perspective. In: Domingo-Ferrer, J., Montes, F. (eds) Privacy in Statistical Databases. PSD 2018. Lecture Notes in Computer Science(), vol 11126. Springer, Cham. https://doi.org/10.1007/978-3-319-99771-1_5

Download citation

  • DOI: https://doi.org/10.1007/978-3-319-99771-1_5

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-319-99770-4

  • Online ISBN: 978-3-319-99771-1

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics