Abstract
In this age of globalization, organizations need to publish their micro-data owing to legal directives or share it with business associates in order to remain competitive. This puts personal privacy at risk. To surmount this risk, attributes that clearly identify individuals, such as Name, Social Security Number, and Driving License Number, are generally removed or re- placed by random values. But this may not be enough because such de-identified databases can sometimes be joined with other public databases on attributes such as Gender, Date of Birth, and Zipcode to re-identify individuals who were supposed to remain anonymous. In the literature, such an identity-leaking attribute combination is called as a quasi-identifier. It is always critical to be able to recognize quasi-identifiers and to apply to them appropriate protective measures to mitigate the identity disclosure risk posed by join attacks.
In this paper, we start out by providing the first formal characterization and a practical technique to identify quasi-identifiers. We show an interesting connection between whether a set of columns forms a quasi-identifier and the number of distinct values assumed by the combination of the columns. We then use this characterization to come up with a probabilistic notion of anonymity. Again we show an interesting connection between the number of distinct values taken by a combination of columns and the anonymity it can offer. This allows us to find an ideal amount of generalization or suppression to apply to different columns in order to achieve probabilistic anonymity. We work through many examples and show that our analysis can be used to make a published database conform to privacy rules like HIPAA. In order to achieve probabilistic anonymity, we observe that one needs to solve multiple 1-dimensional k-anonymity problems. We propose many efficient and scalable algorithms for achieving 1-dimensional anonymity. Our algorithms are optimal in a sense that they minimally distort data and retain much of its utility.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Preview
Unable to display preview. Download preview PDF.
References
Accuracy of the US census data, U.S. Census Bereau, http://www.census.gov/acs/www/UseData/Accuracy/Accuracy1.htm
Public use microdata sample (PUMS), U.S. Census Bureau, http://www.census.gov/acs/www/Products/PUMS/
Aggarwal, C.C.: On k-anonymity and the curse of dimensionality. In: Proceedings of the 2005 International Conference on Very Large Data Bases, pp. 901–909 (2005)
Aggarwal, G., Feder, T., Kenthapadi, K., Motwani, R., Panigrahy, R., Thomas, D., Zhu, A.: Anonymizing tables. In: Proceedings of the International Conference on Database Theory, pp. 246–258 (2005)
Aggarwal, G., Feder, T., Kenthapadi, K., Motwani, R., Panigrahy, R., Thomas, D., Zhu, A.: Approximation algorithms for k-Anonymity. Journal of Privacy Technology, 20051120001 (2005); Earlier version appeared in Proc. of the Intl. Conf. on Database Theory (ICDT 2005)
Aggarwal, G., Feder, T., Kenthapadi, K., Panigrahy, R., Thomas, D., Zhu, A.: Clustering for privacy. In: Proceedings of the ACM Symposium on Principles of Database Systems (2006)
Agrawal, R., Srikant, R.: Fast Algorithms for Mining Association Rules. In: Proceedings of the International Conference on Very Large Data Bases, Santiago, Chile, pp. 487–499 (September 1994)
Baum, K.: First estimates from the national crime victimization survey: Identity theft, 2004. In: Bureau of Justice Statistics Bulletin (April 2006), http://www.ojp.usdoj.gov/bjs/pub/pdf/it04.pdf
Bayardo, R.J., Agrawal, R.: Data privacy through optimal k-anonymization. In: Proceedings of the International Conference on Data Engineering, pp. 217–228 (2005)
Blake, C., Merz, C.: UCI repository of machine learning databases (1998), http://www.ics.uci.edu/~mlearn/MLRepository.html
Brown, M.: Identity theft victim stories: Verbal testimony by michelle brown. In: Privacy Rights Clearing House (July 2000), http://www.privacyrights.org/cases/victim9.htm
Chaudhuri, S., Ganjam, K., Ganti, V., Motwani, R.: Robust and efficient fuzzy match for online data cleaning. In: Proceedings of the ACM SIGMOD International Conference on Management of Data (2003)
Chawla, S., Dwork, C., McSherry, F., Smith, A., Wee, H.: Toward privacy in public databases. In: 2nd Theory of Cryptography Conference (TCC), pp. 363–385 (2005)
Chawla, S., Dwork, C., McSherry, F., Talwar, K.: On the utility of privacy-preserving histograms. In: 21st Conference on Uncertainty in Artificial Intelligence (UAI) (2005)
Chernoff, H.: Asymptotic efficiency for tests based on the sums of observations. Annals of Mathematical Statistics 23, 493–507 (1952)
Dalenius, T.: Finding a needle in a haystack or identifying anonymous census records. Journal of Official Statistics (2), 329–336 (1986)
Gibbons, P.B.: Distinct sampling for highly-accurate answers to distinct values queries and event reports. In: Proceedings of the International Conference on Very Large Data Bases, pp. 541–550 (2001)
GLB. Gramm-Leach-Bliley Act, http://www.ftc.gov/privacy/privacyinitiatives/glbact.html
HIPAA. Health Information Portability and Accountability Act, http://www.hhs.gov/ocr/hipaa/
IBM. Privacy is good for business, http://www-306.ibm.com/innovation/us/customerloyalty/harriet_pearson_interview.shtml
Iyengar, V.: Transforming data to satisfy privacy constraints. In: 8th ACM SIGKDD International Conference on Knowledge Discovery in Databases and Data Mining, pp. 279–288 (2002)
Jain, K., Vazirani, V.: Primal-dual approximation algorithms for metric facility location and k-median problems. In: Proceedings of the Annual IEEE Symposium on Foundations of Computer Science, pp. 2–13 (1999)
Lefevre, K., Dewitt, D.J., Ramakrishnan, R.: Incognito: Efficient full domain k-anonymity. In: Proceedings of the ACM SIGMOD International Conference on Management of Data, pp. 49–60 (2005)
Machanavajjhala, A., Kifer, D., Gehrke, J., Venkitasubramaniam, M.: l-diversity: Privacy beyond k-anonymity. In: Proceedings of the International Conference on Data Engineering, p. 24 (2006)
Manku, G.S., Rajagopalan, S., Lindsay, B.G.: Random sampling techniques for space efficient online computation of order statistics of large datasets. In: Proceedings of the ACM SIGMOD International Conference on Management of Data, pp. 251–262 (1999)
Meyerson, A., Williams, R.: On the complexity of optimal k-anonymity. In: Proceedings of the ACM Symposium on Principles of Database Systems, pp. 223–228 (June 2004)
Munro, I., Paterson, M.: Selection and sorting with limited storage. In: Proceedings of the Annual IEEE Symposium on Foundations of Computer Science, pp. 253–258 (1978)
Rudin, W.: Real and Complex Analysis. McGraw-Hill, New York (1987)
Samarati, P., Sweeney, L.: Generalizing data to provide anonymity when disclosing information (abstract). In: Proceedings of the ACM Symposium on Principles of Database Systems, p. 188 (1998)
SOX. Sarbanes-Oxley Act, http://www.sec.gov/about/laws/soa2002.pdf
Sweeney, L.: Guaranteeing anonymity when sharing medical data, the datafly system. In: Proceedings of the Journal of the American Medical Informatics Association Annual Fall Symposium, pp. 51–55 (1997)
Sweeney, L.: Three computational systems for disclosing medical data in the year 1999. In: Proceedings of MEDINFO, pp. 1124–1129 (1998)
Sweeney, L.: Uniqueness of simple demographics in the U.S. population. In: LIDAP-WP4. Carnegie Mellon University, Laboratory for International Data Privacy, Pittsburgh, PA (2000)
Sweeney, L.: Achieving k-anonymity privacy protection using generalization and suppresion. International Journal on Uncertainty, Fuzziness and Knowledge-based Systems 10(5), 571–588 (2002)
Sweeney, L.: k-Anonymity: A model for preserving privacy. International Journal on Uncertainty, Fuzziness and Knowledge-based Systems 10(5), 557–570 (2002)
TRDDC. Masketeer: A tool for preserving privacy, Pune (2005)
Vazirani, V.: Approximation Algorithms. Springer, Heidelberg (2004)
Vitter, J.: Random sampling with a reservoir. In: ACM Transaction on Mathematical Software, pp. 37–57 (1985)
Winkler, W.: Using simulated annealing for k-anonymity. In: Research Report 2002-07, US Census Bureau Statistical Research Division (November 2002)
Xu, Y., Motwani, R.: Random sampling based algorithms for efficient semi-key discovery (2006), http://theory.stanford.edu/~xuying/papers/minkey_vldb.pdf
Author information
Authors and Affiliations
Editor information
Rights and permissions
Copyright information
© 2008 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Lodha, S., Thomas, D. (2008). Probabilistic Anonymity. In: Bonchi, F., Ferrari, E., Malin, B., Saygin, Y. (eds) Privacy, Security, and Trust in KDD. PInKDD 2007. Lecture Notes in Computer Science, vol 4890. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-540-78478-4_4
Download citation
DOI: https://doi.org/10.1007/978-3-540-78478-4_4
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-78477-7
Online ISBN: 978-3-540-78478-4
eBook Packages: Computer ScienceComputer Science (R0)