GPU Accelerated Bayesian Inference for Quasi-Identifier Discovery in High-Dimensional Data

Podlesny, Nikolai J.; Kayem, Anne V. D. M.; Meinel, Christoph

doi:10.1007/978-3-030-75075-6_40

Nikolai J. Podlesny¹²,
Anne V. D. M. Kayem¹² &
Christoph Meinel¹²

Part of the book series: Lecture Notes in Networks and Systems ((LNNS,volume 226))

Included in the following conference series:

International Conference on Advanced Information Networking and Applications

833 Accesses
1 Citations

Abstract

Determining unique attribute combinations as quasi-identifiers is a common starting point for both re-identification attacks and data anonymisation schemes. The efficient discovery of those quasi-identifiers (QIDs) has been a combinatoric nightmare, actually an enumeration problem [1,2,3] given its W2-complete nature [4,5,6]. Proper privacy guarantees are required to fulfil highest ethical standards and privacy legislation like CCPA or GDPR, yet also enable the most modern data-driven business model based on monetising corporate data pools. In this work, we offer three main contributions: First, we contribute an algorithm that vectorises the QID search. This QID discovery is based on Bayesian inference detection, which usually suffers a state-space explosion for large-scale datasets. By utilising GPU acceleration to execute the vectorised algorithm, we counter the state-space-explosion issue raised by Bayesian networks. Second, we show its applicability to anonymising high-dimensional data which suffers high information-loss when using standard anonymisation approaches. Third, we offer an empirical model that compares multiple optimisations to discover all QIDs in near real-time, even in large-scale datasets. The latter becomes extremely useful for instances in digital health settings where algorithmic execution time can influence life-and-death triage. Finally, we point out that the same approach can foster de-anonymisation attacks on already published datasets. A demonstration is enclosed to re-identify individuals from Mount Vernon, NY and Southern California in a published Twitter dataset on US Presidential Election 2020.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 169.00; Price excludes VAT (USA)

Softcover Book: USD 219.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

References

Nickolls, J., Dally, W.J.: The GPU computing era. IEEE Micro. 30(2), 56–69 (2010)
Google Scholar
Owens, J.D., Houston, M., Luebke, D., Green, S., Stone, J.E., Phillips, J.C.: GPU computing. Proc. IEEE 96(5), 879–899 (2008)
Google Scholar
Cook, C., Zhao, H., Sato, T., Hiromoto, M., Tan, S.X.D.: GPU-based ising computing for solving max-cut combinatorial optimization problems. Integration, 69, 335–344 (2019)
Google Scholar
Podlesny, N.J., Kayem, A.V., Meinel, C.: Attribute compartmentation and greedy UCC discovery for high-dimensional data anonymization. In: Proceedings of the Ninth ACM Conference on Data and Application Security and Privacy, pp. 109–119. ACM (2019)
Google Scholar
Bläsius, T., Friedrich, T., Lischeid, J., Meeks, K., Schirneck, M.: Efficiently enumerating hitting sets of hypergraphs arising in data profiling. In: Algorithm Engineering and Experiments (ALENEX), pp. 130–143 (2019)
Google Scholar
Bläsius, T., Friedrich, T., Schirneck, M.: The parameterized complexity of dependency detection in relational databases. In: Guo, J., Hermelin, D. (eds.) International Symposium on Parameterized and Exact Computation (IPEC), Leibniz International Proceedings in Informatics (LIPIcs), Dagstuhl, Germany, vol. 63, pp. 6:1–6:13 (2016). Schloss Dagstuhl–Leibniz-Zentrum fuer Informatik
Google Scholar
Barth-Jones, D.: The’re-identification’of governor William Weld’s medical information: a critical re-examination of health data identification risks and privacy protections, then and now. Then and Now (July 2012) (2012)
Google Scholar
Price, W.N., Cohen, I.G.: Privacy in the age of medical big data. Nature Med. 25(1), 37–43 (2019)
Google Scholar
Zhu, L., Jin, H., Zheng, R., Feng, X.: Effective Naive Bayes nearest neighbor based image classification on GPU. J. Supercomput. 68(2), 820–848 (2014)
Article Google Scholar
Viegas, F., Gonçalves, M.A., Martins, W., Rocha, L.: Parallel lazy semi-Naive Bayes strategies for effective and efficient document classification. In: Proceedings of the 24th ACM International on Conference on Information and Knowledge Management, pp. 1071–1080 (2015)
Google Scholar
Andrade, G., Viegas, F., Ramos, G.S., Almeida, J., Rocha, L., Gonçalves, M., Ferreira, R.: GPU-NB: a fast CUDA-based implementation of Naive Bayes. In: 2013 25th International Symposium on Computer Architecture and High Performance Computing, pp. 168–175. IEEE (2013)
Google Scholar
Chen, F.C., Jahanshahi, M.R.: NB-CNN: deep learning-based crack detection using convolutional neural network and Naïve Bayes data fusion. IEEE Trans. Ind. Electron. 65(5), 4392–4400 (2017)
Google Scholar
Gruber, L., et al.: GPU-accelerated Bayesian learning and forecasting in simultaneous graphical dynamic linear models. Bayesian Anal. 11(1), 125–149 (2016)
Article MathSciNet Google Scholar
Ng, W.S., Kirchberg, M., Bressan, S., Tan, K.L.: Towards a privacy-aware stream data management system for cloud applications. Int. J. Web Grid Serv. 7(3), 246–267 (2011)
Google Scholar
Kalidoss, T., Sannasi, G., Lakshmanan, S., Kanagasabai, K., Kannan, A.: Data anonymisation of vertically partitioned data using map reduce techniques on cloud. Int. J. Commun. Netw. Distrib. Syst. 20(4), 519–531 (2018)
Google Scholar
Solanki, P., Garg, S., Chhinkaniwala, H.: Heuristic-based hybrid privacy-preserving data stream mining approach using SD-perturbation and multi-iterative k-anonymisation. Int. J. Knowl. Eng. Data Min. 5(4), 306–332 (2018)
Article Google Scholar
Podlesny, N.J., Kayem, A.V., Meinel, C.: Towards identifying de-anonymisation risks in distributed health data silos. In: International Conference on Database and Expert Systems Applications, pp. 33–43. Springer (2019)
Google Scholar
Podlesny, N.J., Kayem, A.V., Meinel, C.: Identifying data exposure across high-dimensional health data silos through Bayesian networks optimised by multigrid and manifold. In: IEEE 17th International Conference on Dependable, Autonomic and Secure Computing, DASC 2019. IEEE (2019)
Google Scholar
Nayahi, J.J.V., Kavitha, V.: Privacy and utility preserving data clustering for data anonymization and distribution on hadoop. Future Gener. Comput. Syst. 74, 393–408 (2017)
Google Scholar
Dean, J., Ghemawat, S.: MapReduce: simplified data processing on large clusters (2004)
Google Scholar
Podlesny, N.J.: Synthetic genome data (2021)
Google Scholar
IBRAHIM SABUNCU. USA Nov.2020 election 20 mil. tweets (with sentiment and party name labels) dataset (2020)
Google Scholar

Download references

Author information

Authors and Affiliations

Hasso-Plattner-Institute, Potsdam, Germany
Nikolai J. Podlesny, Anne V. D. M. Kayem & Christoph Meinel

Authors

Nikolai J. Podlesny
View author publications
You can also search for this author in PubMed Google Scholar
Anne V. D. M. Kayem
View author publications
You can also search for this author in PubMed Google Scholar
Christoph Meinel
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Nikolai J. Podlesny .

Editor information

Editors and Affiliations

Department of Information and Communication Engineering, Fukuoka Institute of Technology, Fukuoka, Japan
Leonard Barolli
Department of Computer Science, Ryerson University, Toronto, ON, Canada
Isaac Woungang
Faculty of Business Administration, Rissho University, Tokyo, Japan
Tomoya Enokido

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Podlesny, N.J., Kayem, A.V.D.M., Meinel, C. (2021). GPU Accelerated Bayesian Inference for Quasi-Identifier Discovery in High-Dimensional Data. In: Barolli, L., Woungang, I., Enokido, T. (eds) Advanced Information Networking and Applications. AINA 2021. Lecture Notes in Networks and Systems, vol 226. Springer, Cham. https://doi.org/10.1007/978-3-030-75075-6_40

Download citation

DOI: https://doi.org/10.1007/978-3-030-75075-6_40
Published: 27 April 2021
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-75074-9
Online ISBN: 978-3-030-75075-6
eBook Packages: Intelligent Technologies and RoboticsIntelligent Technologies and Robotics (R0)

Publish with us

Policies and ethics

GPU Accelerated Bayesian Inference for Quasi-Identifier Discovery in High-Dimensional Data