Abstract
Determining unique attribute combinations as quasi-identifiers is a common starting point for both re-identification attacks and data anonymisation schemes. The efficient discovery of those quasi-identifiers (QIDs) has been a combinatoric nightmare, actually an enumeration problem [1,2,3] given its W2-complete nature [4,5,6]. Proper privacy guarantees are required to fulfil highest ethical standards and privacy legislation like CCPA or GDPR, yet also enable the most modern data-driven business model based on monetising corporate data pools. In this work, we offer three main contributions: First, we contribute an algorithm that vectorises the QID search. This QID discovery is based on Bayesian inference detection, which usually suffers a state-space explosion for large-scale datasets. By utilising GPU acceleration to execute the vectorised algorithm, we counter the state-space-explosion issue raised by Bayesian networks. Second, we show its applicability to anonymising high-dimensional data which suffers high information-loss when using standard anonymisation approaches. Third, we offer an empirical model that compares multiple optimisations to discover all QIDs in near real-time, even in large-scale datasets. The latter becomes extremely useful for instances in digital health settings where algorithmic execution time can influence life-and-death triage. Finally, we point out that the same approach can foster de-anonymisation attacks on already published datasets. A demonstration is enclosed to re-identify individuals from Mount Vernon, NY and Southern California in a published Twitter dataset on US Presidential Election 2020.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
References
Nickolls, J., Dally, W.J.: The GPU computing era. IEEE Micro. 30(2), 56–69 (2010)
Owens, J.D., Houston, M., Luebke, D., Green, S., Stone, J.E., Phillips, J.C.: GPU computing. Proc. IEEE 96(5), 879–899 (2008)
Cook, C., Zhao, H., Sato, T., Hiromoto, M., Tan, S.X.D.: GPU-based ising computing for solving max-cut combinatorial optimization problems. Integration, 69, 335–344 (2019)
Podlesny, N.J., Kayem, A.V., Meinel, C.: Attribute compartmentation and greedy UCC discovery for high-dimensional data anonymization. In: Proceedings of the Ninth ACM Conference on Data and Application Security and Privacy, pp. 109–119. ACM (2019)
Bläsius, T., Friedrich, T., Lischeid, J., Meeks, K., Schirneck, M.: Efficiently enumerating hitting sets of hypergraphs arising in data profiling. In: Algorithm Engineering and Experiments (ALENEX), pp. 130–143 (2019)
Bläsius, T., Friedrich, T., Schirneck, M.: The parameterized complexity of dependency detection in relational databases. In: Guo, J., Hermelin, D. (eds.) International Symposium on Parameterized and Exact Computation (IPEC), Leibniz International Proceedings in Informatics (LIPIcs), Dagstuhl, Germany, vol. 63, pp. 6:1–6:13 (2016). Schloss Dagstuhl–Leibniz-Zentrum fuer Informatik
Barth-Jones, D.: The’re-identification’of governor William Weld’s medical information: a critical re-examination of health data identification risks and privacy protections, then and now. Then and Now (July 2012) (2012)
Price, W.N., Cohen, I.G.: Privacy in the age of medical big data. Nature Med. 25(1), 37–43 (2019)
Zhu, L., Jin, H., Zheng, R., Feng, X.: Effective Naive Bayes nearest neighbor based image classification on GPU. J. Supercomput. 68(2), 820–848 (2014)
Viegas, F., Gonçalves, M.A., Martins, W., Rocha, L.: Parallel lazy semi-Naive Bayes strategies for effective and efficient document classification. In: Proceedings of the 24th ACM International on Conference on Information and Knowledge Management, pp. 1071–1080 (2015)
Andrade, G., Viegas, F., Ramos, G.S., Almeida, J., Rocha, L., Gonçalves, M., Ferreira, R.: GPU-NB: a fast CUDA-based implementation of Naive Bayes. In: 2013 25th International Symposium on Computer Architecture and High Performance Computing, pp. 168–175. IEEE (2013)
Chen, F.C., Jahanshahi, M.R.: NB-CNN: deep learning-based crack detection using convolutional neural network and Naïve Bayes data fusion. IEEE Trans. Ind. Electron. 65(5), 4392–4400 (2017)
Gruber, L., et al.: GPU-accelerated Bayesian learning and forecasting in simultaneous graphical dynamic linear models. Bayesian Anal. 11(1), 125–149 (2016)
Ng, W.S., Kirchberg, M., Bressan, S., Tan, K.L.: Towards a privacy-aware stream data management system for cloud applications. Int. J. Web Grid Serv. 7(3), 246–267 (2011)
Kalidoss, T., Sannasi, G., Lakshmanan, S., Kanagasabai, K., Kannan, A.: Data anonymisation of vertically partitioned data using map reduce techniques on cloud. Int. J. Commun. Netw. Distrib. Syst. 20(4), 519–531 (2018)
Solanki, P., Garg, S., Chhinkaniwala, H.: Heuristic-based hybrid privacy-preserving data stream mining approach using SD-perturbation and multi-iterative k-anonymisation. Int. J. Knowl. Eng. Data Min. 5(4), 306–332 (2018)
Podlesny, N.J., Kayem, A.V., Meinel, C.: Towards identifying de-anonymisation risks in distributed health data silos. In: International Conference on Database and Expert Systems Applications, pp. 33–43. Springer (2019)
Podlesny, N.J., Kayem, A.V., Meinel, C.: Identifying data exposure across high-dimensional health data silos through Bayesian networks optimised by multigrid and manifold. In: IEEE 17th International Conference on Dependable, Autonomic and Secure Computing, DASC 2019. IEEE (2019)
Nayahi, J.J.V., Kavitha, V.: Privacy and utility preserving data clustering for data anonymization and distribution on hadoop. Future Gener. Comput. Syst. 74, 393–408 (2017)
Dean, J., Ghemawat, S.: MapReduce: simplified data processing on large clusters (2004)
Podlesny, N.J.: Synthetic genome data (2021)
IBRAHIM SABUNCU. USA Nov.2020 election 20 mil. tweets (with sentiment and party name labels) dataset (2020)
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2021 The Author(s), under exclusive license to Springer Nature Switzerland AG
About this paper
Cite this paper
Podlesny, N.J., Kayem, A.V.D.M., Meinel, C. (2021). GPU Accelerated Bayesian Inference for Quasi-Identifier Discovery in High-Dimensional Data. In: Barolli, L., Woungang, I., Enokido, T. (eds) Advanced Information Networking and Applications. AINA 2021. Lecture Notes in Networks and Systems, vol 226. Springer, Cham. https://doi.org/10.1007/978-3-030-75075-6_40
Download citation
DOI: https://doi.org/10.1007/978-3-030-75075-6_40
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-75074-9
Online ISBN: 978-3-030-75075-6
eBook Packages: Intelligent Technologies and RoboticsIntelligent Technologies and Robotics (R0)