Skip to main content

Post-processing Methods for High Quality Privacy-Preserving Record Linkage

  • Conference paper
  • First Online:
Data Privacy Management, Cryptocurrencies and Blockchain Technology (DPM 2018, CBT 2018)

Abstract

Privacy-preserving record linkage (PPRL) supports the integration of person-related data from different sources while protecting the privacy of individuals by encoding sensitive information needed for linkage. The use of encoded data makes it challenging to achieve high linkage quality in particular for dirty data containing errors or inconsistencies. Moreover, person-related data is often dense, e.g., due to frequent names or addresses, leading to high similarities for non-matches. Both effects are hard to deal with in common PPRL approaches that rely on a simple threshold-based classification to decide whether a record pair is considered to match. In particular, dirty or dense data likely lead to many multi-links where persons are wrongly linked to more than one other person. Therefore, we propose the use of post-processing methods for resolving multi-links and outline three possible approaches. In our evaluation using large synthetic and real datasets we compare these approaches with each other and show that applying post-processing is highly beneficial and can significantly increase linkage quality in terms of both precision and F-measure.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 59.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 79.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

  1. 1.

    https://www.destatis.de/DE/Methoden/Zensus_/Zensus.html.

  2. 2.

    http://www.ncsbe.gov/.

References

  1. Bloom, B.: Space/time trade-offs in hash coding with allowable errors. CACM 13(7), 422–426 (1970)

    Article  Google Scholar 

  2. Böhm, C., de Melo, G., Naumann, F., Weikum, G.: LINDA: distributed web-of-data-scale entity matching. In: ACM CIKM, pp. 2104–2108 (2012)

    Google Scholar 

  3. Brown, A.P., Borgs, C., Randall, S.M., Schnell, R.: Evaluating privacy-preserving record linkage using cryptographic long-term keys and multibit trees on large medical datasets. BMC Med. Inf. Decis. Making 17(1), 83 (2017)

    Article  Google Scholar 

  4. Christen, P.: Data Matching: Concepts and Techniques for Record Linkage, Entity Resolution, and Duplicate Detection. Springer, Heidelberg (2012). https://doi.org/10.1007/978-3-642-31164-2

    Book  Google Scholar 

  5. Christen, P., Schnell, R., Vatsalan, D., Ranbaduge, T.: Efficient cryptanalysis of bloom filters for privacy-preserving record linkage. In: Kim, J., Shim, K., Cao, L., Lee, J.-G., Lin, X., Moon, Y.-S. (eds.) PAKDD 2017. LNCS (LNAI), vol. 10234, pp. 628–640. Springer, Cham (2017). https://doi.org/10.1007/978-3-319-57454-7_49

    Chapter  Google Scholar 

  6. Christen, P., Vatsalan, D.: Flexible and extensible generation and corruption of personal data. In: ACM CIKM, pp. 1165–1168 (2013)

    Google Scholar 

  7. Do, H.H., Rahm, E.: COMA - a system for flexible combination of schema matching approaches. In: VLDB, pp. 610–621 (2002)

    Google Scholar 

  8. Durham, E.A.: A framework for accurate, efficient private record linkage. Ph.D. thesis, Vanderbilt University (2012)

    Google Scholar 

  9. Elmagarmid, A.K., Ipeirotis, P.G., Verykios, V.S.: Duplicate record detection: a survey. IEEE TKDE 19(1), 1–16 (2007)

    Google Scholar 

  10. Fellegi, I.P., Sunter, A.B.: A theory for record linkage. JASA 64(328), 1183–1210 (1969)

    Article  Google Scholar 

  11. Franke, M., Sehili, Z., Rahm, E.: Parallel privacy preserving record linkage using LSH-based blocking. In: IoTBDS, pp. 195–203 (2018)

    Google Scholar 

  12. Gale, D., Shapley, L.S.: College admissions and the stability of marriage. Am. Math. Mon. 69(1), 9–15 (1962)

    Article  MathSciNet  Google Scholar 

  13. Gibberd, A., Supramaniam, R., Dillon, A., Armstrong, B.K., OConnell, D.L.: Lung cancer treatment and mortality for Aboriginal people in New South Wales, Australia: results from a population-based record linkage study and medical record audit. BMC Cancer 16(1), 289 (2016)

    Article  Google Scholar 

  14. Gusfield, D., Irving, R.W.: The Stable Marriage Problem: Structure and Algorithms. MIT Press, Cambridge (1989)

    MATH  Google Scholar 

  15. Hernández, M.A., Stolfo, S.J.: Real-world data is dirty: data cleansing and the merge/purge problem. Data Min. Knowl. Discovery 2(1), 9–37 (1998)

    Article  Google Scholar 

  16. Irving, R.W.: Stable marriage and indifference. Discrete Appl. Math. 48(3), 261–272 (1994)

    Article  MathSciNet  Google Scholar 

  17. Iwama, K., Miyazaki, S.: A survey of the stable marriage problem and its variants. In: IEEE ICKS, pp. 131–136 (2008)

    Google Scholar 

  18. Jaccard, P.: The distribution of the flora in the alpine zone. New Phytol. 11(2), 37–50 (1912)

    Article  Google Scholar 

  19. Karapiperis, D., Gkoulalas-Divanis, A., Verykios, V.S.: Distance-aware encoding of numerical values for privacy-preserving record linkage. In: IEEE ICDE, pp. 135–138 (2017)

    Google Scholar 

  20. Karapiperis, D., Gkoulalas-Divanis, A., Verykios, V.S.: FEDERAL: a framework for distance-aware privacy-preserving record linkage. IEEE TKDE 30(2), 292–304 (2018)

    Google Scholar 

  21. Karapiperis, D., Verykios, V.S.: A distributed framework for scaling up LSH-based computations in privacy preserving record linkage. In: Proceedings of the BCI (2013)

    Google Scholar 

  22. Karapiperis, D., Verykios, V.S.: A fast and efficient hamming LSH-based scheme for accurate linkage. KAIS 49(3), 861–884 (2016)

    Google Scholar 

  23. Kho, A.N., Cashy, J.P., Jackson, K.L., Pah, A.R., Goel, S., Boehnke, J., Humphries, J.E., Kominers, S.D., Hota, B.N., Sims, S.A., et al.: Design and implementation of a privacy preserving electronic health record linkage tool in Chicago. JAMIA 22(5), 1072–1080 (2015)

    Google Scholar 

  24. Kroll, M., Steinmetzer, S.: Automated cryptanalysis of bloom filter encryptions of health records. In: ICHI (2014)

    Google Scholar 

  25. Kuehni, C.E., et al.: Cohort profile: the Swiss childhood cancer survivor study. Int. J. Epidemiol. 41(6), 1553–1564 (2012)

    Article  Google Scholar 

  26. Kuzu, M., Kantarcioglu, M., Durham, E., Malin, B.: A constraint satisfaction cryptanalysis of bloom filters in private record linkage. In: Fischer-Hübner, S., Hopper, N. (eds.) PETS 2011. LNCS, vol. 6794, pp. 226–245. Springer, Heidelberg (2011). https://doi.org/10.1007/978-3-642-22263-4_13

    Chapter  Google Scholar 

  27. Kuzu, M., Kantarcioglu, M., Durham, E.A., Toth, C., Malin, B.: A practical approach to achieve private medical record linkage in light of public resources. JAMIA 20(2), 285–292 (2013)

    Google Scholar 

  28. Lenz, R.: Measuring the disclosure protection of micro aggregated business microdata: an analysis taking as an example the German structure of costs survey. J. Official Stat. 22(4), 681 (2006)

    Google Scholar 

  29. Luo, Q., et al.: Cancer-related hospitalisations and unknownstage prostate cancer: a population-based record linkage study. BMJ Open 7(1), e014259 (2017)

    Article  Google Scholar 

  30. Marie, A., Gal, A.: On the Stable Marriage of Maximum Weight Royal Couples. In: AAAI Workshop on Information Integration on the Web (2007)

    Google Scholar 

  31. McVitie, D.G., Wilson, L.B.: Stable marriage assignment for unequal sets. BIT Numer. Math. 10(3), 295–309 (1970)

    Article  Google Scholar 

  32. Meilicke, C., Stuckenschmidt, H.: Analyzing mapping extraction approaches. In: OM, pp. 25–36 (2007)

    Google Scholar 

  33. Melnik, S., Garcia-Molina, H., Rahm, E.: Similarity flooding: a versatile graph matching algorithm and its application to schema matching. In: IEEE ICDE, pp. 117–128 (2002)

    Google Scholar 

  34. Munkres, J.: Algorithms for the assignment and transportation problems. SIAM J. 5(1), 32–38 (1957)

    MathSciNet  MATH  Google Scholar 

  35. Niedermeyer, F., Steinmetzer, S., Kroll, M., Schnell, R.: Cryptanalysis of basic bloom filters used for privacy preserving record linkage. JPC 6(2), 59–79 (2014)

    Article  Google Scholar 

  36. Rahm, E., Do, H.H.: Data cleaning: problems and current approaches. IEEE Data Eng. Bull. 23(4), 3–13 (2000)

    Google Scholar 

  37. Saeedi, A., Peukert, E., Rahm, E.: Using link features for entity clustering in knowledge graphs. In: Gangemi, A., et al. (eds.) ESWC 2018. LNCS, vol. 10843, pp. 576–592. Springer, Cham (2018). https://doi.org/10.1007/978-3-319-93417-4_37

    Chapter  Google Scholar 

  38. Schnell, R.: Privacy-preserving record linkage. In: Methodological Developments in Data Linkage, pp. 201–225 (2015)

    Google Scholar 

  39. Schnell, R., Bachteler, T., Reiher, J.: Privacy-preserving record linkage using Bloom filters. BMC Med. Inf. Decis. Making 9(1), 41 (2009)

    Article  Google Scholar 

  40. Schnell, R., Bachteler, T., Reiher, J.: A novel error-tolerant anonymous linking code. GRLC, No. WP-GRLC-2011-02 (2011)

    Google Scholar 

  41. Schnell, R., Borgs, C.: Randomized response and balanced bloom filters for privacy preserving record linkage. In: IEEE ICDMW (2016)

    Google Scholar 

  42. Sehili, Z., Rahm, E.: Speeding up privacy preserving record linkage for metric space similarity measures. Datenbank-Spektrum 16(3), 227–236 (2016)

    Article  Google Scholar 

  43. Vatsalan, D., Christen, P., Verykios, V.S.: A taxonomy of privacy-preserving record linkage techniques. Inf. Syst. 38(6), 946–969 (2013)

    Article  Google Scholar 

  44. Vatsalan, D., Christen, P.: scalable privacy-preserving record linkage for multiple databases. In: ACM CIKM, pp. 1795–1798 (2014)

    Google Scholar 

  45. Vatsalan, D., Christen, P.: Privacy-preserving matching of similar patients. J. Biomed. Inf. 59, 285–298 (2016)

    Article  Google Scholar 

  46. Vatsalan, D., Christen, P., O’Keefe, C.M., Verykios, V.S.: An evaluation framework for privacy-preserving record linkage. JPC 6(1), 3 (2014)

    Article  Google Scholar 

  47. Vatsalan, D., Sehili, Z., Christen, P., Rahm, E.: Privacy-preserving record linkage for big data: current approaches and research challenges. In: Zomaya, A.Y., Sakr, S. (eds.) Handbook of Big Data Technologies, pp. 851–895. Springer, Cham (2017). https://doi.org/10.1007/978-3-319-49340-4_25

    Chapter  Google Scholar 

  48. West, D.B., et al.: Introduction to Graph Theory, vol. 2. Prentice Hall, Upper Saddle River (2001)

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Martin Franke .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2018 Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Franke, M., Sehili, Z., Gladbach, M., Rahm, E. (2018). Post-processing Methods for High Quality Privacy-Preserving Record Linkage. In: Garcia-Alfaro, J., Herrera-Joancomartí, J., Livraga, G., Rios, R. (eds) Data Privacy Management, Cryptocurrencies and Blockchain Technology. DPM CBT 2018 2018. Lecture Notes in Computer Science(), vol 11025. Springer, Cham. https://doi.org/10.1007/978-3-030-00305-0_19

Download citation

  • DOI: https://doi.org/10.1007/978-3-030-00305-0_19

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-030-00304-3

  • Online ISBN: 978-3-030-00305-0

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics