Skip to main content

Overview of Data Linkage Methods for Integrating Separate Health Data Sources

  • Chapter
  • First Online:
Data Science for Healthcare

Abstract

Health data sources across healthcare service deliveries are notoriously disconnected hampering good use of data. Hospitals, family doctors, pharmacists, and health insurers all have their own data, while the data may contain information about the same patients. Also, industries offering healthcare and wellness services host and maintain their own data repositories about the patients on their services. Lastly, governmental organizations collect register and survey data on public health, healthcare utilization, and health outcome. Linking health data of individuals, events, and locations at various aggregation levels from different sources can be extremely insightful. More information can be pulled from linked data than from every data source separately. Bringing patient data together for which unique personal identifiers exist, such as social security numbers, is rather straightforward. In many practices, such identifiers are simply lacking meaning that one has to resort to variables that are not necessarily unique to a person which makes the task of linking data far more challenging. To make things worse, these linking variables come with errors due to misspellings, coding differences, or transcription mistakes. Nevertheless, data linkage needs to be done flawlessly as connecting the wrong patient records or missing valuable connections between patient records can result in biased analyses on linked datasets. This chapter provides a state-of-the-art survey in data linkage technology within healthcare. It will give (1) an overview of the various methods in data linkage including deterministic and probabilistic approaches (2) and a synthesis of healthcare use cases in which data linkage is essential with a discussion on the legal and privacy challenges of using data linkage in healthcare.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 129.00
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
Hardcover Book
USD 169.99
Price excludes VAT (USA)
  • Durable hardcover edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. Batini, C., Scannapieco, M.: Data and Information Quality. Data-Centric Systems and Applications, Chapter 8. Springer, Cham (2016). https://doi.org/10.1007/978-3-319-24106-7_8

    Google Scholar 

  2. Baxter, R., Christen, P., Churches, T.: A comparison of fast blocking methods for record linkage. In: First Workshop on Data Cleaning, Record Linkage and Object Consolidation, CMIS Technical Report 03/139, KDD 2003, Washington DC, 24–27 Aug 2003

    Google Scholar 

  3. Blakely, T., Salmond, C.: Probabilistic record linkage and a method to calculate the positive predictive value. Int. J. Epidemiol. 31, 1246–1252 (2002)

    Article  Google Scholar 

  4. Christen, P., Goiser K.: Quality and complexity measures for data linkage and deduplication. In: Guillet, F.J., Hamilton, H.J. (eds.) Quality Measures in Data Mining. Studies in Computational Intelligence, vol. 43, pp. 127–151. Springer, Berlin (2007)

    Chapter  Google Scholar 

  5. Cook, L.J., Olson, L.M., Dean, J.M.: Probabilistic record linkage: relationships between file sizes, identifiers and match weights. Methods Inf. Med. 40, 196–203 (2001)

    Article  Google Scholar 

  6. Contiero, P., Tittarelli, A., Tagliabue, G., Maghini, A., Fabiano, S., Crosignani, P., Tessandori, R.: The EpiLink record linkage software: presentation and results of linkage test on cancer registry files. Methods Inf. Med. 44(1), 66–71 (2005)

    Article  Google Scholar 

  7. Definition of limited data set. https://www.hopkinsmedicine.org/institutional_review_board/hipaa_research/limited_data_set.html. Accessed 26 Jan 2016

  8. Dey, D., Sarkar, S., De, P.: Entity matching in heterogeneous databases: a distance-based decision model. Institute of Electrical and Electronics Engineers Computer Society (1998). https://www.computer.org/csdl/proceedings/hicss/1998/8251/07/82510305.pdf. Accessed 21 Jan 2019

  9. Dusetzina, S.B., Tyree S., Meyer, A.-M., Meyer, A., Green, L., Carpenter, W.R.: Linking Data for Health Services Research: A Framework and Instructional Guide. The University of North Carolina at Chapel Hill, Rockville (MD)/Agency for Healthcare Research and Quality (US), report no.: 14-EHC033-EF (2014)

    Google Scholar 

  10. General Data Protection Regulation (GDPR) http://ec.europa.eu/justice/data-protection/reform/files/regulation_oj_en.pdf. Accessed 26 Jan 2016

  11. Goldreich, O., Warning, A.: Secure multi-party computation (1998)

    Google Scholar 

  12. Goldstein, H., Harron, K., Wade, A.: The analysis of record linked data using multiple imputation with data value priors. Stat. Med. 31(28), 3481–3493 (2012)

    Article  MathSciNet  Google Scholar 

  13. Goldstein, H., Harron, K., Cortina-Borja, M.: A scaling approach to record linkage. Stat. Med. 36, 2514–2521 (2016). https://doi.org/10.1002/sim.7287

    Article  MathSciNet  Google Scholar 

  14. Government data-matching: Office of the Australian Information Commissioner—OAIC. https://www.oaic.gov.au/privacy-law/other-legislation/government-data-matching. Accessed 26 Jan 2018

  15. Harron, K., Goldstein, H., Dibben, C. (eds.): Methodological Developments in Data Linkage. Wiley, Chichester (2015)

    Google Scholar 

  16. Harron, K., Doidge, J.C., Knight, H.E., Gilbert, R.E., Goldstein, H., Cromwell, D.A., Van der Meulen, J.H.: A guide to evaluating linkage quality for the analysis of linked data. Int. J. Epidemiol. 46(5), 1699–1710 (2017)

    Article  Google Scholar 

  17. Hendriks, P., Reynaert, M., van der Sijs, N.: Transcriptor, language and speech technology technical report series, Radboud University, Nijmegen (2016)

    Google Scholar 

  18. HIPAA for Professionals. https://www.hhs.gov/hipaa/for-professionals/index.html. Accessed 26 Jan 2016

  19. HIPAA PHI: List of 18 Identifiers and Definition of PHI. https://cphs.berkeley.edu/hipaa/hipaa18.html. Accessed 21 Jan 2019

  20. Jaro, M.A.: Probabilistic linkage of large public health data files, Match Ware Technologies. Stat. Med. 14, 491–498 (1995)

    Google Scholar 

  21. Jiang, R., Rafael, E., Li, B., Li, H.: Evaluating and combining named entity recognition systems. In: Proceedings of the Sixth Named Entity Workshop, joint with 54th ACL, Berlin, 12 August 2016, pp. 21–27

    Google Scholar 

  22. Krewski, D.A., Wang, Y., Bartlett, S., et al.: The effect of record linkage errors on risk estimates in cohort mortality studies. Surv. Methodol. 31(1), 13–21 (2005)

    Google Scholar 

  23. Kum, H.-C., Krishnamurthy, A., Machanavajjhala, A., et al.: Privacy preserving interactive record linkage (PPIRL). J. Am. Med. Inform. Assoc. 21, 212–220 (2014)

    Article  Google Scholar 

  24. Linking social care, housing & health data, Data linking: social care, housing & health: Paper 1, Data Linkage literature review (2010)

    Google Scholar 

  25. Marrero, M., Sánchez-Cuadrado, S., Lara, J.M., Andreadakis, G.: Evaluation of named entity extraction systems. In: Advances in Computational Linguistics, Research in Computing Science, pp. 41–47 (2009)

    Google Scholar 

  26. Mendes, R., Vilela, J.: Privacy-preserving data mining: methods, metrics, and applications. IEEE Access. 5, 10562–10582 (2017). https://doi.org/10.1109/ACCESS.2017.2706947

    Article  Google Scholar 

  27. Queensland Data Linkage Framework, Published by the State of Queensland (Queensland Health) (2014)

    Google Scholar 

  28. Sadinle, M.: Bayesian estimation of bipartite matchings for record linkage. J. Am. Stat. Assoc. 112(518), 600–612 (2017). https://doi.org/10.1080/01621459.2016.1148612

    Article  MathSciNet  Google Scholar 

  29. Statistical Data Integration involving Commonwealth Data, National Statistical Service, Australian Government. https://toolkit.data.gov.au/index.php/Statistical_Data_Integration. Accessed 21 Jan 2019

  30. Van der Sijs, N., Hendriks, P.: Al-Kadafi and Tsjechov: Waarom de spelling van namen ertoe doet. Onze Taal 11, 10–14 (2017)

    Google Scholar 

  31. Verykios, V.S., Elmagarmid, A.K., Moustakides, G.V.: Cost optimal record/entity matching. Purdue e-Pubs, Purdue University, report number: 01-014 (2001)

    Google Scholar 

  32. Winkler, W.E.: String Comparator Metrics and Enhanced Decision Rules in the Fellegi-Sunter Model of Record Linkage, Bureau of the Census* Statistical Research Division, Rm 3000-4, Washington, DC 20223 (1990)

    Google Scholar 

  33. Winkler, W.E.: Methods for record linkage and Bayesian networks. In: Proceedings of the Section on Survey Research Methods, pp. 3743–3748. ASA, Boston (2002)

    Google Scholar 

  34. Yuan, Y.C.: Multiple imputation for missing data: concepts and new development. In: Statistics and Data Analytics. SAS Institute, Rockville, Paper 267-25 (2000)

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Ana Kostadinovska .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2019 Springer Nature Switzerland AG

About this chapter

Check for updates. Verify currency and authenticity via CrossMark

Cite this chapter

Kostadinovska, A., Asim, M., Pletea, D., Pauws, S. (2019). Overview of Data Linkage Methods for Integrating Separate Health Data Sources. In: Consoli, S., Reforgiato Recupero, D., Petković, M. (eds) Data Science for Healthcare. Springer, Cham. https://doi.org/10.1007/978-3-030-05249-2_8

Download citation

  • DOI: https://doi.org/10.1007/978-3-030-05249-2_8

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-030-05248-5

  • Online ISBN: 978-3-030-05249-2

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics