Skip to main content

Multiple Instance Learning for Group Record Linkage

  • Conference paper
Advances in Knowledge Discovery and Data Mining (PAKDD 2012)

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 7301))

Included in the following conference series:

Abstract

Record linkage is the process of identifying records that refer to the same entities from different data sources. While most research efforts are concerned with linking individual records, new approaches have recently been proposed to link groups of records across databases. Group record linkage aims to determine if two groups of records in two databases refer to the same entity or not. One application where group record linkage is of high importance is the linking of census data that contain household information across time. In this paper we propose a novel method to group record linkage based on multiple instance learning. Our method treats group links as bags and individual record links as instances. We extend multiple instance learning from bag to instance classification to reconstruct bags from candidate instances. The classified bag and instance samples lead to a significant reduction in multiple group links, thereby improving the overall quality of linked data. We evaluate our method with both synthetic data and real historical census data.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 39.99
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. Andrews, S., Tsochantaridis, I., Hofmann, T.: Support vector machines for multiple-instance learning. In: NIPS, Vancouver, Canada, pp. 561–568 (2003)

    Google Scholar 

  2. Bilenko, M., Mooney, R.J.: Adaptive duplicate detection using learnable string similarity measures. In: ACM KDD, Washington, DC, pp. 39–48 (2003)

    Google Scholar 

  3. Bloothooft, G.: Multi-source family reconstruction. History and Computing 7(2), 90–103 (1995)

    Article  Google Scholar 

  4. Chen, Y., Bi, J., Wang, J.: MILES: Multiple-instance learning via embedded instance selection. IEEE TPAMI 28(12), 1931–1947 (2006)

    Article  Google Scholar 

  5. Chen, Y., Wang, J.Z.: Image categorization by learning and reasoning with regions. Journal of Machine Learning Research 5 (2004)

    Google Scholar 

  6. Christen, P.: Automatic Training Example Selection for Scalable Unsupervised Record Linkage. In: Washio, T., Suzuki, E., Ting, K.M., Inokuchi, A. (eds.) PAKDD 2008. LNCS (LNAI), vol. 5012, pp. 511–518. Springer, Heidelberg (2008)

    Chapter  Google Scholar 

  7. Christen, P.: Development and user experiences of an open source data cleaning, deduplication and record linkage system. ACM SIGKDD Explorations 11(1), 39–48 (2009)

    Article  Google Scholar 

  8. Christen, P.: A survey of indexing techniques for scalable record linkage and deduplication. IEEE TKDE (2011)

    Google Scholar 

  9. Dietterich, T.G., Lathrop, R.H., Lozano-Perez, T.: Solving the multiple-instance problem with axis-parallel rectangles. Artificial Intelligence 89, 31–71 (1997)

    Article  MATH  Google Scholar 

  10. Elmagarmid, A.K., Ipeirotis, P.G., Verykios, V.S.: Duplicate record detection: A survey. IEEE Transactions on Knowledge and Data Engineering 19(1), 1–16 (2007)

    Article  Google Scholar 

  11. Fan, R.-E., Chang, K.-W., Hsieh, C.-J., Wang, X.-R., Lin, C.-J.: LIBLINEAR: A library for large linear classification. JMLR 9, 1871–1874 (2008)

    MATH  Google Scholar 

  12. Fu, Z., Christen, P., Boot, M.: Automatic cleaning and linking of historical census data using household information. In: IEEE ICDM Workshop on DDDM (2011)

    Google Scholar 

  13. Fu, Z., Christen, P., Boot, M.: A supervised learning and group linking method for historical census household linkage. In: AusDM (2011)

    Google Scholar 

  14. Fu, Z., Robles-Kelly, A., Zhou, J.: MILIS: Multiple instance learning with instance selection. IEEE TPAMI 33(5), 958–977 (2011)

    Article  Google Scholar 

  15. Fure, E.: Interactive record linkage: The cumulative construction of life courses. Demographic Research 3, 11 (2000)

    Article  Google Scholar 

  16. Li, F., Sminchisescu, C.: Convex multiple instance learning by estimating likelihood ratio. In: NIPS (2010)

    Google Scholar 

  17. On, B.-W., Elmaciogl, E., Lee, D., Kang, J., Pei, J.: Improving grouped-entity resolution using quasi-cliques. In: IEEE ICDM, Hong Kong, pp. 1008–1015 (2006)

    Google Scholar 

  18. On, B.-W., Koudas, N., Lee, D., Srivastava, D.: Group linkage. In: IEEE ICDE, Istanbul, Turkey, pp. 496–505 (2007)

    Google Scholar 

  19. Quass, D., Starkey, P.: Record linkage for genealogical databases. In: ACM KDD Workshop, Washington, DC, pp. 40–42 (2003)

    Google Scholar 

  20. Reid, A., Davies, R., Garrett, E.: Nineteenth century Scottish demography from linked censuses and civil registers: a ‘sets of related individuals’ approach. History and Computing 14(1+2), 61–86 (2006)

    Google Scholar 

  21. Ruggles, S.: Linking historical censuses: a new approach. History and Computing 14(1+2), 213–224 (2006)

    Google Scholar 

  22. Tan, P., Steinbach, M., Kumar, V.: Introduction to Data Mining. Pearson Addison-Wesley (2005)

    Google Scholar 

  23. Vapnik, V.: The Nature of Statistical Learning Theory. Springer (1995)

    Google Scholar 

  24. Xiao, Y., Liu, B., Cao, L., Yin, J., Wu, X.: SMILE: A similarity-based approach for multiple instance learning. In: IEEE ICDM, Sydney, pp. 309–313 (2010)

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2012 Springer-Verlag Berlin Heidelberg

About this paper

Cite this paper

Fu, Z., Zhou, J., Christen, P., Boot, M. (2012). Multiple Instance Learning for Group Record Linkage. In: Tan, PN., Chawla, S., Ho, C.K., Bailey, J. (eds) Advances in Knowledge Discovery and Data Mining. PAKDD 2012. Lecture Notes in Computer Science(), vol 7301. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-30217-6_15

Download citation

  • DOI: https://doi.org/10.1007/978-3-642-30217-6_15

  • Publisher Name: Springer, Berlin, Heidelberg

  • Print ISBN: 978-3-642-30216-9

  • Online ISBN: 978-3-642-30217-6

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics