Entity Matching with String Transformation and Similarity-Based Features

Sakai, Kazunori; Dong, Yuyang; Oyamada, Masafumi; Takeoka, Kunihiro; Okadome, Takeshi

doi:10.1007/978-3-030-93849-9_5

Kazunori Sakai⁸,
Yuyang Dong⁹,
Masafumi Oyamada⁹,
Kunihiro Takeoka⁹ &
…
Takeshi Okadome⁸

Part of the book series: Communications in Computer and Information Science ((CCIS,volume 1457))

Included in the following conference series:

International Workshop on Software Foundations for Data Interoperability

208 Accesses

Abstract

Entity matching is an important task in common data cleaning and data integration problems of determining two records that refer to the same real-world entity. Many research use string similarity as features to infer entity matching but the power of the similarity may be affected by the pairs of hard-to-classify entities, which are actually different entities but have a high similarity or the same entity with low similarity. String transformation is a good solution to solve different representations between two domains of datasets, such as abbreviations, misspellings, and other expressions.

In this paper, we propose two powerful features, similarity gain and dissimilarity gain, that enables us to discriminate whether the two entities refer to the same entity after string transformation. The similarity gain is defined by the maximum amount of similarity increase among the variations in similarity before and after applying string transformations. The dissimilarity is defined by the maximum amount of similarity decrease. Moreover, the similarity gain and dissimilarity gain can also be used for selecting valuable samples in a limited labeling budget. Sufficient experiments are conducted, and our method with the proposed features improves the best accuracy in most cases.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 54.99; Price excludes VAT (USA)

Softcover Book: USD 69.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

1.
F can also include function \(f_0\) that does not transform anything.

References

Arasu, A., Chaudhuri, S., Kaushik, R.: Transformation-based framework for record matching. In: 2008 IEEE 24th International Conference on Data Engineering, pp. 40–49. IEEE (2008)
Google Scholar
Arasu, A., Chaudhuri, S., Kaushik, R.: Learning string transformations from examples. Proc. VLDB Endow. 2(1), 514–525 (2009)
Article Google Scholar
Arasu, A., Götz, M., Kaushik, R.: On active learning of record matching packages. In: Proceedings of the 2010 ACM SIGMOD International Conference on Management of data, pp. 783–794 (2010)
Google Scholar
Bilenko, M., Mooney, R.J.: Adaptive duplicate detection using learnable string similarity measures. In: Proceedings of the Ninth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 39–48 (2003)
Google Scholar
Çakal, Ö.Ö., Mahdavi, M., Abedjan, Z.: CLRL: feature engineering for cross-language record linkage. In: EDBT, pp. 678–681 (2019)
Google Scholar
Christen, P.: Automatic training example selection for scalable unsupervised record linkage. In: Washio, T., Suzuki, E., Ting, K.M., Inokuchi, A. (eds.) PAKDD 2008. LNCS (LNAI), vol. 5012, pp. 511–518. Springer, Heidelberg (2008). https://doi.org/10.1007/978-3-540-68125-0_45
Chapter Google Scholar
Cohen, W.W., Richman, J.: Learning to match and cluster large high-dimensional data sets for data integration. In: Proceedings of the Eighth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 475–480 (2002)
Google Scholar
Deng, D., et al.: Unsupervised string transformation learning for entity consolidation. In: 35th IEEE International Conference on Data Engineering, ICDE 2019, Macao, China, 8–11 April 2019, pp. 196–207. IEEE (2019)
Google Scholar
Ebraheem, M., Thirumuruganathan, S., Joty, S.R., Ouzzani, M., Tang, N.: Distributed representations of tuples for entity resolution. Proc. VLDB Endow. 11(11), 1454–1467 (2018)
Article Google Scholar
Fellegi, I.P., Sunter, A.B.: A theory for record linkage. J. Am. Stat. Assoc. 64(328), 1183–1210 (1969)
Article Google Scholar
Konda, P., Das, S., Suganthan G.C.P., Doan, A., Ardalan, A., et al.: Magellan: Toward building entity matching management systems
Google Scholar
Li, P., Cheng, X., Chu, X., He, Y., Chaudhuri, S.: Auto-fuzzyjoin: auto-program fuzzy similarity joins without labeled examples (2021)
Google Scholar
Li, Y., Li, J., Suhara, Y., Doan, A., Tan, W.: Deep entity matching with pre-trained language models. Proc. VLDB Endow. 14(1), 50–60 (2020)
Article Google Scholar
Michelson, M., Knoblock, C.A.: Mining the heterogeneous transformations between data sources to aid record linkage. In: ICAI (2009)
Google Scholar
Minton, S.N., Nanjo, C., Knoblock, C.A., Michalowski, M., Michelson, M.: A heterogeneous field matching method for record linkage. In: Fifth IEEE International Conference on Data Mining (ICDM’05). 8p. IEEE (2005)
Google Scholar
Mudgal, S., et al.: Deep learning for entity matching: a design space exploration. In: Das, G., Jermaine, C.M., Bernstein, P.A. (eds.) Proceedings of the 2018 International Conference on Management of Data, SIGMOD Conference 2018, Houston, TX, USA, 10–15 June 2018. pp. 19–34. ACM (2018)
Google Scholar
Paganelli, M., Buono, F.D., Pevarello, M., Guerra, F., Vincini, M.: Automated machine learning for entity matching tasks. In: Velegrakis, Y., Zeinalipour-Yazti, D., Chrysanthis, P.K., Guerra, F. (eds.) Proceedings of the 24th International Conference on Extending Database Technology, EDBT 2021, Nicosia, Cyprus, 23–26 March 2021, pp. 325–330. OpenProceedings.org (2021)
Google Scholar
Papadakis, G., Skoutas, D., Thanos, E., Palpanas, T.: A survey of blocking and filtering techniques for entity resolution. CoRR abs/1905.06167 (2019). http://arxiv.org/abs/1905.06167
Wang, P., Zheng, W., Wang, J., Pei, J.: Automating entity matching model development. In: 37th IEEE International Conference on Data Engineering, ICDE 2021. IEEE (2021)
Google Scholar
Sarawagi, S., Bhamidipaty, A.: Interactive deduplication using active learning. In: Proceedings of the Eighth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 269–278 (2002)
Google Scholar
Tejada, S., Knoblock, C.A., Minton, S.: Learning domain-independent string transformation weights for high accuracy object identification. In: Proceedings of the Eighth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 350–359 (2002)
Google Scholar
Zhao, C., He, Y.: Auto-EM: end-to-end fuzzy entity-matching using pre-trained deep models and transfer learning. In: Liu, L., et al. (eds.) The World Wide Web Conference, WWW 2019, San Francisco, CA, USA, 13–17 May 2019, pp. 2413–2424. ACM (2019)
Google Scholar
Zhu, E., He, Y., Chaudhuri, S.: Auto-join: Joining tables by leveraging transformations. Proc. VLDB Endow. 10(10), 1034–1045 (2017). http://www.vldb.org/pvldb/vol10/p1034-he.pdf

Download references

Author information

Authors and Affiliations

Graduate School of Science and Technology, Kwansei Gakuin University, Hyogo, Japan
Kazunori Sakai & Takeshi Okadome
NEC Corporation, Tokyo, Japan
Yuyang Dong, Masafumi Oyamada & Kunihiro Takeoka

Authors

Kazunori Sakai
View author publications
You can also search for this author in PubMed Google Scholar
Yuyang Dong
View author publications
You can also search for this author in PubMed Google Scholar
Masafumi Oyamada
View author publications
You can also search for this author in PubMed Google Scholar
Kunihiro Takeoka
View author publications
You can also search for this author in PubMed Google Scholar
Takeshi Okadome
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Yuyang Dong .

Editor information

Editors and Affiliations

Eindhoven University of Technology, Eindhoven, The Netherlands
George Fletcher
Tohoku University, Sendai, Japan
Keisuke Nakano
Osaka University, Osaka, Japan
Yuya Sasaki

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Sakai, K., Dong, Y., Oyamada, M., Takeoka, K., Okadome, T. (2022). Entity Matching with String Transformation and Similarity-Based Features. In: Fletcher, G., Nakano, K., Sasaki, Y. (eds) Software Foundations for Data Interoperability. SFDI 2021. Communications in Computer and Information Science, vol 1457. Springer, Cham. https://doi.org/10.1007/978-3-030-93849-9_5

Download citation

DOI: https://doi.org/10.1007/978-3-030-93849-9_5
Published: 19 January 2022
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-93848-2
Online ISBN: 978-3-030-93849-9
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

Entity Matching with String Transformation and Similarity-Based Features