Abstract
Entity matching is an important task in common data cleaning and data integration problems of determining two records that refer to the same real-world entity. Many research use string similarity as features to infer entity matching but the power of the similarity may be affected by the pairs of hard-to-classify entities, which are actually different entities but have a high similarity or the same entity with low similarity. String transformation is a good solution to solve different representations between two domains of datasets, such as abbreviations, misspellings, and other expressions.
In this paper, we propose two powerful features, similarity gain and dissimilarity gain, that enables us to discriminate whether the two entities refer to the same entity after string transformation. The similarity gain is defined by the maximum amount of similarity increase among the variations in similarity before and after applying string transformations. The dissimilarity is defined by the maximum amount of similarity decrease. Moreover, the similarity gain and dissimilarity gain can also be used for selecting valuable samples in a limited labeling budget. Sufficient experiments are conducted, and our method with the proposed features improves the best accuracy in most cases.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Notes
- 1.
F can also include function \(f_0\) that does not transform anything.
References
Arasu, A., Chaudhuri, S., Kaushik, R.: Transformation-based framework for record matching. In: 2008 IEEE 24th International Conference on Data Engineering, pp. 40–49. IEEE (2008)
Arasu, A., Chaudhuri, S., Kaushik, R.: Learning string transformations from examples. Proc. VLDB Endow. 2(1), 514–525 (2009)
Arasu, A., Götz, M., Kaushik, R.: On active learning of record matching packages. In: Proceedings of the 2010 ACM SIGMOD International Conference on Management of data, pp. 783–794 (2010)
Bilenko, M., Mooney, R.J.: Adaptive duplicate detection using learnable string similarity measures. In: Proceedings of the Ninth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 39–48 (2003)
Çakal, Ö.Ö., Mahdavi, M., Abedjan, Z.: CLRL: feature engineering for cross-language record linkage. In: EDBT, pp. 678–681 (2019)
Christen, P.: Automatic training example selection for scalable unsupervised record linkage. In: Washio, T., Suzuki, E., Ting, K.M., Inokuchi, A. (eds.) PAKDD 2008. LNCS (LNAI), vol. 5012, pp. 511–518. Springer, Heidelberg (2008). https://doi.org/10.1007/978-3-540-68125-0_45
Cohen, W.W., Richman, J.: Learning to match and cluster large high-dimensional data sets for data integration. In: Proceedings of the Eighth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 475–480 (2002)
Deng, D., et al.: Unsupervised string transformation learning for entity consolidation. In: 35th IEEE International Conference on Data Engineering, ICDE 2019, Macao, China, 8–11 April 2019, pp. 196–207. IEEE (2019)
Ebraheem, M., Thirumuruganathan, S., Joty, S.R., Ouzzani, M., Tang, N.: Distributed representations of tuples for entity resolution. Proc. VLDB Endow. 11(11), 1454–1467 (2018)
Fellegi, I.P., Sunter, A.B.: A theory for record linkage. J. Am. Stat. Assoc. 64(328), 1183–1210 (1969)
Konda, P., Das, S., Suganthan G.C.P., Doan, A., Ardalan, A., et al.: Magellan: Toward building entity matching management systems
Li, P., Cheng, X., Chu, X., He, Y., Chaudhuri, S.: Auto-fuzzyjoin: auto-program fuzzy similarity joins without labeled examples (2021)
Li, Y., Li, J., Suhara, Y., Doan, A., Tan, W.: Deep entity matching with pre-trained language models. Proc. VLDB Endow. 14(1), 50–60 (2020)
Michelson, M., Knoblock, C.A.: Mining the heterogeneous transformations between data sources to aid record linkage. In: ICAI (2009)
Minton, S.N., Nanjo, C., Knoblock, C.A., Michalowski, M., Michelson, M.: A heterogeneous field matching method for record linkage. In: Fifth IEEE International Conference on Data Mining (ICDM’05). 8p. IEEE (2005)
Mudgal, S., et al.: Deep learning for entity matching: a design space exploration. In: Das, G., Jermaine, C.M., Bernstein, P.A. (eds.) Proceedings of the 2018 International Conference on Management of Data, SIGMOD Conference 2018, Houston, TX, USA, 10–15 June 2018. pp. 19–34. ACM (2018)
Paganelli, M., Buono, F.D., Pevarello, M., Guerra, F., Vincini, M.: Automated machine learning for entity matching tasks. In: Velegrakis, Y., Zeinalipour-Yazti, D., Chrysanthis, P.K., Guerra, F. (eds.) Proceedings of the 24th International Conference on Extending Database Technology, EDBT 2021, Nicosia, Cyprus, 23–26 March 2021, pp. 325–330. OpenProceedings.org (2021)
Papadakis, G., Skoutas, D., Thanos, E., Palpanas, T.: A survey of blocking and filtering techniques for entity resolution. CoRR abs/1905.06167 (2019). http://arxiv.org/abs/1905.06167
Wang, P., Zheng, W., Wang, J., Pei, J.: Automating entity matching model development. In: 37th IEEE International Conference on Data Engineering, ICDE 2021. IEEE (2021)
Sarawagi, S., Bhamidipaty, A.: Interactive deduplication using active learning. In: Proceedings of the Eighth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 269–278 (2002)
Tejada, S., Knoblock, C.A., Minton, S.: Learning domain-independent string transformation weights for high accuracy object identification. In: Proceedings of the Eighth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 350–359 (2002)
Zhao, C., He, Y.: Auto-EM: end-to-end fuzzy entity-matching using pre-trained deep models and transfer learning. In: Liu, L., et al. (eds.) The World Wide Web Conference, WWW 2019, San Francisco, CA, USA, 13–17 May 2019, pp. 2413–2424. ACM (2019)
Zhu, E., He, Y., Chaudhuri, S.: Auto-join: Joining tables by leveraging transformations. Proc. VLDB Endow. 10(10), 1034–1045 (2017). http://www.vldb.org/pvldb/vol10/p1034-he.pdf
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2022 Springer Nature Switzerland AG
About this paper
Cite this paper
Sakai, K., Dong, Y., Oyamada, M., Takeoka, K., Okadome, T. (2022). Entity Matching with String Transformation and Similarity-Based Features. In: Fletcher, G., Nakano, K., Sasaki, Y. (eds) Software Foundations for Data Interoperability. SFDI 2021. Communications in Computer and Information Science, vol 1457. Springer, Cham. https://doi.org/10.1007/978-3-030-93849-9_5
Download citation
DOI: https://doi.org/10.1007/978-3-030-93849-9_5
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-93848-2
Online ISBN: 978-3-030-93849-9
eBook Packages: Computer ScienceComputer Science (R0)