Abstract
Transformers have achieved great success in many NLP tasks. The self-attention mechanism of Transformer learns powerful representation by conducting token-level pairwise interactions within the input sequence. In this paper, we propose a novel entity matching framework named GTA. GTA enhances Transformer for relational data representation by injecting additional hybrid matching knowledge. The hybrid matching knowledge is obtained via graph contrastive learning on a designed hybrid matching graph, in which the dual-level matching and multiple granularity interactions are modeled. In this way, GTA utilizes the prelearned knowledge of both hybrid matching and language modeling. This effectively empowers Transformer to understand the structural features of relational data when performing entity matching. Extensive experiments on open datasets show that GTA effectively enhances Transformer for relational data representation and outperforms state-of-the-art entity matching frameworks.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Notes
- 1.
RoBERTa has proved that removing next sentence prediction (NSP) training objective can improve downstream task performance.
- 2.
- 3.
- 4.
References
Abedjan, Z., et al.: Detecting data errors: where are we and what needs to be done? Proc. VLDB Endow. 9(12), 993–1004 (2016)
Brunner, U., Stockinger, K.: Entity matching with transformer architectures-a step forward in data integration. In: International Conference on Extending Database Technology, Copenhagen, 30 March–2 April 2020. OpenProceedings (2020)
Cappuzzo, R., Papotti, P., Thirumuruganathan, S.: Creating embeddings of heterogeneous relational datasets for data integration tasks. In: Proceedings of the 2020 ACM SIGMOD International Conference on Management of Data, pp. 1335–1349 (2020)
Chen, R., Shen, Y., Zhang, D.: GNEM: a generic one-to-set neural entity matching framework. In: Proceedings of the Web Conference 2021, pp. 1686–1694 (2021)
Christen, P.: Data matching: concepts and techniques for record linkage, entity resolution, and duplicate detection (2012)
Clevert, D.A., Unterthiner, T., Hochreiter, S.: Fast and accurate deep network learning by exponential linear units (ELUs). arXiv preprint arXiv:1511.07289 (2015)
Dalvi, N., Rastogi, V., Dasgupta, A., Das Sarma, A., Sarlós, T.: Optimal hashing schemes for entity matching. In: Proceedings of the 22nd International Conference on World Wide Web, pp. 295–306 (2013)
Devlin, J., Chang, M.W., Lee, K., Toutanova, K.: BERT: pre-training of deep bidirectional transformers for language understanding. In: Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pp. 4171–4186 (2019)
Dong, X.L., Srivastava, D.: Big data integration. In: 2013 IEEE 29th international conference on data engineering (ICDE), pp. 1245–1248. IEEE (2013)
Dosovitskiy, A., et al.: An image is worth 16\(\times \)16 words: transformers for image recognition at scale. In: International Conference on Learning Representations (2020)
Dunn, H.L.: Record linkage. Am. J. Public Health Natl. Health 36(12), 1412–1416 (1946)
Ebraheem, M., Thirumuruganathan, S., Joty, S., Ouzzani, M., Tang, N.: Distributed representations of tuples for entity resolution. Proc. VLDB Endow. 11(11), 1454–1467 (2018)
Elmagarmid, A., Ilyas, I.F., Ouzzani, M., Quiané-Ruiz, J.A., Tang, N., Yin, S.: NADEEF/ER: generic and interactive entity resolution. In: Proceedings of the 2014 ACM SIGMOD International Conference on Management of Data, pp. 1071–1074 (2014)
Fan, W., Ma, Y., Li, Q., He, Y., Zhao, E., Tang, J., Yin, D.: Graph neural networks for social recommendation. In: The World Wide Web Conference, pp. 417–426 (2019)
Fu, C., Han, X., He, J., 0001, L.S.: Hierarchical matching network for heterogeneous entity resolution. In: IJCAI, pp. 3665–3671 (2020)
Gokhale, C., et al.: Corleone: hands-off crowdsourcing for entity matching. In: Proceedings of the 2014 ACM SIGMOD International Conference on Management of Data, pp. 601–612 (2014)
Jin, W., et al.: Graph representation learning: foundations, methods, applications and systems. In: Proceedings of the 27th ACM SIGKDD Conference on Knowledge Discovery & Data Mining, pp. 4044–4045 (2021)
Marcus, A., Wu, E., Karger, D., Madden, S., Miller, R.: Human-powered sorts and joins. Proc. VLDB Endow. 5(1) (2011)
Kingma, D.P., Ba, J.: Adam: a method for stochastic optimization. In: ICLR (Poster) (2015)
Konda, P., et al.: Magellan: toward building entity matching management systems. Proc. VLDB Endow. 9(12), 1581–1584 (2016)
Li, B., Miao, Y., Wang, Y., Sun, Y., Wang, W.: Improving the efficiency and effectiveness for BERT-based entity resolution. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, pp. 13226–13233 (2021)
Li, B., Wang, W., Sun, Y., Zhang, L., Ali, M.A., Wang, Y.: GraphER: token-centric entity resolution with graph convolutional neural networks. In: AAAI, pp. 8172–8179 (2020)
Li, Y., Li, J., Suhara, Y., Doan, A., Tan, W.C.: Deep entity matching with pre-trained language models. Proc. VLDB Endow. 14(1), 50–60 (2020)
Li, Y., Li, J., Suhara, Y., Wang, J., Hirota, W., Tan, W.C.: Deep entity matching: challenges and opportunities. J. Data Inf. Qual. (JDIQ) 13(1), 1–17 (2021)
Liu, Y., et al.: RoBERTa: a robustly optimized BERT pretraining approach. arXiv preprint arXiv:1907.11692 (2019)
Liu, Y., Pan, S., Jin, M., Zhou, C., Xia, F., Yu, P.S.: Graph self-supervised learning: a survey. arXiv preprint arXiv:2103.00111 (2021)
Mudgal, S., et al.: Deep learning for entity matching: a design space exploration. In: Proceedings of the 2018 International Conference on Management of Data, pp. 19–34 (2018)
Oord, A.V.D., Li, Y., Vinyals, O.: Representation learning with contrastive predictive coding. arXiv preprint arXiv:1807.03748 (2018)
Peng, Y., Choi, B., Xu, J.: Graph learning for combinatorial optimization: a survey of state-of-the-art. Data Sci. Eng. 6(2), 119–141 (2021)
Sanh, V., Debut, L., Chaumond, J., Wolf, T.: DistilBERT, a distilled version of BERT: smaller, faster, cheaper and lighter. arXiv preprint arXiv:1910.01108 (2019)
Singh, R., et al.: Synthesizing entity matching rules by examples. Proc. VLDB Endow. 11(2), 189–202 (2017)
Sun, C.C., Shen, D.R.: Mixed hierarchical networks for deep entity matching. J. Comput. Sci. Technol. 36(4), 822–838 (2021)
Sun, C., Shen, D.: Entity resolution with hybrid attention-based networks. In: Jensen, C.S., et al. (eds.) DASFAA 2021. LNCS, vol. 12682, pp. 558–565. Springer, Cham (2021). https://doi.org/10.1007/978-3-030-73197-7_37
Tang, N., et al.: RPT: relational pre-trained transformer is almost all you need towards democratizing data preparation. Proc. VLDB Endow. 14(8), 1254–1261 (2021)
Wang, F., Liu, H.: Understanding the behaviour of contrastive loss. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 2495–2504 (2021)
Wang, J., Kraska, T., Franklin, M.J., Feng, J.: CrowdER: crowdsourcing entity resolution. Proc. VLDB Endow. 5(11) (2012)
Wang, J., Li, G., Yu, J.X., Feng, J.: Entity matching: how similar is similar. Proc. VLDB Endow. 4(10), 622–633 (2011)
Zhang, D., Nie, Y., Wu, S., Shen, Y., Tan, K.L.: Multi-context attention for entity matching. In: Proceedings of The Web Conference 2020, pp. 2634–2640 (2020)
Zheng, Y., Zhang, R., Huang, M., Mao, X.: A pre-training based personalized dialogue generation model with persona-sparse data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 34, pp. 9693–9700 (2020)
Acknowledgements
This work is supported by the National Natural Science Foundation of China (62172082, 62072084, 62072086, U1811261).
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2022 The Author(s), under exclusive license to Springer Nature Switzerland AG
About this paper
Cite this paper
Dou, W. et al. (2022). Empowering Transformer with Hybrid Matching Knowledge for Entity Matching. In: Bhattacharya, A., et al. Database Systems for Advanced Applications. DASFAA 2022. Lecture Notes in Computer Science, vol 13247. Springer, Cham. https://doi.org/10.1007/978-3-031-00129-1_4
Download citation
DOI: https://doi.org/10.1007/978-3-031-00129-1_4
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-00128-4
Online ISBN: 978-3-031-00129-1
eBook Packages: Computer ScienceComputer Science (R0)