Empowering Transformer with Hybrid Matching Knowledge for Entity Matching

Dou, Wenzhou; Shen, Derong; Nie, Tiezheng; Kou, Yue; Sun, Chenchen; Cui, Hang; Yu, Ge

doi:10.1007/978-3-031-00129-1_4

Wenzhou Dou¹⁶,
Derong Shen¹⁶,
Tiezheng Nie¹⁶,
Yue Kou¹⁶,
Chenchen Sun¹⁷,
Hang Cui¹⁸ &
…
Ge Yu¹⁶

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 13247))

Included in the following conference series:

International Conference on Database Systems for Advanced Applications

2622 Accesses
1 Citations

Abstract

Transformers have achieved great success in many NLP tasks. The self-attention mechanism of Transformer learns powerful representation by conducting token-level pairwise interactions within the input sequence. In this paper, we propose a novel entity matching framework named GTA. GTA enhances Transformer for relational data representation by injecting additional hybrid matching knowledge. The hybrid matching knowledge is obtained via graph contrastive learning on a designed hybrid matching graph, in which the dual-level matching and multiple granularity interactions are modeled. In this way, GTA utilizes the prelearned knowledge of both hybrid matching and language modeling. This effectively empowers Transformer to understand the structural features of relational data when performing entity matching. Extensive experiments on open datasets show that GTA effectively enhances Transformer for relational data representation and outperforms state-of-the-art entity matching frameworks.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 79.99; Price excludes VAT (USA)

Softcover Book: USD 99.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

1.
RoBERTa has proved that removing next sentence prediction (NSP) training objective can improve downstream task performance.
2.
https://github.com/anhaidgroup/deepmatcher/blob/master/Datasets.md.
3.
https://www.dgl.ai/.
4.
https://huggingface.co/.

References

Abedjan, Z., et al.: Detecting data errors: where are we and what needs to be done? Proc. VLDB Endow. 9(12), 993–1004 (2016)
Article Google Scholar
Brunner, U., Stockinger, K.: Entity matching with transformer architectures-a step forward in data integration. In: International Conference on Extending Database Technology, Copenhagen, 30 March–2 April 2020. OpenProceedings (2020)
Google Scholar
Cappuzzo, R., Papotti, P., Thirumuruganathan, S.: Creating embeddings of heterogeneous relational datasets for data integration tasks. In: Proceedings of the 2020 ACM SIGMOD International Conference on Management of Data, pp. 1335–1349 (2020)
Google Scholar
Chen, R., Shen, Y., Zhang, D.: GNEM: a generic one-to-set neural entity matching framework. In: Proceedings of the Web Conference 2021, pp. 1686–1694 (2021)
Google Scholar
Christen, P.: Data matching: concepts and techniques for record linkage, entity resolution, and duplicate detection (2012)
Google Scholar
Clevert, D.A., Unterthiner, T., Hochreiter, S.: Fast and accurate deep network learning by exponential linear units (ELUs). arXiv preprint arXiv:1511.07289 (2015)
Dalvi, N., Rastogi, V., Dasgupta, A., Das Sarma, A., Sarlós, T.: Optimal hashing schemes for entity matching. In: Proceedings of the 22nd International Conference on World Wide Web, pp. 295–306 (2013)
Google Scholar
Devlin, J., Chang, M.W., Lee, K., Toutanova, K.: BERT: pre-training of deep bidirectional transformers for language understanding. In: Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pp. 4171–4186 (2019)
Google Scholar
Dong, X.L., Srivastava, D.: Big data integration. In: 2013 IEEE 29th international conference on data engineering (ICDE), pp. 1245–1248. IEEE (2013)
Google Scholar
Dosovitskiy, A., et al.: An image is worth 16\(\times \)16 words: transformers for image recognition at scale. In: International Conference on Learning Representations (2020)
Google Scholar
Dunn, H.L.: Record linkage. Am. J. Public Health Natl. Health 36(12), 1412–1416 (1946)
Article Google Scholar
Ebraheem, M., Thirumuruganathan, S., Joty, S., Ouzzani, M., Tang, N.: Distributed representations of tuples for entity resolution. Proc. VLDB Endow. 11(11), 1454–1467 (2018)
Article Google Scholar
Elmagarmid, A., Ilyas, I.F., Ouzzani, M., Quiané-Ruiz, J.A., Tang, N., Yin, S.: NADEEF/ER: generic and interactive entity resolution. In: Proceedings of the 2014 ACM SIGMOD International Conference on Management of Data, pp. 1071–1074 (2014)
Google Scholar
Fan, W., Ma, Y., Li, Q., He, Y., Zhao, E., Tang, J., Yin, D.: Graph neural networks for social recommendation. In: The World Wide Web Conference, pp. 417–426 (2019)
Google Scholar
Fu, C., Han, X., He, J., 0001, L.S.: Hierarchical matching network for heterogeneous entity resolution. In: IJCAI, pp. 3665–3671 (2020)
Google Scholar
Gokhale, C., et al.: Corleone: hands-off crowdsourcing for entity matching. In: Proceedings of the 2014 ACM SIGMOD International Conference on Management of Data, pp. 601–612 (2014)
Google Scholar
Jin, W., et al.: Graph representation learning: foundations, methods, applications and systems. In: Proceedings of the 27th ACM SIGKDD Conference on Knowledge Discovery & Data Mining, pp. 4044–4045 (2021)
Google Scholar
Marcus, A., Wu, E., Karger, D., Madden, S., Miller, R.: Human-powered sorts and joins. Proc. VLDB Endow. 5(1) (2011)
Google Scholar
Kingma, D.P., Ba, J.: Adam: a method for stochastic optimization. In: ICLR (Poster) (2015)
Google Scholar
Konda, P., et al.: Magellan: toward building entity matching management systems. Proc. VLDB Endow. 9(12), 1581–1584 (2016)
Article Google Scholar
Li, B., Miao, Y., Wang, Y., Sun, Y., Wang, W.: Improving the efficiency and effectiveness for BERT-based entity resolution. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, pp. 13226–13233 (2021)
Google Scholar
Li, B., Wang, W., Sun, Y., Zhang, L., Ali, M.A., Wang, Y.: GraphER: token-centric entity resolution with graph convolutional neural networks. In: AAAI, pp. 8172–8179 (2020)
Google Scholar
Li, Y., Li, J., Suhara, Y., Doan, A., Tan, W.C.: Deep entity matching with pre-trained language models. Proc. VLDB Endow. 14(1), 50–60 (2020)
Article Google Scholar
Li, Y., Li, J., Suhara, Y., Wang, J., Hirota, W., Tan, W.C.: Deep entity matching: challenges and opportunities. J. Data Inf. Qual. (JDIQ) 13(1), 1–17 (2021)
Article Google Scholar
Liu, Y., et al.: RoBERTa: a robustly optimized BERT pretraining approach. arXiv preprint arXiv:1907.11692 (2019)
Liu, Y., Pan, S., Jin, M., Zhou, C., Xia, F., Yu, P.S.: Graph self-supervised learning: a survey. arXiv preprint arXiv:2103.00111 (2021)
Mudgal, S., et al.: Deep learning for entity matching: a design space exploration. In: Proceedings of the 2018 International Conference on Management of Data, pp. 19–34 (2018)
Google Scholar
Oord, A.V.D., Li, Y., Vinyals, O.: Representation learning with contrastive predictive coding. arXiv preprint arXiv:1807.03748 (2018)
Peng, Y., Choi, B., Xu, J.: Graph learning for combinatorial optimization: a survey of state-of-the-art. Data Sci. Eng. 6(2), 119–141 (2021)
Article Google Scholar
Sanh, V., Debut, L., Chaumond, J., Wolf, T.: DistilBERT, a distilled version of BERT: smaller, faster, cheaper and lighter. arXiv preprint arXiv:1910.01108 (2019)
Singh, R., et al.: Synthesizing entity matching rules by examples. Proc. VLDB Endow. 11(2), 189–202 (2017)
Article Google Scholar
Sun, C.C., Shen, D.R.: Mixed hierarchical networks for deep entity matching. J. Comput. Sci. Technol. 36(4), 822–838 (2021)
Article Google Scholar
Sun, C., Shen, D.: Entity resolution with hybrid attention-based networks. In: Jensen, C.S., et al. (eds.) DASFAA 2021. LNCS, vol. 12682, pp. 558–565. Springer, Cham (2021). https://doi.org/10.1007/978-3-030-73197-7_37
Chapter Google Scholar
Tang, N., et al.: RPT: relational pre-trained transformer is almost all you need towards democratizing data preparation. Proc. VLDB Endow. 14(8), 1254–1261 (2021)
Article Google Scholar
Wang, F., Liu, H.: Understanding the behaviour of contrastive loss. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 2495–2504 (2021)
Google Scholar
Wang, J., Kraska, T., Franklin, M.J., Feng, J.: CrowdER: crowdsourcing entity resolution. Proc. VLDB Endow. 5(11) (2012)
Google Scholar
Wang, J., Li, G., Yu, J.X., Feng, J.: Entity matching: how similar is similar. Proc. VLDB Endow. 4(10), 622–633 (2011)
Article Google Scholar
Zhang, D., Nie, Y., Wu, S., Shen, Y., Tan, K.L.: Multi-context attention for entity matching. In: Proceedings of The Web Conference 2020, pp. 2634–2640 (2020)
Google Scholar
Zheng, Y., Zhang, R., Huang, M., Mao, X.: A pre-training based personalized dialogue generation model with persona-sparse data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 34, pp. 9693–9700 (2020)
Google Scholar

Download references

Acknowledgements

This work is supported by the National Natural Science Foundation of China (62172082, 62072084, 62072086, U1811261).

Author information

Authors and Affiliations

School of Computer Science and Engineering, Northeastern University, Shenyang, China
Wenzhou Dou, Derong Shen, Tiezheng Nie, Yue Kou & Ge Yu
School of Computer Science and Engineering, Tianjin University of Technology, Tianjin, China
Chenchen Sun
University of Illinois at Urbana-Champaign, Champaign, USA
Hang Cui

Authors

Wenzhou Dou
View author publications
You can also search for this author in PubMed Google Scholar
Derong Shen
View author publications
You can also search for this author in PubMed Google Scholar
Tiezheng Nie
View author publications
You can also search for this author in PubMed Google Scholar
Yue Kou
View author publications
You can also search for this author in PubMed Google Scholar
Chenchen Sun
View author publications
You can also search for this author in PubMed Google Scholar
Hang Cui
View author publications
You can also search for this author in PubMed Google Scholar
Ge Yu
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Derong Shen .

Editor information

Editors and Affiliations

Indian Institute of Technology Kanpur, Kanpur, India
Arnab Bhattacharya
National University of Singapore, Singapore, Singapore
Janice Lee Mong Li
University of California, Santa Barbara, Santa Barbara, CA, USA
Divyakant Agrawal
IIIT Hyderabad, Hyderabad, India
P. Krishna Reddy
Indraprastha Institute of Information Technology Delhi, New Delhi, India
Mukesh Mohania
Ashoka University, Sonepat, Haryana, India
Anirban Mondal
Indraprastha Institute of Information Technology Delhi, New Delhi, India
Vikram Goyal
University of Aizu, Aizu, Japan
Rage Uday Kiran

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Dou, W. et al. (2022). Empowering Transformer with Hybrid Matching Knowledge for Entity Matching. In: Bhattacharya, A., et al. Database Systems for Advanced Applications. DASFAA 2022. Lecture Notes in Computer Science, vol 13247. Springer, Cham. https://doi.org/10.1007/978-3-031-00129-1_4

Download citation

DOI: https://doi.org/10.1007/978-3-031-00129-1_4
Published: 08 April 2022
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-00128-4
Online ISBN: 978-3-031-00129-1
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics