Content Data Based Schema Matching

Szymczak, Marcin; Bronselaer, Antoon; Zadrożny, Sławomir; De Tré, Guy

doi:10.1007/978-3-319-30165-5_14

Marcin Szymczak^8,9,
Antoon Bronselaer⁹,
Sławomir Zadrożny⁸ &
…
Guy De Tré⁹

Part of the book series: Studies in Computational Intelligence ((SCI,volume 634))

555 Accesses

Abstract

A novel automatic method for detecting corresponding attributes in schemas based on content data is studied. More specifically, our proposed method for the detection of coreferent attributes in schemas is based on a statistical and lexical comparison of content data and detected coreferent tuples across multiple datasets, which increase the possibility of correct schema matching. We will show that knowledge of even a small number of coreferent tuples is sufficient to establish correct matching between corresponding attributes of heterogeneous schemas. The behaviour of the novel schema matching technique has been evaluated on several real life datasets, giving a valuable insight in the influence of the different parameters of our approach on the results obtained.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 84.99; Price excludes VAT (USA)

Softcover Book: USD 109.99; Price excludes VAT (USA)

Hardcover Book: USD 109.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

1.
The order of datasets does not matter, i.e., there exists schema matching between corresponding attributes from the source dataset and the target dataset, and vice versa.
2.
FreeDB, http://www.freedb.org/.
3.
Discogs, http://www.discogs.com/data/.
4.
http://hpi.de/naumann/projects/repeatability/datasets/cd-datasets.html.
5.
Discogs, http://www.discogs.com/data/.
6.
http://www.routeyou.com.
7.
Google Places, http://developers.google.com/places/.

References

Bilke, A., Naumann, F.: Schema matching using duplicates. In: Proceedings of the 28th International Conference on Data Engineering (ICDE) (2005)
Google Scholar
Bronselaer, A., De Tré, G.: A possibilistic approach on string comparison. IEEE Trans. Fuzzy Syst. 17(1), 208–223 (2009)
Article MATH Google Scholar
Bronselaer, A., De Tré, G.: Properties of possibilistic string comparison. IEEE Trans. Fuzzy Syst. 18(2), 312–325 (2010)
Article Google Scholar
Bronselaer, A., Hallez, A., De Tré, G.: Extensions of fuzzy measures and the sugeno integral for possibilistic truth values. Int. J. Intel. Syst. 24(2), 97–117 (2009)
Article MATH Google Scholar
Calvo, T., Mayor, G., Mesiar, R. (eds.): Aggregation Operators: New Trends and Applications. Physica-Verlag GmbH, Heidelberg (2002)
MATH Google Scholar
Chua, C.E.H., Chiang, R.H.L., Lim, E.P.: Instance-based attribute identification in database integration. VLDB J. 12(3), 228–243 (2003). Oct
Article Google Scholar
de Cooman, G.: Towards a possibilistic logic. In: Ruan, D. (ed.) Fuzzy Set Theory and Advanced Mathematical Applications, International Series in Intelligent Technologies, vol. 4, pp. 89–133. Springer, US (1995)
Chapter Google Scholar
Dhamankar, R., Lee, Y., Doan, A., Halevy, A., Domingos, P.: imap: discovering complex semantic matches between database schemas. In: Proceedings of the 2004 ACM SIGMOD International Conference on Management of Data, ACM Press (2004)
Google Scholar
Do, H.h., Rahm, E.: Coma—a system for flexible combination of schema matching approaches. In: Proceedings of the VLDB 2002, pp. 610–621 (2002)
Google Scholar
Doan, A., Domingos, P., Levy, A.Y.: Learning source description for data integration. In: WebDB (Informal Proceedings), pp. 81–86 (2000)
Google Scholar
Elmagarmid, A., Ipeirotis, P., Verykios, V.: Duplicate record detection: a survey. IEEE Trans. Knowl. Data Eng. 19(1), 1–16 (2007)
Article Google Scholar
Hallez, A., De Tré, G., Verstraete, J., Matthé, T.: Application of fuzzy quantifiers on possibilistic truth values. In: Proceedings of EUROFUSE EURO WG on Fuzzy Sets, pp. 252–254. EXIT (2004)
Google Scholar
Hastie, T., Tibshirani, R., Friedman, J.: The Elements of Statistical Learning. Springer Series in Statistics. Springer New York Inc, New York (2001)
Book MATH Google Scholar
Jain, A.K., Duin, R.P.W., Mao, J.: Statistical pattern recognition: a review. IEEE Trans. Pattern Anal. Mach. Intell. 22(1), 4–37 (2000). Jan
Article Google Scholar
Little, R.J.A., Rubin, D.B.: Statistical Analysis with Missing Data. Wiley, New York (1986)
MATH Google Scholar
Lu, H., Fan, W., Goh, C.H., Madnick, S., Cheung, D.: Discovering and reconciling semantic conflicts: a data mining prospective. In: Proceedings of IFIP Working Conference on Data Semantics (DS-7) (1997)
Google Scholar
Madhavan, J., Bernstein, P.A., Rahm, E.: Generic schema matching with cupid. In: Proceedings of the 27th International Conference on Very Large Data Bases. pp. 49–58. VLDB ’01, Morgan Kaufmann Publishers Inc., San Francisco, CA, USA (2001)
Google Scholar
Mehdi, O.A., Ibrahim, H., Affendey, L.S.: Instance based matching using regular expression. Procedia CS 10, 688–695 (2012)
Google Scholar
Perkowitz, M., Doorenbos, R.B., Etzioni, O., Weld, D.S.: Learning to understand information on the internet: an example-based approach. J. Intel. Inf. Syst. 8(2), 133–153 (1997). Mar
Article Google Scholar
Prade, H.: Possibility sets, fuzzy sets and their relation to Lukasiewicz logic. In: Proceeding of 12th Int Symp on Multiple-Valued Logic. pp. 223–227 (1982)
Google Scholar
Rahm, E., Bernstein, P.A.: A survey of approaches to automatic schema matching. VLDB J. 10(4), 334–350 (2001). Dec
Article MATH Google Scholar
Reiss, R.D., Thomas, M.: Statistical analysis of extreme values: with applications to insurance, finance, hydrology and other fields. Birkhuser Basel, 3rd edn. (2007)
Google Scholar
Sugeno, M.: Theory of Fuzzy Integrals and its Applications. Ph.D. thesis, Tokyo, Japan (1974)
Google Scholar
Szymczak, M., Koepke, J.: Matching methods for semantic annotation-based XML document transformations. In: K. Atanassov, et al. (Eds.), New Developments in Fuzzy Sets, Intuitionistic Fuzzy Sets, Generalized Nets and Related Topics. Applications. Volume II. pp. 297–308. SRI PAS (2012)
Google Scholar
Szymczak, M., Zadrożny, S., De Tré, G.: Coreference detection in XML metadata. In: Pedrycz, W., Reformat, M. (eds.) Proceedings of 2013 Joint IFSA World Congress NAFIPS Annual Meeting. pp. 1354–1359 (2013)
Google Scholar
Szymczak, M., Bronselaer, A., Zadrożny, S., De Tré, G.: Semantical mappings of attribute values for data integration. In: Proceedings of NAFIPS 2014. pp. 1–8. IEEE (2014)
Google Scholar
Szymczak, M., Zadrożny, S., Bronselaer, A., De Tré, G.: Coreference detection in an XML schema. Inf. Sci. 296, 237–262 (2015)
Article Google Scholar
Tejada, S., Knoblock, C., Minton, S.: Learning object identification rules for information integration. Inf. Syst. 26(8), 607–633 (2001)
Article MATH Google Scholar
Yager, R.: On the theory of bags. Int. J. Gen. Syst. 13(1), 23–27 (1986)
Article MathSciNet Google Scholar
Zadeh, L.: Fuzzy sets as a basis for a theory of possibility. Fuzzy Sets Syst. 100, 9–34 (1999). Apr
Article Google Scholar
Zadrożny, S., Kacprzyk, J., Sobota, G.: Avoiding duplicate records in a database using a linguistic quantifier based aggregation—a practical approach. In: Proceedings of FUZZ-IEEE. pp. 2194–2201 (2008)
Google Scholar

Download references

Acknowledgments

This contribution is supported by the Foundation for Polish Science under International PhD Projects in Intelligent Computing. Project financed from The European Union within the Innovative Economy Operational Programme 2007–2013 and European Regional Development Fund. This work was also partially supported by the National Science Centre (contract no. UMO-2011/01/B/ST6/06908).

Author information

Authors and Affiliations

Systems Research Institute, Polish Academy of Sciences, Newelska 6, 01-447, Warsaw, Poland
Marcin Szymczak & Sławomir Zadrożny
Department of Telecommunications and Information Processing, University Ghent, St-Pietersnieuwstraat 41, 9000, Ghent, Belgium
Marcin Szymczak, Antoon Bronselaer & Guy De Tré

Authors

Marcin Szymczak
View author publications
You can also search for this author in PubMed Google Scholar
Antoon Bronselaer
View author publications
You can also search for this author in PubMed Google Scholar
Sławomir Zadrożny
View author publications
You can also search for this author in PubMed Google Scholar
Guy De Tré
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Marcin Szymczak .

Editor information

Editors and Affiliations

Dept. of Telecommunication & Inform proc, Ghent University, Gent, Belgium
Guy de Trė
Faculty of Maths and Information Science, Warsaw University of Technology, Warszawa, Poland
Przemysław Grzegorzewski
Polish Academy of Sciences, Systems Research Institute, Warszawa, Poland
Janusz Kacprzyk
Polish Academy of Sciences, Systems Research Institute, Warszawa, Poland
Jan W. Owsiński
Polish Academy of Sciences, Institute of Computer Science, Warszawa, Poland
Wojciech Penczek
Systems Research Institute, Polish Academy of Sciences, Warsaw, Poland
Sławomir Zadrożny

Rights and permissions

Reprints and permissions

Copyright information

About this chapter

Cite this chapter

Szymczak, M., Bronselaer, A., Zadrożny, S., De Tré, G. (2016). Content Data Based Schema Matching. In: Trė, G., Grzegorzewski, P., Kacprzyk, J., Owsiński, J., Penczek, W., Zadrożny, S. (eds) Challenging Problems and Solutions in Intelligent Systems. Studies in Computational Intelligence, vol 634. Springer, Cham. https://doi.org/10.1007/978-3-319-30165-5_14

Download citation

DOI: https://doi.org/10.1007/978-3-319-30165-5_14
Published: 26 March 2016
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-30164-8
Online ISBN: 978-3-319-30165-5
eBook Packages: EngineeringEngineering (R0)

Publish with us

Policies and ethics