An Improved String Similarity Measure Based on Combining Information-Theoretic and Edit Distance Methods

Nguyen, Thi Thuy Anh; Conrad, Stefan

doi:10.1007/978-3-319-25840-9_15

Thi Thuy Anh Nguyen¹⁵ &
Stefan Conrad¹⁵

Part of the book series: Communications in Computer and Information Science ((CCIS,volume 553))

Included in the following conference series:

International Joint Conference on Knowledge Discovery, Knowledge Engineering, and Knowledge Management

744 Accesses
1 Citations

Abstract

The lexical similarity measure is used for calculating the similarities between strings. Existing lexical-based methods usually base on either n-grams or Dice’s approaches. These measures have a good performance and could be extended by adjusting the parameter. However, they do not return reasonable results in some situations where strings are quite similar or the sets of characters are the same but their positions are different. To deal with this problem, our paper presents an approach to improve a lexical-based measure based on both information-theoretic and edit distance measures. The proposed method is tested on a partial OAEI benchmark 2008. The results show that our approach has some prominent features compared to other lexical-based methods. It is also flexible clearly and convenient in implementation. Moreover, we chose a range of good parameters can be applied in different domains.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

1.
http://oaei.ontologymatching.org/.

References

Algergawy, A., Schallehn, E., Saake, G.: A sequence-based ontology matching approach. In: The 10th International Conference on Information Integration and Web-Based Applications & Services, pp. 131–136. ACM (2008)
Google Scholar
Batet, M., Sánchez, D., Valls, A.: An ontology-based measure to compute semantic similarity in biomedicine. Biomed. Inf. 44(1), 118–125 (2011)
Article Google Scholar
Dice, L.R.: Measures of the amount of ecologic association between species. Ecology 26(3), 297–302 (1945)
Article Google Scholar
Euzenat, J., Shvaiko, P.: Ontology Matching. Springer, Heidelberg (2013)
Book MATH Google Scholar
Hamming, R.W.: Error detecting and error correcting codes. Bell Syst. Techn. J. 29(2), 147–160 (1950)
Article MathSciNet Google Scholar
Ichise, R.: Machine learning approach for ontology mapping using multiple concept similarity measures. In: The 7th IEEE/ACIS International Conference on Computer and Information Science, pp. 340–346. IEEE (2008)
Google Scholar
Jaccard, P.: The distribution of the flora in the alpine zone. New Phytol. 11(2), 37–50 (1912)
Article Google Scholar
Jaro, M.A.: Advances in record-linkage methodology as applied to matching the 1985 census of Tampa, Florida. Am. Stat. Assoc. 84(406), 414–420 (1989)
Article Google Scholar
Kondrak, G.: N-gram similarity and distance. In: Consens, M.P., Navarro, G. (eds.) SPIRE 2005. LNCS, vol. 3772, pp. 115–126. Springer, Heidelberg (2005)
Chapter Google Scholar
Levenshtein, V.I.: Binary codes capable of correcting deletions, insertions, and reversals. Sov. Phys. Doklady 10, 707–710 (1966)
MathSciNet MATH Google Scholar
Lin, D.: An information-theoretic definition of similarity. In: The 15th International Conference on Machine Learning, pp. 296–304. Morgan Kaufmann (1998)
Google Scholar
Maedche, A., Staab, S.: Measuring similarity between ontologies. In: Gómez-Pérez, A., Benjamins, V.R. (eds.) EKAW 2002. LNCS (LNAI), vol. 2473, pp. 251–263. Springer, Heidelberg (2002)
Chapter Google Scholar
Needleman, S.B., Wunsch, C.D.: A general method applicable to the search for similarities in the amino acid sequence of two proteins. Mol. Biol. 48, 443–453 (1970)
Article Google Scholar
Nguyen, T.T.A., Conrad, S.: Combination of lexical and structure-based similarity measures to match ontologies automatically. In: Fred, A., Dietz, J.L.G., Liu, K., Filipe, J. (eds.) IC3K 2012. CCIS, vol. 415, pp. 101–112. Springer, Heidelberg (2013)
Chapter Google Scholar
Nguyen, T.T.A., Conrad, S.: Applying information-theoretic and edit distance approaches to flexibly measure lexical similarity. In: The 6th International Conference on Knowledge Discovery and Information Retrieval, pp. 505–511. SciTePress (2014)
Google Scholar
Pirró, G., Euzenat, J.: A feature and information theoretic framework for semantic similarity and relatedness. In: Patel-Schneider, P.F., Pan, Y., Hitzler, P., Mika, P., Zhang, L., Pan, J.Z., Horrocks, I., Glimm, B. (eds.) ISWC 2010, Part I. LNCS, vol. 6496, pp. 615–630. Springer, Heidelberg (2010)
Chapter Google Scholar
Pirró, G., Seco, N.: Design, implementation and evaluation of a new semantic similarity metric combining features and intrinsic information content. In: Meersman, R., Tari, Z. (eds.) OTM 2008, Part II. LNCS, vol. 5332, pp. 1271–1288. Springer, Heidelberg (2008)
Chapter Google Scholar
Sánchez, D., Batet, M., Isern, D., Valls, A.: Ontology-based semantic similarity: a new feature-based approach. Expert Syst. Appl. 39(9), 7718–7728 (2012)
Article Google Scholar
Tversky, A.: Features of similarity. Psychol. Rev. 84, 327–352 (1997)
Article Google Scholar
Wang, X., Ding, Y., Zhao, Y.: Similarity measurement about ontology-based semantic web services. In: The Workshop on Semantics for Web Services (2006)
Google Scholar
Winkler, W.E.: String comparator metrics and enhanced decision rules in the fellegi-sunter model of record linkage. In: The Section on Survey Research, pp. 354–359 (1990)
Google Scholar

Download references

Author information

Authors and Affiliations

Institute of Computer Science, Heinrich-Heine-University Düsseldorf, Universitätsstr. 1, 40225, Düsseldorf, Germany
Thi Thuy Anh Nguyen & Stefan Conrad

Authors

Thi Thuy Anh Nguyen
View author publications
You can also search for this author in PubMed Google Scholar
Stefan Conrad
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Thi Thuy Anh Nguyen .

Editor information

Editors and Affiliations

Instituto de Telecomunicações, Lisboa, Portugal
Ana Fred
Delft University of Technology, Delft, Zuid-Holland, The Netherlands
Jan L. G. Dietz
University of Madeira, Funchal, Portugal
David Aveiro
Henley Business School, University of Reading, Reading, United Kingdom
Kecheng Liu
INSTICC, Setubal, Portugal
Joaquim Filipe

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Nguyen, T.T.A., Conrad, S. (2015). An Improved String Similarity Measure Based on Combining Information-Theoretic and Edit Distance Methods. In: Fred, A., Dietz, J., Aveiro, D., Liu, K., Filipe, J. (eds) Knowledge Discovery, Knowledge Engineering and Knowledge Management. IC3K 2014. Communications in Computer and Information Science, vol 553. Springer, Cham. https://doi.org/10.1007/978-3-319-25840-9_15

Download citation

DOI: https://doi.org/10.1007/978-3-319-25840-9_15
Published: 28 October 2015
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-25839-3
Online ISBN: 978-3-319-25840-9
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics