Skip to main content

An Improved String Similarity Measure Based on Combining Information-Theoretic and Edit Distance Methods

  • Conference paper
  • First Online:
Knowledge Discovery, Knowledge Engineering and Knowledge Management (IC3K 2014)

Abstract

The lexical similarity measure is used for calculating the similarities between strings. Existing lexical-based methods usually base on either n-grams or Dice’s approaches. These measures have a good performance and could be extended by adjusting the parameter. However, they do not return reasonable results in some situations where strings are quite similar or the sets of characters are the same but their positions are different. To deal with this problem, our paper presents an approach to improve a lexical-based measure based on both information-theoretic and edit distance measures. The proposed method is tested on a partial OAEI benchmark 2008. The results show that our approach has some prominent features compared to other lexical-based methods. It is also flexible clearly and convenient in implementation. Moreover, we chose a range of good parameters can be applied in different domains.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 39.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

  1. 1.

    http://oaei.ontologymatching.org/.

References

  1. Algergawy, A., Schallehn, E., Saake, G.: A sequence-based ontology matching approach. In: The 10th International Conference on Information Integration and Web-Based Applications & Services, pp. 131–136. ACM (2008)

    Google Scholar 

  2. Batet, M., Sánchez, D., Valls, A.: An ontology-based measure to compute semantic similarity in biomedicine. Biomed. Inf. 44(1), 118–125 (2011)

    Article  Google Scholar 

  3. Dice, L.R.: Measures of the amount of ecologic association between species. Ecology 26(3), 297–302 (1945)

    Article  Google Scholar 

  4. Euzenat, J., Shvaiko, P.: Ontology Matching. Springer, Heidelberg (2013)

    Book  MATH  Google Scholar 

  5. Hamming, R.W.: Error detecting and error correcting codes. Bell Syst. Techn. J. 29(2), 147–160 (1950)

    Article  MathSciNet  Google Scholar 

  6. Ichise, R.: Machine learning approach for ontology mapping using multiple concept similarity measures. In: The 7th IEEE/ACIS International Conference on Computer and Information Science, pp. 340–346. IEEE (2008)

    Google Scholar 

  7. Jaccard, P.: The distribution of the flora in the alpine zone. New Phytol. 11(2), 37–50 (1912)

    Article  Google Scholar 

  8. Jaro, M.A.: Advances in record-linkage methodology as applied to matching the 1985 census of Tampa, Florida. Am. Stat. Assoc. 84(406), 414–420 (1989)

    Article  Google Scholar 

  9. Kondrak, G.: N-gram similarity and distance. In: Consens, M.P., Navarro, G. (eds.) SPIRE 2005. LNCS, vol. 3772, pp. 115–126. Springer, Heidelberg (2005)

    Chapter  Google Scholar 

  10. Levenshtein, V.I.: Binary codes capable of correcting deletions, insertions, and reversals. Sov. Phys. Doklady 10, 707–710 (1966)

    MathSciNet  MATH  Google Scholar 

  11. Lin, D.: An information-theoretic definition of similarity. In: The 15th International Conference on Machine Learning, pp. 296–304. Morgan Kaufmann (1998)

    Google Scholar 

  12. Maedche, A., Staab, S.: Measuring similarity between ontologies. In: Gómez-Pérez, A., Benjamins, V.R. (eds.) EKAW 2002. LNCS (LNAI), vol. 2473, pp. 251–263. Springer, Heidelberg (2002)

    Chapter  Google Scholar 

  13. Needleman, S.B., Wunsch, C.D.: A general method applicable to the search for similarities in the amino acid sequence of two proteins. Mol. Biol. 48, 443–453 (1970)

    Article  Google Scholar 

  14. Nguyen, T.T.A., Conrad, S.: Combination of lexical and structure-based similarity measures to match ontologies automatically. In: Fred, A., Dietz, J.L.G., Liu, K., Filipe, J. (eds.) IC3K 2012. CCIS, vol. 415, pp. 101–112. Springer, Heidelberg (2013)

    Chapter  Google Scholar 

  15. Nguyen, T.T.A., Conrad, S.: Applying information-theoretic and edit distance approaches to flexibly measure lexical similarity. In: The 6th International Conference on Knowledge Discovery and Information Retrieval, pp. 505–511. SciTePress (2014)

    Google Scholar 

  16. Pirró, G., Euzenat, J.: A feature and information theoretic framework for semantic similarity and relatedness. In: Patel-Schneider, P.F., Pan, Y., Hitzler, P., Mika, P., Zhang, L., Pan, J.Z., Horrocks, I., Glimm, B. (eds.) ISWC 2010, Part I. LNCS, vol. 6496, pp. 615–630. Springer, Heidelberg (2010)

    Chapter  Google Scholar 

  17. Pirró, G., Seco, N.: Design, implementation and evaluation of a new semantic similarity metric combining features and intrinsic information content. In: Meersman, R., Tari, Z. (eds.) OTM 2008, Part II. LNCS, vol. 5332, pp. 1271–1288. Springer, Heidelberg (2008)

    Chapter  Google Scholar 

  18. Sánchez, D., Batet, M., Isern, D., Valls, A.: Ontology-based semantic similarity: a new feature-based approach. Expert Syst. Appl. 39(9), 7718–7728 (2012)

    Article  Google Scholar 

  19. Tversky, A.: Features of similarity. Psychol. Rev. 84, 327–352 (1997)

    Article  Google Scholar 

  20. Wang, X., Ding, Y., Zhao, Y.: Similarity measurement about ontology-based semantic web services. In: The Workshop on Semantics for Web Services (2006)

    Google Scholar 

  21. Winkler, W.E.: String comparator metrics and enhanced decision rules in the fellegi-sunter model of record linkage. In: The Section on Survey Research, pp. 354–359 (1990)

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Thi Thuy Anh Nguyen .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2015 Springer International Publishing Switzerland

About this paper

Cite this paper

Nguyen, T.T.A., Conrad, S. (2015). An Improved String Similarity Measure Based on Combining Information-Theoretic and Edit Distance Methods. In: Fred, A., Dietz, J., Aveiro, D., Liu, K., Filipe, J. (eds) Knowledge Discovery, Knowledge Engineering and Knowledge Management. IC3K 2014. Communications in Computer and Information Science, vol 553. Springer, Cham. https://doi.org/10.1007/978-3-319-25840-9_15

Download citation

  • DOI: https://doi.org/10.1007/978-3-319-25840-9_15

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-319-25839-3

  • Online ISBN: 978-3-319-25840-9

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics