Skip to main content

A Hybrid Approach for XML Similarity

  • Conference paper
SOFSEM 2007: Theory and Practice of Computer Science (SOFSEM 2007)

Part of the book series: Lecture Notes in Computer Science ((LNTCS,volume 4362))

Abstract

In the past few years, XML has been established as an effective means for information management, and has been widely exploited for complex data representation. Owing to an unparalleled increasing use of the XML standard, developing efficient techniques for comparing XML-based documents becomes essential in information retrieval (IR) research. Various algorithms for comparing hierarchically structured data, e.g. XML documents, have been proposed in the literature. However, to our knowledge, most of them focus exclusively on comparing documents based on structural features, overlooking the semantics involved. In this paper, we integrate IR semantic similarity assessment in an edit distance algorithm, seeking to amend similarity judgments when comparing XML-based documents. Our approach comprises of an original edit distance operation cost model, introducing semantic relatedness of XML element/attribute labels, in traditional edit distance computations. A prototype has been developed to evaluate our model’s performance. Experiments yielded notable results.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 84.99
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 109.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. Aho, A., Hirschberg, D., Ullman, J.: Bounds on the Complexity of the Longest Common Subsequence Problem. Journal of the Association for Computing Machinery 23(1), 1–12 (1976)

    MATH  MathSciNet  Google Scholar 

  2. Bertino, E., Guerrini, G., Mesiti, M., Rivara, I., Tavella, C.: Measuring the Structural Similarity among XML Documents and DTDs. Technical Report, University of Genova (2002), http://www.disi.unige.it/person/MesitiM

  3. Bertino, E., Guerrini, G., Mesiti, M.: A Matching Algorithm for Measuring the Structural Similarity between an XML Documents and a DTD and Its Applications. Computer Science 29, 23–46 (2004)

    MathSciNet  Google Scholar 

  4. Chawathe, S., Rajaraman, A., Garcia-Molina, H., Widom, J.: Change Detection in Hierarchically Structured Information. In: Proc. of the ACM Int. Conf. on Management of Data (SIGMOD), Montreal, Quebec, Canada (1996)

    Google Scholar 

  5. Chawathe, S.: Comparing Hierarchical Data in External Memory. In: Proceedings of the Twenty-fth Int. Conf. on Very Large Data Bases, pp. 90–101(1999)

    Google Scholar 

  6. Cobéna, G., Abiteboul, S., Marian, A.: Detecting Changes in XML Documents. In: Proc. of the IEEE Int. Conf. on Data Engineering, pp. 41–52 (2002)

    Google Scholar 

  7. Flesca, S., Manco, G., Masciari, E., Pontieri, L., Pugliese, A.: Detecting Structural Similarities between XML Documents. In: Proc. of WebDB 2002 (2002)

    Google Scholar 

  8. Jiang, J., Conrath, D.: Semantic Similarity Based on Corpus Statistics and Lexical Taxonomy. In: Proc. of the Int. Conf. on Research in Computational Linguistics (1997)

    Google Scholar 

  9. Lee, J.H., Kim, M.H., Lee, Y.J.: Information Retrieval Based on Conceptual Distance in IS-A Hierarchies. Journal of Documentation 49(2), 188–207 (1993)

    Article  Google Scholar 

  10. Levenshtein, V.: Binary Codes Capable of Correcting Deletions, Insertions and Reversals. Sov. Phys. Dokl. 6, 707–710 (1966)

    MathSciNet  Google Scholar 

  11. Lin, D.: An Information-Theoretic Definition of Similarity. In: Proceedings of the 15th Int. Conf. on Machine Learning, pp. 296–304. Morgan Kaufmann, San Francisco (1998)

    Google Scholar 

  12. Maguitman, A.G., Menczer, F., Roinestad, H., Vespignani, A.: Algorithmic Detection of Semantic Similarity. In: Proc. of the 14th Int. WWW Conference, Japan pp. 107–116 (2005)

    Google Scholar 

  13. Miller, G.: WordNet: An On-Line Lexical Database. Int. Journal of Lexicography (1990)

    Google Scholar 

  14. Nierman, A., Jagadish, H.V.: Evaluating Structural Similarity in XML Documents. In: Proc. of the 5th Int. Workshop on the Web and Databases (2002)

    Google Scholar 

  15. Rada, R., Mili, H., Bicknell, E., Blettner, M.: Development and Application of a Metric on Semantic Nets. IEEE Transactions on Systems, Man, and Cybernetics 19, 17–30 (1989)

    Article  Google Scholar 

  16. Resnik, P.: Using Information Content to Evaluate Semantic Similarity in a Taxonomy. In: Proc. of the 14th IJCA-95, vol. 1, pp. 448–453 (1995)

    Google Scholar 

  17. Richardson, R., Smeaton, A.F.: Using WordNet in a Knowledge-Based Approach to Information Retrieval. In: Proc. of the 17th Colloquium on Information Retrieval, (1995)

    Google Scholar 

  18. Sanz, I., Mesiti, M., Guerrini, G., Llavori, R.B.: Approximate subtree identification in heterogeneous XML documents collections. In: Bressan, S., Ceri, S., Hunt, E., Ives, Z.G., Bellahsène, Z., Rys, M., Unland, R. (eds.) XSym 2005. LNCS, vol. 3671, pp. 192–206. Springer, Heidelberg (2005)

    Google Scholar 

  19. Shasha, D., Zhang, K.: Approximate Tree Pattern Matching. In: Shasha, D., Zhang, K. (eds.) Pattern Matching in Strings, Trees and Arrays, vol. 14, Oxford University Press, Oxford (1995)

    Google Scholar 

  20. Wagner, J., Fisher, M.: The String-to-String Correction Problem. Journal of the Association of Computing Machinery 21(1), 168–173 (1974)

    MATH  Google Scholar 

  21. Wong, C., Chandra, A.: Bounds for the String Editing Problem. Journal of the Association for Computing Machinery 23(1), 13–16 (1976)

    MATH  MathSciNet  Google Scholar 

  22. WWW Consortium, The Document Object Model, http://www.w3.org/DOM

  23. Zhang, K., Shasha, D.: Simple Fast Algorithms for the Editing Distance Between Trees and Related Problems. SIAM Journal of Computing 18(6), 1245–1262 (1989)

    Article  MATH  MathSciNet  Google Scholar 

  24. Zhang, Z., Li, R., Cao, S., Zhu, Y.: Similarity Metric in XML Documents. In: Knowledge Management and Experience Management Workshop (2003)

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Editor information

Jan van Leeuwen Giuseppe F. Italiano Wiebe van der Hoek Christoph Meinel Harald Sack František Plášil

Rights and permissions

Reprints and permissions

Copyright information

© 2007 Springer Berlin Heidelberg

About this paper

Cite this paper

Tekli, J., Chbeir, R., Yetongnon, K. (2007). A Hybrid Approach for XML Similarity. In: van Leeuwen, J., Italiano, G.F., van der Hoek, W., Meinel, C., Sack, H., Plášil, F. (eds) SOFSEM 2007: Theory and Practice of Computer Science. SOFSEM 2007. Lecture Notes in Computer Science, vol 4362. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-540-69507-3_68

Download citation

  • DOI: https://doi.org/10.1007/978-3-540-69507-3_68

  • Publisher Name: Springer, Berlin, Heidelberg

  • Print ISBN: 978-3-540-69506-6

  • Online ISBN: 978-3-540-69507-3

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics