Skip to main content

Mining Frequent Association Tag Sequences for Clustering XML Documents

  • Conference paper
Web Technologies and Applications (APWeb 2012)

Part of the book series: Lecture Notes in Computer Science ((LNISA,volume 7235))

Included in the following conference series:

Abstract

Many XML document clustering algorithms need to compute similarity among documents. Due to its semi-structured characteristic, exploiting the structure information for computing structural similarity is a crucial issue in XML similarity computation. Some path based approaches model the structure as path set and use the path set to compute structural similarity. One of the defects of these approaches is that they ignore the relationship between paths. In this paper, we propose the conception of F requent A ssociation T ag S equences ( FATS ). Based on this conception, we devise an algorithm named FATSMiner for mining FATS and model the structure of XML documents as FATS set, and introduce a method for computing structural similarity using FATS. Because FATS implies the ancestor-descendant and sibling relationships among elements, this approach can better represent the structure of XML documents. Our experimental results on real datasets show that this approach is more effective than the other path based approaches.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 84.99
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 109.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. Tekli, J., Chbeir, R., Yetongnon, K.: An overview on XML similarity: Background, current trends and future directions. Computer Science Review 3(3), 151–173 (2009)

    Article  Google Scholar 

  2. Joshi, S., Agrawal, N., Krishnapuram, R., Negi, S.: A Bag of Paths Model for Measuring Structural Similarity in Web Documents. In: Proceedings of the 9th International Conference on Knowledge Discovery and Data Mining (SIGKDD), pp. 577–582 (2003)

    Google Scholar 

  3. Leung, H.P., Chung, F.L., Chan, S.C., Luk, R.: XML Document Clustering Using Common XPath. In: Proceedings of the International Workshop on Challenges in Web Information Retrieval and Integration, pp. 91–96 (2005)

    Google Scholar 

  4. Rafiei, D., Moise, D.L., Sun, D.: Finding Syntactic Similarities Between XML Documents. In: Proceedings of the 17th International Conference on Database and Expert Systems Applications (DEXA), pp. 512–516 (2006)

    Google Scholar 

  5. Nierman, A., Jagadish, H.V.: Evaluating Structural Similarity in XML Documents. In: Proceedings of the 5th International Workshop on the Web and Databases (WebDB), pp. 61–66 (2002)

    Google Scholar 

  6. Zhou, C., Lu, Y., Zou, L., Hu, R.: Evaluate Structure Similarity in XML Documents with Merge-Edit-Distance. In: Washio, T., Zhou, Z.-H., Huang, J.Z., Hu, X., Li, J., Xie, C., He, J., Zou, D., Li, K.-C., Freire, M.M. (eds.) PAKDD 2007. LNCS (LNAI), vol. 4819, pp. 301–311. Springer, Heidelberg (2007)

    Chapter  Google Scholar 

  7. Zhang, K., Statman, R., Shasha, D.: On the Editing Distance Between Unordered Labeled Trees. Information Processing Letters 42(3), 133–139 (1992)

    Article  MathSciNet  MATH  Google Scholar 

  8. Termier, A., Rousset, M.C., Sebag, M.: TreeFinder: a First Step towards XML Data Mining. In: Proceedings of IEEE International Conference on Data Mining (ICDM), pp. 450–457 (2002)

    Google Scholar 

  9. Zaki, M.J.: Efficiently Mining Frequent Trees in a Forest: Algorithms and Applications. IEEE Transactions on Knowledge and Data Engineering 17(8), 1021–1035 (2005)

    Article  Google Scholar 

  10. Miyahara, T., Suzuki, Y., Shoudai, T., Uchida, T., Takahashi, K., Ueda, H.: Discovery of Frequent Tag Tree Patterns in Semistructured Web Documents. In: Chen, M.-S., Yu, P.S., Liu, B. (eds.) PAKDD 2002. LNCS (LNAI), vol. 2336, pp. 341–355. Springer, Heidelberg (2002)

    Chapter  Google Scholar 

  11. Pei, J., Han, J., Mortazavi-Asl, B., Pinto, H., Chen, Q., Dayal, U., Hsu, M.C.: PrefixSpan: Mining Sequential Patterns Efficiently by Prefix-Projected Pattern Growth. In: Proceedings of the 17th International Conference on Data Engineering (ICDE), pp. 215–224 (2001)

    Google Scholar 

  12. Wang, J., Han, J., Pei, J.: CLOSET+: Searching for the Best Strategies for Mining Frequent Closed Itemsets. In: Proceedings of the 9th International Conference on Knowledge Discovery and Data Mining (SIGKDD), pp. 236–245 (2003)

    Google Scholar 

  13. SPMF: A Sequential Pattern Mining Framework, http://www.philippe-fournier-viger.com/spmf/

  14. Kurt, A., Tozal, E.: Classification of XSLT-Generated Web Documents with Support Vector Machines. In: Nayak, R., Zaki, M.J. (eds.) KDXD 2006. LNCS, vol. 3915, pp. 33–42. Springer, Heidelberg (2006)

    Chapter  Google Scholar 

  15. Sigmod Record in XML, http://www.sigmod.org/publications/sigmod-record/xml-edition

  16. Kaufman, L., Rousseeuw, P.J.: Finding Groups in Data: An Introduction to Cluster Analysis. Wiley-Interscience (1990)

    Google Scholar 

  17. Sebastiani, F.: Machine Learning in Automated Text Categorization. ACM Computing Surveys 34(1), 1–47 (2002)

    Article  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2012 Springer-Verlag Berlin Heidelberg

About this paper

Cite this paper

Zhang, L., Li, Z., Chen, Q., Li, X., Li, N., Lou, Y. (2012). Mining Frequent Association Tag Sequences for Clustering XML Documents. In: Sheng, Q.Z., Wang, G., Jensen, C.S., Xu, G. (eds) Web Technologies and Applications. APWeb 2012. Lecture Notes in Computer Science, vol 7235. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-29253-8_8

Download citation

  • DOI: https://doi.org/10.1007/978-3-642-29253-8_8

  • Publisher Name: Springer, Berlin, Heidelberg

  • Print ISBN: 978-3-642-29252-1

  • Online ISBN: 978-3-642-29253-8

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics