Mining Frequent Association Tag Sequences for Clustering XML Documents

Zhang, Lijun; Li, Zhanhuai; Chen, Qun; Li, Xia; Li, Ning; Lou, Ying

doi:10.1007/978-3-642-29253-8_8

Lijun Zhang²⁰,
Zhanhuai Li²⁰,
Qun Chen²⁰,
Xia Li²⁰,
Ning Li²⁰ &
…
Ying Lou²⁰

Part of the book series: Lecture Notes in Computer Science ((LNISA,volume 7235))

Included in the following conference series:

Asia-Pacific Web Conference

2201 Accesses
2 Citations

Abstract

Many XML document clustering algorithms need to compute similarity among documents. Due to its semi-structured characteristic, exploiting the structure information for computing structural similarity is a crucial issue in XML similarity computation. Some path based approaches model the structure as path set and use the path set to compute structural similarity. One of the defects of these approaches is that they ignore the relationship between paths. In this paper, we propose the conception of F requent A ssociation T ag S equences ( FATS ). Based on this conception, we devise an algorithm named FATSMiner for mining FATS and model the structure of XML documents as FATS set, and introduce a method for computing structural similarity using FATS. Because FATS implies the ancestor-descendant and sibling relationships among elements, this approach can better represent the structure of XML documents. Our experimental results on real datasets show that this approach is more effective than the other path based approaches.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 84.99; Price excludes VAT (USA)

Softcover Book: USD 109.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

References

Tekli, J., Chbeir, R., Yetongnon, K.: An overview on XML similarity: Background, current trends and future directions. Computer Science Review 3(3), 151–173 (2009)
Article Google Scholar
Joshi, S., Agrawal, N., Krishnapuram, R., Negi, S.: A Bag of Paths Model for Measuring Structural Similarity in Web Documents. In: Proceedings of the 9th International Conference on Knowledge Discovery and Data Mining (SIGKDD), pp. 577–582 (2003)
Google Scholar
Leung, H.P., Chung, F.L., Chan, S.C., Luk, R.: XML Document Clustering Using Common XPath. In: Proceedings of the International Workshop on Challenges in Web Information Retrieval and Integration, pp. 91–96 (2005)
Google Scholar
Rafiei, D., Moise, D.L., Sun, D.: Finding Syntactic Similarities Between XML Documents. In: Proceedings of the 17th International Conference on Database and Expert Systems Applications (DEXA), pp. 512–516 (2006)
Google Scholar
Nierman, A., Jagadish, H.V.: Evaluating Structural Similarity in XML Documents. In: Proceedings of the 5th International Workshop on the Web and Databases (WebDB), pp. 61–66 (2002)
Google Scholar
Zhou, C., Lu, Y., Zou, L., Hu, R.: Evaluate Structure Similarity in XML Documents with Merge-Edit-Distance. In: Washio, T., Zhou, Z.-H., Huang, J.Z., Hu, X., Li, J., Xie, C., He, J., Zou, D., Li, K.-C., Freire, M.M. (eds.) PAKDD 2007. LNCS (LNAI), vol. 4819, pp. 301–311. Springer, Heidelberg (2007)
Chapter Google Scholar
Zhang, K., Statman, R., Shasha, D.: On the Editing Distance Between Unordered Labeled Trees. Information Processing Letters 42(3), 133–139 (1992)
Article MathSciNet MATH Google Scholar
Termier, A., Rousset, M.C., Sebag, M.: TreeFinder: a First Step towards XML Data Mining. In: Proceedings of IEEE International Conference on Data Mining (ICDM), pp. 450–457 (2002)
Google Scholar
Zaki, M.J.: Efficiently Mining Frequent Trees in a Forest: Algorithms and Applications. IEEE Transactions on Knowledge and Data Engineering 17(8), 1021–1035 (2005)
Article Google Scholar
Miyahara, T., Suzuki, Y., Shoudai, T., Uchida, T., Takahashi, K., Ueda, H.: Discovery of Frequent Tag Tree Patterns in Semistructured Web Documents. In: Chen, M.-S., Yu, P.S., Liu, B. (eds.) PAKDD 2002. LNCS (LNAI), vol. 2336, pp. 341–355. Springer, Heidelberg (2002)
Chapter Google Scholar
Pei, J., Han, J., Mortazavi-Asl, B., Pinto, H., Chen, Q., Dayal, U., Hsu, M.C.: PrefixSpan: Mining Sequential Patterns Efficiently by Prefix-Projected Pattern Growth. In: Proceedings of the 17th International Conference on Data Engineering (ICDE), pp. 215–224 (2001)
Google Scholar
Wang, J., Han, J., Pei, J.: CLOSET+: Searching for the Best Strategies for Mining Frequent Closed Itemsets. In: Proceedings of the 9th International Conference on Knowledge Discovery and Data Mining (SIGKDD), pp. 236–245 (2003)
Google Scholar
SPMF: A Sequential Pattern Mining Framework, http://www.philippe-fournier-viger.com/spmf/
Kurt, A., Tozal, E.: Classification of XSLT-Generated Web Documents with Support Vector Machines. In: Nayak, R., Zaki, M.J. (eds.) KDXD 2006. LNCS, vol. 3915, pp. 33–42. Springer, Heidelberg (2006)
Chapter Google Scholar
Sigmod Record in XML, http://www.sigmod.org/publications/sigmod-record/xml-edition
Kaufman, L., Rousseeuw, P.J.: Finding Groups in Data: An Introduction to Cluster Analysis. Wiley-Interscience (1990)
Google Scholar
Sebastiani, F.: Machine Learning in Automated Text Categorization. ACM Computing Surveys 34(1), 1–47 (2002)
Article Google Scholar

Download references

Author information

Authors and Affiliations

School of Computer Science and Technology, Northwestern Polytechnical University, Xi’an, 710072, China
Lijun Zhang, Zhanhuai Li, Qun Chen, Xia Li, Ning Li & Ying Lou

Authors

Lijun Zhang
View author publications
You can also search for this author in PubMed Google Scholar
Zhanhuai Li
View author publications
You can also search for this author in PubMed Google Scholar
Qun Chen
View author publications
You can also search for this author in PubMed Google Scholar
Xia Li
View author publications
You can also search for this author in PubMed Google Scholar
Ning Li
View author publications
You can also search for this author in PubMed Google Scholar
Ying Lou
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

School of Computer Science, The University of Adelaide, Australia
Quan Z. Sheng
College of Information Science and Engineering, Northeastern University, 110819, Shenyang, China
Guoren Wang
Aarhus University, Denmark
Christian S. Jensen
Center for Applied Informatics, Victoria University, PO Box 14428, 8001, VIC, Australia
Guandong Xu

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Zhang, L., Li, Z., Chen, Q., Li, X., Li, N., Lou, Y. (2012). Mining Frequent Association Tag Sequences for Clustering XML Documents. In: Sheng, Q.Z., Wang, G., Jensen, C.S., Xu, G. (eds) Web Technologies and Applications. APWeb 2012. Lecture Notes in Computer Science, vol 7235. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-29253-8_8

Download citation

DOI: https://doi.org/10.1007/978-3-642-29253-8_8
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-29252-1
Online ISBN: 978-3-642-29253-8
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics