Skip to main content

A Framework for Clustering and Dynamic Maintenance of XML Documents

  • Conference paper
  • First Online:
Advanced Data Mining and Applications (ADMA 2017)

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 10604))

Included in the following conference series:

Abstract

Web data clustering has been widely studied in the data mining communities. However, dynamic maintenance of the web data clusters is still a challenging task. In this paper, we propose a novel framework called XClusterMaint which serves for both clustering and maintenance of the XML documents. For clustering, we take both structure and content into account and propose an efficient solution for grouping the documents based on the combination of structure and content similarity. For maintenance, we propose an incremental approach for maintaining the existing clusters dynamically when we receive new incoming XML documents. Since the dynamic maintenance of the clusters is computationally expensive, we also propose an improved approach which uses a lazy maintenance scheme to improve the performance of the clusters maintenance. The experimental results on real datasets verify the efficiency of the proposed clustering and maintenance model.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 84.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 109.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

  1. 1.

    For simplicity, in this paper, we set \(m=\frac{n}{2}\).

  2. 2.

    \(r^{2}_c\) is usually a fraction of \(r^{1}_c\), i.e. \(r^{2}_c= \lambda r^{1}_c, \lambda \in (0,1)\). In the paper, we find \(\lambda \) = 0.8 is fairly good.

References

  1. Abbas, A.M., Bakar, A.A., Ahmad, M.Z.: Fast dynamic clustering SOAP messages based compression and aggregation model for enhanced performance of web services. J. Netw. Comput. Appl. 41, 80–88 (2014)

    Article  Google Scholar 

  2. Al-Shammary, D., Khalil, I.: Dynamic fractal clustering technique for SOAP web messages. In: IEEE International Conference on Services Computing (SCC), pp. 96–103 (2011)

    Google Scholar 

  3. Cha, S.H.: Comprehensive survey on distance/similarity measures between probability density functions. Int. J. Math. Models Methods Appl. Sci. 1(2), 1 (2007)

    MathSciNet  Google Scholar 

  4. Cheng, W., Zhang, X., Pan, F., Wang, W.: HICC: an entropy splitting-based framework for hierarchical co-clustering. Knowl. Inf. Syst. 46(2), 343–367 (2016)

    Article  Google Scholar 

  5. Cochez, M., Mou, H.: Twister tries: approximate hierarchical agglomerative clustering for average distance in linear time. In: Proceedings of the 2015 ACM SIGMOD International Conference on Management of Data, pp. 505–517 (2015)

    Google Scholar 

  6. Costa, G., Manco, G., Ortale, R., Ritacco, E.: Hierarchical clustering of XML documents focused on structural components. Data Knowl. Eng. 84, 26–46 (2013)

    Article  Google Scholar 

  7. Ding, R., Wang, Q., Dang, Y., Fu, Q., Zhang, H., Zhang, D.: Yading: fast clustering of large-scale time series data. Proc. VLDB Endow. 8(5), 473–484 (2015)

    Article  Google Scholar 

  8. Salton, G., McGill, M.J.: Introduction to Modern Information Retrieval. McGraw-Hill, New York (1983)

    MATH  Google Scholar 

  9. OpenFlights, 15 December 2016. https://datahub.io/dataset/open-flights

  10. Phan, K.A., Tari, Z., Bertok, P.: Similarity-based soap multicast protocol to reduce bandwidth and latency in web services. IEEE Trans. Serv. Comput. 1(2), 88–103 (2008)

    Article  Google Scholar 

  11. Silva, J.A., Faria, E.R., Barros, R.C., Hruschka, E.R., de Carvalho, A.C., Gama, J.: Data stream clustering: a survey. ACM Comput. Surv. (CSUR) 46(1), 13 (2013)

    Article  MATH  Google Scholar 

  12. Tran, T., Nayak, R., Bruza, P.: Combining structure and content similarities for XML document clustering. In: Proceedings of the 7th Australasian Data Mining Conference, vol. 87, pp. 219–225 (2008)

    Google Scholar 

  13. Wang, D., Li, T.: Document update summarization using incremental hierarchical clustering. In Proceedings of the 19th ACM International Conference on Information and Knowledge Management, pp. 279–288 (2010)

    Google Scholar 

  14. Yan, J., Cheng, D., Zong, M., Deng, Z.: Improved spectral clustering algorithm based on similarity measure. In: International Conference on Advanced Data Mining and Applications, pp. 641–654 (2014)

    Google Scholar 

  15. Yongming, G., Dehua, C., Jiajin, L.: Clustering XML documents by combining content and structure. In: International Symposium on Information Science and Engineering, ISISE 2008, vol. 1, pp. 583–587 (2008)

    Google Scholar 

Download references

Acknowledgements

This work was partially supported by the ARC Discovery Project under Grant No. DP170104747 and the Iraqi Ministry of Higher Education and Scientific Research.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Ahmed Al-Shammari .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2017 Springer International Publishing AG

About this paper

Cite this paper

Al-Shammari, A., Liu, C., Naseriparsa, M., Vo, B.Q., Anwar, T., Zhou, R. (2017). A Framework for Clustering and Dynamic Maintenance of XML Documents. In: Cong, G., Peng, WC., Zhang, W., Li, C., Sun, A. (eds) Advanced Data Mining and Applications. ADMA 2017. Lecture Notes in Computer Science(), vol 10604. Springer, Cham. https://doi.org/10.1007/978-3-319-69179-4_28

Download citation

  • DOI: https://doi.org/10.1007/978-3-319-69179-4_28

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-319-69178-7

  • Online ISBN: 978-3-319-69179-4

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics