Skip to main content

Configurable Application Designed for Mining XML Document Collections

  • Chapter
  • First Online:
Intelligent Control and Innovative Computing

Abstract

In this chapter we present a flexible and configurable application for mining large XML document collections. This work is centered on the process of extracting document features related to structure and content. From this process, an attribute frequency matrix is generated and, depending on the cluster algorithm, it is transformed and/or used to obtain similarity measures.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 129.00
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 169.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info
Hardcover Book
USD 169.99
Price excludes VAT (USA)
  • Durable hardcover edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

  1. 1.

    http://www.uml.org [Accessed November 8, 2011].

  2. 2.

    This collection contains an image of Wikipedia articles, and was extracted from a 2007 copy.

  3. 3.

    http://www.postgresql.org/ [Accessed November 8, 2011].

  4. 4.

    http://www.saxproject.org/ [Accessed November 8, 2011].

  5. 5.

    http://www.w3.org/DOM/ [Accessed November 8, 2011].

  6. 6.

    http://www.ims.uni-stuttgart.de/projekte/corplex/TreeTagger/DecisionTreeTagger.html  [Access- ed November 8, 2011].

  7. 7.

    A stop word is a word frequently appearing in a text, such as “the”, “and”, “a”.

  8. 8.

    http://www.oracle.com/technetwork/java/javase/downloads/index.html [Accessed November 8, 2011].

  9. 9.

    http://www.r-project.org/ [Accessed November 8, 2011].

References

  1. Bray T, Paoli J, Sperberg-McQueen CM, Maler E, Yergeau F (2008) eXtensibleMarkup Language (XML) 1.0 (5th Edition), X3C recommendation. http://www.w3.org/TR/REC-xml/ [Accessed November 8, 2011]

  2. Harold ER, Scott Means W (2004) XML in a nutshell. O’Reilly Media, Sebastopol, CA, USA

    MATH  Google Scholar 

  3. Abiteboul S, Buneman P, Suciu D (2001) Data on the web: from relations to semi structured data and XML. Morgan Kaufmann, Series in data management systems, San Francisco, USA

    Google Scholar 

  4. Duran CKG, Juganaru-Mathieu M, Vazquez HJ (2011) Specification design for an xml mining configurable application, lecture notes in engineering and computer science. In: Proceedings of the international multiconference of engineers and computer scientists 2011, IMECS 2011, Hong Kong, 16–18 March 2011. (best paper award). http://www.iaeng.org/publication/IMECS2011/IMECS2011_pp378-381_OnlinePDF.pdf [Accessed November 8, 2011]

  5. Mikut R, Reischl M (2011) Data mining tools, WIREs data mining knowledge discovery. 1(5):431–443. http://onlinelibrary.wiley.com/doi/10.1002/widm.24/pdf [Accessed November 8, 2011]

  6. Büchner AG, Mulvenna MD, Anand SS, Hughes JG (1999) An internet-enabled knowledge discovery process (best paper award). In: Proceedings of the 9th international database conference, Hong Kong, July 1999. http://mcbuchner.com/HTML/Research/PDF/IDC99_OnlinePDF.pdf [Accessed November 8, 2011]

  7. Minaei-Bidgoli B (2004) Data mining for a web-based educational system, PhD thesis, Michigan State University. http://www.lon-capa.org/papers/BehrouzThesisRevised_OnlinePDF.pdf [Accessed November 8, 2011]

  8. Cios KJ, Pedrycz W, Swiniarski RW, Kurgan LA (2007) Data mining: a knowledge discovery approach, Springer Science+Business Media, New York, USA

    MATH  Google Scholar 

  9. Ackoff RL (1989) From data to wisdom. J Appl Syst Anal 16:3–9

    Google Scholar 

  10. Oracle text application developer’s guide release 1 (10.1) (2010). http://www.stanford.edu/dept/itss/docs/oracle/10g/text.101/b10729/classify.htm [Accessed November 8, 2011]

  11. Grabmeier J, Rudolph A (2002) Techniques of cluster algorithms in data mining, in data mining and knowledge discovery. Kluwer Academic Publishers, 6(4):303–360, Netherlands. http://www.springerlink.com/content/d6ekxxcu0d2ngamj/fulltext_OnlinePDF.pdf [Accessed November 8, 2011]

  12. Candillier L, Denoyer L, Gallinari P, Rousset MC, Termier A, Vercoustre AM (2007) Mining XML documents. In: Poncelet P, Masseglia F, Teisseire M (ed) Data mining patterns: new methods and applications. pp 198–219

    Google Scholar 

  13. Büchner AG, Baumgarten M, Mulvenna MD, Böhm R, Anand SS (2000) Data mining and XML: current and future issues. In: Proceedings of the 1st international conference on web information systems engineering, Hong Kong. http://mcbuchner.com/HTML/Research/PDF/WISE00_OnlinePDF.pdf [Accessed November 8, 2011]

  14. Tran T, Nayak R, Bruza P (2008) Document clustering using incremental and pairwise approaches. In: Focused access to XML documents 6th international workshop of the initiative for the evaluation of XML retrieval, INEX-2007. pp. 222–233. Springer-Verlag, Berlin Heidelberg

    Google Scholar 

  15. Candillier L, Tellier I, Torre F (2005) Transforming XML trees for efficient classification and clustering. Workshop on mining XML documents, INEX-2005, Schloss Dagstuhl, 28–30 November 2005. http://www.grappa.univ-lille3.fr/_candillier/publis/INEX05_OnlinePDF.pdf [Accessed November 8, 2011]

  16. Dalamagas T, Cheng T, Winkel K-J, Sellis Timos (2003) A methodology for clustering XML documents by structure. Inform Syst 31:187–228

    Article  Google Scholar 

  17. Manning CD, Raghavan P, Schtze H (2009) An introduction to information retrieval. Cambridge University Press, Cambridge, pp 201–210. http://nlp.stanford.edu/IR-book/pdf/irbookonlinereading_OnlinePDF.pdf [Accessed November 8, 2011]

  18. Feldman R, Sanger J (2007) The text mining handbook: advanced approaches in analyzing unstructured data. Cambridge University Press, Cambridge

    Google Scholar 

  19. Tran T, Nayak R, Bruza P (2008) Combining structure and content similarities for XML document clustering. In: Proceedings 7th Australasian data mining conference, Glenelg, South Australia. CRPIT, 87. Roddick JF, Li J, Christen P, Kennedy PJ, (ed). http://crpit.com/confpapers/CRPITV87TranT_OnlinePDF.pdf [Accessed November 8, 2011]

  20. Lee JW and Park SS (2004) Computing similarity between XML documents for XML mining. In Engineering Knowledge in the Age of the Semantic Web, 14th International Conference, EKAW, pp 492–493. Springer Verlag

    Chapter  Google Scholar 

  21. Dempster AP, Laird NM, Rubin DB (1977) Maximum likelihood from incomplete data via the EM algorithm. J R Stat Soc B 39(1):1–38. Royal Statistical Society, London, http://web.mit.edu/6.435/www/Dempster77_OnlinePDF.pdf [Accessed November 8, 2011]

  22. Roux M (1985) Algorithmes de Classification. Masson, Paris

    Google Scholar 

  23. Schmid H (1997) Probabilistic part-of-speech tagging using decision trees. In: Jones DB, Somers H (ed) New methods in language processing. Routledge, London

    Google Scholar 

Download references

Acknowledgments

The authors acknowledge financial support given to the Promep group “Analysis and Information Management” and to the Universidad Autónoma Metropolitana, Unidad Azcapotzalco (UAM-A) for the financial support given to the project, “Study and Systems Modelling,” Number: 2270217. They acknowledge Dr. Nicolas Dominguez Vergara (former head of the Systems Department) and Dr. Silvia Gonzalez Brambila (former Coordinator of the Master’s Computer Science Program at the UAM Azcapotzalco) for their support and encouragement. They also acknowledge Peggy Currid for her advice in copyediting.

Author information

Authors and Affiliations

Authors

Corresponding authors

Correspondence to Cristal Karina Galindo-Durán or Héctor Javier Vázquez .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2012 Springer Science+Business Media, LLC

About this chapter

Cite this chapter

Mathieu, M.J., Galindo-Durán, C.K., Vázquez, H.J. (2012). Configurable Application Designed for Mining XML Document Collections. In: Ao, S., Castillo, O., Huang, X. (eds) Intelligent Control and Innovative Computing. Lecture Notes in Electrical Engineering, vol 110. Springer, New York, NY. https://doi.org/10.1007/978-1-4614-1695-1_18

Download citation

  • DOI: https://doi.org/10.1007/978-1-4614-1695-1_18

  • Published:

  • Publisher Name: Springer, New York, NY

  • Print ISBN: 978-1-4614-1694-4

  • Online ISBN: 978-1-4614-1695-1

  • eBook Packages: EngineeringEngineering (R0)

Publish with us

Policies and ethics