Abstract
In this chapter we present a flexible and configurable application for mining large XML document collections. This work is centered on the process of extracting document features related to structure and content. From this process, an attribute frequency matrix is generated and, depending on the cluster algorithm, it is transformed and/or used to obtain similarity measures.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Notes
- 1.
http://www.uml.org [Accessed November 8, 2011].
- 2.
This collection contains an image of Wikipedia articles, and was extracted from a 2007 copy.
- 3.
http://www.postgresql.org/ [Accessed November 8, 2011].
- 4.
http://www.saxproject.org/ [Accessed November 8, 2011].
- 5.
http://www.w3.org/DOM/ [Accessed November 8, 2011].
- 6.
http://www.ims.uni-stuttgart.de/projekte/corplex/TreeTagger/DecisionTreeTagger.html [Access- ed November 8, 2011].
- 7.
A stop word is a word frequently appearing in a text, such as “the”, “and”, “a”.
- 8.
http://www.oracle.com/technetwork/java/javase/downloads/index.html [Accessed November 8, 2011].
- 9.
http://www.r-project.org/ [Accessed November 8, 2011].
References
Bray T, Paoli J, Sperberg-McQueen CM, Maler E, Yergeau F (2008) eXtensibleMarkup Language (XML) 1.0 (5th Edition), X3C recommendation. http://www.w3.org/TR/REC-xml/ [Accessed November 8, 2011]
Harold ER, Scott Means W (2004) XML in a nutshell. O’Reilly Media, Sebastopol, CA, USA
Abiteboul S, Buneman P, Suciu D (2001) Data on the web: from relations to semi structured data and XML. Morgan Kaufmann, Series in data management systems, San Francisco, USA
Duran CKG, Juganaru-Mathieu M, Vazquez HJ (2011) Specification design for an xml mining configurable application, lecture notes in engineering and computer science. In: Proceedings of the international multiconference of engineers and computer scientists 2011, IMECS 2011, Hong Kong, 16–18 March 2011. (best paper award). http://www.iaeng.org/publication/IMECS2011/IMECS2011_pp378-381_OnlinePDF.pdf [Accessed November 8, 2011]
Mikut R, Reischl M (2011) Data mining tools, WIREs data mining knowledge discovery. 1(5):431–443. http://onlinelibrary.wiley.com/doi/10.1002/widm.24/pdf [Accessed November 8, 2011]
Büchner AG, Mulvenna MD, Anand SS, Hughes JG (1999) An internet-enabled knowledge discovery process (best paper award). In: Proceedings of the 9th international database conference, Hong Kong, July 1999. http://mcbuchner.com/HTML/Research/PDF/IDC99_OnlinePDF.pdf [Accessed November 8, 2011]
Minaei-Bidgoli B (2004) Data mining for a web-based educational system, PhD thesis, Michigan State University. http://www.lon-capa.org/papers/BehrouzThesisRevised_OnlinePDF.pdf [Accessed November 8, 2011]
Cios KJ, Pedrycz W, Swiniarski RW, Kurgan LA (2007) Data mining: a knowledge discovery approach, Springer Science+Business Media, New York, USA
Ackoff RL (1989) From data to wisdom. J Appl Syst Anal 16:3–9
Oracle text application developer’s guide release 1 (10.1) (2010). http://www.stanford.edu/dept/itss/docs/oracle/10g/text.101/b10729/classify.htm [Accessed November 8, 2011]
Grabmeier J, Rudolph A (2002) Techniques of cluster algorithms in data mining, in data mining and knowledge discovery. Kluwer Academic Publishers, 6(4):303–360, Netherlands. http://www.springerlink.com/content/d6ekxxcu0d2ngamj/fulltext_OnlinePDF.pdf [Accessed November 8, 2011]
Candillier L, Denoyer L, Gallinari P, Rousset MC, Termier A, Vercoustre AM (2007) Mining XML documents. In: Poncelet P, Masseglia F, Teisseire M (ed) Data mining patterns: new methods and applications. pp 198–219
Büchner AG, Baumgarten M, Mulvenna MD, Böhm R, Anand SS (2000) Data mining and XML: current and future issues. In: Proceedings of the 1st international conference on web information systems engineering, Hong Kong. http://mcbuchner.com/HTML/Research/PDF/WISE00_OnlinePDF.pdf [Accessed November 8, 2011]
Tran T, Nayak R, Bruza P (2008) Document clustering using incremental and pairwise approaches. In: Focused access to XML documents 6th international workshop of the initiative for the evaluation of XML retrieval, INEX-2007. pp. 222–233. Springer-Verlag, Berlin Heidelberg
Candillier L, Tellier I, Torre F (2005) Transforming XML trees for efficient classification and clustering. Workshop on mining XML documents, INEX-2005, Schloss Dagstuhl, 28–30 November 2005. http://www.grappa.univ-lille3.fr/_candillier/publis/INEX05_OnlinePDF.pdf [Accessed November 8, 2011]
Dalamagas T, Cheng T, Winkel K-J, Sellis Timos (2003) A methodology for clustering XML documents by structure. Inform Syst 31:187–228
Manning CD, Raghavan P, Schtze H (2009) An introduction to information retrieval. Cambridge University Press, Cambridge, pp 201–210. http://nlp.stanford.edu/IR-book/pdf/irbookonlinereading_OnlinePDF.pdf [Accessed November 8, 2011]
Feldman R, Sanger J (2007) The text mining handbook: advanced approaches in analyzing unstructured data. Cambridge University Press, Cambridge
Tran T, Nayak R, Bruza P (2008) Combining structure and content similarities for XML document clustering. In: Proceedings 7th Australasian data mining conference, Glenelg, South Australia. CRPIT, 87. Roddick JF, Li J, Christen P, Kennedy PJ, (ed). http://crpit.com/confpapers/CRPITV87TranT_OnlinePDF.pdf [Accessed November 8, 2011]
Lee JW and Park SS (2004) Computing similarity between XML documents for XML mining. In Engineering Knowledge in the Age of the Semantic Web, 14th International Conference, EKAW, pp 492–493. Springer Verlag
Dempster AP, Laird NM, Rubin DB (1977) Maximum likelihood from incomplete data via the EM algorithm. J R Stat Soc B 39(1):1–38. Royal Statistical Society, London, http://web.mit.edu/6.435/www/Dempster77_OnlinePDF.pdf [Accessed November 8, 2011]
Roux M (1985) Algorithmes de Classification. Masson, Paris
Schmid H (1997) Probabilistic part-of-speech tagging using decision trees. In: Jones DB, Somers H (ed) New methods in language processing. Routledge, London
Acknowledgments
The authors acknowledge financial support given to the Promep group “Analysis and Information Management” and to the Universidad Autónoma Metropolitana, Unidad Azcapotzalco (UAM-A) for the financial support given to the project, “Study and Systems Modelling,” Number: 2270217. They acknowledge Dr. Nicolas Dominguez Vergara (former head of the Systems Department) and Dr. Silvia Gonzalez Brambila (former Coordinator of the Master’s Computer Science Program at the UAM Azcapotzalco) for their support and encouragement. They also acknowledge Peggy Currid for her advice in copyediting.
Author information
Authors and Affiliations
Corresponding authors
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2012 Springer Science+Business Media, LLC
About this chapter
Cite this chapter
Mathieu, M.J., Galindo-Durán, C.K., Vázquez, H.J. (2012). Configurable Application Designed for Mining XML Document Collections. In: Ao, S., Castillo, O., Huang, X. (eds) Intelligent Control and Innovative Computing. Lecture Notes in Electrical Engineering, vol 110. Springer, New York, NY. https://doi.org/10.1007/978-1-4614-1695-1_18
Download citation
DOI: https://doi.org/10.1007/978-1-4614-1695-1_18
Published:
Publisher Name: Springer, New York, NY
Print ISBN: 978-1-4614-1694-4
Online ISBN: 978-1-4614-1695-1
eBook Packages: EngineeringEngineering (R0)