Configurable Application Designed for Mining XML Document Collections

Mathieu, Mihaela Juganaru; Galindo-Durán, Cristal Karina; Vázquez, Héctor Javier

doi:10.1007/978-1-4614-1695-1_18

Mihaela Juganaru Mathieu⁴,
Cristal Karina Galindo-Durán⁵ &
Héctor Javier Vázquez⁶

Part of the book series: Lecture Notes in Electrical Engineering ((LNEE,volume 110))

762 Accesses

Abstract

In this chapter we present a flexible and configurable application for mining large XML document collections. This work is centered on the process of extracting document features related to structure and content. From this process, an attribute frequency matrix is generated and, depending on the cluster algorithm, it is transformed and/or used to obtain similarity measures.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 129.00; Price excludes VAT (USA)

Softcover Book: USD 169.99; Price excludes VAT (USA)

Hardcover Book: USD 169.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

1.
http://www.uml.org [Accessed November 8, 2011].
2.
This collection contains an image of Wikipedia articles, and was extracted from a 2007 copy.
3.
http://www.postgresql.org/ [Accessed November 8, 2011].
4.
http://www.saxproject.org/ [Accessed November 8, 2011].
5.
http://www.w3.org/DOM/ [Accessed November 8, 2011].
6.
http://www.ims.uni-stuttgart.de/projekte/corplex/TreeTagger/DecisionTreeTagger.html [Access- ed November 8, 2011].
7.
A stop word is a word frequently appearing in a text, such as “the”, “and”, “a”.
8.
http://www.oracle.com/technetwork/java/javase/downloads/index.html [Accessed November 8, 2011].
9.
http://www.r-project.org/ [Accessed November 8, 2011].

References

Bray T, Paoli J, Sperberg-McQueen CM, Maler E, Yergeau F (2008) eXtensibleMarkup Language (XML) 1.0 (5th Edition), X3C recommendation. http://www.w3.org/TR/REC-xml/ [Accessed November 8, 2011]
Harold ER, Scott Means W (2004) XML in a nutshell. O’Reilly Media, Sebastopol, CA, USA
MATH Google Scholar
Abiteboul S, Buneman P, Suciu D (2001) Data on the web: from relations to semi structured data and XML. Morgan Kaufmann, Series in data management systems, San Francisco, USA
Google Scholar
Duran CKG, Juganaru-Mathieu M, Vazquez HJ (2011) Specification design for an xml mining configurable application, lecture notes in engineering and computer science. In: Proceedings of the international multiconference of engineers and computer scientists 2011, IMECS 2011, Hong Kong, 16–18 March 2011. (best paper award). http://www.iaeng.org/publication/IMECS2011/IMECS2011_pp378-381_OnlinePDF.pdf [Accessed November 8, 2011]
Mikut R, Reischl M (2011) Data mining tools, WIREs data mining knowledge discovery. 1(5):431–443. http://onlinelibrary.wiley.com/doi/10.1002/widm.24/pdf [Accessed November 8, 2011]
Büchner AG, Mulvenna MD, Anand SS, Hughes JG (1999) An internet-enabled knowledge discovery process (best paper award). In: Proceedings of the 9th international database conference, Hong Kong, July 1999. http://mcbuchner.com/HTML/Research/PDF/IDC99_OnlinePDF.pdf [Accessed November 8, 2011]
Minaei-Bidgoli B (2004) Data mining for a web-based educational system, PhD thesis, Michigan State University. http://www.lon-capa.org/papers/BehrouzThesisRevised_OnlinePDF.pdf [Accessed November 8, 2011]
Cios KJ, Pedrycz W, Swiniarski RW, Kurgan LA (2007) Data mining: a knowledge discovery approach, Springer Science+Business Media, New York, USA
MATH Google Scholar
Ackoff RL (1989) From data to wisdom. J Appl Syst Anal 16:3–9
Google Scholar
Oracle text application developer’s guide release 1 (10.1) (2010). http://www.stanford.edu/dept/itss/docs/oracle/10g/text.101/b10729/classify.htm [Accessed November 8, 2011]
Grabmeier J, Rudolph A (2002) Techniques of cluster algorithms in data mining, in data mining and knowledge discovery. Kluwer Academic Publishers, 6(4):303–360, Netherlands. http://www.springerlink.com/content/d6ekxxcu0d2ngamj/fulltext_OnlinePDF.pdf [Accessed November 8, 2011]
Candillier L, Denoyer L, Gallinari P, Rousset MC, Termier A, Vercoustre AM (2007) Mining XML documents. In: Poncelet P, Masseglia F, Teisseire M (ed) Data mining patterns: new methods and applications. pp 198–219
Google Scholar
Büchner AG, Baumgarten M, Mulvenna MD, Böhm R, Anand SS (2000) Data mining and XML: current and future issues. In: Proceedings of the 1st international conference on web information systems engineering, Hong Kong. http://mcbuchner.com/HTML/Research/PDF/WISE00_OnlinePDF.pdf [Accessed November 8, 2011]
Tran T, Nayak R, Bruza P (2008) Document clustering using incremental and pairwise approaches. In: Focused access to XML documents 6th international workshop of the initiative for the evaluation of XML retrieval, INEX-2007. pp. 222–233. Springer-Verlag, Berlin Heidelberg
Google Scholar
Candillier L, Tellier I, Torre F (2005) Transforming XML trees for efficient classification and clustering. Workshop on mining XML documents, INEX-2005, Schloss Dagstuhl, 28–30 November 2005. http://www.grappa.univ-lille3.fr/_candillier/publis/INEX05_OnlinePDF.pdf [Accessed November 8, 2011]
Dalamagas T, Cheng T, Winkel K-J, Sellis Timos (2003) A methodology for clustering XML documents by structure. Inform Syst 31:187–228
Article Google Scholar
Manning CD, Raghavan P, Schtze H (2009) An introduction to information retrieval. Cambridge University Press, Cambridge, pp 201–210. http://nlp.stanford.edu/IR-book/pdf/irbookonlinereading_OnlinePDF.pdf [Accessed November 8, 2011]
Feldman R, Sanger J (2007) The text mining handbook: advanced approaches in analyzing unstructured data. Cambridge University Press, Cambridge
Google Scholar
Tran T, Nayak R, Bruza P (2008) Combining structure and content similarities for XML document clustering. In: Proceedings 7th Australasian data mining conference, Glenelg, South Australia. CRPIT, 87. Roddick JF, Li J, Christen P, Kennedy PJ, (ed). http://crpit.com/confpapers/CRPITV87TranT_OnlinePDF.pdf [Accessed November 8, 2011]
Lee JW and Park SS (2004) Computing similarity between XML documents for XML mining. In Engineering Knowledge in the Age of the Semantic Web, 14th International Conference, EKAW, pp 492–493. Springer Verlag
Chapter Google Scholar
Dempster AP, Laird NM, Rubin DB (1977) Maximum likelihood from incomplete data via the EM algorithm. J R Stat Soc B 39(1):1–38. Royal Statistical Society, London, http://web.mit.edu/6.435/www/Dempster77_OnlinePDF.pdf [Accessed November 8, 2011]
Roux M (1985) Algorithmes de Classification. Masson, Paris
Google Scholar
Schmid H (1997) Probabilistic part-of-speech tagging using decision trees. In: Jones DB, Somers H (ed) New methods in language processing. Routledge, London
Google Scholar

Download references

Acknowledgments

The authors acknowledge financial support given to the Promep group “Analysis and Information Management” and to the Universidad Autónoma Metropolitana, Unidad Azcapotzalco (UAM-A) for the financial support given to the project, “Study and Systems Modelling,” Number: 2270217. They acknowledge Dr. Nicolas Dominguez Vergara (former head of the Systems Department) and Dr. Silvia Gonzalez Brambila (former Coordinator of the Master’s Computer Science Program at the UAM Azcapotzalco) for their support and encouragement. They also acknowledge Peggy Currid for her advice in copyediting.

Author information

Authors and Affiliations

Laboratoire en Sciences et Technologies de l’Information, Institut H. Fayol, Ecole Nationale Supérieure des Mines de Saint Etienne, 158, cours Fauriel, 42023, SaintEtienne Cedex 2, France
Mihaela Juganaru Mathieu
Computer Science, MSE Program, Universidad Autónoma Metropolitana, Unidad Azcapotzalco, Mexico, USA
Cristal Karina Galindo-Durán
Departamento de Sistemas, Universidad Autónoma Metropolitana, Unidad Azcapotzalco, Avenida San Pablo 180, Col. Reynosa Tamaulipas, Mexico D.F., C.P. 02200, Mexico
Héctor Javier Vázquez

Authors

Mihaela Juganaru Mathieu
View author publications
You can also search for this author in PubMed Google Scholar
Cristal Karina Galindo-Durán
View author publications
You can also search for this author in PubMed Google Scholar
Héctor Javier Vázquez
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding authors

Correspondence to Cristal Karina Galindo-Durán or Héctor Javier Vázquez .

Editor information

Editors and Affiliations

International Association of Engineers, Hung To Road, Unit 1, 1/F, 37-39, Hong Kong, China
Sio Iong Ao
Computer Science, Tijuana Institute of Technology, Chula Vista, 91909, California, USA
Oscar Castillo
, Faculty of Information Sciences & Eng., University of Canberra, Canberra, 2601, Aust Capital Terr, Australia
Xu Huang

Rights and permissions

Reprints and permissions

Copyright information

About this chapter

Cite this chapter

Mathieu, M.J., Galindo-Durán, C.K., Vázquez, H.J. (2012). Configurable Application Designed for Mining XML Document Collections. In: Ao, S., Castillo, O., Huang, X. (eds) Intelligent Control and Innovative Computing. Lecture Notes in Electrical Engineering, vol 110. Springer, New York, NY. https://doi.org/10.1007/978-1-4614-1695-1_18

Download citation

DOI: https://doi.org/10.1007/978-1-4614-1695-1_18
Published: 11 December 2011
Publisher Name: Springer, New York, NY
Print ISBN: 978-1-4614-1694-4
Online ISBN: 978-1-4614-1695-1
eBook Packages: EngineeringEngineering (R0)

Publish with us

Policies and ethics