Unsupervised Classification of Text-Centric XML Document Collections

Doucet, Antoine; Lehtonen, Miro

doi:10.1007/978-3-540-73888-6_46

Antoine Doucet^1,2 &
Miro Lehtonen²

Part of the book series: Lecture Notes in Computer Science ((LNISA,volume 4518))

Included in the following conference series:

International Workshop of the Initiative for the Evaluation of XML Retrieval

640 Accesses
9 Citations

Abstract

This paper addresses the problem of the unsupervised classification of text-centric XML documents. In the context of the INEX mining track 2006, we present methods to exploit the inherent structural information of XML documents in the document clustering process. Using the k-means algorithm, we have experimented with a couple of feature sets, to discover that a promising direction is to use structural information as a preliminary means to detect and put aside structural outliers. The improvement of the semantic-wise quality of clustering is significantly higher through this approach than through a combination of the structural and textual feature sets.

The paper also discusses the problem of the evaluation of XML clustering. Currently, in the INEX mining track, XML clustering techniques are evaluated against semantic categories. We believe there is a mismatch between the task (to exploit the document structure) and the evaluation, which disregards structural aspects. An illustration of this fact is that, over all the clustering track submissions, our text-based runs obtained the 1st rank (Wikipedia collection, out of 7) and 2nd rank (IEEE collection, out of 13).

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

References

Jardine, N., van Rijsbergen, C.: The use of hierarchic clustering in information retrieval. Information Storage and Retrieval 7, 217–240 (1971)
Article Google Scholar
Tombros, A.: The effectiveness of hierarchic query-based clustering of documents for information retrieval. PhD thesis, University of Glasgow (2002)
Google Scholar
Guillaume, D., Murtaugh, F.: Clustering of XML Documents. Computer Physics Communications 127, 215–227 (2000)
Article MATH Google Scholar
Yi, J., Sundaresan, N.: A classifier for semi-structured documents. In: Proceedings of the sixth ACM SIGKDD international conference on Knowledge discovery and data mining, Boston, Massachusetts, pp. 340–344 ( 2000)
Google Scholar
Nierman, A., Jagadish, H.: Evaluating Structural Similarity in XML. In: Fifth International Workshop on the Web and Databases (WebDB 2002), Madison, Wisconsin (2002)
Google Scholar
Denoyer, L., Gallinari, P.: Report on the xml mining track at inex 2005 and inex. In: [16] (2006)
Google Scholar
Doucet, A., Ahonen-Myka, H.: Naive clustering of a large xml document collection. In: Proceedings of the First Workshop of the Initiative for the Evaluation of XML Retrieval (INEX), Schloss Dagsuhl, Germany, pp. 81–87 (2002)
Google Scholar
Yong, S.L., Hagenbuchner, M., Tsoi, A., Scarselli, F., Gori, M.: Xml document mining using graph neural network. In: Fuhr, N., Lalmas, M., Malik, S., Kazai, G. (eds.) INEX 2005. LNCS, vol. 3977, Springer, Heidelberg (2006)
Google Scholar
Kc, M., Hagenbuchner, M., Tsoi, A.C., Scarselli, F., Gori, M., Sperduti, A.: Xml document mining using contextual self-organizing maps for structures. [16]
Google Scholar
Despeyroux, T., Lechevallier, Y., Trousse, B., Vercoustre, A.M.: Experiments in clustering homogeneous xml documents to validate an existing typology. In: Proceedings of the 5th International Conference on Knowledge Management (I-Know), Vienna, Austria, Journal of Universal Computer Science (2005)
Google Scholar
Vercoustre, A.M., FEGAS,: M., Lechevallier, Y., Despeyroux, T.: Classification de documents xml à partir d’une représentation linéaire des arbres de ces documents. In: Actes des 6èmes journées Extraction et Gestion des Connaissances (EGC 2006), Revue des Nouvelles Technologies de l’Information (RNTI-E-6), Lille, France (2006)
Google Scholar
Candillier, L., Tellier, I., Torre, F.: Transforming xml trees for efficient classification and clustering. In: INEX 2005 Workshop on Mining XML documents (2005)
Google Scholar
Willett, P.: Recent trends in hierarchic document clustering: a critical review. Information Processing and Management 24, 577–597 (1988)
Article Google Scholar
Lehtonen, M.: Preparing Heterogeneous XML for Full-Text Search. ACM Transactions on Information Systems 24, 1–21 (2006)
Article Google Scholar
Denoyer, L., Gallinari, P.: The Wikipedia XML Corpus. SIGIR Forum (2006)
Google Scholar
Fuhr, N., Lalmas, M., Malik, S., Kazai, G. (eds.): Advances in XML Information Retrieval and Evaluation, 5th International Workshop of the Initiative for the Evaluation of XML Retrieval, INEX 2006, Dagstuhl Castle, Germany, December 18-20 2006, Revised Selected Papers. In: Fuhr, N., Lalmas, M., Malik, S., Kazai, G. (eds.) INEX. Lecture Notes in Computer Science, Springer, Heidelberg (2007)
Google Scholar

Download references

Author information

Authors and Affiliations

IRISA-INRIA, Campus de Beaulieu, F-35042 Rennes Cedex, France
Antoine Doucet
Department of Computer Science, P. O. Box 68 (Gustaf Hällströmin katu 2b), FI–00014 University of Helsinki, Finland
Antoine Doucet & Miro Lehtonen

Authors

Antoine Doucet
View author publications
You can also search for this author in PubMed Google Scholar
Miro Lehtonen
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Norbert Fuhr Mounia Lalmas Andrew Trotman

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Doucet, A., Lehtonen, M. (2007). Unsupervised Classification of Text-Centric XML Document Collections. In: Fuhr, N., Lalmas, M., Trotman, A. (eds) Comparative Evaluation of XML Information Retrieval Systems. INEX 2006. Lecture Notes in Computer Science, vol 4518. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-540-73888-6_46

Download citation

DOI: https://doi.org/10.1007/978-3-540-73888-6_46
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-73887-9
Online ISBN: 978-3-540-73888-6
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics