Abstract
This paper addresses the problem of the unsupervised classification of text-centric XML documents. In the context of the INEX mining track 2006, we present methods to exploit the inherent structural information of XML documents in the document clustering process. Using the k-means algorithm, we have experimented with a couple of feature sets, to discover that a promising direction is to use structural information as a preliminary means to detect and put aside structural outliers. The improvement of the semantic-wise quality of clustering is significantly higher through this approach than through a combination of the structural and textual feature sets.
The paper also discusses the problem of the evaluation of XML clustering. Currently, in the INEX mining track, XML clustering techniques are evaluated against semantic categories. We believe there is a mismatch between the task (to exploit the document structure) and the evaluation, which disregards structural aspects. An illustration of this fact is that, over all the clustering track submissions, our text-based runs obtained the 1st rank (Wikipedia collection, out of 7) and 2nd rank (IEEE collection, out of 13).
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Preview
Unable to display preview. Download preview PDF.
References
Jardine, N., van Rijsbergen, C.: The use of hierarchic clustering in information retrieval. Information Storage and Retrieval 7, 217–240 (1971)
Tombros, A.: The effectiveness of hierarchic query-based clustering of documents for information retrieval. PhD thesis, University of Glasgow (2002)
Guillaume, D., Murtaugh, F.: Clustering of XML Documents. Computer Physics Communications 127, 215–227 (2000)
Yi, J., Sundaresan, N.: A classifier for semi-structured documents. In: Proceedings of the sixth ACM SIGKDD international conference on Knowledge discovery and data mining, Boston, Massachusetts, pp. 340–344 ( 2000)
Nierman, A., Jagadish, H.: Evaluating Structural Similarity in XML. In: Fifth International Workshop on the Web and Databases (WebDB 2002), Madison, Wisconsin (2002)
Denoyer, L., Gallinari, P.: Report on the xml mining track at inex 2005 and inex. In: [16] (2006)
Doucet, A., Ahonen-Myka, H.: Naive clustering of a large xml document collection. In: Proceedings of the First Workshop of the Initiative for the Evaluation of XML Retrieval (INEX), Schloss Dagsuhl, Germany, pp. 81–87 (2002)
Yong, S.L., Hagenbuchner, M., Tsoi, A., Scarselli, F., Gori, M.: Xml document mining using graph neural network. In: Fuhr, N., Lalmas, M., Malik, S., Kazai, G. (eds.) INEX 2005. LNCS, vol. 3977, Springer, Heidelberg (2006)
Kc, M., Hagenbuchner, M., Tsoi, A.C., Scarselli, F., Gori, M., Sperduti, A.: Xml document mining using contextual self-organizing maps for structures. [16]
Despeyroux, T., Lechevallier, Y., Trousse, B., Vercoustre, A.M.: Experiments in clustering homogeneous xml documents to validate an existing typology. In: Proceedings of the 5th International Conference on Knowledge Management (I-Know), Vienna, Austria, Journal of Universal Computer Science (2005)
Vercoustre, A.M., FEGAS,: M., Lechevallier, Y., Despeyroux, T.: Classification de documents xml à partir d’une représentation linéaire des arbres de ces documents. In: Actes des 6èmes journées Extraction et Gestion des Connaissances (EGC 2006), Revue des Nouvelles Technologies de l’Information (RNTI-E-6), Lille, France (2006)
Candillier, L., Tellier, I., Torre, F.: Transforming xml trees for efficient classification and clustering. In: INEX 2005 Workshop on Mining XML documents (2005)
Willett, P.: Recent trends in hierarchic document clustering: a critical review. Information Processing and Management 24, 577–597 (1988)
Lehtonen, M.: Preparing Heterogeneous XML for Full-Text Search. ACM Transactions on Information Systems 24, 1–21 (2006)
Denoyer, L., Gallinari, P.: The Wikipedia XML Corpus. SIGIR Forum (2006)
Fuhr, N., Lalmas, M., Malik, S., Kazai, G. (eds.): Advances in XML Information Retrieval and Evaluation, 5th International Workshop of the Initiative for the Evaluation of XML Retrieval, INEX 2006, Dagstuhl Castle, Germany, December 18-20 2006, Revised Selected Papers. In: Fuhr, N., Lalmas, M., Malik, S., Kazai, G. (eds.) INEX. Lecture Notes in Computer Science, Springer, Heidelberg (2007)
Author information
Authors and Affiliations
Editor information
Rights and permissions
Copyright information
© 2007 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Doucet, A., Lehtonen, M. (2007). Unsupervised Classification of Text-Centric XML Document Collections. In: Fuhr, N., Lalmas, M., Trotman, A. (eds) Comparative Evaluation of XML Information Retrieval Systems. INEX 2006. Lecture Notes in Computer Science, vol 4518. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-540-73888-6_46
Download citation
DOI: https://doi.org/10.1007/978-3-540-73888-6_46
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-73887-9
Online ISBN: 978-3-540-73888-6
eBook Packages: Computer ScienceComputer Science (R0)