Classification of Documents Based on the Structure of Their DOM Trees

Geibel, Peter; Pustylnikov, Olga; Mehler, Alexander; Gust, Helmar; Kühnberger, Kai-Uwe

doi:10.1007/978-3-540-69162-4_81

Peter Geibel¹,
Olga Pustylnikov²,
Alexander Mehler²,
Helmar Gust¹ &
…
Kai-Uwe Kühnberger¹

Part of the book series: Lecture Notes in Computer Science ((LNTCS,volume 4985))

Included in the following conference series:

International Conference on Neural Information Processing

1531 Accesses
3 Citations

Abstract

In this paper, we discuss kernels that can be applied for the classification of XML documents based on their DOM trees. DOM trees are ordered trees in which every node might be labeled by a vector of attributes including its XML tag and the textual content. We describe five new kernels suitable for such structures: a kernel based on predefined structural features, a tree kernel derived from the well-known parse tree kernel, the set tree kernel that allows permutations of children, the string tree kernel being an extension of the so-called partial tree kernel, and the soft tree kernel as a more efficient alternative. We evaluate the kernels experimentally on a corpus containing the DOM trees of newspaper articles and on the well-known SUSANNE corpus.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 129.00; Price excludes VAT (USA)

Softcover Book: USD 169.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

References

Feldmann, R., Sanger, J.: The Text Mining Handbook: Advanced Approaches in Analyzing Unstructured Data. Cambridge University Press, Cambridge (2006)
Google Scholar
Mehler, A., Gleim, R., Dehmer, M.: Towards structure-sensitive hypertext categorization. In: Spiliopoulou, M., Kruse, R., Borgelt, C., Nürnberger, A., Gaul, W. (eds.) Proc. of the 29th Ann. Conf. of the German Class. Soc., Springer, Heidelberg (2005)
Google Scholar
Mehler, A., Geibel, P., Gleim, R., Pustylnikov, S.H.O., Jain, B.J.: Learning text types solely by structural differentiae, vol. 1. Publications of the Institute of Cognitive Science (PICS), Osnabrück (January 2007)
Google Scholar
Vapnik, V.N.: The Nature of Statistical Learning Theory. Springer, Heidelberg (1995)
MATH Google Scholar
Gärtner, T.: A survey of kernels for structured data. SIGKDD Explorations 5(2), 49–58 (2003)
Article Google Scholar
Geibel, P., Jain, B.J., Wysotzki, F.: Combining recurrent neural networks and support vector machines for structural pattern recognition. Neurocomputing 64, 63–105 (2005)
Google Scholar
Collins, M., Duffy, N.: Convolution kernels for natural language. In: NIPS, pp. 625–632 (2001)
Google Scholar
Moschitti, A.: A study on convolution kernels for shallow statistic parsing. In: ACL, pp. 335–342 (2004)
Google Scholar
Haussler, D.: Convolution Kernels on Discrete Structure. Technical Report UCSC-CRL-99-10, University of California at Santa Cruz, Santa Cruz, CA, USA (1999)
Google Scholar
Kashima, H., Koyanagi, T.: Kernels for semi-structured data. In: Proc. ICML, pp. 291–298 (2002)
Google Scholar
Moschitti, A.: Efficient convolution kernels for dependency and constituent syntactic trees. In: Fürnkranz, J., Scheffer, T., Spiliopoulou, M. (eds.) ECML 2006. LNCS (LNAI), vol. 4212, pp. 318–329. Springer, Heidelberg (2006)
Chapter Google Scholar
Lodhi, H., Saunders, C., Shawe-Taylor, J., Cristianini, N., Watkins, C.J.C.H.: Text classification using string kernels. JMLR 2, 419–444 (2002)
Article MATH Google Scholar
Biber, D.: Dimensions of Register Variation. A Cross-Linguistic Comparison. Cambridge University Press, Cambridge (1995)
Google Scholar
Mehler, A.: Hierarchical orderings of textual units. In: Proceedings of the 19th International Conference on Computational Linguistics (COLING 2002), pp. 646–652. Morgan Kaufmann, San Francisco (2002)
Google Scholar
Köhler, R.: Syntactic Structures: Properties and Interrelations. Journal of Quantitative Linguistics, 46–47 (1999)
Google Scholar
Pustylnikov, O.: Guessing Text Type by Structure. In: Proceedings of the ESSLLI Student Session 2007 (to appear, 2007)
Google Scholar
Geibel, P., Wysotzki, F.: Learning relational concepts with decision trees. In: Saitta, L. (ed.) Machine Learning: Proceedings of the Thirteenth International Conference, pp. 166–174. Morgan Kaufmann Publishers, San Francisco (1996)
Google Scholar
Sampson, G.: English for the Computer: The Susanne Corpus and Analytic Scheme: SUSANNE Corpus and Analytic Scheme. Clarendon Press (1995)
Google Scholar

Download references

Author information

Authors and Affiliations

Institute of Cognitive Science, AI Group, University of Osnabrück, Germany
Peter Geibel, Helmar Gust & Kai-Uwe Kühnberger
Text Technology Group, University of Bielefeld, Germany
Olga Pustylnikov & Alexander Mehler

Authors

Peter Geibel
View author publications
You can also search for this author in PubMed Google Scholar
Olga Pustylnikov
View author publications
You can also search for this author in PubMed Google Scholar
Alexander Mehler
View author publications
You can also search for this author in PubMed Google Scholar
Helmar Gust
View author publications
You can also search for this author in PubMed Google Scholar
Kai-Uwe Kühnberger
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Masumi Ishikawa Kenji Doya Hiroyuki Miyamoto Takeshi Yamakawa

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Geibel, P., Pustylnikov, O., Mehler, A., Gust, H., Kühnberger, KU. (2008). Classification of Documents Based on the Structure of Their DOM Trees. In: Ishikawa, M., Doya, K., Miyamoto, H., Yamakawa, T. (eds) Neural Information Processing. ICONIP 2007. Lecture Notes in Computer Science, vol 4985. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-540-69162-4_81

Download citation

DOI: https://doi.org/10.1007/978-3-540-69162-4_81
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-69159-4
Online ISBN: 978-3-540-69162-4
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics