Skip to main content

Classification of Documents Based on the Structure of Their DOM Trees

  • Conference paper
Neural Information Processing (ICONIP 2007)

Part of the book series: Lecture Notes in Computer Science ((LNTCS,volume 4985))

Included in the following conference series:

Abstract

In this paper, we discuss kernels that can be applied for the classification of XML documents based on their DOM trees. DOM trees are ordered trees in which every node might be labeled by a vector of attributes including its XML tag and the textual content. We describe five new kernels suitable for such structures: a kernel based on predefined structural features, a tree kernel derived from the well-known parse tree kernel, the set tree kernel that allows permutations of children, the string tree kernel being an extension of the so-called partial tree kernel, and the soft tree kernel as a more efficient alternative. We evaluate the kernels experimentally on a corpus containing the DOM trees of newspaper articles and on the well-known SUSANNE corpus.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 129.00
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 169.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. Feldmann, R., Sanger, J.: The Text Mining Handbook: Advanced Approaches in Analyzing Unstructured Data. Cambridge University Press, Cambridge (2006)

    Google Scholar 

  2. Mehler, A., Gleim, R., Dehmer, M.: Towards structure-sensitive hypertext categorization. In: Spiliopoulou, M., Kruse, R., Borgelt, C., Nürnberger, A., Gaul, W. (eds.) Proc. of the 29th Ann. Conf. of the German Class. Soc., Springer, Heidelberg (2005)

    Google Scholar 

  3. Mehler, A., Geibel, P., Gleim, R., Pustylnikov, S.H.O., Jain, B.J.: Learning text types solely by structural differentiae, vol. 1. Publications of the Institute of Cognitive Science (PICS), Osnabrück (January 2007)

    Google Scholar 

  4. Vapnik, V.N.: The Nature of Statistical Learning Theory. Springer, Heidelberg (1995)

    MATH  Google Scholar 

  5. Gärtner, T.: A survey of kernels for structured data. SIGKDD Explorations 5(2), 49–58 (2003)

    Article  Google Scholar 

  6. Geibel, P., Jain, B.J., Wysotzki, F.: Combining recurrent neural networks and support vector machines for structural pattern recognition. Neurocomputing 64, 63–105 (2005)

    Google Scholar 

  7. Collins, M., Duffy, N.: Convolution kernels for natural language. In: NIPS, pp. 625–632 (2001)

    Google Scholar 

  8. Moschitti, A.: A study on convolution kernels for shallow statistic parsing. In: ACL, pp. 335–342 (2004)

    Google Scholar 

  9. Haussler, D.: Convolution Kernels on Discrete Structure. Technical Report UCSC-CRL-99-10, University of California at Santa Cruz, Santa Cruz, CA, USA (1999)

    Google Scholar 

  10. Kashima, H., Koyanagi, T.: Kernels for semi-structured data. In: Proc. ICML, pp. 291–298 (2002)

    Google Scholar 

  11. Moschitti, A.: Efficient convolution kernels for dependency and constituent syntactic trees. In: Fürnkranz, J., Scheffer, T., Spiliopoulou, M. (eds.) ECML 2006. LNCS (LNAI), vol. 4212, pp. 318–329. Springer, Heidelberg (2006)

    Chapter  Google Scholar 

  12. Lodhi, H., Saunders, C., Shawe-Taylor, J., Cristianini, N., Watkins, C.J.C.H.: Text classification using string kernels. JMLR 2, 419–444 (2002)

    Article  MATH  Google Scholar 

  13. Biber, D.: Dimensions of Register Variation. A Cross-Linguistic Comparison. Cambridge University Press, Cambridge (1995)

    Google Scholar 

  14. Mehler, A.: Hierarchical orderings of textual units. In: Proceedings of the 19th International Conference on Computational Linguistics (COLING 2002), pp. 646–652. Morgan Kaufmann, San Francisco (2002)

    Google Scholar 

  15. Köhler, R.: Syntactic Structures: Properties and Interrelations. Journal of Quantitative Linguistics, 46–47 (1999)

    Google Scholar 

  16. Pustylnikov, O.: Guessing Text Type by Structure. In: Proceedings of the ESSLLI Student Session 2007 (to appear, 2007)

    Google Scholar 

  17. Geibel, P., Wysotzki, F.: Learning relational concepts with decision trees. In: Saitta, L. (ed.) Machine Learning: Proceedings of the Thirteenth International Conference, pp. 166–174. Morgan Kaufmann Publishers, San Francisco (1996)

    Google Scholar 

  18. Sampson, G.: English for the Computer: The Susanne Corpus and Analytic Scheme: SUSANNE Corpus and Analytic Scheme. Clarendon Press (1995)

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Editor information

Masumi Ishikawa Kenji Doya Hiroyuki Miyamoto Takeshi Yamakawa

Rights and permissions

Reprints and permissions

Copyright information

© 2008 Springer-Verlag Berlin Heidelberg

About this paper

Cite this paper

Geibel, P., Pustylnikov, O., Mehler, A., Gust, H., Kühnberger, KU. (2008). Classification of Documents Based on the Structure of Their DOM Trees. In: Ishikawa, M., Doya, K., Miyamoto, H., Yamakawa, T. (eds) Neural Information Processing. ICONIP 2007. Lecture Notes in Computer Science, vol 4985. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-540-69162-4_81

Download citation

  • DOI: https://doi.org/10.1007/978-3-540-69162-4_81

  • Publisher Name: Springer, Berlin, Heidelberg

  • Print ISBN: 978-3-540-69159-4

  • Online ISBN: 978-3-540-69162-4

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics