Abstract
In this paper, we investigate strategies for automatically classifying documents in different languages thematically, geographically or according to other criteria. A novel linguistically motivated text representation scheme is presented that can be used with machine learning algorithms in order to learn classifications from pre-classified examples and then automatically classify documents that might be provided in entirely different languages. Our approach makes use of ontologies and lexical resources but goes beyond a simple mapping from terms to concepts by fully exploiting the external knowledge manifested in such resources and mapping to entire regions of concepts. For this, a graph traversal algorithm is used to explore related concepts that might be relevant. Extensive testing has shown that our methods lead to significant improvements compared to existing approaches.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Preview
Unable to display preview. Download preview PDF.
References
Bel, N., Koster, C.H.A., Villegas, M.: Cross-lingual text categorization. In: Koch, T., Sølvberg, I.T. (eds.) ECDL 2003. LNCS, vol. 2769, pp. 126–139. Springer, Heidelberg (2003)
García Adeva, J.J., Calvo, R.A., de Ipiña, D.L.: Multilingual approaches to text categorisation. Europ. J. for the Informatics Professional VI(3), 43–51 (2005)
Jalam, R.: Apprentissage automatique et catégorisation de textes multilingues. PhD thesis, Université Lumière Lyon 2, Lyon, France (2003)
Olsson, J.S., et al.: Cross-language text classification. In: Proc. SIGIR 2005, pp. 645–646 (2005), doi:10.1145/1076034.1076170
Rigutini, L., et al.: An EM based training algorithm for cross-language text categorization. In: Proc. Web Intelligence 2005, Washington, DC, USA, pp. 529–535 (2005)
Oard, D.W., Dorr, B.J.: A survey of multilingual text retrieval. Technical report, University of Maryland at College Park, College Park, MD, USA (1996)
de Buenaga Rodríguez, M., et al.: Using WordNet to complement training information in text categorization. In: Proc. 2nd RANLP (1997)
Moschitti, A., Basili, R.: Complex linguistic features for text classification: a comprehensive study. In: McDonald, S., Tait, J. (eds.) ECIR 2004. LNCS, vol. 2997, Springer, Heidelberg (2004)
Ifrim, G., Theobald, M., Weikum, G.: Learning word-to-concept mappings for automatic text classification. In: Proc. 22nd ICML - LWS, pp. 18–26 (2005)
Verdejo, F., Gonzalo, J., Peñas, A., et al.: Evaluating wordnets in cross-language text retrieval. In: Proceedings LREC (2000)
Scott, S., Matwin, S.: Text classification using WordNet hypernyms. In: Proc. Worksh. Usage of WordNet in NLP Systems at COLING-98, pp. 38–44. Sage, Thousand Oaks (1998)
Bloehdorn, S., Hotho, A.: Boosting for text classification with semantic features. In: Proc. Worksh. on Mining for/from the Semantic Web at KDD 2004, pp. 70–87 (2004)
Ramakrishnanan, G., et al.: Text representation with WordNet synsets using soft sense disambiguation. Ing. systèmes d’information 8(3), 55–70 (2003)
Gliozzo, A.M., et al.: Cross language text categ. by acq. multil. domain models from comp. corpora. In: Proc. ACL Worksh. Building and Using Parallel Texts (2005), http://tcc.itc.it/people/gliozzo/Papers/Gliozzo-ACL-2005b.pdf
Dumais, S.T., et al.: Automatic cross-language retrieval using latent semantic indexing. In: AAAI Symposium on CrossLanguage Text and Speech Retrieval (1997), http://lsi.research.telcordia.com/lsi/papers/XLANG96.pdf
Vapnik, V.N.: The Nature of Statistical Learning Theory. Springer, New York (1995)
Sebastiani, F.: Machine learning in automated text categorization. ACM Computing Surveys 34(1), 1–47 (2002), http://www.math.unipd.it/~fabseb60/Publications/ACMCS02.pdf
AltaVista: Babel fish translation (2006), http://babelfish.altavista.com/
Fellbaum, C.: WordNet: An Electronic Lexical Database (Language, Speech, and Communication). MIT Press, Cambridge (1998)
Farreres, X., Rigau, G., Rodríguez, H.: Using WordNet for building WordNets. In: Proc. Conf. Use of WordNet in NLP Systems, pp. 65–72 (1998)
Theobald, M., Schenkel, R., Weikum, G.: Exploiting structure, annotation, and ontological knowledge for automatic classification of XML data. In: 6th Intl. Worksh. Web and Databases, pp. 1–6 (2003)
Russell, S., Norvig, P.: Artificial Intelligence: A Modern Approach. Prentice-Hall, Englewood Cliffs (1995)
Joachims, T.: Making large-scale support vector machine learning practical. Advances in Kernel Methods: Support Vector Machines (1999), http://www.joachims.org/publications/joachims_99a.pdf
Daudé, J., et al.: Making Wordnet mappings robust. In: Proc. Congreso de la Sociedad Española para el Procesamiento del Lenguage Natural (SEPLN) (2003)
Wikimedia Foundation: Wikipedia (2006), http://www.wikipedia.org/
Author information
Authors and Affiliations
Editor information
Rights and permissions
Copyright information
© 2007 Springer Berlin Heidelberg
About this paper
Cite this paper
de Melo, G., Siersdorfer, S. (2007). Multilingual Text Classification Using Ontologies. In: Amati, G., Carpineto, C., Romano, G. (eds) Advances in Information Retrieval. ECIR 2007. Lecture Notes in Computer Science, vol 4425. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-540-71496-5_49
Download citation
DOI: https://doi.org/10.1007/978-3-540-71496-5_49
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-71494-1
Online ISBN: 978-3-540-71496-5
eBook Packages: Computer ScienceComputer Science (R0)