Bilingual Text Classification

Civera, Jorge; Cubel, Elsa; Vidal, Enrique

doi:10.1007/978-3-540-72847-4_35

Jorge Civera¹,
Elsa Cubel¹ &
Enrique Vidal¹

Part of the book series: Lecture Notes in Computer Science ((LNIP,volume 4477))

Included in the following conference series:

Iberian Conference on Pattern Recognition and Image Analysis

1559 Accesses
3 Citations

Abstract

Bilingual documentation has become a common phenomenon in official institutions and private companies. In this scenario, the categorization of bilingual text is a useful tool. In this paper, different approaches will be proposed to tackle this bilingual classification task. On the one hand, three finite-state transducer algorithms from the grammatical inference framework will be presented. On the other hand, a naive combination of smoothed n-gram models will be introduced. To evaluate the performance of bilingual classifiers, two categorized bilingual corpora of different complexity were considered. Experiments in a limited-domain task show that all the models obtain similar results. However, results on a more open-domain task denote the supremacy of the naive approach.

Work supported by the EC (FEDER) and the Spanish MEC under grant TIN2006-15694-CO2-01 and the Consellería d’Empresa, Universitat i Ciència - Generalitat Valenciana under contract GV06/252 and the Ministerio de Educación y Ciencia.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 84.99; Price excludes VAT (USA)

Softcover Book: USD 109.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

References

Civera, J., Juan, A.: Mixtures of IBM Model 2. In: Proc. of EAMT, pp. 159–167 (2006)
Google Scholar
McCallum, A., Nigam, K.: A comparison of event models for naive bayes text classification. In: AAAI-98 on Learning for Text Categorization, pp. 41–48 (1998)
Google Scholar
Juan, A., Vidal, E.: On the use of Bernoulli mixture models for text classification. Pattern Recognition 35(12), 2705–2710 (2002)
Article MATH Google Scholar
Civera, J., Cubel, E., Juan, A., Vidal, E.: Different approaches to bilingual text classification based on grammatical inference techniques. In: Marques, J.S., Pérez de la Blanca, N., Pina, P. (eds.) IbPRIA 2005. LNCS, vol. 3523, pp. 630–637. Springer, Heidelberg (2005)
Google Scholar
Picó, D., Casacuberta, F.: Some statistical-estimation methods for stochastic finite-state transducers. Machine Learning 44, 121–142 (2001)
Article MATH Google Scholar
Knight, K., Al-Onaizan, Y.: Translation with finite-state devices. In: Farwell, D., Gerber, L., Hovy, E. (eds.) AMTA 1998. LNCS (LNAI), vol. 1529, pp. 421–437. Springer, Heidelberg (1998)
Chapter Google Scholar
Oncina, J., García, P., Vidal, E.: Learning subsequential transducers for pattern recognition interpretation tasks. IEEE Transactions on PAMI 15, 448–458 (1993)
Google Scholar
Oncina, J., Varó, M.: Using domain information during the learning of a subsequential transducer. In: Miclet, L., de la Higuera, C. (eds.) ICGI 1996. LNCS, vol. 1147, pp. 301–312. Springer, Heidelberg (1996)
Chapter Google Scholar
Cubel, E.: Aprendizaje de transductores subsecuenciales estocásticos. Technical Report II-DSIC-B-23/01, Universidad Politécnica de Valencia, Spain (2002)
Google Scholar
Brown, P., et al.: The mathematics of statistical machine translation: Parameter estimation. Computational Linguistics 19(2), 263–312 (1993)
Google Scholar
Vilar, J.M.: Improve the learning of subsequential transducers by using alignments and dictionaries. In: Oliveira, A.L. (ed.) ICGI 2000. LNCS (LNAI), vol. 1891, pp. 298–311. Springer, Heidelberg (2000)
Google Scholar
Och, F., Ney, H.: Improved statistical alignment models. In: ACL, pp. 440–447 (2000)
Google Scholar
Casacuberta, F., et al.: Some approaches to statistical and finite-state speech-to-speech translation. Computer Speech and Language 18, 25–47 (2004)
Article Google Scholar
Civera, J., Vilar, J.M., Cubel, E., Lagarda, A.L., Barrachina, S., Casacuberta, F., Vidal, E., Picó, D., González, J.: A syntactic pattern recognition approach to computer assisted translation. In: Fred, A., Caelli, T.M., Duin, R.P.W., Campilho, A.C., de Ridder, D. (eds.) SSPR&SPR 2004. LNCS, vol. 3138, pp. 207–215. Springer, Heidelberg (2004)
Google Scholar
Jelinek, F.: Statistical Methods for Speech Recognition. MIT Press, Cambridge (1998)
Google Scholar
Chen, S., Goodman, J.: An empirical study of smoothing techniques for language modelling. In: Proc. of ACL’96, San Francisco, USA, pp. 310–318 (1996)
Google Scholar
Amengual, J., et al.: The EuTrans-I speech translation system. Machine Translation 15, 75–103 (2000)
Article MATH Google Scholar
Simard, M.: The BAF: A Corpus of English-French Bitext. In: Proc. of LREC’98, Granada, Spain, vol. 1, pp. 489–494 (1998)
Google Scholar
Llorens, D., Vilar, J.M., Casacuberta, F.: Finite state language models smoothed using n-grams. IJPRAI 16(3), 275–289 (2002)
Google Scholar

Download references

Author information

Authors and Affiliations

Instituto Tecnológico de Informática, Universidad Politécnica de Valencia,
Jorge Civera, Elsa Cubel & Enrique Vidal

Authors

Jorge Civera
View author publications
You can also search for this author in PubMed Google Scholar
Elsa Cubel
View author publications
You can also search for this author in PubMed Google Scholar
Enrique Vidal
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Joan Martí José Miguel Benedí Ana Maria Mendonça Joan Serrat

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Civera, J., Cubel, E., Vidal, E. (2007). Bilingual Text Classification. In: Martí, J., Benedí, J.M., Mendonça, A.M., Serrat, J. (eds) Pattern Recognition and Image Analysis. IbPRIA 2007. Lecture Notes in Computer Science, vol 4477. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-540-72847-4_35

Download citation

DOI: https://doi.org/10.1007/978-3-540-72847-4_35
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-72846-7
Online ISBN: 978-3-540-72847-4
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics