Abstract
Bilingual documentation has become a common phenomenon in official institutions and private companies. In this scenario, the categorization of bilingual text is a useful tool. In this paper, different approaches will be proposed to tackle this bilingual classification task. On the one hand, three finite-state transducer algorithms from the grammatical inference framework will be presented. On the other hand, a naive combination of smoothed n-gram models will be introduced. To evaluate the performance of bilingual classifiers, two categorized bilingual corpora of different complexity were considered. Experiments in a limited-domain task show that all the models obtain similar results. However, results on a more open-domain task denote the supremacy of the naive approach.
Work supported by the EC (FEDER) and the Spanish MEC under grant TIN2006-15694-CO2-01 and the Consellería d’Empresa, Universitat i Ciència - Generalitat Valenciana under contract GV06/252 and the Ministerio de Educación y Ciencia.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Preview
Unable to display preview. Download preview PDF.
References
Civera, J., Juan, A.: Mixtures of IBM Model 2. In: Proc. of EAMT, pp. 159–167 (2006)
McCallum, A., Nigam, K.: A comparison of event models for naive bayes text classification. In: AAAI-98 on Learning for Text Categorization, pp. 41–48 (1998)
Juan, A., Vidal, E.: On the use of Bernoulli mixture models for text classification. Pattern Recognition 35(12), 2705–2710 (2002)
Civera, J., Cubel, E., Juan, A., Vidal, E.: Different approaches to bilingual text classification based on grammatical inference techniques. In: Marques, J.S., Pérez de la Blanca, N., Pina, P. (eds.) IbPRIA 2005. LNCS, vol. 3523, pp. 630–637. Springer, Heidelberg (2005)
Picó, D., Casacuberta, F.: Some statistical-estimation methods for stochastic finite-state transducers. Machine Learning 44, 121–142 (2001)
Knight, K., Al-Onaizan, Y.: Translation with finite-state devices. In: Farwell, D., Gerber, L., Hovy, E. (eds.) AMTA 1998. LNCS (LNAI), vol. 1529, pp. 421–437. Springer, Heidelberg (1998)
Oncina, J., García, P., Vidal, E.: Learning subsequential transducers for pattern recognition interpretation tasks. IEEE Transactions on PAMI 15, 448–458 (1993)
Oncina, J., Varó, M.: Using domain information during the learning of a subsequential transducer. In: Miclet, L., de la Higuera, C. (eds.) ICGI 1996. LNCS, vol. 1147, pp. 301–312. Springer, Heidelberg (1996)
Cubel, E.: Aprendizaje de transductores subsecuenciales estocásticos. Technical Report II-DSIC-B-23/01, Universidad Politécnica de Valencia, Spain (2002)
Brown, P., et al.: The mathematics of statistical machine translation: Parameter estimation. Computational Linguistics 19(2), 263–312 (1993)
Vilar, J.M.: Improve the learning of subsequential transducers by using alignments and dictionaries. In: Oliveira, A.L. (ed.) ICGI 2000. LNCS (LNAI), vol. 1891, pp. 298–311. Springer, Heidelberg (2000)
Och, F., Ney, H.: Improved statistical alignment models. In: ACL, pp. 440–447 (2000)
Casacuberta, F., et al.: Some approaches to statistical and finite-state speech-to-speech translation. Computer Speech and Language 18, 25–47 (2004)
Civera, J., Vilar, J.M., Cubel, E., Lagarda, A.L., Barrachina, S., Casacuberta, F., Vidal, E., Picó, D., González, J.: A syntactic pattern recognition approach to computer assisted translation. In: Fred, A., Caelli, T.M., Duin, R.P.W., Campilho, A.C., de Ridder, D. (eds.) SSPR&SPR 2004. LNCS, vol. 3138, pp. 207–215. Springer, Heidelberg (2004)
Jelinek, F.: Statistical Methods for Speech Recognition. MIT Press, Cambridge (1998)
Chen, S., Goodman, J.: An empirical study of smoothing techniques for language modelling. In: Proc. of ACL’96, San Francisco, USA, pp. 310–318 (1996)
Amengual, J., et al.: The EuTrans-I speech translation system. Machine Translation 15, 75–103 (2000)
Simard, M.: The BAF: A Corpus of English-French Bitext. In: Proc. of LREC’98, Granada, Spain, vol. 1, pp. 489–494 (1998)
Llorens, D., Vilar, J.M., Casacuberta, F.: Finite state language models smoothed using n-grams. IJPRAI 16(3), 275–289 (2002)
Author information
Authors and Affiliations
Editor information
Rights and permissions
Copyright information
© 2007 Springer Berlin Heidelberg
About this paper
Cite this paper
Civera, J., Cubel, E., Vidal, E. (2007). Bilingual Text Classification. In: Martí, J., Benedí, J.M., Mendonça, A.M., Serrat, J. (eds) Pattern Recognition and Image Analysis. IbPRIA 2007. Lecture Notes in Computer Science, vol 4477. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-540-72847-4_35
Download citation
DOI: https://doi.org/10.1007/978-3-540-72847-4_35
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-72846-7
Online ISBN: 978-3-540-72847-4
eBook Packages: Computer ScienceComputer Science (R0)