Skip to main content

Improving POS Tagging Across Portuguese Variants with Word Embeddings

  • Conference paper
  • First Online:
Computational Processing of the Portuguese Language (PROPOR 2016)

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 9727))

  • 603 Accesses

Abstract

Brazilian Portuguese (BP) and European Portuguese (EP) have specific NLP resources and tools for many tasks. It is generally agreed upon that applying them to the variant other than their intended one results in a performance drop; however, very little research has measured it. We evaluated a POS tagger in a cross-variant setting under multiple combinations of word embeddings, train and test corpora, and found that (i) BP is easier than EP, (ii) word embeddings help increase tagger performance significantly, but not enough to close the accuracy gap in a cross-variant setting and (iii) embeddings generated from a corpus with both variants are useful in cross-variant scenarios. While we cannot generalize observations from POS tagging to any NLP task, this is an important first step for such evaluations.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 39.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

  1. 1.

    The texts explored here are from before the Spelling Agreement of the Portuguese language taking place.

  2. 2.

    More information about the workshops on http://ttg.uni-saarland.de/lt4vardial2015/.

  3. 3.

    Many of these OOV words are common in both variants, but since the Bosque corpus is very small, they only appear in one of the halves.

  4. 4.

    Available at http://www.linguateca.pt/cetempublico/.

  5. 5.

    The Bosque corpus is composed of sentences from CETENFolha and CETEMPúblico. We removed all those sentences to avoid any overlap with the labeled corpus.

  6. 6.

    Remember that their BP and EP corpora are not the same as ours.

References

  1. Afonso, S., Bick, E., Haber, R., Santos, D.: Floresta sintá(c)tica: a treebank for Portuguese. In: Proceedings of the Third International Conference on Language Resources and Evaluation (LREC 2002). pp. 1698–1703 (2002)

    Google Scholar 

  2. Aluísio, S.M., Pelizzoni, J.M., Marchi, A.R., de Oliveira, L., Manenti, R., Marquiafável, V.: An account of the challenge of tagging a reference corpus for brazilian portuguese. In: Mamede, N.J., Baptista, J., Trancoso, I., Nunes, M.G.V. (eds.) PROPOR 2003. LNCS, vol. 2721, pp. 110–117. Springer, Heidelberg (2003)

    Chapter  Google Scholar 

  3. Bick, E.: The Parsing System PALAVRAS: Automatic Grammatical Analysis of Portuguese in a Constraint Grammar Framework. Ph.D. thesis, Aarhus University (2000)

    Google Scholar 

  4. Branco, A., Carvalheiro, C., Costa, F., Castro, S., Silva, J., Martins, C., Ramos, J.: DeepBankPT and companion portuguese treebanks in a multilingual collection of treebanks aligned with the penn treebank. In: Baptista, J., Mamede, N., Candeias, S., Paraboni, I., Pardo, T.A.S., Volpe Nunes, M.G. (eds.) PROPOR 2014. LNCS, vol. 8775, pp. 207–213. Springer, Heidelberg (2014)

    Google Scholar 

  5. Duchi, J., Hazan, E., Singer, Y.: Adaptive subgradient methods for online learning and stochastic optimization. J. Mach. Learn. Res. 12, 2121–2159 (2011)

    MathSciNet  MATH  Google Scholar 

  6. Fonseca, E.R., Rosa, J.L.G., Aluísio, S.M.: Evaluating word embeddings and a revised corpus for part-of-speech tagging in Portuguese. J. Braz. Comput. Soc. 21(2), 1–14 (2015)

    Google Scholar 

  7. Garcia, M., Gamallo, P., Gayo, I., Cruz, M.A.P.: PoS-tagging the web in Portuguese. national varieties, text typologies and spelling systems. Procesamiento Lenguaje Nat. 53, 95–101 (2014)

    Google Scholar 

  8. Hamdi, A., Nasr, A., Habash, N., Gala, N.: POS-tagging of tunisian dialect using standard arabic resources and tools. In: Proceedings of the Second Workshop on Arabic Natural Language Processing. pp. 59–68 (2015)

    Google Scholar 

  9. Mikolov, T., Chen, K., Corrado, G., Dean, J.: Efficient estimation of word representations in vector space. In: Proceedings of the ICLR Workshop (2013)

    Google Scholar 

  10. Rocha, P., Santos, D.: CETEMPúblico: Um corpus de grandes dimensões de linguagem jornalística portuguesa. In: Actas do V Encontro para o processamento computacional da língua portuguesa escrita e falada. pp. 131–140 (2000)

    Google Scholar 

  11. Scarton, C., Sanches Duran, M., Aluísio, S.M.: Using cross-linguistic knowledge to build verbnet-style lexicons: results for a (brazilian) portuguese verbnet. In: Baptista, J., Mamede, N., Candeias, S., Paraboni, I., Pardo, T.A.S., Volpe Nunes, M.G. (eds.) PROPOR 2014. LNCS, vol. 8775, pp. 149–160. Springer, Heidelberg (2014)

    Google Scholar 

  12. Tseng, H., Jurafsky, D., Manning, C.: Morphological features help POS tagging of unknown words across language varieties. In: Proceedings of the Fourth SIGHAN Workshop on Chinese Language Processing. pp. 32–39 (2005)

    Google Scholar 

  13. Vergez-Couret, M., Urieli, A.: Pos-tagging different varieties of Occitan with single-dialect resources. In: Proceedings of the First Workshop on Applying NLP Tools to Similar Languages, Varieties and Dialects. pp. 21–29 (2014)

    Google Scholar 

Download references

Acknowledgments

This work was funded by grant#2013/22973-0 of the São Paulo Research Funding Agency (FAPESP).

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Erick Rocha Fonseca .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2016 Springer International Publishing Switzerland

About this paper

Cite this paper

Fonseca, E.R., Aluísio, S.M. (2016). Improving POS Tagging Across Portuguese Variants with Word Embeddings. In: Silva, J., Ribeiro, R., Quaresma, P., Adami, A., Branco, A. (eds) Computational Processing of the Portuguese Language. PROPOR 2016. Lecture Notes in Computer Science(), vol 9727. Springer, Cham. https://doi.org/10.1007/978-3-319-41552-9_22

Download citation

  • DOI: https://doi.org/10.1007/978-3-319-41552-9_22

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-319-41551-2

  • Online ISBN: 978-3-319-41552-9

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics