Towards a Statistical-Enriched Corpus Containing Portuguese Collocations in Use: Reviewing Possible Extraction Tools

Costa, Ângela; Coheur, Luísa

doi:10.1007/978-3-319-41552-9_32

Ângela Costa^18,20 &
Luísa Coheur^18,19

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 9727))

Included in the following conference series:

International Conference on Computational Processing of the Portuguese Language

614 Accesses

Abstract

Collocations are a main problem for any natural language processing task, from machine translation to summarization. With the goal of building a corpus with collocations, enriched with statistical information about them, we survey, in this paper, four tools for extracting collocations. These tools allow us to collect sentences with collocations, and also to gather statistics on this particular type of co-ocurrences, like Mutual Information and Log likelihood values.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

1.
http://corpora.informatik.uni-leipzig.de.
2.
https://gramtrans.com/deepdict/.
3.
http://www.clul.ul.pt/pt/recursos/.
4.
https://www.sketchengine.co.uk.
5.
http://www.linguateca.pt/COMPARA/listas_freq.php.
6.
http://beta.visl.sdu.dk/constraint_grammar.html.
7.
http://linguateca.dei.uc.pt/Floresta/InicialFloresta.html.
8.
http://www.linguateca.pt/floresta/principal.html.
9.
This measure is based only on a frequency of words w1 and w2 and bigram w1 w2, it is not affected by the size of the corpus.
10.
If the clustering option is selected, the collocates within a word sketch are clustered according to any such clusters from the distributional thesaurus that they appear in. The words from the thesaurus are clustered according to their distributional similarity scores.
11.
Salience is a statistical measure of how salient a word or lemma is in a given context, given the frequency of the word and the context. This is measured with logDice.
12.
[7] suggests that MI is generally used for a lexicographical purpose, while MI3 is probably more useful for second language learning.

References

Anagnostou, N.K., Weir, G.R.S.: Review of software applications for derivingcollocations. In: ICT in the Analysis, Teaching and Learning of Languages, Preprints of the ICTATLL Workshop 2006, Glasgow, pp. 91–100 (2006)
Google Scholar
van den Bosch, A., Daelemans, W.: Memory-based morphological analysis. In: Proceedings of the 37th Annual Meeting of the Association for Computational Linguistics on Computational Linguistics, ACL 1999, pp. 285–292. Association for Computational Linguistics, Stroudsburg (1999). http://dx.org/10.3115/1034678.1034726
Branco, A., Silva, J.: Evaluating solutions for the rapid development of state-of-the-art pos taggers for Portuguese. In: Proceedings of the Fourth International Conference on Language Resources and Evaluation (LREC-2004). European Language Resources Association (ELRA), Lisbon. http://www.lrec-conf.org/proceedings/lrec2004/pdf/572.pdf, aCL Anthology Identifier: L04–1354
Correia, J.M.P.: Syntax Deep Explorer. Ph.D. thesis. Instituto Superior Técnico (2015)
Google Scholar
Daelemans, W., Zavrel, J., Berck, P., Gillis, S.: Mbt: a memory-based part of speech tagger-generator. In: Proceedings of Fourth Workshop on Very Large Corpora, pp. 14–27. ACL SIGDAT (1996)
Google Scholar
Kilgarriff, A., Rychly, P., Smrz, P., Tugwell, D.: The sketch engine. In: Proceedings of EURALEX (2004)
Google Scholar
McEnery, T., Xiao, R., Tono, Y.: Corpus-Based Language Studies: An Advanced Resource Book. Taylor & Francis (2006)
Google Scholar
Mendes, A., Antunes, S., do Nascimento, M.F.B., Miguel, J., Casteleiro, L.P., Sá, T.: Combina-pt: a large corpus-extracted and hand-checked lexical database of Portuguese multiword expressions. In: Proceedings of LREC, pp. 1900–1905 (2006)
Google Scholar
Santos, D., Rocha, P.: Evaluating CETEMPúblico, a free resource for Portuguese. In: Proceedings of the 39th Annual Meeting on Association for Computational Linguistics, pp. 450–457. Association for Computational Linguistics (2001)
Google Scholar
Tutin, A., Grossmann, F.: Collocations régulières et irrégulières: esquisse de typologie du phénomène collocatif. Revue française de linguistique appliquée 7(1), 7–25 (2002)
Google Scholar

Download references

Acknowledgments

The work was partially supported by national funds through FCT - Fundação para a Ciência e a Tecnologia, reference UID/CEC/50021/2013. Ângela Costa is supported by PhD fellowship from FCT (SFRH/BD/85737/2012).

Author information

Authors and Affiliations

INESC-ID Lisboa, Lisbon, Portugal
Ângela Costa & Luísa Coheur
Instituto Superior Técnico, Lisbon, Portugal
Luísa Coheur
Centro de Linguística da, Universidade Nova de Lisboa, Lisbon, Portugal
Ângela Costa

Authors

Ângela Costa
View author publications
You can also search for this author in PubMed Google Scholar
Luísa Coheur
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Ângela Costa .

Editor information

Editors and Affiliations

Universidade de Lisbon, Portugal
João Silva
ISCTE-IUL, Lisbon, Portugal
Ricardo Ribeiro
Universidade de Évora, Évora, Portugal
Paulo Quaresma
Universidade de Caxias do Sul, Caxias do Suö, Brazil
André Adami
Universidade de Lisbon, Lisboa, Portugal
António Branco

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Costa, Â., Coheur, L. (2016). Towards a Statistical-Enriched Corpus Containing Portuguese Collocations in Use: Reviewing Possible Extraction Tools. In: Silva, J., Ribeiro, R., Quaresma, P., Adami, A., Branco, A. (eds) Computational Processing of the Portuguese Language. PROPOR 2016. Lecture Notes in Computer Science(), vol 9727. Springer, Cham. https://doi.org/10.1007/978-3-319-41552-9_32

Download citation

DOI: https://doi.org/10.1007/978-3-319-41552-9_32
Published: 21 June 2016
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-41551-2
Online ISBN: 978-3-319-41552-9
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics