Abstract
Encapsulators are linguistic units which establish coherent referential connections to the preceding discourse in a text. In this paper, we address the challenge of automatically analysing the pronominal encapsulator ello in Spanish text. Our method identifies, for each occurrence, the antecedent of the pronoun (including its grammatical type), the connective phrase which combines with the pronoun to express a discourse relation linking the antecedent text segment to the following text segment, and the type of semantic relation expressed by the complex discourse marker formed by the connective phrase and pronoun. We describe our annotation of a corpus to inform the development of our method and to finetune an automatic analyser based on bidirectional encoder representation transformers. On testing our method, we find that it performs with greater accuracy than three baselines (0.76 for the resolution task), and sets a promising benchmark for the automatic annotation of occurrences of the pronoun ello, their antecedents, and the semantic relations between the two text segments linked by the connective in combination with the pronoun.
Similar content being viewed by others
Notes
The complex discourse marker is not explicitly annotated. Only the component pronouns and connective phrases are annotated.
https://spacy.io/. Last accessed 4th July 2019.
In this paper, we consider gerund phrases to be noun phrases due to their distributional similarity to the latter.
Available at https://github.com/google-research/bert. Last accessed 3rd July 2019.
Available at https://storage.googleapis.com/bert_models/2018_11_23/multi_cased_L-12_H-768_A-12.zip. Last accessed 26th May 2021. Further details on the derivation of BERT’s multilingual models are presented at https://github.com/google-research/bert/blob/master/multilingual.md. Last accessed 26th May 2021.
Associating each occurrence of ello with a context of 512 neighbouring tokens.
Tagging each sequence of 512 tokens independently of other sequences in the text.
In the literature, this additional layer is usually described as being situated “on top of” the BERT layer.
System to Automatically Classify and Resolve Ello.
We used the implementation made in the scikit-learn machine learning library for Python to compute \(\kappa \) scores.
According to the scale proposed by Viera and Garrett Viera and Garrett (2005).
By contrast, token T2 in column Pred. class label (Method 3) is not of this type because it is two tokens away from the true start of the antecedent.
Available from http://cs.famaf.unc.edu.ar/~ccardellino/SBWCE/SBW-vectors-300-min5.bin.gz. Last accessed 22nd August 2019. These word embeddings were derived from the Spanish Billion Word corpus, available from http://crscardellino.github.io/SBWCE/. Last accessed 22nd August 2019.
Adjusted from \(\alpha =0.05\) for comparisons between two systems.
References
Ariel, M. (1988). Referring and accessibility. Journal of Linguistics, 64, 65–87.
Ariel, M. (1991). The function of accessibility in a theory of grammar. Journal of Pragmatics, 16, 443–463.
Ariel, M. (1999). Cognitive universals and linguistic conventions. The case of resumptive pronouns. Studies in Language, 23, 217–269.
Baum, L. E., & Petrie, T. (1966). Statistical inference for probabilistic functions of finite state Markov chains. The Annals of Mathematical Statistics, 37(6), 1554–1563. https://doi.org/10.1214/aoms/1177699147
Bello, A. (1911). Gramática de la Lengua Castellana. Roger & Chernovitz Editores.
Benveniste, E. (1980). Problemas de Lingüística general. Tono I. Siglo XXI Editores.
Borreguero, M. (2006). Naturaleza y función de los encapsuladores en los textos informativamente densos (la noticia periodística). Cuadernos de Filología Italiana, 13, 73–95.
Cohen, J. (1960). A coefficient of agreement for nominal scales. Educational and Psychological Measurement, 20(1), 37–46.
Cornish, F. (1999). Anaphora, discourse, and understanding. Oxford University Press.
Cornish, F. (2008). How indexicals function in texts: Discourse, text, and one neo-Gricean account of indexical reference. Journal of Pragmatics, 40(6), 997.
Devlin, J., Chang, M. W., Lee, K., & Toutanova, K. (2019). BERT: Pre-training of deep bidirectional transformers for language understanding. In: Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pp. 4171–4186. Association for Computational Linguistics, Minneapolis, Minnesota. Retrieved from https://www.aclweb.org/anthology/N19-1423
Fernández, O. (1999). El pronombre personal. Formas y distribuciones. Pronombre átonos y tónicos. In I. Bosque & V. C. Demonte (Eds.), Gramática Descriptiva de la Lengua Española (pp. 1209–1273). Espasa Calpe.
Figueras, C. (2002). La jerarquía de accesibilidad de las expresiones referenciales en español. Revista Española de Lingüística, 32, 53–96.
Francis, N. (1986). Anaphoric nouns. University of Birmingham.
Gómez-Rodríguez, C., & Vilares, D. (2018). Constituent parsing as sequence labeling. In: Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, (pp. 1314–1324). Association for Computational Linguistics. Retrieved from http://aclweb.org/anthology/D18-1162
González-Ruiz, R. (2009). Algunas notas en torno a un mecanismo de cohesión textual: La anáfora conceptual. Nuevos Enfoques y PropuestasEstudios sobre el Texto (pp. 247–278). Peter Lang.
Halliday, M., & Hasan, R. (1976). Cohesion in english. Longman.
Hochreiter, S., & Schmidhuber, J. (1997). Long short-term memory. Neural Computing, 9(8), 1735–1780. https://doi.org/10.1162/neco.1997.9.8.1735
Honnibal, M., & Johnson, M. (2015). An improved non-monotonic transition system for dependency parsing. In: Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing, (pp. 1373–1378). Association for Computational Linguistics, Lisbon, Portugal. Retrieved from https://aclweb.org/anthology/D/D15/D15-1162
Jaynes, E. T. (1957). Information theory and statistical mechanics. Physical Review, 106(4), 620–630. https://doi.org/10.1103/PhysRev.106.620
Kennison, S. (2003). Comprehending the pronouns her, him, and his: Implications for theories of referential processing. Journal of Memory and Language, 49(3), 335–352.
Kennison, S., & Trofe, J. (2003). Comprehending pronouns: A role for word-specific gender stereotype information. Journal of Psycholinguistic Research, 32(3), 355–378.
Kintsch, W. (1998). Comprehension a paradigm for cognition. Academic Press.
Kudo, T. (2005). CRF++: Yet another CRF toolkit. Revised from http://crfpp.sourceforge.net
Lafferty, J., McCallum, A., & Pereira, F. C. (2001). Conditional random fields: Probabilistic models for segmenting and labeling sequence data. In: Proceedings of the 18th International Conference on Machine Learning, (pp. 282–289). Morgan Kaufmann.
Lakshmi, S., Ram, R. V. S., & Sobha, L. D. (2012). Clause Boundary Identification for Malayalam Using CRF. In: Proceedings of the Workshop on Machine Translation and Parsing in Indian Languages (MTPIL-2012), (pp. 83–92). Association for Computational Linguistics, Mumbai, India.
López Samaniego, A. (2011). La categorización de entidades del discurso en la escritura profesional. Phd thesis, Universitat de Barcelona, Barcelona, España.
McCallum, A., & Li, W. (2003). Early results for named entity recognition with conditional random fields, feature induction and web-enhanced lexicons. In: Proceedings of the seventh conference on Natural language learning at HLT-NAACL 2003-Volume 4, (pp. 188–191). Association for Computational Linguistics.
Mitkov, R., Evans, R., Orasan, C., Barbu, C., Jones, L., & Sotirova, V. (2000). Coreference and anaphora: Developing annotating tools annotated resources and annotation strategies. In: Proceedings of DAARC-2000, UK, (pp. 49–58).
Montolío, E. (2013). Construcciones conectivas que encapsulan. [A pesar de + SN] y la escritura experta. Cuadernos AISPI, 2, 115–132.
Montolío, E. (2014). Mecanismos de cohesión (II). Los conectores. In: E.M. (Dir) (ed.) Manual de Escritura Académica y Profesional, (pp. 9–92). Ariel, Barcelona.
Orăsan, C. (2003). PALinkA: A highly customizable tool for discourse annotation. In: Proceedings of the 4th SIGdial Workshop on Discourse and Dialog, (pp. 39 – 43). Sapporo, Japan. Retrieved from http://clg.wlv.ac.uk/papers/palinka-final.pdf.
Parodi, G. (2014). Comprensión de Textos Escritos. Teoría de la Comunicabilidad. Eudeba.
Parodi, G., & Burdiles, G. (2016). Encapsulación y tipos de coherencia referencial y relacional: el pronombre “ello” como mecanismo encapsulador en el discurso escrito de la economía. Onomázein, 33(1), 107–129.
Parodi, G., & Burdiles, G. (2019). Los pronombres neutros ‘esto’, ‘eso’ y ‘aquello’ como mecanismos encapsuladores: coherencia referencial y relacional. Spanish in Context, 16(1), 104–127.
Parodi, G., Julio, C., Nadal, L., Burdiles, G., & Cruz, A. (2018). Always look back: Eye movements as a reflection of anaphoric encapsulation in Spanish while reading the neuter pronoun ello. Journal of Pragmatics, 132, 47–58.
Parodi, G., Julio, C., Nadal, L., Cruz, A., & Burdiles, G. (2019). Stepping back to look ahead: Neuter encapsulation and referent extension in counter-argumentative and causal semantic relations in Spanish. Language Cognition, 11(3), 431–454.
Portolés, J. (2004). Pragmática para Hispanistas. Longman.
Prandi, M. (2004). The building blocks of meaning. Benjamins.
RAE. (2005). Diccionario Panhispánico de Dudas. Santillana, Bogotá.
RAE & ASALE. (2010). Nueva Gramática de la Lengua Española. Manual [New grammar of Spanish language. Handbook]. Espasa, Buenos Aires.
Recasens, M., Hu, Z., & Rhinehart, O. (2016). Sense anaphoric pronouns: Am i one? In: Proceedings of the Workshop on Coreference Resolution Beyond OntoNotes (CORBON 2016), pp. 1–6. Association for Computational Linguistics, San Diego, California. Retrieved from https://doi.org/10.18653/v1/W16-0701. https://www.aclweb.org/anthology/W16-0701
Rohanian, O., Taslimipoor, S., Yaneva, V., & Ha, L. A. (2017). Using gaze data to predict multiword expressions. In: Proceedings of the International Conference Recent Advances in Natural Language Processing, RANLP 2017, (pp. 601–609). INCOMA Ltd., Varna, Bulgaria. Retrieved from https://doi.org/10.26615/978-954-452-049-6_078.
Sanders, T., Spooren, W., & Noordman, L. (1992). Toward a taxonomy of coherence relations. Discourse Processes, 15, 1–35. https://doi.org/10.1080/01638539209544800
Sanders, T., Spooren, W., & Noordman, L. (1993). Coherence relations in a cognitive theory of discourse representation. Cognitive Linguistics (pp. 93–134). Retrieved from https://doi.org/10.1515/cogl.1993.4.2.93.
Schmid, H. (2000). English abstract nouns as conceptual shells: From corpus to cognition. de Gruyter.
Sha, F., & Pereira, F. (2003). Shallow parsing with conditional random fields. In: Proceedings of the 2003 Conference of the North American Chapter of the Association for Computational Linguistics on Human Language Technology-Volume 1, (pp. 134–141). Association for Computational Linguistics.
Shimbo, M., & Hara, K. (2007). A discriminative learning model for coordinate conjunctions. In: Proceedings of the 2007 Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning, (pp. 610–619). Prague.
Sinclair, J. (1993). Written discourse structure. In J. Sinclair, M. Hoey, & G. Fox (Eds.), Techniques of description. Spoken and written discourse (pp. 6–31). Routledge.
Sinclair, J. (1994). Trust the text. In M. Coulthard (Ed.), Advances in written text analysis (pp. 6–31). Routledge.
Sutton, C., & McCallum, A. (2011). An introduction to conditional random fields. Foundations and Trends in Machine Learning, 4(4), 268–373.
Tadros, A. (1994). Predictive categories in expository texts. In M. Coulthard (Ed.), Advances in written text analysis (pp. 69–82). Routledge.
Taslimipoor, S., Desantis, A., Cherchi, M., Mitkov, R., & Monti, J. (2016). Language resources for italian: Towards the development of a corpus of annotated italian multiword expressions. In: P. Basile, A. Corazza, F. Cutugno, S. Montemagni, M. Nissim, V. Patti, G. Semeraro, R. Sprugnoli (eds.) Proceedings of Third Italian Conference on Computational Linguistics (CLiC-it 2016) & Fifth Evaluation Campaign of Natural Language Processing and Speech Tools for Italian. Final Workshop (EVALITA 2016), CEUR Workshop Proceedings, vol. 1749. Napoli, Italy.
Taslimipoor, S., & Mitkov, R. (2016). Computational phraseology light: Automatic translation of multiword expressions without translation resources. Yearbook of Phraseology, 7(1), 149–166.
van Dijk, T. A., & Kintsch, W. (1983). Strategies of discourse comprehension. Academic Press.
Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., ukasz Kaiser, L., Polosukhin, I. (2017). Attention is all you need. In: I. Guyon, U.V. Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, R. Garnett (eds.) Advances in Neural Information Processing Systems 30, (pp. 5998–6008). Curran Associates, Inc. Retrieved from http://papers.nips.cc/paper/7181-attention-is-all-you-need.pdf
Viera, A. J., & Garrett, J. M. (2005). Understanding interobserver agreement: The kappa statistic. Family Medicine, 37(5), 360–363.
Wang, X., Bruno, J., Molloy, H., Evanini, K., & Zechner, K. (2017). Discourse Annotation of Non-native Spontaneous Spoken Responses Using the Rhetorical Structure Theory Framework. In: Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers), (pp. 263–268). Association for Computational Linguistics, Vancouver, Canada.
Zulaica, I., & Gutiérrez, J. (2009). Hacia una semántica computacional de las anáforas demostrativas. Linguamática, 1(2), 81–90.
Author information
Authors and Affiliations
Corresponding author
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
About this article
Cite this article
Parodi, G., Evans, R., Ha, L.A. et al. A sequence labelling approach for automatic analysis of ello: tagging pronouns, antecedents, and connective phrases. Lang Resources & Evaluation 56, 139–164 (2022). https://doi.org/10.1007/s10579-021-09559-z
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10579-021-09559-z