Abstract
The advent of the computer era, which enabled the development of large text corpora and of sophisticated corpus processing tools, led to unprecedented advances in the area of collocational analysis. These advances were paralleled by significant achievements in the area of syntactic analysis, with parsing technologies becoming available for an increasing number of languages. But more often than not, these developments have taken place independently. The coupling of collocational and syntactic analyses has seldom been considered, despite the fact that one type of analysis could benefit the other. In this chapter, we focus on the integration of syntactic parsing and collocational analysis. First, we review the literature describing syntactically-informed approaches to collocation extraction. Second, we survey the work devoted to exploiting collocational resources for syntactic parsing. Finally, we refer to more recent work that proposes a joint approach to collocational and syntactic analysis, arguing that the two analyses are interdependent to such a degree that only a simultaneous process, one in which structure decoding and pattern identification go hand in hand, can provide a solid bridge between them.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Notes
- 1.
The same metaphor is used by Uhrig and Proisl (2012).
- 2.
Although there is no generally accepted list of collocation patterns (as it is widely accepted that the parameters of a collocation extraction procedure may vary according to the intended use of results), most authors agree that typical collocation patterns include the ones enumerated in Hausmann’s definition (1989, 1010) – ‘We shall call collocation a characteristic combination of two words in a structure like the following: (a) noun + adjective (epithet); (b) noun + verb; (c) verb + noun (object); (d) verb + adverb; (e) adjective + adverb; (f) noun + (prep) + noun’.
- 3.
This decision is also motivated by statistical considerations, as most statistical methods are unreliable for low-frequency data. However, it is contested by the lexicographic community, because a significant part of lexicographically interesting candidates occurs only once or twice in a corpus (Piao et al., 2005, 379).
- 4.
https://github.com/seretan/collocation-extraction-toy (accessed 1 February 2018).
- 5.
A French verb, for instance, may have as many as 48 forms (Tzoukermann and Radev, 1996).
- 6.
The same approach has been used by Uhrig and Proisl (2012), among others.
- 7.
The extraction system was named FipsCo, as it relies on the output of the Fips parser (Laenzlinger and Wehrli, 1991; Wehrli, 1997; Wehrli and Nerima, 2015). It is available online at http://latlapps.unige.ch (accessed 1 February 2018).
- 8.
PARSEME (2013–2017) was a European COST Action focusing on the link between complex lexical items and a comprehensive linguistic analysis of text. With more than 200 members from 33 countries, the Action fostered research on the integration of complex lexical items in parsing and translation.
References
Breidt, E. (1993). Extraction of V-N-collocations from text corpora: A feasibility study for German. In Proceedings of the Workshop on Very Large Corpora: Academic and Industrial Perspectives, Columbus (pp. 74–83).
Brun, C. (1998). Terminology finite-state preprocessing for computational LFG. In Proceedings of the 36th Annual Meeting of the Association for Computational Linguistics and 17th International Conference on Computational Linguistics, Morristown (pp. 196–200).
Charest, S., Brunelle, E., Fontaine, J., & Pelletier, B. (2007). Élaboration automatique d’un dictionnaire de cooccurrences grand public. In Actes de la 14e conférence sur le Traitement Automatique des Langues Naturelles (TALN 2007), Toulouse (pp. 283–292).
Choueka, Y. (1988). Looking for needles in a haystack, or locating interesting collocational expressions in large textual databases. In Proceedings of the International Conference on User-Oriented Content-Based Text and Image Handling, Cambridge, MA (pp. 609–623).
Church, K., & Hanks, P. (1990). Word association norms, mutual information, and lexicography. Computational Linguistics, 16(1), 22–29.
Constant, M., & Sigogne, A. (2011). MWU-aware part-of-speech tagging with a CRF model and lexical resources. In Proceedings of the Workshop on Multiword Expressions: From Parsing and Generation to the Real World (pp. 49–56). Portland: Association for Computational Linguistics.
Constant, M., Sigogne, A., & Watrin, P. (2012). Discriminative strategies to integrate multiword expression recognition and parsing. In Proceedings of the 50th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) (pp. 204–212). Jeju Island: Association for Computational Linguistics.
Daille, B. (1994). Approche mixte pour l’extraction automatique de terminologie: Statistiques lexicales et filtres linguistiques. Ph.D. thesis, Université Paris 7.
Dunning, T. (1993). Accurate methods for the statistics of surprise and coincidence. Computational Linguistics, 19(1), 61–74.
Evert, S. (2004). The statistics of word cooccurrences: Word pairs and collocations. Ph.D. thesis, University of Stuttgart.
Evert, S., & Krenn, B. (2005). Using small random samples for the manual evaluation of statistical association measures. Computer Speech & Language, 19(4), 450–466.
Gross, M. (1984). Lexicon-grammar and the syntactic analysis of French. In Proceedings of the 10th Annual Computational Linguistics and 22nd Meeting of the Association for Computational Linguistics, Morristown. (pp. 275–282).
Hausmann, F. J. (1989). Le dictionnaire de collocations. In F. Hausmann, O. Reichmann, H. Wiegand, & L. Zgusta (Eds.), Wörterbücher: Ein internationales Handbuch zur Lexicographie (pp. 1010–1019). Berlin: Dictionaries, Dictionnaires, de Gruyter.
Hornby, A. S., Cowie, A. P., & Lewis, J. W. (1948). Oxford advanced learner’dictionary of current English. London: Oxford University Press.
Huang, C. R., Kilgarriff, A., Wu, Y., Chiu, C. M., Smith, S., Rychly, P., Bai, M. H., & Chen, K. J. (2005). Chinese sketch engine and the extraction of grammatical collocations. In Proceedings of the Fourth SIGHAN Workshop on Chinese Language Processing, Jeju Island (pp. 48–55).
Kilgarriff, A., Rychly, P., Smrz, P., & Tugwell, D. (2004). The sketch engine. In Proceedings of the Eleventh EURALEX International Congress, Lorient (pp. 105–116).
Korkontzelos, I., & Manandhar, S. (2010). Can recognising multiword expressions improve shallow parsing? In Human Language Technologies: The 2010 Annual Conference of the North American Chapter of the Association for Computational Linguistics (pp. 636–644). Los Angeles: Association for Computational Linguistics.
Krenn, B. (2000). Collocation mining: Exploiting corpora for collocation identification and representation. In Proceedings of the KONVENS 2000, Ilmenau (pp. 209–214).
Laenzlinger, C., & Wehrli, E. (1991). Fips, un analyseur interactif pour le français. TA Informations, 32(2), 35–49.
Lafon, P. (1984). Dépouillements et statistiques en lexicométrie. Genève/Paris: Slatkine – Champion.
Lea, D., & Runcie, M. (Eds.). (2002). Oxford collocations dictionary for students of English. Oxford: Oxford University Press.
Lin, D. (1998). Extracting collocations from text corpora. In Proceedings of the First Workshop on Computational Terminology, Montreal (pp. 57–63).
Lin, D. (1999). Automatic identification of non-compositional phrases. In Proceedings of the 37th Annual Meeting of the Association for Computational Linguistics on Computational Linguistics, Morristown (pp. 317–324).
Lü, Y., & Zhou, M. (2004). Collocation translation acquisition using monolingual corpora. In Proceedings of the 42nd Meeting of the Association for Computational Linguistics (ACL’04), Barcelona (pp. 167–174)
Mel’čuk, I. (2003). Collocations: Définition, rôle et utilité. In F. Grossmann, & A. Tutin (Eds.), Les collocations: Analyse et traitement (pp. 23–32). Amsterdam: Editions De Werelt
Monti, J., Seretan, V., Pastor, G. C., & Mitkov, R. (2018). Multiword units in machine translation and translation technology. In R. Mitkov, J. Monti, G. C. Pastor, & V. Seretan (Eds.), Multiword units in machine translation and translation technology (Current issues in linguistic theory, Vol. 341). Amsterdam/Philadelphia: John Benjamins.
Nivre, J. (2006). Inductive dependency parsing (Text, speech and language technology). Secaucus: Springer.
Nivre, J., & Nilsson, J. (2004). Multiword units in syntactic parsing. In MEMURA 2004 – Methodologies and Evaluation of Multiword Units in Real-World Applications (LREC Workshop) (pp. 39–46).
Nugues, P. M. (2014). Corpus processing tools (pp. 23–64). Berlin/Heidelberg: Springer.
Orliac, B., & Dillinger, M. (2003). Collocation extraction for machine translation. In Proceedings of Machine Translation Summit IX, New Orleans (pp. 292–298).
Pearce, D. (2001). Synonymy in collocation extraction. In Proceedings of the NAACL Workshop on WordNet and Other Lexical Resources: Applications, Extensions and Customizations, Pittsburgh (pp. 41–46).
Pearce, D. (2002). A comparative evaluation of collocation extraction techniques. In Third International Conference on Language Resources and Evaluation, Las Palmas (pp. 1530–1536).
Pecina, P. (2005). An extensive empirical study of collocation extraction methods. In Proceedings of the ACL Student Research Workshop, Ann Arbor (pp. 13–18).
Pecina, P. (2008). Lexical association measures: Collocation extraction. Ph.D. thesis, Charles University.
Piao, S. S., Rayson, P., Archera, D., & McEnery, T. (2005). Comparing and combining a semantic tagger and a statistical tool for MWE extraction. Computer Speech and Language Special Issue on Multiword Expressions, 19(4), 378–397.
Rani, A., Mehla, K., & Jangra, A. (2015). Parsers and parsing approaches: Classification and state of the art. In Proceedings of the 2015 International Conference on Futuristic Trends on Computational Analysis and Knowledge Management (ABLAZE), New Delhi (pp. 34–38).
Sag, I. A., Baldwin, T., Bond, F., Copestake, A., & Flickinger, D. (2002). Multiword expressions: A pain in the neck for NLP. In Proceedings of the Third International Conference on Intelligent Text Processing and Computational Linguistics (CICLING 2002), Mexico City (pp. 1–15).
Seretan, V. (2008). Collocation extraction based on syntactic parsing. Ph.D. thesis, University of Geneva.
Seretan, V. (2011). Syntax-based collocation extraction, text, speech and language technology (Vol. 44). Dordrecht: Springer.
Seretan, V., & Wehrli, E. (2006). Accurate collocation extraction using a multilingual parser. In Proceedings of the 21st International Conference on Computational Linguistics and 44th Annual Meeting of the Association for Computational Linguistics, Sydney (pp. 953–960).
Seretan, V., Nerima, L., & Wehrli, E. (2003). Extraction of multi-word collocations using syntactic bigram composition. In Proceedings of the Fourth International Conference on Recent Advances in NLP (RANLP-2003), Borovets (pp. 424–431).
Shimohata, S., Sugio, T., & Nagata, J. (1997). Retrieving collocations by co-occurrences and word order constraints. In Proceedings of the Annual Meeting of the Association for Computational Linguistics, Madrid (pp. 476–481).
Sinclair, J. (1995). Collins cobuild english dictionary. London: Harper Collins.
Smadja, F. (1993). Retrieving collocations from text: Xtract. Computational Linguistics, 19(1), 143–177.
Tzoukermann, E., & Radev, D. R. (1996). Using word class for part-of-speech disambiguation. In Proceedings of the Fourth Workshop on Very Large Corpora, Copenhagen (pp. 1–13).
Uhrig, P., & Proisl, T. (2012). Less hay, more needles – using dependency-annotated corpora to provide lexicographers with more accurate lists of collocation candidates. Lexicographica, 28(1), 141–180.
Villada Moirón, M. B. (2005). Data-driven identification of fixed expressions and their modifiability. Ph.D. thesis, University of Groningen.
Villavicencio, A., Kordoni, V., Zhang, Y., Idiart, M., & Ramisch, C. (2007). Validation and evaluation of automatically acquired multiword expressions for grammar engineering. In Proceedings of the 2007 Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning (EMNLP-CoNLL), Prague (pp. 1034–1043).
Wehrli, E. (1997). L’analyse syntaxique des langues naturelles: Problèmes et méthodes. Paris: Masson.
Wehrli, E., & Nerima, L. (2015). The fips multilingual parser. In N. Gala, R. Rapp, & G. Bel-Enguix (Eds.), Language production, cognition, and the lexicon, text, speech and language technology (Vol. 48, pp. 473–489). Cham: Springer.
Wehrli, E., Seretan, V., & Nerima, L. (2010). Sentence analysis and collocation identification. In Proceedings of the Workshop on Multiword Expressions: From Theory to Applications (MWE 2010), Beijing (pp. 27–35).
Wehrli, E., Seretan, V., & Nerima, L. (to appear) Verbal collocations and pronominalization. In G. C. Pastor & U. Heid (Eds.), Currrent trends in computational phraseology, research in linguistics and literature. Amsterdam/Philadelphia: John Benjamins.
Wu, H., & Zhou, M. (2003). Synonymous collocation extraction using translation information. In Proceeding of the Annual Meeting of the Association for Computational Linguistics (ACL 2003), Sapporo (pp. 120–127).
Zhang, Y., & Kordoni, V. (2006). Automated deep lexical acquisition for robust open texts processing. In Proceedings of LREC-2006, Genoa (pp. 275–280).
Acknowledgements
I am grateful to the anonymous reviewers, whose comments and suggestions allowed me to improve the chapter.
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2018 Springer International Publishing AG, part of Springer Nature
About this chapter
Cite this chapter
Seretan, V. (2018). Bridging Collocational and Syntactic Analysis. In: Cantos-Gómez, P., Almela-Sánchez, M. (eds) Lexical Collocation Analysis. Quantitative Methods in the Humanities and Social Sciences. Springer, Cham. https://doi.org/10.1007/978-3-319-92582-0_2
Download citation
DOI: https://doi.org/10.1007/978-3-319-92582-0_2
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-92581-3
Online ISBN: 978-3-319-92582-0
eBook Packages: Mathematics and StatisticsMathematics and Statistics (R0)