Automatically Selecting Complementary Vector Representations for Semantic Textual Similarity

Hay, Julien; Van de Cruys, Tim; Muller, Philippe; Doan, Bich-Liên; Popineau, Fabrice; Ait-Elhara, Ouassim

doi:10.1007/978-3-030-18129-1_3

Julien Hay^6,7,
Tim Van de Cruys⁸,
Philippe Muller⁸,
Bich-Liên Doan⁷,
Fabrice Popineau⁷ &
…
Ouassim Ait-Elhara⁶

Part of the book series: Studies in Computational Intelligence ((SCI,volume 834))

232 Accesses
1 Citations

Abstract

The goal of the Semantic Textual Similarity task is to automatically quantify the semantic similarity of two text snippets. Since 2012, the task has been organized on a yearly basis as a part of the SemEval evaluation campaign. This paper presents a method that aims to combine different sentence-based vector representations in order to improve the computation of semantic similarity values. Our hypothesis is that such a combination of different representations allows us to pinpoint different semantic aspects, which improves the accuracy of similarity computations. The method’s main difficulty lies in the selection of the most complementary representations, for which we present an optimization method. Our final system is based on the winning system of the 2015 evaluation campaign, augmented with the complementary vector representations selected by our optimization method. We also present evaluation results on the dataset of the 2016 campaign, which confirms the benefit of our method.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 84.99; Price excludes VAT (USA)

Softcover Book: USD 109.99; Price excludes VAT (USA)

Hardcover Book: USD 109.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

1.
Gensim is a tool integrating different methods from distributional semantics, including methods to perform similarity measurements. https://radimrehurek.com/gensim/.
2.
https://github.com/hayj/STSSeries/.
3.
We define a model as a set of parameter assignments for the corpus preprocessing (e.g. lemmatization, stopwords removal) and vectors building (e.g. dimension size, window size) in the corresponding intervals (mentioned in Sect. 3.2) which subsequently produce a set of vector representations for each text snippet in the corpus.
4.
Scikit-learn is a tool written in Python integrating several machine learning methods http://scikit-learn.org/ (Pedregosa et al. 2011).

References

Afzal, N., Wang, Y., & Liu, H. (2016). MayoNLP at SemEval-2016 task 1: Semantic textual similarity based on lexical semantic net and deep learning semantic model. In Proceedings of the 10th International Workshop on Semantic Evaluation (SemEval-2016) (pp. 674–679), San Diego, California: Association for Computational Linguistics.
Google Scholar
Agirre, E., Banea, C., Cardie, C., Cer, D., Diab, M., Gonzalez-Agirre, A., et al. (2015). SemEval-2015 task 2: Semantic textual similarity, english, spanish and pilot on interpretability. In Proceedings of the 9th International Workshop on Semantic Evaluation (SemEval 2015) (pp. 252–263). Association for Computational Linguistics.
Google Scholar
Agirre, E., Cer, D., Diab, M., & Gonzalez-Agirre, A. (2012). SemEval-2012 task 6: A pilot on semantic textual similarity. In *SEM 2012: The First Joint Conference on Lexical and Computational Semantics–Volume 1: Proceedings of the Main Conference and the Shared Task, and Volume 2: Proceedings of the Sixth International Workshop on Semantic Evaluation (SemEval 2012) (pp. 385–393). Montréal, Canada: Association for Computational Linguistics.
Google Scholar
Baroni, M., Dinu, G., & Kruszewski, G. (2014). Don’t count, predict! A systematic comparison of context-counting vs. context-predicting semantic vectors (pp. 238–247). ACL.
Google Scholar
Baroni, M., & Lenci, A. (2010). Distributional memory: A general framework for corpus-based semantics. Computational Linguistics, 36(4), 673–721.
Article Google Scholar
Baroni, M., & Zamparelli, R. (2010). Nouns are vectors, adjectives are matrices: Representing adjective-noun constructions in semantic space. In Proceedings of the 2010 Conference on Empirical Methods in Natural Language Processing (pp. 1183–1193). Cambridge, MA: Association for Computational Linguistics.
Google Scholar
Bojanowski, P., Grave, E., Joulin, A., & Mikolov, T. (2017). Enriching word vectors with subword information. Transactions of the Association for Computational Linguistics, 5, 135–146.
Article Google Scholar
Cer, D., Diab, M., Agirre, E., Lopez-Gazpio, I., & Specia, L. (2017). SemEval-2017 task 1: Semantic textual similarity multilingual and cross-lingual focused evaluation. In Proceedings of the 11th International Workshop on Semantic Evaluation (SemEval-2017) (pp. 1–14). Association for Computational Linguistics.
Google Scholar
Curran, J. R. (2004). From distributional to semantic similarity. Ph.D. thesis, University of Edinburgh, UK.
Google Scholar
Dolan, W. B., & Brockett, C. (2005). Automatically constructing a corpus of sentential paraphrases. In Third International Workshop on Paraphrasing (IWP2005). Asia Federation of Natural Language Processing.
Google Scholar
Goldberg, D. E., & Holland, J. H. (1988). Genetic algorithms and machine learning. Machine Learning, 3(2), 95–99.
Article Google Scholar
Harris, Z. (1954). Distributional structure. Word, 10(23), 146–162.
Article Google Scholar
Huang, P.-S., He, X., Gao, J., Deng, L., Acero, A., & Heck, L. (2013). Learning deep structured semantic models for web search using clickthrough data. In Proceedings of the 22nd ACM International Conference on Information & Knowledge Management, CIKM 2013 (pp. 2333–2338), New York, NY, USA: ACM.
Google Scholar
Le, Q. V. & Mikolov, T. (2014). Distributed representations of sentences and documents. In Proceedings of the 31th International Conference on Machine Learning, ICML 2014, Beijing, China, 21–26 June 2014 (pp. 1188–1196).
Google Scholar
Mikolov, T., Chen, K., Corrado, G., & Dean, J. (2013a). Efficient estimation of word representations in vector space. In In Proceedings of Workshop at ICLR.
Google Scholar
Mikolov, T., Yih, W., & Zweig, G. (2013b). Linguistic regularities in continuous space word representations. In Human Language Technologies: Conference of the North American Chapter of the Association of Computational Linguistics, Proceedings, June 9–14, 2013 (pp. 746–751). Atlanta, Georgia, USA: Westin Peachtree Plaza Hotel.
Google Scholar
Mitchell, J., & Lapata, M. (2008). Vector-based models of semantic composition. In ACL 2008, Proceedings of the 46th Annual Meeting of the Association for Computational Linguistics, June 15–20, 2008, Columbus, Ohio, USA (pp. 236–244).
Google Scholar
Pagliardini, M., Gupta, P., & Jaggi, M. (2017). Unsupervised learning of sentence embeddings using compositional n-gram features. CoRR. abs/1703.02507.
Google Scholar
Pedregosa, F., Varoquaux, G., Gramfort, A., Michel, V., Thirion, B., Grisel, O., et al. (2011). Scikit-learn: Machine learning in Python. Journal of Machine Learning Research, 12, 2825–2830.
MathSciNet MATH Google Scholar
Pennington, J., Socher, R., & Manning, C. D. (2014). Glove: Global vectors for word representation. In Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing, EMNLP 2014, October 25–29, 2014, Doha, Qatar, A meeting of SIGDAT, a Special Interest Group of the ACL (pp. 1532–1543).
Google Scholar
Řehůřek, R., & Sojka, P. (2010). Software framework for topic modelling with large corpora. In Proceedings of the LREC 2010 Workshop on New Challenges for NLP Frameworks (pp. 45–50). Valletta, Malta. ELRA. http://is.muni.cz/publication/884893/en.
Rychalska, B., Pakulska, K., Chodorowska, K., Walczak, W., & Andruszkiewicz, P. (2016). Samsung poland NLP team at SemEval-2016 task 1: Necessity for diversity; combining recursive autoencoders, wordnet and ensemble methods to measure semantic similarity. In Proceedings of the 10th International Workshop on Semantic Evaluation (SemEval-2016) (pp. 602–608). San Diego, California: Association for Computational Linguistics.
Google Scholar
Schapire, R. E. (2003). The boosting approach to machine learning: An overview. In Nonlinear Estimation and Classification (pp. 149–171). Springer.
Google Scholar
Sultan, M. A., Bethard, S., & Sumner, T. (2015). Dls\(@\)cu: Sentence similarity from word alignment and semantic vector composition. In Proceedings of the 9th International Workshop on Semantic Evaluation (SemEval 2015) (pp. 148–153). Denver, Colorado: Association for Computational Linguistics.
Google Scholar
Tian, J., Zhou, Z., Lan, M., & Wu, Y. (2017). ECNU at SemEval-2017 task 1: Leverage kernel-based traditional NLP features and neural networks to build a universal model for multilingual and cross-lingual semantic textual similarity. In Proceedings of the 11th International Workshop on Semantic Evaluation (SemEval-2017) (pp. 191–197). Association for Computational Linguistics.
Google Scholar
Turney, P., & Pantel, P. (2010). From frequency to meaning: Vector space models of semantics. Journal of Artificial Intelligence Research, 37(1), 141–188.
Article MathSciNet Google Scholar
Van de Cruys, T., Poibeau, T., & Korhonen, A. (2013). A tensor-based factorization model of semantic compositionality. In Conference of the North American Chapter of the Association of Computational Linguistics (HTL-NAACL) (pp. 1142–1151).
Google Scholar

Download references

Author information

Authors and Affiliations

Octopeek, 22 Rue du Général de Gaulle, 95880, Paris, Enghien-les-Bains, France
Julien Hay & Ouassim Ait-Elhara
LRI, Bat 650, Rue Noetzlin, 91190, Paris, Gif-sur-Yvette, France
Julien Hay, Bich-Liên Doan & Fabrice Popineau
IRIT, Université Toulouse III Paul Sabatier, 118 Route de Narbonne, 31062, Toulouse, France
Tim Van de Cruys & Philippe Muller

Authors

Julien Hay
View author publications
You can also search for this author in PubMed Google Scholar
Tim Van de Cruys
View author publications
You can also search for this author in PubMed Google Scholar
Philippe Muller
View author publications
You can also search for this author in PubMed Google Scholar
Bich-Liên Doan
View author publications
You can also search for this author in PubMed Google Scholar
Fabrice Popineau
View author publications
You can also search for this author in PubMed Google Scholar
Ouassim Ait-Elhara
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Julien Hay .

Editor information

Editors and Affiliations

University of Bordeaux, Bordeaux, France
Bruno Pinaud
Polytechnic School of the University of Nantes, University of Nantes, Nantes, France
Fabrice Guillet
University of Côte d'Azur, Inria, Sophia Antipolis, France
Fabien Gandon
CNRS, Hubert Curien Laboratory, University of Lyon, Université Jean Monnet, Saint-Etienne, Saint-Étienne, France
Christine Largeron

Rights and permissions

Reprints and permissions

Copyright information

About this chapter

Cite this chapter

Hay, J., Van de Cruys, T., Muller, P., Doan, BL., Popineau, F., Ait-Elhara, O. (2019). Automatically Selecting Complementary Vector Representations for Semantic Textual Similarity. In: Pinaud, B., Guillet, F., Gandon, F., Largeron, C. (eds) Advances in Knowledge Discovery and Management. Studies in Computational Intelligence, vol 834. Springer, Cham. https://doi.org/10.1007/978-3-030-18129-1_3

Download citation

DOI: https://doi.org/10.1007/978-3-030-18129-1_3
Published: 21 June 2019
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-18128-4
Online ISBN: 978-3-030-18129-1
eBook Packages: Intelligent Technologies and RoboticsIntelligent Technologies and Robotics (R0)

Publish with us

Policies and ethics