Morphosyntactic Preprocessing Impact on Document Embedding: An Empirical Study on Semantic Similarity

Yahi, Nourelhouda; Belhadef, Hacene

doi:10.1007/978-3-030-33582-3_12

Nourelhouda Yahi¹⁷ &
Hacene Belhadef¹⁷

Part of the book series: Advances in Intelligent Systems and Computing ((AISC,volume 1073))

Included in the following conference series:

International Conference of Reliable Information and Communication Technology

1576 Accesses
3 Citations

Abstract

Word embedding technique is among the most widely known and used representations of text documents vocabulary. It serves to capture word context in a document, but in many applications the need is to understand the content of text, which is longer than just a single word, that’s what we call “Document Embedding”. This paper presents an empirical study that evaluates the morphosyntactic data preprocessing impact on document embedding techniques over textual semantic similarity evaluation task, and that by comparing the impact of the most widely known text preprocessing techniques, such as: (1) Cleaning technique containing stop-words removal, lowercase conversion, punctuation and number elimination, (2) Stemming technique using the most known algorithms in the literature: Porter, Snowball and Lancaster stemmer and (3) Lemmatization technique using Wordnet Lemmatizer. Experimental analysis on MSRP (Microsoft Research Paraphrase) dataset reveals that preprocessing techniques improve classifier accuracy, where Stemming methods outperforms other techniques.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 169.00; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

1.
http://research.microsoft.com/nlp/msrparaphrase.htm.

References

Batet, M., Sanchez, D.: A review on semantic similarity. In: Encyclopedia of Information Science and Technology, Third Edition, pp. 7575–7583. IGI Global (2015)
Google Scholar
Bengio, Y., et al.: A neural probabilistic language model. J. Mach. Learn. Res. 3, 1137–1155 (2003)
MATH Google Scholar
Camacho-Collados, J., Pilehvar, M.T.: On the role of text preprocessing in neural network architectures: an evaluation study on text categorization and sentiment analysis. arXiv preprint arXiv:1707-1780 (2017)
Chu, H.: Information Representation and Retrieval in the Digital Age. Information Today, Inc., Medford (2003)
Google Scholar
Dolan, B., Quirk, C., Brockett, C.: Unsupervised construction of large paraphrase corpora: exploiting massively parallel news sources. In: Proceedings of the 20th International Conference on Computational Linguistics, p. 350. Association for Computational Linguistics (2004)
Google Scholar
Duwairi, R., El-Orfali, M.: A study of the effects of preprocessing strategies on sentiment analysis for Arabic text. J. Inform. Sci. 40(4), 501–513 (2014)
Article Google Scholar
Ibrahim, A., Katz, B., Lin, J.: Extracting structural paraphrases from aligned monolingual corpora. In: Proceedings of the Second International Workshop on Paraphrasing, vol. 16, pp. 57–64. Association for Computational Linguistics (2003)
Google Scholar
Kenter, T., De Rijke, M.: Short text similarity with word embeddings. In: Proceedings of the 24th ACM International on Conference on Information and Knowledge Management, pp. 1411–1420. ACM (2015)
Google Scholar
Kiela, D., Clark, S.: A systematic study of semantic vector space model parameters. In: Proceedings of the 2nd Workshop on Continuous Vector Space Models and their Compositionality (CVSC), pp. 21–30 (2014)
Google Scholar
Kosala, R., Blockeel, H.: Web mining research: a survey. ACM Sigkdd Explor. Newsl. 2(1), 1–15 (2000)
Article Google Scholar
Le, Q., Mikolov, T.: Distributed representations of sentences and douments. In: International Conference on Machine Learning, pp. 1188–1196 (2014)
Google Scholar
Lodhi, H., et al.: Text classification using string kernels. J. Mach. Learn. Res. 2(Feb), 419–444 (2002)
MATH Google Scholar
Mikolov, T., Yih, W., Zweig, G.: Linguistic regularities in continuous space word representations. In: Proceedings of the 2013 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 746–751 (2013)
Google Scholar
Mikolov, T., et al.: Efficient estimation of word representations in vector space. arXiv preprint arXiv:1301.3781 (2013)
Pak, M.Y., Gunal, S.: The impact of text representation and preprocessing on author identification. Anadolu Üniversitesi Bilim Ve Teknoloji Dergisi A-Uygulamalı Bilimler ve Mühendislik 18(1), 218–224 (2017)
Google Scholar
Park, E.-K., Ra, D.-Y., Jang, M.-G.: Techniques for improving web retrieval effectiveness. Inf. Process. Manage. 41(5), 1207–1223 (2005)
Article Google Scholar
Porter, M.F.: An algorithm for suffix stripping. Program 14(3), 130–137 (1980)
Article Google Scholar
Sergienko, R., Shan, M., Minker, W.: A comparative study of text preprocessing approaches for topic detection of user utterances. In: Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC 2016), pp. 1826–1831 (2016)
Google Scholar
Uysal, A.K., Gunal, S.: The impact of preprocessing on text classification. Inform. Process. Manage. 50(1), 104–112 (2014)
Article Google Scholar

Download references

Author information

Authors and Affiliations

NTIC Faculty, MISC Laboratory, University of Constantine 2-Abdelhamid Mehri, Constantine, Algeria
Nourelhouda Yahi & Hacene Belhadef

Authors

Nourelhouda Yahi
View author publications
You can also search for this author in PubMed Google Scholar
Hacene Belhadef
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Nourelhouda Yahi .

Editor information

Editors and Affiliations

College of Computer Science and Engineering, Taibah University, Medina, Saudi Arabia
Faisal Saeed
School of Computing, Universiti Utara Malaysia (UUM), Sintok, Kedah Darul Aman, Malaysia
Fathey Mohammed
Management of Information Systems Department College of Business Administration, Taibah University, Yanbu, Saudi Arabia
Nadhmi Gazem

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Yahi, N., Belhadef, H. (2020). Morphosyntactic Preprocessing Impact on Document Embedding: An Empirical Study on Semantic Similarity. In: Saeed, F., Mohammed, F., Gazem, N. (eds) Emerging Trends in Intelligent Computing and Informatics. IRICT 2019. Advances in Intelligent Systems and Computing, vol 1073. Springer, Cham. https://doi.org/10.1007/978-3-030-33582-3_12

Download citation

DOI: https://doi.org/10.1007/978-3-030-33582-3_12
Published: 02 November 2019
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-33581-6
Online ISBN: 978-3-030-33582-3
eBook Packages: Intelligent Technologies and RoboticsIntelligent Technologies and Robotics (R0)

Publish with us

Policies and ethics