Skip to main content

Morphosyntactic Preprocessing Impact on Document Embedding: An Empirical Study on Semantic Similarity

  • Conference paper
  • First Online:
Emerging Trends in Intelligent Computing and Informatics (IRICT 2019)

Part of the book series: Advances in Intelligent Systems and Computing ((AISC,volume 1073))

Abstract

Word embedding technique is among the most widely known and used representations of text documents vocabulary. It serves to capture word context in a document, but in many applications the need is to understand the content of text, which is longer than just a single word, that’s what we call “Document Embedding”. This paper presents an empirical study that evaluates the morphosyntactic data preprocessing impact on document embedding techniques over textual semantic similarity evaluation task, and that by comparing the impact of the most widely known text preprocessing techniques, such as: (1) Cleaning technique containing stop-words removal, lowercase conversion, punctuation and number elimination, (2) Stemming technique using the most known algorithms in the literature: Porter, Snowball and Lancaster stemmer and (3) Lemmatization technique using Wordnet Lemmatizer. Experimental analysis on MSRP (Microsoft Research Paraphrase) dataset reveals that preprocessing techniques improve classifier accuracy, where Stemming methods outperforms other techniques.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 169.00
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

  1. 1.

    http://research.microsoft.com/nlp/msrparaphrase.htm.

References

  1. Batet, M., Sanchez, D.: A review on semantic similarity. In: Encyclopedia of Information Science and Technology, Third Edition, pp. 7575–7583. IGI Global (2015)

    Google Scholar 

  2. Bengio, Y., et al.: A neural probabilistic language model. J. Mach. Learn. Res. 3, 1137–1155 (2003)

    MATH  Google Scholar 

  3. Camacho-Collados, J., Pilehvar, M.T.: On the role of text preprocessing in neural network architectures: an evaluation study on text categorization and sentiment analysis. arXiv preprint arXiv:1707-1780 (2017)

  4. Chu, H.: Information Representation and Retrieval in the Digital Age. Information Today, Inc., Medford (2003)

    Google Scholar 

  5. Dolan, B., Quirk, C., Brockett, C.: Unsupervised construction of large paraphrase corpora: exploiting massively parallel news sources. In: Proceedings of the 20th International Conference on Computational Linguistics, p. 350. Association for Computational Linguistics (2004)

    Google Scholar 

  6. Duwairi, R., El-Orfali, M.: A study of the effects of preprocessing strategies on sentiment analysis for Arabic text. J. Inform. Sci. 40(4), 501–513 (2014)

    Article  Google Scholar 

  7. Ibrahim, A., Katz, B., Lin, J.: Extracting structural paraphrases from aligned monolingual corpora. In: Proceedings of the Second International Workshop on Paraphrasing, vol. 16, pp. 57–64. Association for Computational Linguistics (2003)

    Google Scholar 

  8. Kenter, T., De Rijke, M.: Short text similarity with word embeddings. In: Proceedings of the 24th ACM International on Conference on Information and Knowledge Management, pp. 1411–1420. ACM (2015)

    Google Scholar 

  9. Kiela, D., Clark, S.: A systematic study of semantic vector space model parameters. In: Proceedings of the 2nd Workshop on Continuous Vector Space Models and their Compositionality (CVSC), pp. 21–30 (2014)

    Google Scholar 

  10. Kosala, R., Blockeel, H.: Web mining research: a survey. ACM Sigkdd Explor. Newsl. 2(1), 1–15 (2000)

    Article  Google Scholar 

  11. Le, Q., Mikolov, T.: Distributed representations of sentences and douments. In: International Conference on Machine Learning, pp. 1188–1196 (2014)

    Google Scholar 

  12. Lodhi, H., et al.: Text classification using string kernels. J. Mach. Learn. Res. 2(Feb), 419–444 (2002)

    MATH  Google Scholar 

  13. Mikolov, T., Yih, W., Zweig, G.: Linguistic regularities in continuous space word representations. In: Proceedings of the 2013 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 746–751 (2013)

    Google Scholar 

  14. Mikolov, T., et al.: Efficient estimation of word representations in vector space. arXiv preprint arXiv:1301.3781 (2013)

  15. Pak, M.Y., Gunal, S.: The impact of text representation and preprocessing on author identification. Anadolu Üniversitesi Bilim Ve Teknoloji Dergisi A-Uygulamalı Bilimler ve Mühendislik 18(1), 218–224 (2017)

    Google Scholar 

  16. Park, E.-K., Ra, D.-Y., Jang, M.-G.: Techniques for improving web retrieval effectiveness. Inf. Process. Manage. 41(5), 1207–1223 (2005)

    Article  Google Scholar 

  17. Porter, M.F.: An algorithm for suffix stripping. Program 14(3), 130–137 (1980)

    Article  Google Scholar 

  18. Sergienko, R., Shan, M., Minker, W.: A comparative study of text preprocessing approaches for topic detection of user utterances. In: Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC 2016), pp. 1826–1831 (2016)

    Google Scholar 

  19. Uysal, A.K., Gunal, S.: The impact of preprocessing on text classification. Inform. Process. Manage. 50(1), 104–112 (2014)

    Article  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Nourelhouda Yahi .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2020 Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Yahi, N., Belhadef, H. (2020). Morphosyntactic Preprocessing Impact on Document Embedding: An Empirical Study on Semantic Similarity. In: Saeed, F., Mohammed, F., Gazem, N. (eds) Emerging Trends in Intelligent Computing and Informatics. IRICT 2019. Advances in Intelligent Systems and Computing, vol 1073. Springer, Cham. https://doi.org/10.1007/978-3-030-33582-3_12

Download citation

Publish with us

Policies and ethics