Skip to main content

Sentence Paraphrase Detection Using Classification Models

  • Conference paper
  • First Online:
Text Processing (FIRE 2016)

Abstract

In this paper, we address on the task of sentence paraphrase detection which is focused on deciding whether the two sentences have the relationship of paraphrase. A supervised learning strategy for paraphrase detection is described whereby the two sentences are classified to decide the paraphrase relationship and using only the lexical features operated at n-gram as the classification features. Gradient Boosting, K-Nearest Neighbor, Decision Tree and Support vector machine are chosen as the classifiers. The performance of the classification method is compared and the features are analyzed to determine which of them are most important for paraphrase detection. Evaluation is performed on the corpus of 2016 Detecting Paraphrase in Indian Languages task proposed by Forum of Information Retrieval Evaluation (DPIL-FIRE2016). The experimental results show that the Gradient Boosting can achieve the highest Overall Score. By using the learned classifier, we got the highest F1 measure for both Task1 and Task2 on Malayalam and Tamil, and the highest F1 measure for Task2 on Punjabi in DPIL-FIRE2016.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 39.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

  1. 1.

    http://nlp.amrita.edu/dpil_cen/

References

  1. Huang, E.: Paraphrase detection using recursive autoencoder (2011). http://nlp.stanford.edu/courses/cs224n/2011/reports/ehhuang.pdf

  2. Anand Kumar, M., Singh, S., Kavirajan, B., et al.: DPIL@FIRE2016: overview of shared task on detecting paraphrases in Indian Languages. In: Working notes of FIRE 2016–Forum for Information Retrieval Evaluation, Kolkata, India, 7–10 December 2016

    Google Scholar 

  3. Alzahrani, S.M., Salim, N., Abraham, A.: Understanding plagiarism linguistic patterns, textual features, and detection methods. IEEE Trans. Syst. Man Cybern. Part C (Appl. Rev.) 42(2), 133–149 (2012)

    Article  Google Scholar 

  4. Banerjee, S., Lavie, A.: METEOR: an automatic metric for MT evaluation with improved correlation with human judgments. In: Proceedings of the ACL Workshop on Intrinsic and Extrinsic Evaluation Measures for Machine Translation and/or Summarization, vol. 29, pp. 65–72 (2005)

    Google Scholar 

  5. McClendon, J.L., Mack, N.A., Hodges, L.F.: The use of paraphrase identification in the retrieval of appropriate responses for script based conversational agents. In: FLAIRS Conference (2014)

    Google Scholar 

  6. Dolan, W.B., Brockett, C.: Automatically constructing a corpus of sentential paraphrases. In: Proceedings of IWP (2005)

    Google Scholar 

  7. Lintean, M., Rus, V.: Paraphrase identification using weighted dependencies and word semantics. In: Twenty-Second International FLAIRS Conference (2009)

    Google Scholar 

  8. Socher, R., Huang, E.H., Pennin, J., et al.: Dynamic pooling and unfolding recursive autoencoders for paraphrase detection. In: Advances in Neural Information Processing Systems, pp. 801–809 (2011)

    Google Scholar 

  9. Hu, B., Lu, Z., Li, H., et al.: Convolutional neural network architectures for matching natural language sentences. In: Advances in Neural Information Processing Systems, pp. 2042–2050 (2014)

    Google Scholar 

  10. Fernando, S., Stevenson, M.: A semantic similarity approach to paraphrase detection. In: Proceedings of the 11th Annual Research Colloquium of the UK Special Interest Group for Computational Linguistics, pp. 45–52 (2008)

    Google Scholar 

  11. Rus, V., McCarthy, P.M., Lintean, M.C., McNamara, D.S., Graesser, A.C.: Paraphrase identification with lexico-syntactic graph subsumption. In: FLAIRS Conference, pp. 201–206 (2008)

    Google Scholar 

  12. Saini, A., Gurgaon, H.: Anuj@DPIL-FIRE2016: a novel paraphrase detection method in Hindi Language using machine learning. In: FIRE (Working Notes), pp. 270–274 (2016)

    Google Scholar 

  13. Sarkar, K.: KS_JU@DPIL-FIRE2016: detecting paraphrases in Indian Languages using multinomial logistic regression model. arXiv preprint arXiv:1612.08171 (2016)

  14. Thangarajan, R., Kogilavani, S.V., Karthic, A., et al.: KEC@ DPIL-FIRE2016: detection of paraphrases on Indian Languages (2016)

    Google Scholar 

  15. Sarkar, S., Saha, S., Bentham, J., et al.: NLP-NITMZ@DPIL-FIRE2016: Language independent paraphrases detection. In: FIRE (Working Notes), pp. 256–259 (2016)

    Google Scholar 

  16. Friedl, M.A., Brodley, C.E.: Decision tree classification of land cover from remotely sensed data. Remote Sens. Environ. 61(3), 399–409 (1997)

    Article  Google Scholar 

  17. Peterson, L.E.: K-nearest neighbor. Scholarpedia 4(2), 1883 (2009)

    Article  Google Scholar 

  18. Wu, T.F., Lin, C.J., Weng, R.C.: Probability estimates for multi-class classification by pairwise coupling. J. Mach. Learn. Res. 5(Aug), 975–1005 (2004)

    MathSciNet  MATH  Google Scholar 

  19. Friedman, J.H.: Greedy function approximation: a gradient boosting machine. Ann. Stat. 1189–1232 (2001)

    Google Scholar 

  20. Friedman, J.H.: Stochastic gradient boosting. Comput. Stat. Data Anal. 38(4), 367–378 (2002)

    Article  MathSciNet  MATH  Google Scholar 

  21. Li, H.: Statistical learning methods. Tsinghua University Press, Beijing (2012). (in Chinese)

    Google Scholar 

  22. Li, C., Lu, J., Lu, Y.: Efficient merging and filtering algorithms for approximate string searches. In: ICDE, pp. 257–266 (2008)

    Google Scholar 

  23. Wang, J., Li, G., Feng, J.: Fast-join: an efficient method for fuzzy token matching based string similarity join. In: ICDE, pp. 458–469 (2011)

    Google Scholar 

Download references

Acknowledgments

This work is supported by Social Science Fund of Heilongjiang Province (NO. 16XWB02).

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Leilei Kong .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2018 Springer International Publishing AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Tian, L., Ning, H., Kong, L., Chen, K., Qi, H., Han, Z. (2018). Sentence Paraphrase Detection Using Classification Models. In: Majumder, P., Mitra, M., Mehta, P., Sankhavara, J. (eds) Text Processing. FIRE 2016. Lecture Notes in Computer Science(), vol 10478. Springer, Cham. https://doi.org/10.1007/978-3-319-73606-8_13

Download citation

  • DOI: https://doi.org/10.1007/978-3-319-73606-8_13

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-319-73605-1

  • Online ISBN: 978-3-319-73606-8

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics