Abstract
Every language possesses plausible several interpretations. With the evolution of web, smart devices and social media it has become a challenging task to identify these syntactic or semantic ambiguities. In Natural Language Processing, two statements written using different words having same meaning is termed as paraphrasing. At FIRE 2016, we have worked upon the problem of detecting paraphrases for the given Shared Task DPIL (Detecting Paraphrases in Indian Languages) in Hindi Language specifically. This paper proposes a novel approach to identify if two statements are paraphrased or not using various machine learning algorithms like Random Forest, Support Vector Machine, Gradient Boosting and Gaussian Naïve Bayes on the given training data set of two subtasks. In cross validation experiments, Random Forest outperforms the other methods with F1-score of 0.94. We have extended our work by adding few more features and using the former best classifier resulting in improvement of F1-score by 1%. The experimental results depict that our algorithm got the highest F1-score and accuracy and hence, secured the first rank in Hindi language in this shared task among all participants. Our novel approach can be used in various applications such as question-answering system, document clustering, machine translation, text summarization, plagiarism detection and many more.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
References
Sethi, N., Agrawal, P., Madaan, V., Singh, S.K.: A novel approach to paraphrase Hindi sentences using natural language processing. Indian J. Sci. Technol. 9(28), July 2016. https://doi.org/10.17485/ijst/2016/v9i28/98374
Kumar, N.: A graph based automatic plagiarism detection technique to handle artificial word reordering and paraphrasing. In: Gelbukh, A. (ed.) CICLing 2014. LNCS, vol. 8404, pp. 481–494. Springer, Heidelberg (2014). https://doi.org/10.1007/978-3-642-54903-8_40
Xu, W., Callison-Burch, C., Dolan, W.B.: SemEval-2015 task 1: paraphrase and semantic similarity in Twitter (PIT). In: Proceedings of the 9th International Workshop on Semantic Evaluation (SemEval 2015), Denver, Colorado, June 4–5, pp. 1–11. Association for Computational Linguistics (2015)
https://www.aclweb.org/aclwiki/index.php?title=Paraphrase_Identification_(State_of_the_art)
http://www.cfilt.iitb.ac.in/wordnet/webhwn/downloaderInfo.php
Zhang, W., Zeng, F., Wu, X., Zhang, X., Jiang, R: A comparative study of ensemble learning approaches in the classification of breast cancer metastasis. In: International Joint Conference on Bioinformatics, Systems Biology and Intelligent Computing (2009)
Banfield, E.R., Lawrence, O.H., Kevin, W.B., Kegelmeyer, W.P.: A comparison of decision tree ensemble creation techniques. IEEE Trans. Pattern Anal. Mach. Learn. 29(1) (2007)
Verma, A., Arora, A.: Reflexive hybrid approach to provide precise answer of user desired frequently asked question. In: 2017 7th International Conference on Cloud Computing, Data Science and Engineering-Confluence, pp. 159–163. IEEE, January 2017
Sundaram, M.S., Anand Kumar, M., Soman, K.P.: AMRITA CEN@ SemEval-2015: paraphrase detection for Twitter using unsupervised feature learning with recursive autoencoders. In: SemEval-2015, p. 45 (2015)
Mahalakshmi, S., Anand Kumar, M., Soman, K.P.: Paraphrase detection for Tamil language using deep learning algorithm. Int. J. Appl. Eng. Res. 10(17), 13929–13934 (2015)
Socher, R., Huang, E.H., Pennin, J., Manning, C.D., Ng, A.Y.: Dynamic pooling and unfolding recursive autoencoders for paraphrase detection. In: Advances in Neural Information Processing Systems, pp. 801–809 (2011)
Anand Kumar, M., Singh, S., Kavirajan, B., Soman, K.P.: DPIL@FIRE2016: overview of shared task on detecting paraphrases in Indian languages. In: Working notes of FIRE 2016 - Forum for Information Retrieval Evaluation, Kolkata, India, December 7–10, CEUR Workshop Proceedings (2016). http://ceur-ws.org/
Verma, A., Mehta, S.: A comparative study of ensemble learning methods for classification in bioinformatics. In: 2017 7th International Conference on Cloud Computing, Data Science and Engineering-Confluence, pp. 155–158. IEEE, January 2017
Acknowledgments
We would like to thank the organizers of FIRE 2016 for conducting this shared task on Detecting Paraphrases for Indian Languages (DPIL) and building the paraphrase corpora. We would also like to thank Sapient Corporation and Hays Business Solutions for giving us an opportunity to work and explore the world of text analytics.
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2018 Springer International Publishing AG
About this paper
Cite this paper
Saini, A., Verma, A. (2018). Anuj@DPIL-FIRE2016: A Novel Paraphrase Detection Method in Hindi Language Using Machine Learning. In: Majumder, P., Mitra, M., Mehta, P., Sankhavara, J. (eds) Text Processing. FIRE 2016. Lecture Notes in Computer Science(), vol 10478. Springer, Cham. https://doi.org/10.1007/978-3-319-73606-8_11
Download citation
DOI: https://doi.org/10.1007/978-3-319-73606-8_11
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-73605-1
Online ISBN: 978-3-319-73606-8
eBook Packages: Computer ScienceComputer Science (R0)