Efficient Approach for Near Duplicate Document Detection Using Textual and Conceptual Based Techniques

Roul, Rajendra Kumar; Mittal, Sahil; Joshi, Pravin

doi:10.1007/978-3-319-07353-8_23

Rajendra Kumar Roul⁷,
Sahil Mittal⁷ &
Pravin Joshi⁷

Part of the book series: Smart Innovation, Systems and Technologies ((SIST,volume 27))

2004 Accesses
2 Citations

Abstract

With the rapid development and usage of World Wide Web, there are a huge number of duplicate web pages. To help the search engine for providing results free from duplicates, detection and elimination of duplicates is required. The proposed approach combines the strength of some "state of the art" duplicate detection algorithms like Shingling and Simhash to efficiently detect and eliminate near duplicate web pages while considering some important factors like word order. In addition, it employs Latent Semantic Indexing (LSI) to detect conceptually similar documents which are often not detected by textual based duplicate detection techniques like Shingling and Simhash. The approach utilizes hamming distance and cosine similarity (for textual and conceptual duplicate detection respectively) between two documents as their similarity measure. For performance measurement, the F-measure of the proposed approach is compared with the traditional Simhash technique. Experimental results show that our approach can outperform the traditional Simhash.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 169.00; Price excludes VAT (USA)

Softcover Book: USD 219.99; Price excludes VAT (USA)

Hardcover Book: USD 219.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

References

Broder, A.Z.: Identifying and filtering near-duplicate documents. In: Giancarlo, R., Sankoff, D. (eds.) CPM 2000. LNCS, vol. 1848, pp. 1–10. Springer, Heidelberg (2000)
Chapter Google Scholar
Charikar, M.S.: Similarity estimation techniques from rounding algorithms. In: STOC 2002: Proceedings of the 34th Annual ACM Symposium on Theory of Computing, pp. 380–388. ACM, New York (2002)
Google Scholar
Henzinger, M.: Finding near-duplicate web pages: a large-scale evaluation of algorithms. In: SIGIR 2006: Proceedings of the 29th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 284–291. ACM, New York (2006)
Chapter Google Scholar
Manku, G.S., Jain, A., Sharma, A.D.: Detecting Near-duplicates for web crawling. In: WWW / Track: Data Mining (2007)
Google Scholar
Sun, Y., Qin, J., Wang, W.: Near Duplicate Text Detection Using Frequency-Biased Signatures. In: Lin, X., Manolopoulos, Y., Srivastava, D., Huang, G. (eds.) WISE 2013, Part I. LNCS, vol. 8180, pp. 277–291. Springer, Heidelberg (2013)
Chapter Google Scholar
Pi, B., Fu, S., Wang, W., Han, S.: SimHash-based Effective and Efficient Detecting of Near-Duplicate Short Messages. In: Proceedings of the 2nd Symposium International Computer Science and Computational Technology
Google Scholar
Zhang, Y.H., Zhang, F.: Research on New Algorithm of Topic-Oriented Crawler and Duplicated Web Pages Detection. In: Intelligent Computing Theories and Applications 8th International Conference, ICIC, Huangshan, China, pp. 25–29 (2012)
Google Scholar
Figuerola, C.G., Díaz, R.G., Berrocal, J.L.A., Rodríguez, A.F.Z.: Web Document Duplicate Detection using Fuzzy Hashing. In: Trends in Practical Applications of Agents and Multiagent Systems, 9th International Conference on Practical Applications of Agents and Multiagent Systems, vol. 90, pp. 117–125 (2011)
Google Scholar
Tan, P.N., Kumar, V., Steinbach, M.: Introduction to Data Mining. Pearson
Google Scholar
Theobald, M., Siddharth, J., Paepcke, A.: SpotSigs: Robust and Efficient Near Duplicate Detection. In: Large Web Collections in (SIGIR 2008), pp. 20–24 (2008)
Google Scholar
Rehurek, R., Sojka, P.: Software Framework for Topic Modeling with Large Corpora. In: Proceedings of LREC workshop New Challenges for NLP Frameworks, pp. 46–50. University of Malta, Valleta (2010)
Google Scholar
Robertson, S.: Understanding Inverse Document Frequency: On theoretical arguments for IDF. Journal of Documentation 60(5), 503–520
Google Scholar
Golub, G.H., Reinsch, C.: Singular value decomposition and least square solutions. Numerische Mathematik 10. IV 5(14), 403–420 (1970)
Article MathSciNet Google Scholar
Celikik, M., Bast, H.: Fast error-tolerant search on very large texts. In: SAC 2009 Proceedings of the ACM Symposium on Applied Computing, pp. 1724–1731 (2009)
Google Scholar

Download references

Author information

Authors and Affiliations

BITS Pilani K. K. Birla Goa Campus, Zuarinagar, Goa, 403726, India
Rajendra Kumar Roul, Sahil Mittal & Pravin Joshi

Authors

Rajendra Kumar Roul
View author publications
You can also search for this author in PubMed Google Scholar
Sahil Mittal
View author publications
You can also search for this author in PubMed Google Scholar
Pravin Joshi
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Rajendra Kumar Roul .

Editor information

Editors and Affiliations

Indian Statistical Institute, Machine Intelligence Unit, Kolkata, India
Malay Kumar Kundu
Dept. of Computer Science and Engineering, National Institute of Technology Rourkela, Rourkela, India
Durga Prasad Mohapatra
Dept. of Electronics and Tele-Communication Engineering, Jadavpur University Artificial Intelligence Laboratory, Kolkata, India
Amit Konar
Dept. of Computer Science and Engineering, St. Thomas' College of Engineering & Technology, Kidderpore, West Bengal, India
Aruna Chakraborty

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Roul, R.K., Mittal, S., Joshi, P. (2014). Efficient Approach for Near Duplicate Document Detection Using Textual and Conceptual Based Techniques. In: Kumar Kundu, M., Mohapatra, D., Konar, A., Chakraborty, A. (eds) Advanced Computing, Networking and Informatics- Volume 1. Smart Innovation, Systems and Technologies, vol 27. Springer, Cham. https://doi.org/10.1007/978-3-319-07353-8_23

Download citation

DOI: https://doi.org/10.1007/978-3-319-07353-8_23
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-07352-1
Online ISBN: 978-3-319-07353-8
eBook Packages: EngineeringEngineering (R0)

Publish with us

Policies and ethics