Skip to main content

Organizing News Archives by Near-Duplicate Copy Detection in Digital Libraries

  • Conference paper
Asian Digital Libraries. Looking Back 10 Years and Forging New Frontiers (ICADL 2007)

Part of the book series: Lecture Notes in Computer Science ((LNISA,volume 4822))

Included in the following conference series:

Abstract

There are huge numbers of documents in digital libraries. How to effectively organize these documents so that humans can easily browse or reference is a challenging task. Existing classification methods and chronological or geographical ordering only provide partial views of the news articles. The relationships among news articles might not be easily grasped. In this paper, we propose a near-duplicate copy detection approach to organizing news archives in digital libraries. Conventional copy detection methods use word-level features which could be time-consuming and not robust to term substitutions. In this paper, we propose a sentence-level statistics-based approach to detect near-duplicate documents, which is language independent, simple but effective. It’s orthogonal to and can be used to complement word-based approaches. Also it’s insensitive to actual page layout of articles. The experimental results showed the high efficiency and good accuracy of the proposed approach in detecting near-duplicates in news archives.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 39.99
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. Brin, S., Davis, J., Garcia-Molina, H.: Copy Detection Mechanisms for Digital Documents. In: ACM SIGMOD International Conference on Management of Data, pp. 398–409 (1995)

    Google Scholar 

  2. Broder, A., Glassman, S., Manasse, M., Zweig, G.: Syntactic Clustering of the Web. In: 6th International World Wide Web Conference, pp. 393–404 (1997)

    Google Scholar 

  3. Charikar, M.S.: Similarity Estimation Techniques from Rounding Algorithms. In: 34th Annual ACM Symposium on Theory of Computing, pp. 380–388 (2002)

    Google Scholar 

  4. Heintze, N.: Scalable Document Fingerprinting. In: Proceedings of the 2nd USENIX workshop on Electronic Commerce, pp. 191–200 (1996)

    Google Scholar 

  5. Henzinger, M.: Finding Near-Duplicate Web Pages: A Large-Scale Evaluation of Algorithms. In: Proceedings of SIGIR 2006, pp. 284–291 (2006)

    Google Scholar 

  6. Hoad, T.C., Zobel, J.: Methods for identifying versioned and plagiarized documents. Journal of the American Society for Information Science and Technology 54(3), 203–215 (2003)

    Article  Google Scholar 

  7. NTCIR (NII Test Collection for IR Systems) Project, http://research.nii.ac.jp/ntcir/

  8. Shen, D., Sun, J.T., Tang, Q., Chen, Z.: A Comparison of Implicit and Explicit Links for Web Page Classification. In Proceedings of WWW 2006, pp. 643–650 (2006)

    Google Scholar 

  9. Shivakumar, N., Garcia-Molina, H.: SCAM: a copy detection mechanism for digital documents. In: Proceedings of International Conference on Theory and Practice of Digital Libraries (1995)

    Google Scholar 

  10. Shulman, S.: E-Rulemaking: Issues in Current Research and Practice. International Journal of Public Administration 28, 621–641 (2005)

    Article  Google Scholar 

  11. Xu, G., Ma, W.Y.: Building Implicit Links from Content for Forum Search. In: Proceedings of SIGIR 2006, pp. 300–307 (2006)

    Google Scholar 

  12. Yang, H., Callan, J.: Near-Duplicate Detection by Instance-level Constrained Clustering. In: Proceedings of SIGIR 2006, pp. 421–428 (2006)

    Google Scholar 

  13. Zhu, Y., Shasha, D.: StatStream: Statistical Monitoring of Thousands of Data Streams in Real Time. In: Proceedings of the 28th ACM VLDB International Conference on Very Large Data Base, pp. 358–369 (2002)

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Editor information

Dion Hoe-Lian Goh Tru Hoang Cao Ingeborg Torvik Sølvberg Edie Rasmussen

Rights and permissions

Reprints and permissions

Copyright information

© 2007 Springer-Verlag Berlin Heidelberg

About this paper

Cite this paper

Chang, HC., Wang, JH. (2007). Organizing News Archives by Near-Duplicate Copy Detection in Digital Libraries. In: Goh, D.HL., Cao, T.H., Sølvberg, I.T., Rasmussen, E. (eds) Asian Digital Libraries. Looking Back 10 Years and Forging New Frontiers. ICADL 2007. Lecture Notes in Computer Science, vol 4822. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-540-77094-7_52

Download citation

  • DOI: https://doi.org/10.1007/978-3-540-77094-7_52

  • Publisher Name: Springer, Berlin, Heidelberg

  • Print ISBN: 978-3-540-77093-0

  • Online ISBN: 978-3-540-77094-7

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics