Skip to main content

Using Link-Based Content Analysis to Measure Document Similarity Effectively

  • Conference paper
Advances in Data and Web Management (APWeb 2009, WAIM 2009)

Part of the book series: Lecture Notes in Computer Science ((LNISA,volume 5446))

Abstract

Along with a massive amount of information being placed online, it is a challenge to exploit the internal and external information of documents when assessing similarity between them. A variety of approaches have been proposed to model the document similarity based on different foundations, but usually they are not applicable for combining internal and external information. In this paper, we introduce a link-based method into content analysis, which is based on random walk on graphs. By defining similarity as the meeting probability of two random surfers, we propose a computational model for content analysis, which can also be integrated with external information of documents. Empirical study shows that our method achieves good accuracy, acceptable performance and fast convergent rate in multi-relational document similarity measuring.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 84.99
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 109.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. Renda, M.E., Straccia, U.: A Personalized Collaborative Digital Library Environment: a model and an application. Information Processing and Management 41(1), 5–21 (2005)

    Article  MATH  Google Scholar 

  2. Salton, G., Wong, A., Yang, C.S.: A Vector Space Model for Automatic Indexing. Communications of the ACM 18, 613–620 (1975)

    Article  MATH  Google Scholar 

  3. Damashek, M.: Gauging similarity with n-grams: Language-independent categorization of text. Science 267, 843–848

    Google Scholar 

  4. Deerwester, S.C., Dumais, S.T., Landauer, T.K., Furnas, G.W., Harshman, R.A.: Indexing by latent semantic analysis. Journal of the American Society of Information Science 41(6), 391–407 (1990)

    Article  Google Scholar 

  5. Jeh, G., Widom, J.: SimRank: A measure of structural-context similarity. In: SIGKDD 2002, pp. 538–543 (2002)

    Google Scholar 

  6. Xi, W., Fox, E.A., Fan, W., Zhang, B., Chen, Z., Yan, J., Zhuang, D.: SimFusion: measuring similarity using unified relationship matrix. In: SIGIR 2005, pp. 130–137 (2005)

    Google Scholar 

  7. Yin, X., Han, J., Yu, P.S.: Linkclus: Efficient clustering via heterogeneous semantic links. In: VLDB 2006, pp. 427–438 (2006)

    Google Scholar 

  8. Kaufman, L., Rousseeuw, P.J.: Finding Groups in Data: an Introduction to Cluster Analysis. John Wiley & Sons, Chichester (1990)

    Book  MATH  Google Scholar 

  9. Lovasz, L.: Random walks on graphs: a survey. In: Combinatorics, Paul Erdos is Eighty, vol. 2, pp. 1–46, Keszthely, Hungary (1993)

    Google Scholar 

  10. Kallenberg, O.: Foundations of Modern Probability. Springer, New York (1997)

    MATH  Google Scholar 

  11. Fogaras, D., Racz, B.: Scaling Link-Based Similarity Search. In: WWW 2005, pp. 641–650 (2005)

    Google Scholar 

  12. Getoor, L., Diehl, C.P.: Link mining: A survey. In: SIGKDD 2005 Explorations, vol. 7(2), pp. 3–12.

    Google Scholar 

  13. Salton, G., Buckley, C.: Term-weighting approaches in automatic text retrieval. Information Processing & Management 24(5), 513–523 (1988)

    Article  Google Scholar 

  14. Hammouda, K.M., Kamel, M.S.: Phrase-based Document Similarity Based on an Index Graph Model. In: ICDM 2002, pp. 203–210 (2002)

    Google Scholar 

  15. Aslam, J.A., Frost, M.: An Information-theoretic Measure for Document Similarity. In: SIGIR 2003, pp. 449–450 (2003)

    Google Scholar 

  16. Calado, P., Cristo, M., Moura, E.S., Ziviani, N., Ribeiro-Neto, B.A., Goncalves, M.A.: Combining link-based and content-based methods for web document classification. In: CIKM 2003, pp. 394–401 (2003)

    Google Scholar 

  17. Jin, R., Dumais, S.: Probabilistic Combination of Content and Links. In: SIGIR 2001, pp. 402–403 (2001)

    Google Scholar 

  18. Zhu, S., Yu, K., Chi, Y., Gong, Y.: Combining content and link for classification using matrix factorization. In: SIGIR 2007, pp. 487–494 (2007)

    Google Scholar 

  19. Porter, M.: An algorithm for suffix stripping. Program, vol. 14(3), pp. 130–137 (1980), http://www.tartarus.org/~martin/PorterStemmer

  20. Salton, G.: The SMART Retrieval System—Experiments in Automatic Document Processing. Prentice-Hall, Inc., Upper Saddle River (1971)

    Google Scholar 

  21. The stop-words list, http://members.unine.ch/jacques.savoy/clef/englishST.txt

  22. ACM Computing Classification System, http://portal.acm.org/ccs.cfm

Download references

Author information

Authors and Affiliations

Authors

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2009 Springer-Verlag Berlin Heidelberg

About this paper

Cite this paper

Li, P., Li, Z., Liu, H., He, J., Du, X. (2009). Using Link-Based Content Analysis to Measure Document Similarity Effectively. In: Li, Q., Feng, L., Pei, J., Wang, S.X., Zhou, X., Zhu, QM. (eds) Advances in Data and Web Management. APWeb WAIM 2009 2009. Lecture Notes in Computer Science, vol 5446. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-00672-2_40

Download citation

  • DOI: https://doi.org/10.1007/978-3-642-00672-2_40

  • Publisher Name: Springer, Berlin, Heidelberg

  • Print ISBN: 978-3-642-00671-5

  • Online ISBN: 978-3-642-00672-2

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics