Using Link-Based Content Analysis to Measure Document Similarity Effectively

Li, Pei; Li, Zhixu; Liu, Hongyan; He, Jun; Du, Xiaoyong

doi:10.1007/978-3-642-00672-2_40

Pei Li^22,23,
Zhixu Li^22,23,
Hongyan Liu²⁴,
Jun He^22,23 &
…
Xiaoyong Du^22,23

Part of the book series: Lecture Notes in Computer Science ((LNISA,volume 5446))

Included in the following conference series:

1177 Accesses
4 Citations

Abstract

Along with a massive amount of information being placed online, it is a challenge to exploit the internal and external information of documents when assessing similarity between them. A variety of approaches have been proposed to model the document similarity based on different foundations, but usually they are not applicable for combining internal and external information. In this paper, we introduce a link-based method into content analysis, which is based on random walk on graphs. By defining similarity as the meeting probability of two random surfers, we propose a computational model for content analysis, which can also be integrated with external information of documents. Empirical study shows that our method achieves good accuracy, acceptable performance and fast convergent rate in multi-relational document similarity measuring.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 84.99; Price excludes VAT (USA)

Softcover Book: USD 109.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

References

Renda, M.E., Straccia, U.: A Personalized Collaborative Digital Library Environment: a model and an application. Information Processing and Management 41(1), 5–21 (2005)
Article MATH Google Scholar
Salton, G., Wong, A., Yang, C.S.: A Vector Space Model for Automatic Indexing. Communications of the ACM 18, 613–620 (1975)
Article MATH Google Scholar
Damashek, M.: Gauging similarity with n-grams: Language-independent categorization of text. Science 267, 843–848
Google Scholar
Deerwester, S.C., Dumais, S.T., Landauer, T.K., Furnas, G.W., Harshman, R.A.: Indexing by latent semantic analysis. Journal of the American Society of Information Science 41(6), 391–407 (1990)
Article Google Scholar
Jeh, G., Widom, J.: SimRank: A measure of structural-context similarity. In: SIGKDD 2002, pp. 538–543 (2002)
Google Scholar
Xi, W., Fox, E.A., Fan, W., Zhang, B., Chen, Z., Yan, J., Zhuang, D.: SimFusion: measuring similarity using unified relationship matrix. In: SIGIR 2005, pp. 130–137 (2005)
Google Scholar
Yin, X., Han, J., Yu, P.S.: Linkclus: Efficient clustering via heterogeneous semantic links. In: VLDB 2006, pp. 427–438 (2006)
Google Scholar
Kaufman, L., Rousseeuw, P.J.: Finding Groups in Data: an Introduction to Cluster Analysis. John Wiley & Sons, Chichester (1990)
Book MATH Google Scholar
Lovasz, L.: Random walks on graphs: a survey. In: Combinatorics, Paul Erdos is Eighty, vol. 2, pp. 1–46, Keszthely, Hungary (1993)
Google Scholar
Kallenberg, O.: Foundations of Modern Probability. Springer, New York (1997)
MATH Google Scholar
Fogaras, D., Racz, B.: Scaling Link-Based Similarity Search. In: WWW 2005, pp. 641–650 (2005)
Google Scholar
Getoor, L., Diehl, C.P.: Link mining: A survey. In: SIGKDD 2005 Explorations, vol. 7(2), pp. 3–12.
Google Scholar
Salton, G., Buckley, C.: Term-weighting approaches in automatic text retrieval. Information Processing & Management 24(5), 513–523 (1988)
Article Google Scholar
Hammouda, K.M., Kamel, M.S.: Phrase-based Document Similarity Based on an Index Graph Model. In: ICDM 2002, pp. 203–210 (2002)
Google Scholar
Aslam, J.A., Frost, M.: An Information-theoretic Measure for Document Similarity. In: SIGIR 2003, pp. 449–450 (2003)
Google Scholar
Calado, P., Cristo, M., Moura, E.S., Ziviani, N., Ribeiro-Neto, B.A., Goncalves, M.A.: Combining link-based and content-based methods for web document classification. In: CIKM 2003, pp. 394–401 (2003)
Google Scholar
Jin, R., Dumais, S.: Probabilistic Combination of Content and Links. In: SIGIR 2001, pp. 402–403 (2001)
Google Scholar
Zhu, S., Yu, K., Chi, Y., Gong, Y.: Combining content and link for classification using matrix factorization. In: SIGIR 2007, pp. 487–494 (2007)
Google Scholar
Porter, M.: An algorithm for suffix stripping. Program, vol. 14(3), pp. 130–137 (1980), http://www.tartarus.org/~martin/PorterStemmer
Salton, G.: The SMART Retrieval System—Experiments in Automatic Document Processing. Prentice-Hall, Inc., Upper Saddle River (1971)
Google Scholar
The stop-words list, http://members.unine.ch/jacques.savoy/clef/englishST.txt
ACM Computing Classification System, http://portal.acm.org/ccs.cfm

Download references

Author information

Authors and Affiliations

Key Labs of Data Engineering and Knowledge Engineering, Ministry of Education, China
Pei Li, Zhixu Li, Jun He & Xiaoyong Du
School of Information, Renmin University of China, Beijing, China
Pei Li, Zhixu Li, Jun He & Xiaoyong Du
Department of Management Science and Engineering, Tsinghua University, Beijing, China
Hongyan Liu

Authors

Pei Li
View author publications
You can also search for this author in PubMed Google Scholar
Zhixu Li
View author publications
You can also search for this author in PubMed Google Scholar
Hongyan Liu
View author publications
You can also search for this author in PubMed Google Scholar
Jun He
View author publications
You can also search for this author in PubMed Google Scholar
Xiaoyong Du
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

Department of Computer Science, City University of Hong Kong, 83 Tat Chee Avenue, Kowloon, Hong Kong, China
Qing Li
Department of Computer Science & Technology, Tsinghua University, Beijing, China
Ling Feng
School of Computing Science, Simon Fraser University, 8888 University Drive, V5A 1S6, Burnaby BC, Canada
Jian Pei
Department of Computer Science, University of Vermont, VT 05405, Burlington, USA
Sean X. Wang
School of Information Technology and Electrical Engineering, The University of Queensland, QLD 4072, Brisbane, Australia
Xiaofang Zhou
Jiangsu Provincial Key Lab of Computer Information Processing Technology School of Computer Science & Technology, Soochow University China, 1 shizi Street Suzhou, 215006, Jiangsu, China
Qiao-Ming Zhu

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Li, P., Li, Z., Liu, H., He, J., Du, X. (2009). Using Link-Based Content Analysis to Measure Document Similarity Effectively. In: Li, Q., Feng, L., Pei, J., Wang, S.X., Zhou, X., Zhu, QM. (eds) Advances in Data and Web Management. APWeb WAIM 2009 2009. Lecture Notes in Computer Science, vol 5446. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-00672-2_40

Download citation

DOI: https://doi.org/10.1007/978-3-642-00672-2_40
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-00671-5
Online ISBN: 978-3-642-00672-2
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics