Blog Post and Comment Extraction Using Information Quantity of Web Format

Cao, Donglin; Liao, Xiangwen; Xu, Hongbo; Bai, Shuo

doi:10.1007/978-3-540-68636-1_29

Donglin Cao^1,2,3,
Xiangwen Liao^1,2,
Hongbo Xu¹ &
…
Shuo Bai¹

Part of the book series: Lecture Notes in Computer Science ((LNISA,volume 4993))

Included in the following conference series:

Asia Information Retrieval Symposium

1448 Accesses
4 Citations
1 Altmetric

Abstract

With the development of the research on blogosphere, acquiring the post and comment from blog page becomes more important in improving the search performance. In this paper, we present a two-stage method. First, we combine the advantage of the vision information and the effective text information to locate the main text which represents the theme of blog page. Second, we use the information quantity of separator to detect the boundary between the post and comment. According to our experiments, this method achieves a good performance in extraction and improves the performance of blog search.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 84.99; Price excludes VAT (USA)

Softcover Book: USD 109.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

References

Crescenzi, V., RoadRunner, G.M.: Towards automatic data extraction from large web site. In: Proceeding of the 26th International Conference on very Large Database Systems, pp. 109–118 (2001)
Google Scholar
Gupta, S., Kaiser, G., Neistadt, D., Grimm, P.: Dom-based content extraction of html documents. In: 12th International World Wide Web Conference (May 2003)
Google Scholar
Gupta, S., Kaiser, G.E., Grimm, P., Chiang, M.F., Starren, J.: Automating content extraction of html documents. World Wide Web 8(2), 179–224 (2005)
Article Google Scholar
Irmak, U., Suel, T.: Interactive wrapper generation with minimal user effort. In: WWW 2006, pp. 553–563–224 (2006)
Google Scholar
Ling, Y., Meng, X., Meng, W.: Automated extraction of hit numbers from search result pages. In: Yu, J.X., Kitsuregawa, M., Leong, H.-V. (eds.) WAIM 2006. LNCS, vol. 4016, pp. 73–84. Springer, Heidelberg (2006)
Chapter Google Scholar
Liu, B., Grossman, R., Zhai, Y.: Mining data records in web pages. In: ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (August 2003)
Google Scholar
Lu, Y., Meng, W., Zhang, W., Liu, K.-L., Yu, C.T.: Automatic extraction of publication time from news search results. In: ICDE Workshops (2006)
Google Scholar
Qi, Y., Candan, K.S.: Blogs, wikis and rss: Cuts: Curvature-based development pattern analysis and segmentation for blogs and other text streams. In: Proceedings of the seventeenth conference on Hypertext and hypermedia HYPERTEXT 2006 (August 2006)
Google Scholar
Reis, D.C., Golgher, P.B., Silva, A.S., Laender, A.F.: Automatic web news extraction using tree edit distance. In: Proceedings of the 13th international conference on World Wide Web WWW 2004 (May 2004)
Google Scholar
Reynar, J.C.: Topic segmentation: Algorithms and applications. PhD thesis (1998)
Google Scholar
Song, R., Liu, H., Wen, J., Ma, W.: Learning block importance models for web pages. In: Proceedings of the 13th international conference on World Wide Web WWW 2004 (May 2004)
Google Scholar
Zhai, Y., Liu, B.: Structured data extraction from the web based on partial tree alignment. IEEE Transactions on Knowledge and Data Engineering (2006)
Google Scholar
Zhao, H., Meng, W., Yu, C.T.: Automatic extraction of dynamic record sections from search engine result pages. In: VLDB 2006, pp. 989–1000 (2006)
Google Scholar

Download references

Author information

Authors and Affiliations

Institute of Computing Technology, Chinese Academy of Sciences, Beijing, 100080
Donglin Cao, Xiangwen Liao, Hongbo Xu & Shuo Bai
Graduate School, the Chinese Academy of Sciences, Beijing, 100039
Donglin Cao & Xiangwen Liao
Dept. of Cognitive Science, Xiamen University, Xiamen, 361005, P.R. China
Donglin Cao

Authors

Donglin Cao
View author publications
You can also search for this author in PubMed Google Scholar
Xiangwen Liao
View author publications
You can also search for this author in PubMed Google Scholar
Hongbo Xu
View author publications
You can also search for this author in PubMed Google Scholar
Shuo Bai
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Hang Li Ting Liu Wei-Ying Ma Tetsuya Sakai Kam-Fai Wong Guodong Zhou

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Cao, D., Liao, X., Xu, H., Bai, S. (2008). Blog Post and Comment Extraction Using Information Quantity of Web Format. In: Li, H., Liu, T., Ma, WY., Sakai, T., Wong, KF., Zhou, G. (eds) Information Retrieval Technology. AIRS 2008. Lecture Notes in Computer Science, vol 4993. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-540-68636-1_29

Download citation

DOI: https://doi.org/10.1007/978-3-540-68636-1_29
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-68633-0
Online ISBN: 978-3-540-68636-1
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics