Skip to main content

Blog Post and Comment Extraction Using Information Quantity of Web Format

  • Conference paper
Information Retrieval Technology (AIRS 2008)

Part of the book series: Lecture Notes in Computer Science ((LNISA,volume 4993))

Included in the following conference series:

Abstract

With the development of the research on blogosphere, acquiring the post and comment from blog page becomes more important in improving the search performance. In this paper, we present a two-stage method. First, we combine the advantage of the vision information and the effective text information to locate the main text which represents the theme of blog page. Second, we use the information quantity of separator to detect the boundary between the post and comment. According to our experiments, this method achieves a good performance in extraction and improves the performance of blog search.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 84.99
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 109.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. Crescenzi, V., RoadRunner, G.M.: Towards automatic data extraction from large web site. In: Proceeding of the 26th International Conference on very Large Database Systems, pp. 109–118 (2001)

    Google Scholar 

  2. Gupta, S., Kaiser, G., Neistadt, D., Grimm, P.: Dom-based content extraction of html documents. In: 12th International World Wide Web Conference (May 2003)

    Google Scholar 

  3. Gupta, S., Kaiser, G.E., Grimm, P., Chiang, M.F., Starren, J.: Automating content extraction of html documents. World Wide Web 8(2), 179–224 (2005)

    Article  Google Scholar 

  4. Irmak, U., Suel, T.: Interactive wrapper generation with minimal user effort. In: WWW 2006, pp. 553–563–224 (2006)

    Google Scholar 

  5. Ling, Y., Meng, X., Meng, W.: Automated extraction of hit numbers from search result pages. In: Yu, J.X., Kitsuregawa, M., Leong, H.-V. (eds.) WAIM 2006. LNCS, vol. 4016, pp. 73–84. Springer, Heidelberg (2006)

    Chapter  Google Scholar 

  6. Liu, B., Grossman, R., Zhai, Y.: Mining data records in web pages. In: ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (August 2003)

    Google Scholar 

  7. Lu, Y., Meng, W., Zhang, W., Liu, K.-L., Yu, C.T.: Automatic extraction of publication time from news search results. In: ICDE Workshops (2006)

    Google Scholar 

  8. Qi, Y., Candan, K.S.: Blogs, wikis and rss: Cuts: Curvature-based development pattern analysis and segmentation for blogs and other text streams. In: Proceedings of the seventeenth conference on Hypertext and hypermedia HYPERTEXT 2006 (August 2006)

    Google Scholar 

  9. Reis, D.C., Golgher, P.B., Silva, A.S., Laender, A.F.: Automatic web news extraction using tree edit distance. In: Proceedings of the 13th international conference on World Wide Web WWW 2004 (May 2004)

    Google Scholar 

  10. Reynar, J.C.: Topic segmentation: Algorithms and applications. PhD thesis (1998)

    Google Scholar 

  11. Song, R., Liu, H., Wen, J., Ma, W.: Learning block importance models for web pages. In: Proceedings of the 13th international conference on World Wide Web WWW 2004 (May 2004)

    Google Scholar 

  12. Zhai, Y., Liu, B.: Structured data extraction from the web based on partial tree alignment. IEEE Transactions on Knowledge and Data Engineering (2006)

    Google Scholar 

  13. Zhao, H., Meng, W., Yu, C.T.: Automatic extraction of dynamic record sections from search engine result pages. In: VLDB 2006, pp. 989–1000 (2006)

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Editor information

Hang Li Ting Liu Wei-Ying Ma Tetsuya Sakai Kam-Fai Wong Guodong Zhou

Rights and permissions

Reprints and permissions

Copyright information

© 2008 Springer-Verlag Berlin Heidelberg

About this paper

Cite this paper

Cao, D., Liao, X., Xu, H., Bai, S. (2008). Blog Post and Comment Extraction Using Information Quantity of Web Format. In: Li, H., Liu, T., Ma, WY., Sakai, T., Wong, KF., Zhou, G. (eds) Information Retrieval Technology. AIRS 2008. Lecture Notes in Computer Science, vol 4993. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-540-68636-1_29

Download citation

  • DOI: https://doi.org/10.1007/978-3-540-68636-1_29

  • Publisher Name: Springer, Berlin, Heidelberg

  • Print ISBN: 978-3-540-68633-0

  • Online ISBN: 978-3-540-68636-1

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics