Skip to main content

An FW-DTSS Based Approach for News Page Information Extraction

  • Conference paper
  • First Online:
Data Mining and Big Data (DMBD 2016)

Part of the book series: Lecture Notes in Computer Science ((LNISA,volume 9714))

Included in the following conference series:

Abstract

Automatically identifying and extracting main text from a news page becomes a critical task in many web content analysis applications with the explosive growth of News information. However, body contents are usually covered by presentation elements, such as dynamic flashing logos, navigational menus and a multitude of ad blocks. In this paper, we have proposed a function word (FW) based approach which involves the concept of DOM tree structure similarity (DTSS). Function words are the word that have no real meaning but semantic or functional meaning. Experiment statistics show that function words emerge a lot in main text, while they don’t appear or appear just once or twice in presentation elements. Our approach involves three separate stages. Stage 1 is learning stages. In stage 2, the number of function words in each paragraph is counted and then the paragraph having the most function words is chosen to be the sample. In stage 3, all body paragraphs are extracted according to their similarity with the sample paragraph in DOM tree structure. Experiments results on real world data show that the FW-DTSS based approach is excellent in efficiency and accuracy, compared with that of statistics-based and Vision-based approaches.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 39.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

  1. Gibson, D., Punera, K., Tomkins, A.: The volume and evolution of web page templates. In: Special Interest Tracks and Posters of the 14th International Conference on World Wide Web, pp. 830–839. ACM (2005)

    Google Scholar 

  2. World Wide Web Consortium: Document Object Model (DOM) Level 2 Specification. W3C Recommendation (2000)

    Google Scholar 

  3. Chakrabarti, S.: Integrating the Document Object Model with hyperlinks for enhanced topic distillation and information extraction. In: International Conference on World Wide Web, WWW 2001, pp. 211–220 (2001)

    Google Scholar 

  4. Koch, P.P.: The document object model: an introduction. Digital Web Magazine (2001). http://www.digital-web.com/articles/the_document_object_model/

  5. Li, D.: Visual communication and design performance research for webpage. Henan University (2009)

    Google Scholar 

  6. Deng, C., Yu, S., Wen, J.: VIPS: A Vision-based Page segmentation. Microsoft Technical Report, MSR-TR-203-79 (2003)

    Google Scholar 

  7. He, Z., Gu, J., Yang, J.: Information extraction of BBS posting based on vision feature. Comput. Appl. 29, 171–174 (2009)

    Google Scholar 

  8. Alexjc: The easy way to extract useful text from arbitrary HTML (2007). http://ai-depot.com/articles/the-easy-way-to-extractuseful-text-fromarbitrary-html/

  9. Zhang, J., Ya, T.: A study of the identification of authorship for Chinese texts. In: IEEE International Conference on Intelligence and Security Informatics, pp. 263–264 (2008)

    Google Scholar 

  10. Ding, J.: Existential state and presentation of Chinese style. Rhetoric Learn. 3, 1–6 (2006)

    Google Scholar 

  11. Quan, S., Zhan, B., Zheng, Y: Authentication of online authorship or article based on hypothesis testing model. In: The 14th IEEE International Conference on Computational Science and Engineering, pp. 3–8. IEEE Computer Society (2011)

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Zhengyou Xia .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2016 Springer International Publishing Switzerland

About this paper

Cite this paper

Ma, L., Xia, Z. (2016). An FW-DTSS Based Approach for News Page Information Extraction. In: Tan, Y., Shi, Y. (eds) Data Mining and Big Data. DMBD 2016. Lecture Notes in Computer Science(), vol 9714. Springer, Cham. https://doi.org/10.1007/978-3-319-40973-3_22

Download citation

  • DOI: https://doi.org/10.1007/978-3-319-40973-3_22

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-319-40972-6

  • Online ISBN: 978-3-319-40973-3

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics