An FW-DTSS Based Approach for News Page Information Extraction

Ma, Leiming; Xia, Zhengyou

doi:10.1007/978-3-319-40973-3_22

Leiming Ma¹⁵ &
Zhengyou Xia¹⁵

Part of the book series: Lecture Notes in Computer Science ((LNISA,volume 9714))

Included in the following conference series:

International Conference on Data Mining and Big Data

2878 Accesses
3 Citations

Abstract

Automatically identifying and extracting main text from a news page becomes a critical task in many web content analysis applications with the explosive growth of News information. However, body contents are usually covered by presentation elements, such as dynamic flashing logos, navigational menus and a multitude of ad blocks. In this paper, we have proposed a function word (FW) based approach which involves the concept of DOM tree structure similarity (DTSS). Function words are the word that have no real meaning but semantic or functional meaning. Experiment statistics show that function words emerge a lot in main text, while they don’t appear or appear just once or twice in presentation elements. Our approach involves three separate stages. Stage 1 is learning stages. In stage 2, the number of function words in each paragraph is counted and then the paragraph having the most function words is chosen to be the sample. In stage 3, all body paragraphs are extracted according to their similarity with the sample paragraph in DOM tree structure. Experiments results on real world data show that the FW-DTSS based approach is excellent in efficiency and accuracy, compared with that of statistics-based and Vision-based approaches.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

Gibson, D., Punera, K., Tomkins, A.: The volume and evolution of web page templates. In: Special Interest Tracks and Posters of the 14th International Conference on World Wide Web, pp. 830–839. ACM (2005)
Google Scholar
World Wide Web Consortium: Document Object Model (DOM) Level 2 Specification. W3C Recommendation (2000)
Google Scholar
Chakrabarti, S.: Integrating the Document Object Model with hyperlinks for enhanced topic distillation and information extraction. In: International Conference on World Wide Web, WWW 2001, pp. 211–220 (2001)
Google Scholar
Koch, P.P.: The document object model: an introduction. Digital Web Magazine (2001). http://www.digital-web.com/articles/the_document_object_model/
Li, D.: Visual communication and design performance research for webpage. Henan University (2009)
Google Scholar
Deng, C., Yu, S., Wen, J.: VIPS: A Vision-based Page segmentation. Microsoft Technical Report, MSR-TR-203-79 (2003)
Google Scholar
He, Z., Gu, J., Yang, J.: Information extraction of BBS posting based on vision feature. Comput. Appl. 29, 171–174 (2009)
Google Scholar
Alexjc: The easy way to extract useful text from arbitrary HTML (2007). http://ai-depot.com/articles/the-easy-way-to-extractuseful-text-fromarbitrary-html/
Zhang, J., Ya, T.: A study of the identification of authorship for Chinese texts. In: IEEE International Conference on Intelligence and Security Informatics, pp. 263–264 (2008)
Google Scholar
Ding, J.: Existential state and presentation of Chinese style. Rhetoric Learn. 3, 1–6 (2006)
Google Scholar
Quan, S., Zhan, B., Zheng, Y: Authentication of online authorship or article based on hypothesis testing model. In: The 14th IEEE International Conference on Computational Science and Engineering, pp. 3–8. IEEE Computer Society (2011)
Google Scholar

Download references

Author information

Authors and Affiliations

College of Computer Science and Technology, Nanjing University of Aeronautics and Astronautics, Nanjing, 211106, China
Leiming Ma & Zhengyou Xia

Authors

Leiming Ma
View author publications
You can also search for this author in PubMed Google Scholar
Zhengyou Xia
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Zhengyou Xia .

Editor information

Editors and Affiliations

Peking University, Beijing, China
Ying Tan
Xi'an Jiaotong-Liverpool University, Suzhou, China
Yuhui Shi

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Ma, L., Xia, Z. (2016). An FW-DTSS Based Approach for News Page Information Extraction. In: Tan, Y., Shi, Y. (eds) Data Mining and Big Data. DMBD 2016. Lecture Notes in Computer Science(), vol 9714. Springer, Cham. https://doi.org/10.1007/978-3-319-40973-3_22

Download citation

DOI: https://doi.org/10.1007/978-3-319-40973-3_22
Published: 14 June 2016
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-40972-6
Online ISBN: 978-3-319-40973-3
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics