STEM: a suffix tree-based method for web data records extraction

Fang, Yixiang; Xie, Xiaoqin; Zhang, Xiaofeng; Cheng, Reynold; Zhang, Zhiqiang

doi:10.1007/s10115-017-1062-0

STEM: a suffix tree-based method for web data records extraction

Regular Paper
Published: 09 May 2017

Volume 55, pages 305–331, (2018)
Cite this article

Knowledge and Information Systems Aims and scope Submit manuscript

Yixiang Fang¹,
Xiaoqin Xie²,
Xiaofeng Zhang³,
Reynold Cheng¹ &
…
Zhiqiang Zhang²

731 Accesses
12 Citations
Explore all metrics

Abstract

To automatically extract data records from Web pages, the data record extraction algorithm is required to be robust and efficient. However, most of existing algorithms are not robust enough to cope with rich information or noisy data. In this paper, we propose a novel suffix tree-based extraction method (STEM) for this challenging task. First, we extract a sequence of identifiers from the tag paths of Web pages. Then, a suffix tree is built on top of this sequence and four refining filters are proposed to screen out data regions that might not contain data records. To evaluate model performance, we define an evaluation metric called pattern similarity and perform rigorous experiments on five real data sets. The promising experimental results have demonstrated that the proposed STEM is superior to the state-of-the-art algorithms like MDR, TPC and CTVS with respect to precision, recall and pattern similarity. Moreover, the time complexity of STEM is linear to the total number of HTML tags contained in Web pages, which indicates the potential applicability of STEM in a wide range of Web-scale data record extraction applications.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

A comprehensive and analytical review of text clustering techniques

Article 08 April 2024

A Review of Tools and Techniques for Preprocessing of Textual Data

The pattern frequency distribution theory: a mathematic establishment toward rational and reliable pattern mining

Article 20 August 2022

Notes

Without further explanation, the DOM tree nodes mentioned in this paper are element nodes.
JTidy: http://sourceforge.net/projects/jtidy/?source=directory.
http://i.cs.hku.hk/~yxfang/public_data.rar.
For TB2 and TB3, the number of websites are not specified in [28], and so we put “–” there.

References

Laender A, Riberro-Neto B, Silva A, Teixira J (2002) A brief survey of web data extraction tools. In: SIGMOD record, vol 31(2). ACM, New York, pp 84–93
Chang C-H, Kayed M, Girgis MR, Shaala KF (2006) A survey of web information extraction systems. IEEE Trans Knowl Data Eng. 18(10):1411–1428
Article Google Scholar
Sleiman H, Corchuelo R et al (2013) A survey on region extractors from web documents. IEEE Trans Knowl Data Eng 25(9):1960–1981
Article Google Scholar
Ferrara E, De Meo P, Fiumara G, Baumgartner R (2014) Web data extraction, applications and techniques: a survey. Knowl Based Syst 70:301–323
Article Google Scholar
Miao G, Tatemura J, Hsiung W-P, Sawires A, Moser L (2009) Extracting data records from the web using tag path clustering. In: Proceedings of the 18th international conference on World Wide Web (WWW). ACM, New York, pp 981–990
Su W, Wang J, Wang J, Lochovsky FH, Liu Y (2012) Combing tag and value similarity for data extraction and alignment. IEEE Trans Knowl Data Eng 24(7):1186–1200
Article Google Scholar
Liu B, Grossman R, Zhai Y (2003) Mining data records in web pages. In: Proceedings of the ninth ACM SIGKDD international conference on knowledge discovery and data mining. ACM, New York, pp 601–606
Baumgartner R, Gottlob G, Herzog M (2009) Scalable web data extraction for online market intelligence. Proc VLDB Endow 2(2):1512–1523
Article Google Scholar
Liu L, Pu C, Han W (2000) Xwrap: an xml-enabled wrapper construction system for web information sources. In: Proceedings of the IEEE 16th international conference on data engineering, pp 611–621
Gulhane P, Madaan A, Mehta R, et al (2011) Web-scale information extraction with vertex. In: ICDE, pp 209–220
Etzioni O, Cafarella M, Downey D, Kok S, Popescu A-M, Shaked T, Soderland S, Weld DS, Yates A (2004) Web-scale information extraction in knowitall (preliminary results). In: Proceedings of the 13th international conference on World Wide Web. ACM, New York, pp 100–110
Gupta S, Kaiser G, Neistadt D, Grimm P (2003) Dom-based content extraction of html documents. In: Proceedings of the 12th international conference on World Wide Web, pp 207–214
Gupta S, Kaiser GE, Grimm P, Chiang MF, Starren J (2005) Automating content extraction of html documents. In: Proceedings of the 14th international conference on World Wide Web, vol 8, no. 2, pp 179–224, 6
Zheng S, Song R, Wen J-R, Giles CL (2009) Efficient record-level wrapper induction. In: Proceedings of the 18th international conference on information and knowledge management (CIKM). ACM, New York, pp 47–56
Dalvi N, Kumar R, Soliman M (2011) Automatic wrappers for large scale web extraction. Proc VLDB Endow 4(4):219–230
Article Google Scholar
Buttler D, Liu L, Pu C (2001) A fully automated object extraction system for the world wide web. In: Proceedings of IEEE 21st international conference on distributed computing systems, pp 361–370
Chang C, Lui S (2001) Iepad: information extraction based on pattern discovery. In: Proceedings of the 10th international conference on World Wide Web. ACM, New York, pp 681–688
Kayed M (2012) Peer matrix alignment: a new algorithm. In: Advances in knowledge discovery and data mining, pp 268–279
Zhai Y, Liu B (2005) Web data extraction based on partial tree alignment. In: Proceedings of the 14th international conference on World Wide Web (WWW). ACM, New York, pp 76–85
Liu B, Zhai Y (2005) Net-a system for extracting web data from flat and nested data records. In: WISE, vol 2005. Springer, pp 487–495
Arasu A, Garcia-Molina H (2003) Extracting structured data from web pages. In: Proceedings of ACM SIGMOD international conference on management of data, pp 337–348
Zhai Y, Liu B (2006) Structured data extraction from the web based on partial tree alignment. IEEE Trans Knowl Data Eng 18(12):1614–1628
Article Google Scholar
Zhao H, Meng W, Yu C (2006) Automatic extraction of dynamic record sections from search engine result pages. In: Proceedings of the 32nd international conference on very large data bases, pp 989–1000
Bing L, Lam W, Gu Y (2011) Towards a unified solution: Data record region detection and segmentation. In: Proceedings of the 20th ACM international conference on information and knowledge management, pp 1265–1274
Gatterbauer W, Bohunsky P, Herzog M, Krüpl B, Pollak B (2007) Towards domain-independent information extraction from web tables. In: Proceedings of the 16th international conference on World Wide Web. ACM, New York, pp 71–80
Cai D, Yu S, Wen J-R, Ma W-Y (2003) Extracting content structure for web pages based on visual representation. In: Web technologies and applications, pp 406–417
Simon K, Lausen G (2005) Viper: augmenting automatic information extraction with visual perceptions. In: Proceedings of the 14th ACM international conference on information and knowledge management, pp 381–388
Bing L, Lam W, Wong T-L (2013) Robust detection of semi-structured web records using a dom structure-knowledge-driven model. ACM Trans Web 7(4):21:1–21:32
Article Google Scholar
Zhao H, Meng W, Wu Z, Raghavan V, Yu C (2005) Fully automatic wrapper generation for search engines. In: Proceedings of the 14th international conference on World Wide Web (WWW). ACM, New York, pp 66–75
Liu W, Meng X, Meng W (2010) Vide: a vision-based approach for deep web data extraction. IEEE Trans Knowl Data Eng 22(3):447–460
Article Google Scholar
Fumarola F, Weninger T, Barber R, Malerba D, Han J (2011) Hylien: a hybrid approach to general list extraction on the web. In: Proceedings of the 20th international conference on World Wide Web (WWW), pp 35–36
Furche T, Gottlob G, Grasso G, Guo X, Orsi G, Schallhart C, Wang C (2014) Diadem: thousands of websites to a single database. Proc VLDB Endow 7(14):1845–1856
Article Google Scholar
Crescenzi V, Merialdo P, Qiu D (2013) Alfred: crowd assisted data extraction. In: Proceedings of the 22th international conference on World Wide Web (WWW), pp 297–300
Kondreddi SK, Triantafillou P, Weikum G (2014) Combining information extraction and human computing for crowdsourced knowledge acquisition. In: IEEE 30th international conference on data engineering, pp 988–999
Pasternack J, Roth D (2009) Extracting article text from the web with maximum subsequence segmentation. In: Proceedings of the 18th international conference on World Wide Web (WWW), pp 971–980
Weninger T, Hsu WH (2008) Text extraction from the web via text-to-tag ratio. In: 19th international workshop on database and expert systems application, pp 23–28
Weninger T, Hsu WH, Han J (2010) Cetr: content extraction via tag ratios. In: WWW, pp 971–980
Sun F, Song D, Liao L (2011) Dom based content extraction via text density. In: SIGIR, pp 245–254
Wu S, Liu J, Fan J (2015) Automatic web content extraction by combination of learning and grouping. In: Proceedings of the 24th international conference on World Wide Web (WWW), pp 1264–1274
Song D, Sun F, Liao L (2015) A hybrid approach for content extraction with text density and visual importance of dom nodes. Knowl Inf Syst 42(1):75–96
Article Google Scholar
Chiu D-Y, Wu Y-H, Chen A (2009) Efficient frequent sequence mining by a dynamic strategy switching algorithm. VLDB J 18(1):303–327
Article Google Scholar
Loh W-K, Ahn H (2011) A storage-efficient suffix tree construction algorithm for human genome sequences. IEICE Trans 94–D(12):2557–2560
Article Google Scholar
Pei J, Han J, Mortazavi-Asl B, Pinto H, Chen Q, Dayal U, Hsu M-C (2001) Prefixspan: mining sequential patterns efficiently by prefix-projected pattern growth. In: ICCN, pp 215–224
Pei J, Han J, Mortazavi-Asl B, Wang J, Pinto H, Chen Q, Dayal U, Hsu M-C (2004) Mining sequential patterns by pattern-growth: the prefixspan approach. IEEE Trans Knowl Data Eng, vol 16
Xie X, Fang Y, Zhang Z, Li L (2012) Extracting data records from web using suffix tree. In: Proceedings of the 18th SIGKDD workshop on mining data semantics. ACM
Maaß M (1999) Suffix trees and their applications. Ferienakademie
Grossi R, Italiano GF (1993) Suffix trees and their applications in string algorithms. In: Proceedings of the 1st south American workshop on string processing, pp 57–76
Zamir O, Etzioni O (1998) Web document clustering: a feasibility demonstration. In: Proceedings of the 21st annual international ACM SIGIR conference on research and development in information retrieval, pp 46–54
Chim H, Deng X (2007) A new suffix tree similarity measure for document clustering. In Proceedings of the 16th international conference on World Wide Web. ACM, pp 121–130
Fang Y, Zhang H, Ye Y, Li X (2014) Detecting hot topics from twitter: a multiview approach. J Inf Sci 40(5):578–593
Article Google Scholar
Ukkonen E (1995) On-line construction of suffix trees. Algorithmica 14(3):249–260
Article MathSciNet MATH Google Scholar
Farach M (1997) Optimal suffix tree construction with large alphabets. In: FOCS, pp 137–143
Greenberg RI (2003) Bounds on the number of longest common subsequences. CoRR, vol. cs.DM/0301030. [Online]. Available: http://arxiv.org/abs/cs.DM/0301030
Knuth DE, Morris JH, Pratt VR (1977) Fast pattern matching in strings. SAIM J Comput 6(2):323–350
Yamada Y, Craswell N, Nakatoh T, Hirokawa S (2004) Testbed for information extraction from deep web. In: Proceedings of the 13th international conference on World Wide Web (WWW). ACM, New York, pp 346–347
Navarro G (2001) A guided tour to approximate string matching. ACM Comput Surv 33(1):31–88
Article Google Scholar

Download references

Acknowledgements

Reynold Cheng and Yixiang Fang were supported by the Research Grants Council of Hong Kong (RGC Projects HKU 17229116 and 17205115) and the University of Hong Kong (Projects 102009508 and 104004129). Xiaoqin Xie was supported by the China Scholarship Council and the National Natural Science Foundation of China (Nos. 61202090, 61370084, 61272184), the Science and Technology Innovation Talents Special Fund of Harbin under Grant (No. 2015RQQXJ067), the Fundamental Research Funds for the Central Universities under Grant (No. HEUCF10060). Xiaofeng Zhang was partially supported by the National Science Foundation of China under Grant No. 61370213, National Key Technology Support Program of China under Grant No. 2014BAL05B06, and Shenzhen Science and Technology Program under Grant No. JCYJ20160330163900579. We would like to thank the reviewers for their insightful comments.

Author information

Authors and Affiliations

Department of Computer Science, The University of Hong Kong, Pokfulam Road, Pokfulam, Hong Kong
Yixiang Fang & Reynold Cheng
College of Computer Science and Technology, Harbin Engineering University, Harbin, China
Xiaoqin Xie & Zhiqiang Zhang
School of Computer Science, Harbin Institute of Technology Shenzhen Graduate School, G709, HIT Campus, University Town, Xili, Shenzhen, China
Xiaofeng Zhang

Authors

Yixiang Fang
View author publications
You can also search for this author in PubMed Google Scholar
Xiaoqin Xie
View author publications
You can also search for this author in PubMed Google Scholar
Xiaofeng Zhang
View author publications
You can also search for this author in PubMed Google Scholar
Reynold Cheng
View author publications
You can also search for this author in PubMed Google Scholar
Zhiqiang Zhang
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Xiaofeng Zhang.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Fang, Y., Xie, X., Zhang, X. et al. STEM: a suffix tree-based method for web data records extraction. Knowl Inf Syst 55, 305–331 (2018). https://doi.org/10.1007/s10115-017-1062-0

Download citation

Received: 15 April 2016
Revised: 10 March 2017
Accepted: 29 April 2017
Published: 09 May 2017
Issue Date: May 2018
DOI: https://doi.org/10.1007/s10115-017-1062-0

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

STEM: a suffix tree-based method for web data records extraction

Abstract

Access this article

Similar content being viewed by others

A comprehensive and analytical review of text clustering techniques

A Review of Tools and Techniques for Preprocessing of Textual Data

The pattern frequency distribution theory: a mathematic establishment toward rational and reliable pattern mining

Notes

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Keywords

Navigation

STEM: a suffix tree-based method for web data records extraction

Abstract

Access this article

Similar content being viewed by others

A comprehensive and analytical review of text clustering techniques

A Review of Tools and Techniques for Preprocessing of Textual Data

The pattern frequency distribution theory: a mathematic establishment toward rational and reliable pattern mining

Notes

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation