Structured Data Extraction: Wrapper Generation

Liu, Bing

doi:10.1007/978-3-642-19460-3_9

Bing Liu²

Part of the book series: Data-Centric Systems and Applications ((DCSA))

10k Accesses
3 Citations

Abstract

Web information extraction is the problem of extracting target information items from Web pages. There are two general problems: extracting information from natural language text and extracting structured data from Web pages. This chapter focuses on extracting structured data. A program for extracting such data is usually called a wrapper. Extracting information from text is studied mainly in the natural language processing community.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 49.99; Price excludes VAT (USA)

Softcover Book: USD 64.99; Price excludes VAT (USA)

Hardcover Book: USD 89.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Bibliography

Arasu, A. and H. Garcia-Molina. Extracting structured data from web pages. In Proceedings of ACM SIGMOD Conference on Management of Data (SIGMOD-2003), 2003.
Google Scholar
Arlotta, L., V. Crescenzi, G. Mecca, and P. Merialdo. Automatic annotation of data extracted from large web sites. In Proceedings of Intl. Workshop on Web and Databases, 2003.
Google Scholar
Baumgartner, R., S. Flesca, and G. Gottlob. Visual web information extraction with lixto. In Proceedings of International Conference on Very Large Data Bases (VLDB-2001), 2001.
Google Scholar
Buttler, D., L. Liu, and C. Pu. A fully automated object extraction system for the World Wide Web. In Proceedings of International Conference on Distributed Computing Systems (ICDCS-2001), 2002.
Google Scholar
Cafarella, M., A. Halevy, D. Wang, E. Wu, and Y. Zhang. Webtables: Exploring the power of tables on the web. In Proceedings of International Conference on Very Large Data Bases (VLDB-2008), 2008.
Google Scholar
Carrillo, H. and D. Lipman. The multiple sequence alignment problem in biology. SIAM Journal on Applied Mathematics, 1988, 48(5): p. 1073–1082.
Article MATH MathSciNet Google Scholar
Chang, C., M. Kayed, M. Girgis, and K. Shaalan. A survey of web information extraction systems. IEEE Transactions on Knowledge and Data Engineering, 2006: p. 1411–1428.
Google Scholar
Chang, C. and S. Lui. IEPAD: information extraction based on pattern discovery. In Proceedings of International Conference on World Wide Web (WWW-2001), 2001.
Google Scholar
Chen, W. New algorithm for ordered tree-to-tree correction problem. Journal of Algorithms, 2001, 40(2): p. 135–158.
Article MATH MathSciNet Google Scholar
Cohen, W., M. Hurst, and L. Jensen. A flexible learning system for wrapping tables and lists in HTML documents. In Proceedings of International Conference on World Wide Web (WWW-2002), 2002.
Google Scholar
Crescenzi, V., G. Mecca, and P. Merialdo. RoadRunner: Towards automatic data extraction from large web sites. In Proceedings of International Conference on Very Large Data Bases (VLDB-2001), 2001.
Google Scholar
Embley, D., Y. Jiang, and Y. Ng. Record-boundary discovery in Web documents. ACM SIGMOD Record, 1999, 28(2): p. 467–478.
Article Google Scholar
Grumbach, S. and G. Mecca. In search of the lost schema. Database Theory—ICDT’99, 1999: p. 314–331.
Google Scholar
Gusfield, D. Algorithms on strings, trees, and sequences: computer science and computational biology. 1997: Cambridge Univ Press.
Google Scholar
Hogue, A. and D. Karger. Thresher: automating the unwrapping of semantic content from the World Wide Web. In Proceedings of International Conference on World Wide Web (WWW-2005), 2005.
Google Scholar
Honavar, V. and G. Slutzki. eds. Grammatical Inference. Fourth Intl Colloquium on Grammatical Inference. 1998, LNCS 1433. Springer-Verlag.
Google Scholar
Hsu, C. and M. Dung. Generating finite-state transducers for semi-structured data extraction from the Web. Information Systems, 1998, 23(8): p. 521–538.
Article Google Scholar
Irmak, U. and T. Suel. Interactive wrapper generation with minimal user effort. In Proceedings of International Conference on World Wide Web (WWW-2006), 2006.
Google Scholar
Kushmerick, N. Wrapper induction for information extraction, PhD Thesis. 1997.
Google Scholar
Kushmerick, N. Wrapper induction: Efficiency and expressiveness. Artificial Intelligence, 2000, 118(1–2): p. 15–68.
Article MATH MathSciNet Google Scholar
Lafferty, J., A. McCallum, and F. Pereira. Conditional random fields: Probabilistic models for segmenting and labeling sequence data. In Proceedings of International Conference on Machine Learning (ICML-2001), 2001.
Google Scholar
Lerman, K., L. Getoor, S. Minton, and C. Knoblock. Using the structure of Web sites for automatic segmentation of tables. In Proceedings of ACM SIGMOD Conference on Management of Data (SIGMOD-2004), 2004.
Google Scholar
Li, Z. and W. Ng. Wiccap: From semi-structured data to structured data. In Proceedings of 11th IEEE International Conference and Workshop on the Engineering of Computer-Based Systems (ECBS'04), 2004.
Google Scholar
Liu, B., R. Grossman, and Y. Zhai. Mining data records in Web pages. In Proceedings of ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD-2003), 2003.
Google Scholar
Liu, B. and Y. Zhai. NET - A System for Extracting Web Data from Flat and Nested Data Records. In Proceedings of Intl. Conf. on Web Information Systems Engineering (WISE2005), 2005.
Google Scholar
Miao, G., J. Tatemura, W. Hsiung, A. Sawires, and L. Moser. Extracting data records from the web using tag path clustering. In Proceedings of International Conference on World Wide Web (WWW-2009), 2009.
Google Scholar
Muslea, I., S. Minton, and C. Knoblock. Active learning with multiple views. Journal of Artificial Intelligence Research, 2006, 27(1): p. 203–233.
MATH MathSciNet Google Scholar
Muslea, I., S. Minton, and C. Knoblock. A hierarchical approach to wrapper induction. In Proceedings of Intl. Conf. on Autonomous Agents (AGENTS-1999), 1999.
Google Scholar
Raposo, J., A. Pan, M. Álvarez, J. Hidalgo, and A. Vina. The wargo system: Semi-automatic wrapper generation in presence of complex data access modes. In Proceedings of Workshop on Database and Expert Systems Applications, 2002.
Google Scholar
Reis, D., P. Golgher, A. Silva, and A. Laender. Automatic web news extraction using tree edit distance. In Proceedings of International Conference on World Wide Web (WWW-2004), 2004.
Google Scholar
Simon, K. and G. Lausen. ViPER: augmenting automatic information extraction with visual perceptions. In Proceedings of ACM International Conference on Information and knowledge management (CIKM-2005), 2005.
Google Scholar
Song, X., J. Liu, Y. Cao, C. Lin, and H. Hon. Automatic extraction of web data records containing user-generated content. In Proceedings of ACM International Conference on Information and knowledge management (CIKM-2010), 2010.
Google Scholar
Tai, K. The tree-to-tree correction problem. Journal of the ACM (JACM), 1979, 26(3): p. 433.
Google Scholar
Wang, J. and F. Lochovsky. Data extraction and label assignment for web databases. In Proceedings of International Conference on World Wide Web (WWW-2003), 2003.
Google Scholar
Wang, J., B. Shapiro, D. Shasha, K. Zhang, and K. Currey. An algorithm for finding the largest approximately common substructures of two trees. Pattern Analysis and Machine Intelligence, IEEE Transactions on, 2002, 20(8): p. 889–895.
Article Google Scholar
Yang, J., R. Cai, Y. Wang, J. Zhu, L. Zhang, and W. Ma. Incorporating sitelevel knowledge to extract structured data from web forums. In Proceedings of International Conference on World Wide Web (WWW-2009), 2009.
Google Scholar
Yang, W. Identifying syntactic differences between two programs. Software: Practice and Experience, 1991, 21(7): p. 739–755.
Article Google Scholar
Zhai, Y. and B. Liu. Extracting web data using instance-based learning. World Wide Web, 2007, 10(2): p. 113–132.
Article Google Scholar
Zhai, Y. and B. Liu. Structured data extraction from the web based on partial tree alignment. IEEE Transactions on Knowledge and Data Engineering, 2006: p. 1614–1628.
Google Scholar
Zhai, Y. and B. Liu. Web data extraction based on partial tree alignment. In Proceedings of International Conference on World Wide Web (WWW-2005), 2005.
Google Scholar
Zhang, K., R. Statman, and D. Shasha. On the editing distance between unordered labeled trees. Information Processing Letters, 1992, 42(3): p. 133–139.
Article MATH MathSciNet Google Scholar
Zhao, H., W. Meng, Z. Wu, V. Raghavan, and C. Yu. Fully automatic wrapper generation for search engines. In Proceedings of International Conference on World Wide Web (WWW-2005), 2005.
Google Scholar
Zheng, S., R. Song, J. Wen, and C. Giles. Efficient record-level wrapper induction. In Proceedings of ACM International Conference on Information and knowledge management (CIKM-2009), 2009.
Google Scholar
Zhu, J., Z. Nie, J. Wen, B. Zhang, and W. Ma. 2D conditional random fields for web information extraction. In Proceedings of International Conference on Machine Learning (ICML-2005), 2005.
Google Scholar
Zhu, J., Z. Nie, J. Wen, B. Zhang, and W. Ma. Simultaneous record detection and attribute labeling in web data extraction. In Proceedings of ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD-2006), 2006.
Google Scholar

Download references

Author information

Authors and Affiliations

Department of Computer Science, University of Illinois, Chicago, 851 S. Morgan St., Chicago, IL, 60607-7053, USA
Bing Liu

Authors

Bing Liu
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Bing Liu .

Rights and permissions

Reprints and permissions

Copyright information

About this chapter

Cite this chapter

Liu, B. (2011). Structured Data Extraction: Wrapper Generation. In: Web Data Mining. Data-Centric Systems and Applications. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-19460-3_9

Download citation

DOI: https://doi.org/10.1007/978-3-642-19460-3_9
Published: 15 April 2011
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-19459-7
Online ISBN: 978-3-642-19460-3
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics