Skip to main content

Structured Data Extraction: Wrapper Generation

  • Chapter
  • First Online:
Web Data Mining

Part of the book series: Data-Centric Systems and Applications ((DCSA))

Abstract

Web information extraction is the problem of extracting target information items from Web pages. There are two general problems: extracting information from natural language text and extracting structured data from Web pages. This chapter focuses on extracting structured data. A program for extracting such data is usually called a wrapper. Extracting information from text is studied mainly in the natural language processing community.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 49.99
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 64.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info
Hardcover Book
USD 89.99
Price excludes VAT (USA)
  • Durable hardcover edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

Bibliography

  1. Arasu, A. and H. Garcia-Molina. Extracting structured data from web pages. In Proceedings of ACM SIGMOD Conference on Management of Data (SIGMOD-2003), 2003.

    Google Scholar 

  2. Arlotta, L., V. Crescenzi, G. Mecca, and P. Merialdo. Automatic annotation of data extracted from large web sites. In Proceedings of Intl. Workshop on Web and Databases, 2003.

    Google Scholar 

  3. Baumgartner, R., S. Flesca, and G. Gottlob. Visual web information extraction with lixto. In Proceedings of International Conference on Very Large Data Bases (VLDB-2001), 2001.

    Google Scholar 

  4. Buttler, D., L. Liu, and C. Pu. A fully automated object extraction system for the World Wide Web. In Proceedings of International Conference on Distributed Computing Systems (ICDCS-2001), 2002.

    Google Scholar 

  5. Cafarella, M., A. Halevy, D. Wang, E. Wu, and Y. Zhang. Webtables: Exploring the power of tables on the web. In Proceedings of International Conference on Very Large Data Bases (VLDB-2008), 2008.

    Google Scholar 

  6. Carrillo, H. and D. Lipman. The multiple sequence alignment problem in biology. SIAM Journal on Applied Mathematics, 1988, 48(5): p. 1073–1082.

    Article  MATH  MathSciNet  Google Scholar 

  7. Chang, C., M. Kayed, M. Girgis, and K. Shaalan. A survey of web information extraction systems. IEEE Transactions on Knowledge and Data Engineering, 2006: p. 1411–1428.

    Google Scholar 

  8. Chang, C. and S. Lui. IEPAD: information extraction based on pattern discovery. In Proceedings of International Conference on World Wide Web (WWW-2001), 2001.

    Google Scholar 

  9. Chen, W. New algorithm for ordered tree-to-tree correction problem. Journal of Algorithms, 2001, 40(2): p. 135–158.

    Article  MATH  MathSciNet  Google Scholar 

  10. Cohen, W., M. Hurst, and L. Jensen. A flexible learning system for wrapping tables and lists in HTML documents. In Proceedings of International Conference on World Wide Web (WWW-2002), 2002.

    Google Scholar 

  11. Crescenzi, V., G. Mecca, and P. Merialdo. RoadRunner: Towards automatic data extraction from large web sites. In Proceedings of International Conference on Very Large Data Bases (VLDB-2001), 2001.

    Google Scholar 

  12. Embley, D., Y. Jiang, and Y. Ng. Record-boundary discovery in Web documents. ACM SIGMOD Record, 1999, 28(2): p. 467–478.

    Article  Google Scholar 

  13. Grumbach, S. and G. Mecca. In search of the lost schema. Database Theory—ICDT’99, 1999: p. 314–331.

    Google Scholar 

  14. Gusfield, D. Algorithms on strings, trees, and sequences: computer science and computational biology. 1997: Cambridge Univ Press.

    Google Scholar 

  15. Hogue, A. and D. Karger. Thresher: automating the unwrapping of semantic content from the World Wide Web. In Proceedings of International Conference on World Wide Web (WWW-2005), 2005.

    Google Scholar 

  16. Honavar, V. and G. Slutzki. eds. Grammatical Inference. Fourth Intl Colloquium on Grammatical Inference. 1998, LNCS 1433. Springer-Verlag.

    Google Scholar 

  17. Hsu, C. and M. Dung. Generating finite-state transducers for semi-structured data extraction from the Web. Information Systems, 1998, 23(8): p. 521–538.

    Article  Google Scholar 

  18. Irmak, U. and T. Suel. Interactive wrapper generation with minimal user effort. In Proceedings of International Conference on World Wide Web (WWW-2006), 2006.

    Google Scholar 

  19. Kushmerick, N. Wrapper induction for information extraction, PhD Thesis. 1997.

    Google Scholar 

  20. Kushmerick, N. Wrapper induction: Efficiency and expressiveness. Artificial Intelligence, 2000, 118(1–2): p. 15–68.

    Article  MATH  MathSciNet  Google Scholar 

  21. Lafferty, J., A. McCallum, and F. Pereira. Conditional random fields: Probabilistic models for segmenting and labeling sequence data. In Proceedings of International Conference on Machine Learning (ICML-2001), 2001.

    Google Scholar 

  22. Lerman, K., L. Getoor, S. Minton, and C. Knoblock. Using the structure of Web sites for automatic segmentation of tables. In Proceedings of ACM SIGMOD Conference on Management of Data (SIGMOD-2004), 2004.

    Google Scholar 

  23. Li, Z. and W. Ng. Wiccap: From semi-structured data to structured data. In Proceedings of 11th IEEE International Conference and Workshop on the Engineering of Computer-Based Systems (ECBS'04), 2004.

    Google Scholar 

  24. Liu, B., R. Grossman, and Y. Zhai. Mining data records in Web pages. In Proceedings of ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD-2003), 2003.

    Google Scholar 

  25. Liu, B. and Y. Zhai. NET - A System for Extracting Web Data from Flat and Nested Data Records. In Proceedings of Intl. Conf. on Web Information Systems Engineering (WISE2005), 2005.

    Google Scholar 

  26. Miao, G., J. Tatemura, W. Hsiung, A. Sawires, and L. Moser. Extracting data records from the web using tag path clustering. In Proceedings of International Conference on World Wide Web (WWW-2009), 2009.

    Google Scholar 

  27. Muslea, I., S. Minton, and C. Knoblock. Active learning with multiple views. Journal of Artificial Intelligence Research, 2006, 27(1): p. 203–233.

    MATH  MathSciNet  Google Scholar 

  28. Muslea, I., S. Minton, and C. Knoblock. A hierarchical approach to wrapper induction. In Proceedings of Intl. Conf. on Autonomous Agents (AGENTS-1999), 1999.

    Google Scholar 

  29. Raposo, J., A. Pan, M. Álvarez, J. Hidalgo, and A. Vina. The wargo system: Semi-automatic wrapper generation in presence of complex data access modes. In Proceedings of Workshop on Database and Expert Systems Applications, 2002.

    Google Scholar 

  30. Reis, D., P. Golgher, A. Silva, and A. Laender. Automatic web news extraction using tree edit distance. In Proceedings of International Conference on World Wide Web (WWW-2004), 2004.

    Google Scholar 

  31. Simon, K. and G. Lausen. ViPER: augmenting automatic information extraction with visual perceptions. In Proceedings of ACM International Conference on Information and knowledge management (CIKM-2005), 2005.

    Google Scholar 

  32. Song, X., J. Liu, Y. Cao, C. Lin, and H. Hon. Automatic extraction of web data records containing user-generated content. In Proceedings of ACM International Conference on Information and knowledge management (CIKM-2010), 2010.

    Google Scholar 

  33. Tai, K. The tree-to-tree correction problem. Journal of the ACM (JACM), 1979, 26(3): p. 433.

    Google Scholar 

  34. Wang, J. and F. Lochovsky. Data extraction and label assignment for web databases. In Proceedings of International Conference on World Wide Web (WWW-2003), 2003.

    Google Scholar 

  35. Wang, J., B. Shapiro, D. Shasha, K. Zhang, and K. Currey. An algorithm for finding the largest approximately common substructures of two trees. Pattern Analysis and Machine Intelligence, IEEE Transactions on, 2002, 20(8): p. 889–895.

    Article  Google Scholar 

  36. Yang, J., R. Cai, Y. Wang, J. Zhu, L. Zhang, and W. Ma. Incorporating sitelevel knowledge to extract structured data from web forums. In Proceedings of International Conference on World Wide Web (WWW-2009), 2009.

    Google Scholar 

  37. Yang, W. Identifying syntactic differences between two programs. Software: Practice and Experience, 1991, 21(7): p. 739–755.

    Article  Google Scholar 

  38. Zhai, Y. and B. Liu. Extracting web data using instance-based learning. World Wide Web, 2007, 10(2): p. 113–132.

    Article  Google Scholar 

  39. Zhai, Y. and B. Liu. Structured data extraction from the web based on partial tree alignment. IEEE Transactions on Knowledge and Data Engineering, 2006: p. 1614–1628.

    Google Scholar 

  40. Zhai, Y. and B. Liu. Web data extraction based on partial tree alignment. In Proceedings of International Conference on World Wide Web (WWW-2005), 2005.

    Google Scholar 

  41. Zhang, K., R. Statman, and D. Shasha. On the editing distance between unordered labeled trees. Information Processing Letters, 1992, 42(3): p. 133–139.

    Article  MATH  MathSciNet  Google Scholar 

  42. Zhao, H., W. Meng, Z. Wu, V. Raghavan, and C. Yu. Fully automatic wrapper generation for search engines. In Proceedings of International Conference on World Wide Web (WWW-2005), 2005.

    Google Scholar 

  43. Zheng, S., R. Song, J. Wen, and C. Giles. Efficient record-level wrapper induction. In Proceedings of ACM International Conference on Information and knowledge management (CIKM-2009), 2009.

    Google Scholar 

  44. Zhu, J., Z. Nie, J. Wen, B. Zhang, and W. Ma. 2D conditional random fields for web information extraction. In Proceedings of International Conference on Machine Learning (ICML-2005), 2005.

    Google Scholar 

  45. Zhu, J., Z. Nie, J. Wen, B. Zhang, and W. Ma. Simultaneous record detection and attribute labeling in web data extraction. In Proceedings of ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD-2006), 2006.

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Bing Liu .

Rights and permissions

Reprints and permissions

Copyright information

© 2011 Springer-Verlag Berlin Heidelberg

About this chapter

Cite this chapter

Liu, B. (2011). Structured Data Extraction: Wrapper Generation. In: Web Data Mining. Data-Centric Systems and Applications. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-19460-3_9

Download citation

  • DOI: https://doi.org/10.1007/978-3-642-19460-3_9

  • Published:

  • Publisher Name: Springer, Berlin, Heidelberg

  • Print ISBN: 978-3-642-19459-7

  • Online ISBN: 978-3-642-19460-3

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics