Skip to main content

Part of the book series: Lecture Notes in Computer Science ((LNISA,volume 7487))

Included in the following conference series:

Abstract

The web has become a pig sty—everyone dumps information at random places and in random shapes. Try to find the cheapest apartment in Oxford considering rent, travel, tax and heating costs; or a cheap, reasonable reviewed 11” laptop with an SSD drive.

Data extraction flushes structured information out of this sty: It turns mostly unstructured web pages into highly structured knowledge. In this chapter, we give a gentle introduction to data extraction including pointers to existing systems. We start with an overview and classification of data extraction systems along two primary dimensions, the level of supervision and the considered scale. The rest of the chapter is organized along the major division of these approaches into site-specific and supervised versus domain-specific and unsupervised. We first discuss supervised data extraction, where a human user identifies for each site examples of the relevant data and the system generalizes these examples into extraction programs. We focus particularly on declarative and rule-based paradigms. In the second part, we turn to fully automated (or unsupervised) approaches where the system by itself identifies the relevant data and fully automatically extracts data from many websites. Ontologies or schemata have proven invaluable to guide unsupervised data extraction and we present an overview of the existing approaches and the different ways in which they are using ontologies.

The research leading to these results has received funding from the European Research Council under the European Community’s Seventh Framework Programme (FP7/2007–2013) / ERC grant agreement DIADEM, no. 246858.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 39.99
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. Arocena, G.O., Mendelzon, A.O.: Weboql: Restructuring documents, databases, and webs. In: Proc. Int’l. Conf. on Data Engineering (ICDE), pp. 24–33. IEEE Comp. Soc. Press, Washington, DC (1998)

    Chapter  Google Scholar 

  2. Baumgartner, R., Flesca, S., Gottlob, G.: The Elog Web Extraction Language. In: Nieuwenhuis, R., Voronkov, A. (eds.) LPAR 2001. LNCS (LNAI), vol. 2250, pp. 548–560. Springer, Heidelberg (2001)

    Chapter  Google Scholar 

  3. Baumgartner, R., Flesca, S., Gottlob, G.: Visual web information extraction with lixto. In: Proc. Int’l. Conf. on Very Large Data Bases (VLDB), San Francisco, CA, USA, pp. 119–128 (2001), http://portal.acm.org/citation.cfm?id=645927.672194

  4. Benedikt, M., Gottlob, G., Senellart, P.: Determining relevance of accesses at runtime. In: Proc. Symp. on Principles of Database Systems, PODS (2011)

    Google Scholar 

  5. Blanco, L., Bronzi, M., Crescenzi, V., Merialdo, P., Papotti, P.: Exploiting information redundancy to wring out structured data from the web. In: Proceedings of the 19th International Conference on World Wide Web, WWW 2010, pp. 1063–1064. ACM, New York (2010), http://doi.acm.org/10.1145/1772690.1772805

    Google Scholar 

  6. Bolin, M., Webber, M., Rha, P., Wilson, T., Miller, R.C.: Automation and customization of rendered web pages. In: Proc. Symposium on User Interface Software and Technology (UIST), pp. 163–172. ACM, New York (2005)

    Google Scholar 

  7. Calì, A., Gottlob, G., Pieris, A.: Query Answering under Non-guarded Rules in Datalog+/-. In: Hitzler, P., Lukasiewicz, T. (eds.) RR 2010. LNCS, vol. 6333, pp. 1–17. Springer, Heidelberg (2010)

    Chapter  Google Scholar 

  8. Carlson, A., Betteridge, J., Kisiel, B., Settles, B., Hruschka, Jr., E.R., Mitchell, T.M.: Toward an architecture for never-ending language learning. In: Proc. AAAI Conf. on Artificial Intelligence (AAAI), pp. 1306–1313. AAAI Press (2010)

    Google Scholar 

  9. Chang, C.H., Kayed, M., Girgis, M.R., Shaalan, K.F.: A survey of web information extraction systems. IEEE Transactions on Knowledge and Data Engineering 18(10), 1411–1428 (2006)

    Article  Google Scholar 

  10. Dalvi, N., Kumar, R., Soliman, M.: Automatic wrappers for large scale web extraction. Proc. VLDB Endow. 4(4), 219–230 (2011), http://dl.acm.org/citation.cfm?id=1938545.1938547

    Article  Google Scholar 

  11. Dalvi, N., Machanavajjhala, A., Pang, B.: An analysis of structured data on the web. Proc. VLDB Endow. 5(7), 680–691 (2012), http://dl.acm.org/citation.cfm?id=2180912.2180920

    Article  Google Scholar 

  12. Embley, D., Campbell, D., Jiang, Y., Liddle, S., Lonsdale, D., Ng, Y.K., Smith, R.: Conceptual-model-based data extraction from multiple-record web pages. Journal on Data & Knowledge Engineering 31(3), 227–251 (1999)

    Article  MATH  Google Scholar 

  13. Fazzinga, B., Flesca, S., Tagarelli, A.: Schema-based web wrapping. Knowl. Inf. Syst. 26(1), 127–173 (2011)

    Article  Google Scholar 

  14. Ferrara, E., Fiumara, G., Baumgartner, R.: Web data extraction, applications and techniques: A survey (2010) unpublished, http://www.emilio.ferrara.name/wp-content/uploads/2011/07/survey-csur.pdf

  15. Flesca, S., Oro, E., Ruffolo, M.: Wrappo: Wrapping objects from the web. Tech. rep., Institute of High Performance Computing and Networking of the Italian National Research Council, ICAR-CNR (2012)

    Google Scholar 

  16. Furche, T., Gottlob, G., Grasso, G., Guo, X., Orsi, G., Schallhart, C.: Real understanding of real estate forms. In: Proceedings of the Internation Conference on Web Intelligence, Mining and Semantics, WIMS 2011 (2011)

    Google Scholar 

  17. Furche, T., Gottlob, G., Grasso, G., Guo, X., Orsi, G., Schallhart, C.: Opal: automated form understanding for the deep web. In: Proceedings of the 21st International Conference on World Wide Web, WWW 2012, pp. 829–838. ACM, New York (2012), http://doi.acm.org/10.1145/2187836.2187948

    Google Scholar 

  18. Furche, T., Gottlob, G., Grasso, G., Orsi, G., Schallhart, C., Wang, C.: Little Knowledge Rules the Web: Domain-Centric Result Page Extraction. In: Rudolph, S., Gutierrez, C. (eds.) RR 2011. LNCS, vol. 6902, pp. 61–76. Springer, Heidelberg (2011)

    Chapter  Google Scholar 

  19. Furche, T., Gottlob, G., Grasso, G., Schallhart, C., Sellers, A.: Oxpath: A language for scalable, memory-efficient data extraction from web applications. In: Proc. Int’l. Conf. on Very Large Data Bases, VLDB (2011)

    Google Scholar 

  20. Furche, T., Gottlob, G., Guo, X., Schallhart, C., Sellers, A., Wang, C.: How the Minotaur Turned into Ariadne: Ontologies in Web Data Extraction. In: Auer, S., Díaz, O., Papadopoulos, G.A. (eds.) ICWE 2011. LNCS, vol. 6757, pp. 13–27. Springer, Heidelberg (2011), http://dl.acm.org/citation.cfm?id=2027776.2027779

    Chapter  Google Scholar 

  21. Furche, T., Grasso, G., Kravchenko, A., Schallhart, C.: Turn the Page: Automated Traversal of Paginated Websites. In: Brambilla, M., Tokuda, T., Tolksdorf, R. (eds.) ICWE 2012. LNCS, vol. 7387, pp. 332–346. Springer, Heidelberg (2012)

    Chapter  Google Scholar 

  22. Grishman, R., Sundheim, B.: Message understanding conference - 6: A brief history. In: Proceedings of the International Conference on Computational Linguistics (1996)

    Google Scholar 

  23. Gulhane, P., Rastogi, R., Sengamedu, S.H., Tengli, A.: Exploiting content redundancy for web information extraction. In: Proceedings of the 19th International Conference on World Wide Web, WWW 2010, pp. 1105–1106. ACM, New York (2010), http://doi.acm.org/10.1145/1772690.1772826

    Google Scholar 

  24. Halevy, A.Y.: Structured Data on the Web. In: Feldman, Y.A., Kraft, D., Kuflik, T. (eds.) NGITS 2009. LNCS, vol. 5831, pp. 2–2. Springer, Heidelberg (2009), http://dl.acm.org/citation.cfm?id=1813323.1813326

    Chapter  Google Scholar 

  25. Kayed, M., Chang, C.H.: FiVaTech: Page-Level Web Data Extraction from Template Pages. IEEE Transactions on Knowledge and Data Engineering 22(2), 249–263 (2010)

    Article  Google Scholar 

  26. Leshed, G., Haber, E.M., Matthews, T., Lau, T.: Coscripter: automating & sharing how-to knowledge in the enterprise. In: Proceeding of the Twenty-Sixth Annual SIGCHI Conference on Human Factors in Computing Systems, CHI 2008, pp. 1719–1728. ACM, New York (2008), http://doi.acm.org/10.1145/1357054.1357323

    Chapter  Google Scholar 

  27. Lin, J., Wong, J., Nichols, J., Cypher, A., Lau, T.A.: End-user programming of mashups with vegemite. In: Proceedings of the 13th International Conference on Intelligent User Interfaces, IUI 2009, pp. 97–106. ACM, New York (2009), http://doi.acm.org/10.1145/1502650.1502667

    Google Scholar 

  28. Liu, M., Ling, T.W.: A rule-based query language for html. In: Proc. Int’l. Conf. on Database Systems for Advanced Applications (DASFAA), pp. 6–13. IEEE Comp. Soc. Press (2001)

    Google Scholar 

  29. Liu, W., Meng, X., Meng, W.: Vision-based Web Data Records Extraction. In: Proc. 9th International Workshop on the Web and Databases, pp. 20–25 (2006)

    Google Scholar 

  30. Madhavan, J., Jeffery, S.R., Cohen, S., Dong, X., Ko, D., Yu, C., Halevy, A., Inc, G.: Web-scale data integration: You can only afford to pay as you go. In: CIDR (2007)

    Google Scholar 

  31. Marx, M.: Conditional XPath, the First Order Complete XPath Dialect. In: Proc. ACM Symposium on Principles of Database Systems, pp. 13–22. ACM (June 2004), http://turing.wins.uva.nl/~marx/pub/recent/pods04.pdf

  32. Mendelzon, A.O., Mihaila, G.A., Milo, T.: Querying the world wide web. Int. J. on Digital Libraries 1(1), 54–67 (1997)

    Google Scholar 

  33. Navarrete, I., Sciavicco, G.: Spatial reasoning with rectangular cardinal direction relations. In: ECAI, pp. 1–9 (2006)

    Google Scholar 

  34. Oro, E., Ruffolo, M.: Xonto: An ontology-based system for semantic information extraction from pdf documents. In: Proc. Int’. Conf. on Tools with Artificial Intelligence (ICTAI), pp. 118–125 (2008)

    Google Scholar 

  35. Oro, E., Ruffolo, M., Saccà, D.: Ontology-based information extraction from pdf documents with xonto. International Journal on Artificial Intelligence Tools (IJAIT) 18(5), 673–695 (2009)

    Article  Google Scholar 

  36. Oro, E., Ruffolo, M., Staab, S.: Sxpath - extending xpath towards spatial querying on web documents. PVLDB 4(2), 129–140 (2010)

    Google Scholar 

  37. Renz, J.: Qualitative spatial reasoning with topological information. Springer (2002)

    Google Scholar 

  38. Sahuguet, A., Azavant, F.: Building light-weight wrappers for legacy web data-sources using w4f. In: Proc. Int’l. Conf. on Very Large Data Bases (VLDB), pp. 738–741 (1999)

    Google Scholar 

  39. Sarawagi, S.: Information extraction. Found. Trends Databases 1(3), 261–377 (2008), http://dx.doi.org/10.1561/1900000003

    Article  MATH  Google Scholar 

  40. Satpal, S., Bhadra, S., Sellamanickam, S., Rastogi, R., Sen, P.: Web information extraction using markov logic networks. In: Proc. Int’l. Conf. on World Wide Web (WWW), pp. 115–116. ACM, New York (2011), http://doi.acm.org/10.1145/1963192.1963251

    Google Scholar 

  41. Sawa, N., Morishima, A., Sugimoto, S., Kitagawa, H.: Wraplet: Wrapping your web contents with a lightweight language. In: Proceedings of the 2007 Third International IEEE Conference on Signal-Image Technologies and Internet-Based System, pp. 387–394. IEEE Computer Society, Washington, DC (2007)

    Chapter  Google Scholar 

  42. Senellart, P., Mittal, A., Muschick, D., Gilleron, R., Tommasi, M.: Automatic wrapper induction from hidden-web sources with domain knowledge. In: Proc. Int’l. Workshop on Web Information and Data Management, WIDM 2008, pp. 9–16. ACM, New York (2008), http://doi.acm.org/10.1145/1458502.1458505

    Google Scholar 

  43. Shen, W., Doan, A., Naughton, J.F., Ramakrishnan, R.: Declarative information extraction using datalog with embedded extraction predicates. In: Proc. Int’l. Conf. on Very Large Data Bases, VLDB, pp. 1033–1044 (2007)

    Google Scholar 

  44. Simon, K., Lausen, G.: ViPER: Augmenting Automatic Information Extraction with visual Perceptions. In: Proc. 14th ACM Conference on Information and Knowledge Management, pp. 381–388 (2005)

    Google Scholar 

  45. Su, W., Wang, J., Lochovsky, F.H.: Ode: Ontology-assisted data extraction. ACM Transactions on Database Systems 34, 12:1–12:35 (2009), http://doi.acm.org/10.1145/1538909.1538914

    Article  Google Scholar 

  46. W3C, X.M.L.: Path Language (XPath) Version 1.0 (November 1999), http://www.w3.org/TR/xpath

  47. Wang, J., Chen, C., Wang, C., Pei, J., Bu, J., Guan, Z., Zhang, W.V.: Can we learn a template-independent wrapper for news article extraction from a single training site? In: Proceedings of the 15th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, KDD 2009, pp. 1345–1354. ACM, New York (2009), http://doi.acm.org/10.1145/1557019.1557163

    Google Scholar 

  48. Wimalasuriya, D.C., Dou, D.: Ontology-based information extraction: An introduction and a survey of current approaches. J. Inf. Sci. 36, 306–323 (2010), http://dx.doi.org/10.1177/0165551509360123

    Article  Google Scholar 

  49. Yates, A., Cafarella, M., Banko, M., Etzioni, O., Broadhead, M., Soderland, S.: Textrunner: open information extraction on the web. In: Proceedings of Human Language Technologies: The Annual Conference of the North American Chapter of the Association for Computational Linguistics: Demonstrations on XX, NAACL 2007, pp. 25–26. Association for Computational Linguistics, Morristown (2007), http://portal.acm.org/citation.cfm?id=1614164.1614177

    Google Scholar 

  50. Zhai, Y., Liu, B.: Structured Data Extraction from the Web Based on Partial Tree Alignment. IEEE Transactions on Knowledge and Data Engineering 18(12), 1614–1628 (2006)

    Article  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2012 Springer-Verlag Berlin Heidelberg

About this chapter

Cite this chapter

Flesca, S., Furche, T., Oro, L. (2012). Reasoning and Ontologies in Data Extraction. In: Eiter, T., Krennwallner, T. (eds) Reasoning Web. Semantic Technologies for Advanced Query Answering. Reasoning Web 2012. Lecture Notes in Computer Science, vol 7487. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-33158-9_5

Download citation

  • DOI: https://doi.org/10.1007/978-3-642-33158-9_5

  • Publisher Name: Springer, Berlin, Heidelberg

  • Print ISBN: 978-3-642-33157-2

  • Online ISBN: 978-3-642-33158-9

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics