Reasoning and Ontologies in Data Extraction

Flesca, Sergio; Furche, Tim; Oro, Linda

doi:10.1007/978-3-642-33158-9_5

Sergio Flesca¹⁷,
Tim Furche¹⁸ &
Linda Oro¹⁹

Part of the book series: Lecture Notes in Computer Science ((LNISA,volume 7487))

Included in the following conference series:

Reasoning Web International Summer School

1286 Accesses
2 Citations

Abstract

The web has become a pig sty—everyone dumps information at random places and in random shapes. Try to find the cheapest apartment in Oxford considering rent, travel, tax and heating costs; or a cheap, reasonable reviewed 11” laptop with an SSD drive.

Data extraction flushes structured information out of this sty: It turns mostly unstructured web pages into highly structured knowledge. In this chapter, we give a gentle introduction to data extraction including pointers to existing systems. We start with an overview and classification of data extraction systems along two primary dimensions, the level of supervision and the considered scale. The rest of the chapter is organized along the major division of these approaches into site-specific and supervised versus domain-specific and unsupervised. We first discuss supervised data extraction, where a human user identifies for each site examples of the relevant data and the system generalizes these examples into extraction programs. We focus particularly on declarative and rule-based paradigms. In the second part, we turn to fully automated (or unsupervised) approaches where the system by itself identifies the relevant data and fully automatically extracts data from many websites. Ontologies or schemata have proven invaluable to guide unsupervised data extraction and we present an overview of the existing approaches and the different ways in which they are using ontologies.

The research leading to these results has received funding from the European Research Council under the European Community’s Seventh Framework Programme (FP7/2007–2013) / ERC grant agreement DIADEM, no. 246858.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

References

Arocena, G.O., Mendelzon, A.O.: Weboql: Restructuring documents, databases, and webs. In: Proc. Int’l. Conf. on Data Engineering (ICDE), pp. 24–33. IEEE Comp. Soc. Press, Washington, DC (1998)
Chapter Google Scholar
Baumgartner, R., Flesca, S., Gottlob, G.: The Elog Web Extraction Language. In: Nieuwenhuis, R., Voronkov, A. (eds.) LPAR 2001. LNCS (LNAI), vol. 2250, pp. 548–560. Springer, Heidelberg (2001)
Chapter Google Scholar
Baumgartner, R., Flesca, S., Gottlob, G.: Visual web information extraction with lixto. In: Proc. Int’l. Conf. on Very Large Data Bases (VLDB), San Francisco, CA, USA, pp. 119–128 (2001), http://portal.acm.org/citation.cfm?id=645927.672194
Benedikt, M., Gottlob, G., Senellart, P.: Determining relevance of accesses at runtime. In: Proc. Symp. on Principles of Database Systems, PODS (2011)
Google Scholar
Blanco, L., Bronzi, M., Crescenzi, V., Merialdo, P., Papotti, P.: Exploiting information redundancy to wring out structured data from the web. In: Proceedings of the 19th International Conference on World Wide Web, WWW 2010, pp. 1063–1064. ACM, New York (2010), http://doi.acm.org/10.1145/1772690.1772805
Google Scholar
Bolin, M., Webber, M., Rha, P., Wilson, T., Miller, R.C.: Automation and customization of rendered web pages. In: Proc. Symposium on User Interface Software and Technology (UIST), pp. 163–172. ACM, New York (2005)
Google Scholar
Calì, A., Gottlob, G., Pieris, A.: Query Answering under Non-guarded Rules in Datalog+/-. In: Hitzler, P., Lukasiewicz, T. (eds.) RR 2010. LNCS, vol. 6333, pp. 1–17. Springer, Heidelberg (2010)
Chapter Google Scholar
Carlson, A., Betteridge, J., Kisiel, B., Settles, B., Hruschka, Jr., E.R., Mitchell, T.M.: Toward an architecture for never-ending language learning. In: Proc. AAAI Conf. on Artificial Intelligence (AAAI), pp. 1306–1313. AAAI Press (2010)
Google Scholar
Chang, C.H., Kayed, M., Girgis, M.R., Shaalan, K.F.: A survey of web information extraction systems. IEEE Transactions on Knowledge and Data Engineering 18(10), 1411–1428 (2006)
Article Google Scholar
Dalvi, N., Kumar, R., Soliman, M.: Automatic wrappers for large scale web extraction. Proc. VLDB Endow. 4(4), 219–230 (2011), http://dl.acm.org/citation.cfm?id=1938545.1938547
Article Google Scholar
Dalvi, N., Machanavajjhala, A., Pang, B.: An analysis of structured data on the web. Proc. VLDB Endow. 5(7), 680–691 (2012), http://dl.acm.org/citation.cfm?id=2180912.2180920
Article Google Scholar
Embley, D., Campbell, D., Jiang, Y., Liddle, S., Lonsdale, D., Ng, Y.K., Smith, R.: Conceptual-model-based data extraction from multiple-record web pages. Journal on Data & Knowledge Engineering 31(3), 227–251 (1999)
Article MATH Google Scholar
Fazzinga, B., Flesca, S., Tagarelli, A.: Schema-based web wrapping. Knowl. Inf. Syst. 26(1), 127–173 (2011)
Article Google Scholar
Ferrara, E., Fiumara, G., Baumgartner, R.: Web data extraction, applications and techniques: A survey (2010) unpublished, http://www.emilio.ferrara.name/wp-content/uploads/2011/07/survey-csur.pdf
Flesca, S., Oro, E., Ruffolo, M.: Wrappo: Wrapping objects from the web. Tech. rep., Institute of High Performance Computing and Networking of the Italian National Research Council, ICAR-CNR (2012)
Google Scholar
Furche, T., Gottlob, G., Grasso, G., Guo, X., Orsi, G., Schallhart, C.: Real understanding of real estate forms. In: Proceedings of the Internation Conference on Web Intelligence, Mining and Semantics, WIMS 2011 (2011)
Google Scholar
Furche, T., Gottlob, G., Grasso, G., Guo, X., Orsi, G., Schallhart, C.: Opal: automated form understanding for the deep web. In: Proceedings of the 21st International Conference on World Wide Web, WWW 2012, pp. 829–838. ACM, New York (2012), http://doi.acm.org/10.1145/2187836.2187948
Google Scholar
Furche, T., Gottlob, G., Grasso, G., Orsi, G., Schallhart, C., Wang, C.: Little Knowledge Rules the Web: Domain-Centric Result Page Extraction. In: Rudolph, S., Gutierrez, C. (eds.) RR 2011. LNCS, vol. 6902, pp. 61–76. Springer, Heidelberg (2011)
Chapter Google Scholar
Furche, T., Gottlob, G., Grasso, G., Schallhart, C., Sellers, A.: Oxpath: A language for scalable, memory-efficient data extraction from web applications. In: Proc. Int’l. Conf. on Very Large Data Bases, VLDB (2011)
Google Scholar
Furche, T., Gottlob, G., Guo, X., Schallhart, C., Sellers, A., Wang, C.: How the Minotaur Turned into Ariadne: Ontologies in Web Data Extraction. In: Auer, S., Díaz, O., Papadopoulos, G.A. (eds.) ICWE 2011. LNCS, vol. 6757, pp. 13–27. Springer, Heidelberg (2011), http://dl.acm.org/citation.cfm?id=2027776.2027779
Chapter Google Scholar
Furche, T., Grasso, G., Kravchenko, A., Schallhart, C.: Turn the Page: Automated Traversal of Paginated Websites. In: Brambilla, M., Tokuda, T., Tolksdorf, R. (eds.) ICWE 2012. LNCS, vol. 7387, pp. 332–346. Springer, Heidelberg (2012)
Chapter Google Scholar
Grishman, R., Sundheim, B.: Message understanding conference - 6: A brief history. In: Proceedings of the International Conference on Computational Linguistics (1996)
Google Scholar
Gulhane, P., Rastogi, R., Sengamedu, S.H., Tengli, A.: Exploiting content redundancy for web information extraction. In: Proceedings of the 19th International Conference on World Wide Web, WWW 2010, pp. 1105–1106. ACM, New York (2010), http://doi.acm.org/10.1145/1772690.1772826
Google Scholar
Halevy, A.Y.: Structured Data on the Web. In: Feldman, Y.A., Kraft, D., Kuflik, T. (eds.) NGITS 2009. LNCS, vol. 5831, pp. 2–2. Springer, Heidelberg (2009), http://dl.acm.org/citation.cfm?id=1813323.1813326
Chapter Google Scholar
Kayed, M., Chang, C.H.: FiVaTech: Page-Level Web Data Extraction from Template Pages. IEEE Transactions on Knowledge and Data Engineering 22(2), 249–263 (2010)
Article Google Scholar
Leshed, G., Haber, E.M., Matthews, T., Lau, T.: Coscripter: automating & sharing how-to knowledge in the enterprise. In: Proceeding of the Twenty-Sixth Annual SIGCHI Conference on Human Factors in Computing Systems, CHI 2008, pp. 1719–1728. ACM, New York (2008), http://doi.acm.org/10.1145/1357054.1357323
Chapter Google Scholar
Lin, J., Wong, J., Nichols, J., Cypher, A., Lau, T.A.: End-user programming of mashups with vegemite. In: Proceedings of the 13th International Conference on Intelligent User Interfaces, IUI 2009, pp. 97–106. ACM, New York (2009), http://doi.acm.org/10.1145/1502650.1502667
Google Scholar
Liu, M., Ling, T.W.: A rule-based query language for html. In: Proc. Int’l. Conf. on Database Systems for Advanced Applications (DASFAA), pp. 6–13. IEEE Comp. Soc. Press (2001)
Google Scholar
Liu, W., Meng, X., Meng, W.: Vision-based Web Data Records Extraction. In: Proc. 9th International Workshop on the Web and Databases, pp. 20–25 (2006)
Google Scholar
Madhavan, J., Jeffery, S.R., Cohen, S., Dong, X., Ko, D., Yu, C., Halevy, A., Inc, G.: Web-scale data integration: You can only afford to pay as you go. In: CIDR (2007)
Google Scholar
Marx, M.: Conditional XPath, the First Order Complete XPath Dialect. In: Proc. ACM Symposium on Principles of Database Systems, pp. 13–22. ACM (June 2004), http://turing.wins.uva.nl/~marx/pub/recent/pods04.pdf
Mendelzon, A.O., Mihaila, G.A., Milo, T.: Querying the world wide web. Int. J. on Digital Libraries 1(1), 54–67 (1997)
Google Scholar
Navarrete, I., Sciavicco, G.: Spatial reasoning with rectangular cardinal direction relations. In: ECAI, pp. 1–9 (2006)
Google Scholar
Oro, E., Ruffolo, M.: Xonto: An ontology-based system for semantic information extraction from pdf documents. In: Proc. Int’. Conf. on Tools with Artificial Intelligence (ICTAI), pp. 118–125 (2008)
Google Scholar
Oro, E., Ruffolo, M., Saccà, D.: Ontology-based information extraction from pdf documents with xonto. International Journal on Artificial Intelligence Tools (IJAIT) 18(5), 673–695 (2009)
Article Google Scholar
Oro, E., Ruffolo, M., Staab, S.: Sxpath - extending xpath towards spatial querying on web documents. PVLDB 4(2), 129–140 (2010)
Google Scholar
Renz, J.: Qualitative spatial reasoning with topological information. Springer (2002)
Google Scholar
Sahuguet, A., Azavant, F.: Building light-weight wrappers for legacy web data-sources using w4f. In: Proc. Int’l. Conf. on Very Large Data Bases (VLDB), pp. 738–741 (1999)
Google Scholar
Sarawagi, S.: Information extraction. Found. Trends Databases 1(3), 261–377 (2008), http://dx.doi.org/10.1561/1900000003
Article MATH Google Scholar
Satpal, S., Bhadra, S., Sellamanickam, S., Rastogi, R., Sen, P.: Web information extraction using markov logic networks. In: Proc. Int’l. Conf. on World Wide Web (WWW), pp. 115–116. ACM, New York (2011), http://doi.acm.org/10.1145/1963192.1963251
Google Scholar
Sawa, N., Morishima, A., Sugimoto, S., Kitagawa, H.: Wraplet: Wrapping your web contents with a lightweight language. In: Proceedings of the 2007 Third International IEEE Conference on Signal-Image Technologies and Internet-Based System, pp. 387–394. IEEE Computer Society, Washington, DC (2007)
Chapter Google Scholar
Senellart, P., Mittal, A., Muschick, D., Gilleron, R., Tommasi, M.: Automatic wrapper induction from hidden-web sources with domain knowledge. In: Proc. Int’l. Workshop on Web Information and Data Management, WIDM 2008, pp. 9–16. ACM, New York (2008), http://doi.acm.org/10.1145/1458502.1458505
Google Scholar
Shen, W., Doan, A., Naughton, J.F., Ramakrishnan, R.: Declarative information extraction using datalog with embedded extraction predicates. In: Proc. Int’l. Conf. on Very Large Data Bases, VLDB, pp. 1033–1044 (2007)
Google Scholar
Simon, K., Lausen, G.: ViPER: Augmenting Automatic Information Extraction with visual Perceptions. In: Proc. 14th ACM Conference on Information and Knowledge Management, pp. 381–388 (2005)
Google Scholar
Su, W., Wang, J., Lochovsky, F.H.: Ode: Ontology-assisted data extraction. ACM Transactions on Database Systems 34, 12:1–12:35 (2009), http://doi.acm.org/10.1145/1538909.1538914
Article Google Scholar
W3C, X.M.L.: Path Language (XPath) Version 1.0 (November 1999), http://www.w3.org/TR/xpath
Wang, J., Chen, C., Wang, C., Pei, J., Bu, J., Guan, Z., Zhang, W.V.: Can we learn a template-independent wrapper for news article extraction from a single training site? In: Proceedings of the 15th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, KDD 2009, pp. 1345–1354. ACM, New York (2009), http://doi.acm.org/10.1145/1557019.1557163
Google Scholar
Wimalasuriya, D.C., Dou, D.: Ontology-based information extraction: An introduction and a survey of current approaches. J. Inf. Sci. 36, 306–323 (2010), http://dx.doi.org/10.1177/0165551509360123
Article Google Scholar
Yates, A., Cafarella, M., Banko, M., Etzioni, O., Broadhead, M., Soderland, S.: Textrunner: open information extraction on the web. In: Proceedings of Human Language Technologies: The Annual Conference of the North American Chapter of the Association for Computational Linguistics: Demonstrations on XX, NAACL 2007, pp. 25–26. Association for Computational Linguistics, Morristown (2007), http://portal.acm.org/citation.cfm?id=1614164.1614177
Google Scholar
Zhai, Y., Liu, B.: Structured Data Extraction from the Web Based on Partial Tree Alignment. IEEE Transactions on Knowledge and Data Engineering 18(12), 1614–1628 (2006)
Article Google Scholar

Download references

Author information

Authors and Affiliations

DEIS, University of Calabria, Via P. Bucci 41C, 87036, Rende, Italy
Sergio Flesca
Department of Computer Science, Oxford University, Wolfson Building, Parks Road, Oxford, OX1 3QD, UK
Tim Furche
ICAR-CNR, University of Calabria, Via P. Bucci 41C, 87036, Rende, Italy
Linda Oro

Authors

Sergio Flesca
View author publications
You can also search for this author in PubMed Google Scholar
Tim Furche
View author publications
You can also search for this author in PubMed Google Scholar
Linda Oro
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

Institute of Information Systems, Knowledge-Based Systems Group, Vienna University of Technology, Favoritenstraße 9–11/1843, A-1040, Vienna, Austria
Thomas Eiter & Thomas Krennwallner &

Rights and permissions

Reprints and permissions

Copyright information

About this chapter

Cite this chapter

Flesca, S., Furche, T., Oro, L. (2012). Reasoning and Ontologies in Data Extraction. In: Eiter, T., Krennwallner, T. (eds) Reasoning Web. Semantic Technologies for Advanced Query Answering. Reasoning Web 2012. Lecture Notes in Computer Science, vol 7487. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-33158-9_5

Download citation

DOI: https://doi.org/10.1007/978-3-642-33158-9_5
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-33157-2
Online ISBN: 978-3-642-33158-9
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics