Crawling the Content Hidden Behind Web Forms

Álvarez, Manuel; Raposo, Juan; Pan, Alberto; Cacheda, Fidel; Bellas, Fernando; Carneiro, Víctor

doi:10.1007/978-3-540-74477-1_31

Manuel Álvarez¹,
Juan Raposo¹,
Alberto Pan¹,
Fidel Cacheda¹,
Fernando Bellas¹ &
…
Víctor Carneiro¹

Part of the book series: Lecture Notes in Computer Science ((LNTCS,volume 4706))

Included in the following conference series:

International Conference on Computational Science and Its Applications

2072 Accesses
12 Citations

Abstract

The crawler engines of today cannot reach most of the information contained in the Web. A great amount of valuable information is “hidden” behind the query forms of online databases, and/or is dynamically generated by technologies such as JavaScript. This portion of the web is usually known as the Deep Web or the Hidden Web. We have built DeepBot, a prototype hidden-web crawler able to access such content. DeepBot receives as input a set of domain definitions, each one describing a specific data-collecting task and automatically identifies and learns to execute queries on the forms relevant to them. In this paper we describe the techniques employed for building DeepBot and report the experimental results obtained when testing it with several real world data collection tasks.

This research was partially supported by the Spanish Ministry of Education and Science under project TSI2005-07730.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 129.00; Price excludes VAT (USA)

Softcover Book: USD 169.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

References

Álvarez, M., Pan, A., Raposo, J., Hidalgo, J.: Crawling Web Pages with Support for Client-Side Dynamism. In: Yu, J.X., Kitsuregawa, M., Leong, H.V. (eds.) WAIM 2006. LNCS, vol. 4016, pp. 252–262. Springer, Heidelberg (2006)
Chapter Google Scholar
Álvarez, M., Raposo J., Pan, A., Cacheda, F., Bellas, F., Carneiro, V. Crawling the Content Hidden Behind Web Forms, http://www.tic.udc.es/~mad/publications/cchiddenbwf_extended.pdf
Bergholz, A., Chidlovskii, B.: Crawling for Domain-Specific Hidden Web Resources. In: Proceedings of the 4th Int. Conference on Web Information Systems Engineering (2003)
Google Scholar
Bergman, M.: The Deep Web. Surfacing Hidden Value (2001), http://brightplanet.com/technology/deepweb.asp
Chang, C.-C.K., He, B., Patel, M., Zhang, Z.: Structured Databases on the Web: Observations and Implications. SIGMOD Record 33(3) (2004)
Google Scholar
Chang, C.-C.K., He, B., Zhang, Z.: MetaQuerier over the Deep Web: Shallow Integration Across Holistic Sources. In: Proceedings of the VLDB Workshop on Information Integration on the Web (2004)
Google Scholar
Cohen, W., Ravikumar, P., Fienberg, S.: A Comparison of String Distance Metrics for Name-Matching Tasks. In: Proceedings of IJCAI-03 Workshop (2003)
Google Scholar
Gravano, L., Ipeirotis, P., Sahami, M.: QProber: A System for Automatic Classification of Hidden-Web Databases. ACM Transactions on Information Systems 21(1) (2003)
Google Scholar
He, H., Meng, W., Yu, C., Wu, Z.: Automatic Integration of Web Search Interfaces with WISE-Integrator. VLDB Journal 13(3), 256–273 (2004)
Article Google Scholar
Ipeirotis, P., Gravano, L.: Distributed Search over the Hidden Web: Hierarchical Database Sampling and Selection. In: Proceedings of the 28th Very Large DataBases Conference (2002)
Google Scholar
Liddle, S., Embley, D., Scott, D., Yau Ho, S.: Extracting Data Behind Web Forms. In: Proceedings of the 28th Intl. Conference on Very Large Databases (2002)
Google Scholar
Ntoulas, A., Zerfos, et al.: Downloading Textual Hidden Web Content Through Keyword Queries. In: Proceedings of the 5th ACM/IEEE Joint Conference on Digital Libraries (2005)
Google Scholar
Pan, A., Raposo, J., Álvarez, M., Hidalgo, J., Viña, A.: Semi-Automatic Wrapper Generation for Commercial Web Sources. In: Proceedings of IFIP WG8.1 Working Conference on Engineering Information Systems in the Internet Context. (2002)
Google Scholar
Raghavan, S., Garcia-Molina, H.: Crawling the hidden web. Technical Report 2000 -36, Computer Science Department, Stanford University, (December 2000), Available at http://dbpubs.stanford.edu/pub/2000-36
Zhang, Z., He, B., Chang, C.-C.K.: Light-weight Domain-based Form Assistant: Querying Web Databases On the Fly. In: Proceedings of the 31st Very Large Data Bases Conference (2005)
Google Scholar

Download references

Author information

Authors and Affiliations

Department of Information and Communications Technologies,University of A Coruña, 15071 A Coruña, Spain
Manuel Álvarez, Juan Raposo, Alberto Pan, Fidel Cacheda, Fernando Bellas & Víctor Carneiro

Authors

Manuel Álvarez
View author publications
You can also search for this author in PubMed Google Scholar
Juan Raposo
View author publications
You can also search for this author in PubMed Google Scholar
Alberto Pan
View author publications
You can also search for this author in PubMed Google Scholar
Fidel Cacheda
View author publications
You can also search for this author in PubMed Google Scholar
Fernando Bellas
View author publications
You can also search for this author in PubMed Google Scholar
Víctor Carneiro
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Osvaldo Gervasi Marina L. Gavrilova

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Álvarez, M., Raposo, J., Pan, A., Cacheda, F., Bellas, F., Carneiro, V. (2007). Crawling the Content Hidden Behind Web Forms. In: Gervasi, O., Gavrilova, M.L. (eds) Computational Science and Its Applications – ICCSA 2007. ICCSA 2007. Lecture Notes in Computer Science, vol 4706. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-540-74477-1_31

Download citation

DOI: https://doi.org/10.1007/978-3-540-74477-1_31
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-74475-7
Online ISBN: 978-3-540-74477-1
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics