Skip to main content

Crawling the Content Hidden Behind Web Forms

  • Conference paper
Computational Science and Its Applications – ICCSA 2007 (ICCSA 2007)

Part of the book series: Lecture Notes in Computer Science ((LNTCS,volume 4706))

Included in the following conference series:

Abstract

The crawler engines of today cannot reach most of the information contained in the Web. A great amount of valuable information is “hidden” behind the query forms of online databases, and/or is dynamically generated by technologies such as JavaScript. This portion of the web is usually known as the Deep Web or the Hidden Web. We have built DeepBot, a prototype hidden-web crawler able to access such content. DeepBot receives as input a set of domain definitions, each one describing a specific data-collecting task and automatically identifies and learns to execute queries on the forms relevant to them. In this paper we describe the techniques employed for building DeepBot and report the experimental results obtained when testing it with several real world data collection tasks.

This research was partially supported by the Spanish Ministry of Education and Science under project TSI2005-07730.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 129.00
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 169.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. Álvarez, M., Pan, A., Raposo, J., Hidalgo, J.: Crawling Web Pages with Support for Client-Side Dynamism. In: Yu, J.X., Kitsuregawa, M., Leong, H.V. (eds.) WAIM 2006. LNCS, vol. 4016, pp. 252–262. Springer, Heidelberg (2006)

    Chapter  Google Scholar 

  2. Álvarez, M., Raposo J., Pan, A., Cacheda, F., Bellas, F., Carneiro, V. Crawling the Content Hidden Behind Web Forms, http://www.tic.udc.es/~mad/publications/cchiddenbwf_extended.pdf

  3. Bergholz, A., Chidlovskii, B.: Crawling for Domain-Specific Hidden Web Resources. In: Proceedings of the 4th Int. Conference on Web Information Systems Engineering (2003)

    Google Scholar 

  4. Bergman, M.: The Deep Web. Surfacing Hidden Value (2001), http://brightplanet.com/technology/deepweb.asp

  5. Chang, C.-C.K., He, B., Patel, M., Zhang, Z.: Structured Databases on the Web: Observations and Implications. SIGMOD Record 33(3) (2004)

    Google Scholar 

  6. Chang, C.-C.K., He, B., Zhang, Z.: MetaQuerier over the Deep Web: Shallow Integration Across Holistic Sources. In: Proceedings of the VLDB Workshop on Information Integration on the Web (2004)

    Google Scholar 

  7. Cohen, W., Ravikumar, P., Fienberg, S.: A Comparison of String Distance Metrics for Name-Matching Tasks. In: Proceedings of IJCAI-03 Workshop (2003)

    Google Scholar 

  8. Gravano, L., Ipeirotis, P., Sahami, M.: QProber: A System for Automatic Classification of Hidden-Web Databases. ACM Transactions on Information Systems 21(1) (2003)

    Google Scholar 

  9. He, H., Meng, W., Yu, C., Wu, Z.: Automatic Integration of Web Search Interfaces with WISE-Integrator. VLDB Journal 13(3), 256–273 (2004)

    Article  Google Scholar 

  10. Ipeirotis, P., Gravano, L.: Distributed Search over the Hidden Web: Hierarchical Database Sampling and Selection. In: Proceedings of the 28th Very Large DataBases Conference (2002)

    Google Scholar 

  11. Liddle, S., Embley, D., Scott, D., Yau Ho, S.: Extracting Data Behind Web Forms. In: Proceedings of the 28th Intl. Conference on Very Large Databases (2002)

    Google Scholar 

  12. Ntoulas, A., Zerfos, et al.: Downloading Textual Hidden Web Content Through Keyword Queries. In: Proceedings of the 5th ACM/IEEE Joint Conference on Digital Libraries (2005)

    Google Scholar 

  13. Pan, A., Raposo, J., Álvarez, M., Hidalgo, J., Viña, A.: Semi-Automatic Wrapper Generation for Commercial Web Sources. In: Proceedings of IFIP WG8.1 Working Conference on Engineering Information Systems in the Internet Context. (2002)

    Google Scholar 

  14. Raghavan, S., Garcia-Molina, H.: Crawling the hidden web. Technical Report 2000 -36, Computer Science Department, Stanford University, (December 2000), Available at http://dbpubs.stanford.edu/pub/2000-36

  15. Zhang, Z., He, B., Chang, C.-C.K.: Light-weight Domain-based Form Assistant: Querying Web Databases On the Fly. In: Proceedings of the 31st Very Large Data Bases Conference (2005)

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Editor information

Osvaldo Gervasi Marina L. Gavrilova

Rights and permissions

Reprints and permissions

Copyright information

© 2007 Springer-Verlag Berlin Heidelberg

About this paper

Cite this paper

Álvarez, M., Raposo, J., Pan, A., Cacheda, F., Bellas, F., Carneiro, V. (2007). Crawling the Content Hidden Behind Web Forms. In: Gervasi, O., Gavrilova, M.L. (eds) Computational Science and Its Applications – ICCSA 2007. ICCSA 2007. Lecture Notes in Computer Science, vol 4706. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-540-74477-1_31

Download citation

  • DOI: https://doi.org/10.1007/978-3-540-74477-1_31

  • Publisher Name: Springer, Berlin, Heidelberg

  • Print ISBN: 978-3-540-74475-7

  • Online ISBN: 978-3-540-74477-1

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics