Skip to main content

Crawling Web Pages with Support for Client-Side Dynamism

  • Conference paper
Advances in Web-Age Information Management (WAIM 2006)

Part of the book series: Lecture Notes in Computer Science ((LNISA,volume 4016))

Included in the following conference series:

Abstract

There is a great amount of information on the web that can not be accessed by conventional crawler engines. This portion of the web is usually known as the Hidden Web. To be able to deal with this problem, it is necessary to solve two tasks: crawling the client-side and crawling the server-side hidden web. In this paper we present an architecture and a set of related techniques for accessing the information placed in web pages with support for client-side dynamism, dealing with aspects such as JavaScript technology, non-standard session maintenance mechanisms, client redirections, pop-up menus, etc. Our approach leverages current browser APIs and implements novel crawling models and algorithms.

This research was partially supported by the Spanish Ministry of Education and Science under project TSI2005-07730.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 84.99
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 109.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. Bergman, M.: The Deep Web. Surfacing Hidden Value, http://www.brightplanet.com/technology/deepweb.asp

  2. Brin, S., Page, L.: The Anatomy of a Large-Scale Hypertextual Search Engine. In: Proceedings of the 7th International World Wide Web Conference (1998)

    Google Scholar 

  3. Ipeirotis, P., Gravano, L.: Distributed Search over the Hidden Web: Hierarchical Database Sampling and Selection. In: Proceedings of the 28th International Conference on Very Large Databases, VLDB 2002 (2002)

    Google Scholar 

  4. Microsoft Internet Explorer WebBrowser Control, http://www.microsoft.com/windows/ie

  5. Pan, A., Raposo, J., Álvarez, M., Hidalgo, J., Viña, A.: Semi-Automatic Wrapper Generation for Commercial Web Sources. In: Proceedings of IFIP WG8.1 Working Conference on Engineering Information Systems in the Internet Context, EISIC 2002 (2002)

    Google Scholar 

  6. Raghavan, S., García-Molina, H.: Crawling the Hidden Web. In: Proceedings of the 27th International Conference on Very Large Databases (2001)

    Google Scholar 

  7. Mozilla Rhino - JavaScript Engine (Java), http://www.mozilla.org/rhino

  8. Mozilla SpiderMonkey – JavaScript engine (C), http://www.mozilla.org/js/spidermonkey

  9. WebCopier – Feel the Internet in your Hands, http://www.maximumsoft.com

  10. Scripts in HTML Documents, http://www.w3.org/TR/html4/interact/scripts.html

  11. Yahoo Mail, http://mail.yahoo.com

  12. Naming and Addressing: http://www.w3.org/Addressing

Download references

Author information

Authors and Affiliations

Authors

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2006 Springer-Verlag Berlin Heidelberg

About this paper

Cite this paper

Álvarez, M., Pan, A., Raposo, J., Hidalgo, J. (2006). Crawling Web Pages with Support for Client-Side Dynamism. In: Yu, J.X., Kitsuregawa, M., Leong, H.V. (eds) Advances in Web-Age Information Management. WAIM 2006. Lecture Notes in Computer Science, vol 4016. Springer, Berlin, Heidelberg. https://doi.org/10.1007/11775300_22

Download citation

  • DOI: https://doi.org/10.1007/11775300_22

  • Publisher Name: Springer, Berlin, Heidelberg

  • Print ISBN: 978-3-540-35225-9

  • Online ISBN: 978-3-540-35226-6

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics