Abstract
A key problem of retrieving, integrating and mining rich and high quality information from massive Deep Web Databases (WDBs) online is how to automatically and effectively discover and recognize domain-specific WDBs’ entry points, i.e., forms, in the Web. It has been a challenging task because domain-specific WDBs’ forms with dynamic and heterogeneous properties are very sparsely distributed over several trillion Web pages. Although significant efforts have been made to address the problem and its special cases, more effective solutions remain to be further explored towards achieving both the satisfactory harvest rate and coverage rate of domain-specific WDBs’ forms simultaneously. In this paper, an Enhanced Form-Focused Crawler for domain-specific WDBs (E-FFC) has been proposed as a novel framework to address existing solutions’ limitations. The E-FFC, based on the divide and conquer strategy, employs a series of novel and effective strategies/algorithms, including a two-step page classifier, a link scoring strategy, classifiers for advanced searchable and domain-specific forms, crawling stopping criteria, etc. to its end achieving the optimized harvest rate and coverage rate of domain-specific WDBs’ forms simultaneously. Experiments of the E-FFC over a number of real Web pages in a set of representative domains have been conducted and the results show that the E-FFC outperforms the existing domain-specific Deep Web Form-Focused Crawlers in terms of the harvest rate, coverage rate and crawling robustness.
Similar content being viewed by others
References
Aggarwal, C.C., AL-Garawi, F., Yu, P.S. (2001). Intelligent crawling on the world wide web with arbitrary predicates. In Proc. of the 10th international conference on world wide web (WWW’01) (pp. 96–105).
Akilandeswari, J., & Gopalan, N.P. (2008). An architectural framework of a crawler for locating deep web repositories using learning multi-agent systems. In the Proc. of the third international conference on internet and web applications and services (ICIW’08) (pp. 558–562).
Barbosa, L., & Freire, J. (2005). Searching for hidden-web databases. In Proc. of WebDB (pp. 1–6).
Barbosa, L., & Freire, J. (2007). An adaptive crawler for locating hidden-web entry points. In Proc. of WWW 2007 (WWW’07) (pp. 441–450). ACM.
Barbosa, L., Freire, J., Silva, A. (2007). Organizing hidden-web databases by clustering visible web documents. In Proc. of IEEE the 23rd international of conference on data engineering (ICDE’2007) (pp. 326–335).
Bazarganigilani, M., Syed, A., & Burki, S. (2011). Focused web crawling using decay content and genetic programming. International Journal of Data Mining & Knowledge Management Process (IJDKP), 1, 1–11.
Bergman, M.K. (2001). The deep web: surfacing hidden value, Journal of Electronic Publishing, 7, 1–17.
Bharat, K., Broder, A., Henzinger, M., Kumar, P., Venkatasubramanian, S. (1998). The connectivity server: fast access to linkage information on the web. Computer Networks and ISDN Systems, 30, 469–477.
BrightPlanet.com (2001). The deep web: Surfacing hidden value, Accessible at http://brightplanet.com. Accessed 1 March 2012
Castillo, C. (2005). Effective web crawling. ACM SIGIR Forum, 39, 55–56.
Chang, K.C.-C., He, B., Li, C., et al. (2004). Structured databases on the web: observations and implications. ACM SIGMOD Record, 33, 61–70.
Chang, K.C.-C., He, B., Zhang, Z. (2005). Toward large-scale integration: building a metaQuerier over databases on the web. In Proc. of CIDR (CIDR’05) (pp. 44–55).
Chakrabarti, S., van den Berg, M., Dom, B. (1999). Focused crawling: a new approach to topic-specific web resource discovery. Computer Networks, 31, 1623–1640.
Chakrabarti, S., Punera, K., Subramanyam, M. (2002). Accelerated focused crawling through online relevance feedback. In Proc. of the 9th international conference on world wide web (WWW’02) (pp. 148–159).
Chau, M., & Chen, H. (2008). A machine learning approach to web page filtering using content and structure analysis. Decision Support Systems, 44, 482–494.
Cho, J., Garcia, M.H., Page, L. (1998). Efficient crawling through URL ordering. Computer Networks and ISDN Systems, 30, 161–172.
Cope, J., Craswell, N., Hawking, D. (2003). Automated discovery of search interfaces on the web. In Proc. of ADC (ADC’03) (pp. 181–189).
Dehua, D. (2010). Deep web services crawler. Master Thesis.
Diligenti, M., Coetzee, F., Lawrence, S., Giles, C.L., Gori, M. (2000). Focused crawling using context graphs. In Proc. of VLDB(VLDB’2000) (pp. 527–534).
Ester, M., Kriegel, H.P., Schubert, M. (2004). Accurate and efficient crawling for relevant websites. In Proc. of the 13th international conference on very large data bases (VLDB’ 04) (Vol. 30, pp. 396–407).
Ghanem, T.M., & Aref, W.G. (2004). Databases deepen the web. IEEE Computer, 73, 116–117.
He, B., & Chang, K.C.-C. (2003). Statistical schema matching across web query interfaces. In Proc. of the ACM SIGMOD conference (SIGMOD‘03) (pp. 217–228).
He, B., Li, C., Killian, D., Patel, M., Tseng, Y., Chang, K.C.-C. (2006). A structure-driven yield-aware web form crawler: building a database of online databases. UIUC Technical Report: UIUCDCS-R-2006-2752, UIUC-ENG-2006-1792 (pp. 1–12).
He, B., Patel, M., Zhang, Z., Chang, K.C.-C. (2007). Accessing the deep web: a survey. Communications of the ACM, 50, 95–101.
He, B., Tao, T., Chang, K.C.-C. (2004). Organizing structured web sources by query schemas: A clustering approach. In Proc. of the thirteenth ACM international conference on information and knowledge management (CIKM’04) (pp. 3–7).
He, B., Tao, T., Chang, K.C.-C. (2004). Clustering structured web sources: A schema-based, model-differentiation approach. In Proc. of the 9th International Conference on Extending Database Technology (EDBT’04) (pp. 536–546).
Hornung, T., Simon, K., Lausen, G. (2009). Mashups over the deep web. Lecture Notes in Business Information Processing, 18, 228–241.
Jamali, M., Sayyadi, H., Hariri, B.B., Abolhassani, H. (2006). A method for focused crawling using combination of link structure and content similarity. In Proc. of ACM international conference on web intelligence (WI’06) (pp. 753–756).
Li, Y., Nie, T., Shen, D., Yu, G. (2010). Domain-oriented deep web data sources discovery and identification. In Proc. of the 12th international asia-pacific web conference (pp. 464–467).
Madhavan, J., Ko, D., Kot, L. (2008). Google’s deep-web crawl, In Proc. of the VLDB Endowment (Vol. 1, pp. 1241–1251).
Madhavan, J., Afanasiev, L., Antova, L., Halevy, A. (2009). Harnessing the deep web: present and future. CIDR Perspective available at http://arxiv.org/abs/0909.1785.
Novak, B. (2004). A survey of focused web crawling algorithms. In Proc. of SIKDD (SIKDD’04) (Vol. 5558, pp. 55–58).
Peng, Q, Meng, W., He, H., Yu, C.T. (2004). WISE-cluster: clustering e-commerce search engines automatically. In Proc. of the 6th ACM international workshop on web information and data management (VIDM’04) (pp. 104–111).
Raghavan, S., & Garacia-Molina, H. (2001). Crawling the hidden web. In Proc. of VLDB (VLDB’01) (pp. 129–138).
Rennie, J., & McCallum, A.K. (1999). Using reinforcement learning to spider the web efficiently. In Proc. of ICML(ICML’1999) (pp. 335–343).
Rocco, D., Liu, L., Critchlow, T. (2004). Focused crawling of the deep web using service class descriptions. In Proc. of the international conference on service oriented computing (ICSOC’04) (pp. 15–18).
Steve, L., & Giles, C.L. (1998). Searching the world wide web. Science, 280, 98–100.
Steve, L., & Giles C.L. (1999). Accessibility of information on the web. Nature, 400, 107–109.
The UIUC Web (2003). Integration Repository. available at http://metaquerier.cs.uiuc.edu/repository. Accessed 1 March 2012.
Wang, C., Guan, Z.Y., Chen, C., Bu, J.J., Wang, J.F., Lin, H.Z. (2009). On-line topical importance estimation: An effective focused crawling algorithm combining link and content analysis (Vol. 10, pp. 1114–1124). Zhejiang Univ Sci A.
Wang, Y., Lu, J., Chen, J. (2009). Crawling deep web using a new set covering algorithm. In Proc. of ADMA(ADMA’2009) (pp. 326–337).
Wang, Y., Zuo, W., Peng, T., He, F. (2008). Domain-specific deep web sources discovery. In Proc. of the international conference on natural computation (ICNC’08) (pp. 202–206).
Wu, W., Yu, C.T., Doan, A., Meng, W. (2004). An interactive clustering-based approach to integrating source query interfaces on the deep web. In Proc. of the 23th ACM international conference on management of data (SIGMOD’2004) (pp. 95–106).
Yadav, D., Sharma, A.K., Gupta, J.P. (2009). Topic web crawling using weighted anchor text and web page change detection techniques, WSEAS. Transactions on Information Science and Applications, 6, 263–275.
Zhang, Z., He, B., Chang, K.C.-C. (2005). Light-weight domain-based form assistant: querying web databases on the fly. In Proc. of VLDB conference (VLDB’05) (pp. 97–108).
Acknowledgements
We are grateful to anonymous reviewers and Editor for their valuable comments and suggestions. This work was supported by the National Natural Science Foundation of China (No.61272119).
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
About this article
Cite this article
Li, Y., Wang, Y. & Du, J. E-FFC: an enhanced form-focused crawler for domain-specific deep web databases. J Intell Inf Syst 40, 159–184 (2013). https://doi.org/10.1007/s10844-012-0221-8
Received:
Revised:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10844-012-0221-8