Automated Extraction of Hit Numbers from Search Result Pages

Ling, Yanyan; Meng, Xiaofeng; Meng, Weiyi

doi:10.1007/11775300_7

Yanyan Ling¹⁹,
Xiaofeng Meng¹⁹ &
Weiyi Meng²⁰

Part of the book series: Lecture Notes in Computer Science ((LNISA,volume 4016))

Included in the following conference series:

International Conference on Web-Age Information Management

1196 Accesses
3 Citations

Abstract

When a query is submitted to a search engine, the search engine returns a dynamically generated result page that contains the number of hits (i.e., the number of matching results) for the query. Hit number is a very useful piece of information in many important applications such as obtaining document frequencies of terms, estimating the sizes of search engines and generating search engine summaries. In this paper, we propose a novel technique for automatically identifying the hit number for any search engine and any query. This technique consists of three steps: first segment each result page into a set of blocks, then identify the block(s) that contain the hit number using a machine learning approach, and finally extract the hit number from the identified block(s) by comparing the patterns in multiple blocks from the same search engine. Experimental results indicate that this technique is highly accurate.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 84.99; Price excludes VAT (USA)

Softcover Book: USD 109.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

References

Lin Can, Zhang Qian, Xiaofeng Meng, and Wenyin Lin. Postal address detection from web documents. In WIRI, pages 40–45. IEEE Computer Society, 2005.
Google Scholar
Cope, J., Craswell, N., Hawking, D.: Automated discovery of search interfaces on the web. In: Schewe, K.-D., Zhou, X. (eds.) ADC. CRPIT, vol. 17, pp. 181–189. Australian Computer Society (2003)
Google Scholar
Doorenbos, R.B., Etzioni, O., Weld, D.S.: A scalable comparison-shopping agent for the world-wide web. Agents, 39–48 (1997)
Google Scholar
Ipeirotis, P.G., Gravano, L., Sahami, M.: Probe, count, and classify: Categorizing hidden web databases. In: SIGMOD Conference (2001)
Google Scholar
Liu, B., Grossman, R.L., Zhai, Y.: Mining web pages for data records. IEEE Intelligent Systems 19(6), 49–55 (2004)
Article Google Scholar
Witten, I.H., Frank, E.: Data Mining. In: Practical machine learning tools and techniques, 2nd edn. Morgan Kaufmann, San Francisco (2005)
Google Scholar
Zhao, H., Meng, W., Wu, Z., Raghavan, V., Yu, C.T.: Fully automatic wrapper generation for search engines. In: Ellis, A., Hagino, T. (eds.) WWW, pp. 66–75. ACM, New York (2005)
Google Scholar

Download references

Author information

Authors and Affiliations

School of Information, Renmin University of China, China
Yanyan Ling & Xiaofeng Meng
Dept. of Computer Science, SUNY at Binghamton, Binghamton, NY, 13902, USA
Weiyi Meng

Authors

Yanyan Ling
View author publications
You can also search for this author in PubMed Google Scholar
Xiaofeng Meng
View author publications
You can also search for this author in PubMed Google Scholar
Weiyi Meng
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

Chinese University of Hong Kong, Hong Kong, China
Jeffrey Xu Yu
Institute of Industrial Science, The University of Tokyo, 4-6-1 Komaba, Meguro-ku, 153-8505, Tokyo, Japan
Masaru Kitsuregawa
Department of Computing, Hong Kong Polytechnic University, Hong Kong
Hong Va Leong

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Ling, Y., Meng, X., Meng, W. (2006). Automated Extraction of Hit Numbers from Search Result Pages. In: Yu, J.X., Kitsuregawa, M., Leong, H.V. (eds) Advances in Web-Age Information Management. WAIM 2006. Lecture Notes in Computer Science, vol 4016. Springer, Berlin, Heidelberg. https://doi.org/10.1007/11775300_7

Download citation

DOI: https://doi.org/10.1007/11775300_7
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-35225-9
Online ISBN: 978-3-540-35226-6
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics