Abstract
Web Document Indexing is an important part of every Search Engine (SE). Indexing quality has an overwhelming effect on retrieval effectiveness. A document index is a set of terms which show the contents (topic) of the document and helps in distinguishing a given document from other documents in the collection of documents. Small index size can lead to poor results and may miss some relevant items. Large index size allows retrieval of many useful documents along with a significant number of irrelevant ones and decreases the search speed and effectiveness of the searched item. Though the problem has been studied for many years there is still no algorithm to find the optimal index size and sets of index terms. This paper shows how different attributes of the web document (namely Title, Anchor and Emphasize) contribute to the average precision in the process of search. The experiments are done on the WT10g collection of a 1.69-million page corpus.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Preview
Unable to display preview. Download preview PDF.
References
Luhn, H. P.: The automatic creation of literature abstracts. IBM Journal of Research and Development, Vol. 2. (1958) 159–165.
Salton, G.: Automatic Text Processing: The Transformation, Analysis and Retrieval of Information by Computer. Addison-Wesley (1989).
Wong, W., Fu, A. W.: Incremental Document Clustering for Web Page Classification. In: 2000 International Conference on Information Society in the 21st Century: Emerging Technologies and New Challenges (IS2000), Aizu-Wakamatsu City, Fukushima, Japan, November 5–8 (2000); available at http://www.cse.cuhk.edu.hk/kdd/web_mine/WPClustering/paper.ps.gz.
Porter, M. F.: An algorithm for sufix stripping. Program, Vol 14, No. 3, July (1980) 130–137.
Gudivada, V., Raghavan V., Grosky W., Kasanagottu R.: Information retrieval on the WWW. IEEE Internet Computing, Vol 1, No. 5, Sep–Oct (1997) 58–69.
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2003 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Hyusein, B., Patel, A. (2003). Web Document Indexing and Retrieval. In: Gelbukh, A. (eds) Computational Linguistics and Intelligent Text Processing. CICLing 2003. Lecture Notes in Computer Science, vol 2588. Springer, Berlin, Heidelberg. https://doi.org/10.1007/3-540-36456-0_62
Download citation
DOI: https://doi.org/10.1007/3-540-36456-0_62
Published:
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-00532-2
Online ISBN: 978-3-540-36456-6
eBook Packages: Springer Book Archive