Skip to main content

Metadata Based Web Mining for Topic-Specific Information Gathering

  • Conference paper
  • First Online:
Electronic Commerce and Web Technologies (EC-Web 2000)

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 1875))

Included in the following conference series:

Abstract

As the World-Wide-Web grows at an exponential rate, we are faced with the issue of rating pages in terms of quality and trust. In this siutation, with significant linkage among web pages, what other pages say about a web page can be as important as and more objective than what the page says about itself. The cumulative knowledge of such recommendations (or lack of them) can help a system to decide whether to pursue a page or not. This metadata information can also be used by a web robot program, for example, to derive summary information about web documents written in a foreign language. In this paper, we describe how we exploit this type of metadata to drive a web information gathering system, which forms the backend of a topic-specific search engine. The system uses metadata from hyperlinks to guide itself to crawl the web staying focused on a target topic. The crawler follows links that point to information related to the topic and avoids following links to irrelevant pages. Moreover, the system uses the metadata to improve its definition of the target topic through association mining. Ultimately, the guided crawling system builds a rich repository of metadata information, which is used to serve the search engine.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 84.99
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 109.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. Agrawal, R., Srikant, R.: Fast Algorithms for Mining Association Rules. The 20th VLDB Conference. Santiago, Chile, (1994)

    Google Scholar 

  2. K. Bharat, M. Henzinger: Improved Algorithms for Topic Distillation in Hyperlinked Environments. Proc. of 21st Int. ACM SIGIR Conference. Melbourne, Australia, (1998)

    Google Scholar 

  3. Chakrabarti, S., Dom, B., Raghavan, P., Rajagopalan, S., Gibson, D., Kleinberg, J.: Automatic Resource Compilation by Analyzing Hyperlink Structure and Associated Text.

    Google Scholar 

  4. Chakrabarti, S., van den Berg, M., Dom, B.: Focused Crawling: A New Approach to Topic-Specific Web Resource Discovery. The 8th Int. World Wide Web Conference. Toronto, Canada, (1999)

    Google Scholar 

  5. Chen, H., Chung, Y.M., Ramsey, M. and Yang, C.C.: A Smart Itsy Bitsy Spider for the Web. Journal of American Society of Information Science. 49(7) (1998) 604–618

    Article  Google Scholar 

  6. Cho, J., Garcia-Molina, H., Page, L.: Efficient Crawling through URL Ordering. The 7th Int. World Wide Web Conference. Brisbane, Australia, (1998)

    Google Scholar 

  7. Matthias Eichstaedt, Daniel Ford, Reiner Kraft, Qi Lu, Wayne Niblack, Neel Sundaresan: Grand Central Station. IBM Research Report. IBM Almaden Research Center, (1998)

    Google Scholar 

  8. R. Feldman, H. Hirsh: Mining Associations in Text in the Presence of Background Knowledge. The 2nd Int. Conference on Knowledge Discovery and Data Mining. Portland, Oregon. (1996) 343–346

    Google Scholar 

  9. B, Huberman, P. Pirolli, J. Pitkow, R. Lukose: Strong Regularities in World Wide Web Surfing. Science. 280 (1998) 95–97

    Article  Google Scholar 

  10. J. Kleinberg: Authoritative Sources in a Hyperlinked Environment. Proc. of 9th ACM-SIAM Symposium on Discrete Algorithms. (1997)

    Google Scholar 

  11. Kumar, R., Raghavan, P., Rajagopalan, S., Tomkins, A.: Trawling the Web for Emerging Cyber-Communities. The 8th Int. World Wide Web Conference. Toronto, Canada, (1999)

    Google Scholar 

  12. Lassila, O., Swick, R.R.: Resource Description Framework (RDF) Model, Syntax, Recommendation. W3C, (1999), ”http://www.w3.org/TR/REC-rdf-syntax/

  13. Lawrence, S., Giles, L.: Accessibility and Distribution of Information on the Web. Nature. 400, (1999) 107–109

    Article  Google Scholar 

  14. McCallum, A., Nigam, K., Rennie, J., Seymore, K.: Building Domain-Specific Search Engines with Machine Learning Techniques. AAAI Spring Symposium. (1999)

    Google Scholar 

  15. Miller, G.: Nouns in WordNet: A Lexical Inheritance System. International Journal of Lexicography. 2(4) (1990) 245–264

    Article  Google Scholar 

  16. E. Spertus: ParaSite: Mining Structure Information on the Web. The 6th Int. World Wide Web Conference. Santa Clara, CA, (1997)

    Google Scholar 

  17. Sundaresan, N., Yi, J., Huang, A.: Using metadata to enhance a web information gathering system. The 3rd ACM SIGMOD Workshop on the Web and Databases. Dallas, TX, (2000) 11–16

    Google Scholar 

  18. Yi, J., Sundaresan, N., Huang, A.: Automated Construction of Topic-specific Web Search Engines with Data Mining Techniques. IBM Research Report. IBM Almaden Research Center. (2000)

    Google Scholar 

  19. Yi, J., Sundaresan N.: Metadata Based Web Mining for Relevance. International database Engineering and Applications Symposium, forthcoming. Yokohama, Japan, (2000)

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2000 Springer-Verlag Berlin Heidelberg

About this paper

Cite this paper

Yi, J., Sundaresan, N., Huang, A. (2000). Metadata Based Web Mining for Topic-Specific Information Gathering. In: Bauknecht, K., Madria, S.K., Pernul, G. (eds) Electronic Commerce and Web Technologies. EC-Web 2000. Lecture Notes in Computer Science, vol 1875. Springer, Berlin, Heidelberg. https://doi.org/10.1007/3-540-44463-7_31

Download citation

  • DOI: https://doi.org/10.1007/3-540-44463-7_31

  • Published:

  • Publisher Name: Springer, Berlin, Heidelberg

  • Print ISBN: 978-3-540-67981-3

  • Online ISBN: 978-3-540-44463-3

  • eBook Packages: Springer Book Archive

Publish with us

Policies and ethics