Skip to main content

Measuring structural similarity among web documents: preliminary results

  • Part III: EP'98
  • Conference paper
  • First Online:
Electronic Publishing, Artistic Imaging, and Digital Typography (RIDT 1998)

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 1375))

Included in the following conference series:

Abstract

When we describe a Web page informally, we often use phrases like “it looks like a newspaper site”, “there are several unordered lists” or “it's just a collection of links”. Unfortunately, no Web search or classification tools provide the capability to retrieve information using such informal descriptions that are based on the appearance, i.e., structure, of the Web page. In this paper, we take a look at the concept of structurally similar Web pages. We note that some structural properties can be identified with semantic properties of the data and provide measures for comparison between HTML documents.

Research supported in part by the National Science Foundation under CAREER Award IRI-9896052. URLs: www.cs.wpi.edu/~ifc and casa.wpi.edu

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. R. B. Allen. Retrieval from Facet Spaces. Electronic Publishing, 8(3):247–257, 1996.

    Google Scholar 

  2. M. Balabanovic, Y. Shoham, and Y. Yun. An Adaptive Agent for Automated Web Browsing. Technical Report CS-TN-97-52, Stanford University, February 1997.

    Google Scholar 

  3. C. Chekuri, M. H. Goldwasser, P. Raghavan, and E. Upfal. Web Search Using Automatic Classification. Technical report, Stanford University and IBM Almaden Center, December 1996. theory.stanford.edu/people/wass/publications/Web_Search/.

    Google Scholar 

  4. I. F. Cruz and W. T. Lucas. DelaunayMM: a Visual Framework for Multimedia Presentation. In IEEE Symposium on Visual Languages (VL '97), pages 212–219, 1997.

    Google Scholar 

  5. Digital Equipment Corporation. Digital's AltaVista Search Unveils Largest and Freshest Web Index. www.altavista.digital.com/av/content/pr101497.htm.

    Google Scholar 

  6. H. Lieberman. Letizia: an Agent that Assists Web Browsing. In Proc. of the International Joint Conference on Artificial Intelligence, 1995.

    Google Scholar 

  7. M. A. Marks and T. R. Webb. Internet Documents Clustered by Structure. Major Qualifying Project, Worcester Polytechnic Institute, 1997.

    Google Scholar 

  8. D. Sankoff and J. B. Kruskal, eds. Time Warps, String Edits, and Macromolecules: the Theory and Practice of Sequence Comparison. Addison-Wesley, 1983.

    Google Scholar 

  9. J. T.-L. Wang, G. J. S. Chang, G. Patel, L. Rhihan, D. Shasha, and K. Zhang. Structural Mapping and Discovery in Document Databases. In ACM-SIGMOD Intl. Conf. on Management of Data, pages 560–563, 1997.

    Google Scholar 

  10. L. Weitzman and K. Wittenburg. Automatic Presentation of Multimedia Documents Using Relational Grammars. In ACM Multimedia Conference, 1994.

    Google Scholar 

  11. K. Wittenburg and E. Sigman. Visual Focusing and Transition Techniques in a Treeviewer for Web Information Access. In IEEE Symposium on Visual Languages (VL '97), pages 20–27, 1997.

    Google Scholar 

  12. Yahoo! Inc. Yahoo! Ranked No. 1 Web Site Among Business Users in First-Ever PC Meter Workplace Study. www.yahoo.com/docs/pr/release106.html.

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Editor information

Roger D. Hersch Jacques André Heather Brown

Rights and permissions

Reprints and permissions

Copyright information

© 1998 Springer-Verlag Berlin Heidelberg

About this paper

Cite this paper

Cruz, I.F., Borisov, S., Marks, M.A., Webb, T.R. (1998). Measuring structural similarity among web documents: preliminary results. In: Hersch, R.D., André, J., Brown, H. (eds) Electronic Publishing, Artistic Imaging, and Digital Typography. RIDT 1998. Lecture Notes in Computer Science, vol 1375. Springer, Berlin, Heidelberg. https://doi.org/10.1007/BFb0053296

Download citation

  • DOI: https://doi.org/10.1007/BFb0053296

  • Published:

  • Publisher Name: Springer, Berlin, Heidelberg

  • Print ISBN: 978-3-540-64298-5

  • Online ISBN: 978-3-540-69718-3

  • eBook Packages: Springer Book Archive

Publish with us

Policies and ethics