Skip to main content

Structure Formation in the Web

Toward A Graph Model of Hypertext Types

  • Chapter
  • First Online:
Linguistic Modeling of Information and Markup Languages

Part of the book series: Text, Speech and Language Technology ((TLTB,volume 41))

Abstract

In this chapter we develop a representation model of web document networks. Based on the notion of uncertain web document structures, the model is defined as a template which grasps nested manifestation levels of hypertext types. Further, we specify the model on the conceptual, formal and physical level and exemplify it by reconstructing competing web document models.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 84.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 109.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info
Hardcover Book
USD 109.99
Price excludes VAT (USA)
  • Durable hardcover edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

  • Adamic, Lada A. (1999). The small world of web. In Abiteboul, Serge and Vercoustre, Anne-Marie, editors, Research and Advanced Technology for Digital Libraries, pages 443–452. Springer, Berlin.

    Chapter  Google Scholar 

  • Barnard, D. T., Burnard, L., DeRose, S. J., Durand, D. G., and Sperberg-McQueen, C. M. (1995). Lessons for the World Wide Web from the text encoding initiative. In Proc. of the 4th Int. WWW Conf.

    Google Scholar 

  • Baroni, Marco and Bernardini, Silvia, editors (2006). WaCky! Working papers on the Web as corpus. Gedit, Bologna, Italy.

    Google Scholar 

  • Björneborn, Lennart (2004). Small-World Link Structures across an Academic Web Space: A Library and Information Science Approach. PhD thesis, Royal School of Library and Information Science, Department of Information Studies, Denmark.

    Google Scholar 

  • Björneborn, Lennart and Ingwersen, Peter (2004). Towards a basic framework for webometrics. JASIST, 55(14):1216–1227.

    Article  Google Scholar 

  • Chakrabarti, Soumen (2002). Mining the Web: Discovering Knowledge from Hypertext Data. Morgan Kaufmann, San Francisco.

    Google Scholar 

  • Géry, Mathias and Chevallet, Jean-Pierre (2001). Toward a structured information retrieval system on the web: Automatic structure extraction of web pages. In Int. Workshop on Web Dynamics as part of the 8th Int. Conf. on Database Theory.

    Google Scholar 

  • Haas, Stephanie W. and Grams, Erika S. (2000). Readers, authors, and page structure. JASIST, 51(2):181–192.

    Article  Google Scholar 

  • Holt, Richard C., Schürr, Andy, Elliott Sim, Susan, and Winter, Andreas (2006). GXL: A graph-based standard exchange format for reengineering. Science of Computer Programming, 60(2):149–170.

    Article  MATH  MathSciNet  Google Scholar 

  • Koehler, Wallace (1999). An analysis of web page and web site constancy and permanence. JASIST, 50(2):162–180.

    Article  MathSciNet  Google Scholar 

  • Koehler, Wallace (2003). A longitudinal study of web pages continued: a consideration of document persistence. Information Research, 9(2).

    Google Scholar 

  • Kot, Mark, Silverman, Emily, and Berg, Celeste A. (2003). Zipf’s law and the diversity of biology newsgroups. Scientometrics, 56(2):247–257.

    Article  Google Scholar 

  • Kumar, Ravi, Novak, Jasmine, Raghavan, Prabhakar, and Tomkins, Andrew (2004). Structure and evolution of blogspace. Communications of the ACM, 47(12):35–39.

    Article  Google Scholar 

  • Martin, James R. (1992). English Text. System and Structure. Benjamins, Philadelphia.

    Google Scholar 

  • Mehler, Alexander (2005). Zur textlinguistischen Fundierung der Text- und Korpuskonversion. Sprache und Datenverarbeitung, 1:29–53.

    Google Scholar 

  • Mehler, Alexander (2008). Large text networks as an object of corpus linguistic studies. In Lüdeling, A. and Kytö, M., editors, Corpus Linguistics. An International Handbook, pages 328–382. De Gruyter, Berlin/New York.

    Google Scholar 

  • Mehler, Alexander and Gleim, Rüdiger (2006). The net for the graphs: Webgenre representation for corpus linguistic studies. In Baroni, M. and Bernardini, S. (2006), pages 191–224.

    Google Scholar 

  • Mehler, Alexander, Gleim, Rüdiger, and Wegner, Armin (2007). Structural uncertainty of hypertext types. An empirical study. In Towards Genre-Enabled Search Engines: The Impact of NLP. Workshop in conjunction with RANLP 2007, pages 13–19.

    Google Scholar 

  • Mehler, Alexander, Sharoff, Serge, and Santini, Marina, editors (2009). Genres on the Web: Computational Models and Empirical Studies. Submitted to Springer, Berlin/New York.

    Google Scholar 

  • Menczer, Filippo (2004). Lexical and semantic clustering by web links. JASIST, 55(14): 1261–1269.

    Article  Google Scholar 

  • Mukherjea, Sougata (2000). Organizing topic-specific web information. In Proc. of the 11th ACM Conf. on Hypertext and Hypermedia, pages 133–141. ACM.

    Google Scholar 

  • Pirolli, Peter, Pitkow, James, and Rao, Ramana (1996). Silk from a sow’s ear: Extracting usable structures from the web. In Proc. of the ACM SIGCHI Conf. on Human Factors in Computing, pages 118–125.

    Google Scholar 

  • Power, Richard, Scott, Donia, and Bouayad-Agha, Nadjet (2003). Document structure. Computational Linguistics, 29(2):211–260.

    Article  Google Scholar 

  • Thelwall, M., Prabowo, R., and Fairclough, R. (2006a). Are raw RSS feeds suitable for broad issue scanning? A science concern case study. JASIST, 57(12):1644–1654.

    Article  Google Scholar 

  • Thelwall, Mike, Vaughan, Liwen, and Björneborn, Lennart (2006b). Webometrics. Annual Review of Information Science Technology, 6(8).

    Google Scholar 

  • Thüring, Manfred, Hannemann, Jörg, and Haake, Jörg M. (1995). Hypermedia and cognition: Designing for comprehension. Communications of the ACM, 38(8):57–66.

    Article  Google Scholar 

  • Tsikrika, Theodora and Lalmas, Mounia (2002). Combining web document representations in a Bayesian inference network model using link and content-based evidence. In Proc. ECIR ’02, volume 2291 of LNCS, pages 53–72.

    Google Scholar 

  • Weare, Christopher and Lin, Wan-Ying (2000). Content analysis of the World Wide Web: Opportunities and challenges. Social Science Computer Review, 18(3):272–292.

    Article  Google Scholar 

Download references

Acknowledgements

Financial support of the Deutsche Forschungsgemeinschaft (DFG) via the project Induction of Web Genre Document Grammars of the Research Group 437 Text Technological Information Modeling and via the Project KnowCIT of the Excellence Cluster 277 Cognitive Interaction Technology is gratefully acknowledged.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Alexander Mehler .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2010 Springer Science+Business Media B.V.

About this chapter

Cite this chapter

Mehler, A. (2010). Structure Formation in the Web. In: Witt, A., Metzing, D. (eds) Linguistic Modeling of Information and Markup Languages. Text, Speech and Language Technology, vol 41. Springer, Dordrecht. https://doi.org/10.1007/978-90-481-3331-4_12

Download citation

  • DOI: https://doi.org/10.1007/978-90-481-3331-4_12

  • Published:

  • Publisher Name: Springer, Dordrecht

  • Print ISBN: 978-90-481-3330-7

  • Online ISBN: 978-90-481-3331-4

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics