Skip to main content

Mathematical Models of Textual Data: A Short Review

  • Chapter
Mathematical Models and Methods for Planet Earth

Part of the book series: Springer INdAM Series ((SINDAMS,volume 6))

Abstract

This contribution is the result of trying to put in amore systematic, updated and readable form some notes I used in the preparation of my talk at the INdAM Workshop Mathematical Models and Methods for Planet Earth, in Rome (May, 2013). The aim was to discuss some recent mathematical approaches to textual data analysis, focusing on literary texts and on some specific topics and examples: universal statistical properties of written language and the nature of long correlations in literary texts with specific applications to authorship attribution, keyword extraction, and automatic text generation.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 84.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Hardcover Book
USD 109.99
Price excludes VAT (USA)
  • Durable hardcover edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

  1. 1.

    1 To be more precise, d n is a pseudo-distance, since it does not satisfy the triangular inequality and it is not even positive definite: two texts X,Y can be at distance d n (X,Y) = 0 without being the same.

References

  1. Allegrini, P., Grigolini, P., Palatella L.: Intermittency and scale-free networks: a dynamical model for human language complexity. Chaos, Solitons and Fractals 20, 95–105 (2004)

    Article  Google Scholar 

  2. Altmann, E.G., Cristadoro, G., Degli Esposti, M.: On the origin of long-range correlations in texts. Proceedings of the National Academy of Sciences 109, 11582–11587 (2012)

    Article  CAS  Google Scholar 

  3. Alvarez-Lacalle, E., Dorow, B., Eckmann, J.P., Moses E.: Hierarchical structures induce longrange dynamical correlations in written texts. Proc Natl Acad Sci USA 103, 7956–7961 (2006)

    Article  CAS  Google Scholar 

  4. Amit, M., Shmerler Y., Eisenberg, E., Abraham, M., Shnerb N.: Language and codification dependence of long-range correlations in texts. Fractals 2, 7–13 (1994)

    Article  Google Scholar 

  5. Basile, C., Benedetto, D., Caglioti, E., Degli Esposti M.: An example of mathematical authorship attribution. J. Math. Phys. 41, 125–211 (2008)

    Google Scholar 

  6. Benedetto, D., Caglioti, E., Degli Esposti, M.: The unreasonable effectiveness ofMathematics in Human Science: the attribution of texts by Antonio Gramsci. In: Emmer, M. (ed.) Imagine Math-Between Culture and Mathematics, pp. 143–154. Springer-Verlag Italia, Milano (2012)

    Chapter  Google Scholar 

  7. Benedetto, D., Degli Esposti, M., Maspero, G.: The puzzle of Basil’s Epistula 38: a mathematical approach to a philological problem. Journal of Quantitative Linguistics (2013)

    Google Scholar 

  8. Benedetto, D., Caglioti, E., Loreto, V.: Language trees and zipping. Physical Review Letters 88(4), 48702 (2002)

    Article  Google Scholar 

  9. Bennet, W. R., Scientific and engineering problem-solving with the computer. Prentice-Hall, Englewood Cliffs, NJ (1976)

    Google Scholar 

  10. Bernhardsson, S., da Rocha, L.E.C., Minnhagen, P.: The meta book and size-dependent properties of written language. New Journal of Physics 11, 123015 (2009)

    Article  Google Scholar 

  11. Clement, R., Sharp, D.: Ngram and Bayesian classification of documents for topic and Authorship. Lit. Ling. Comp. 18, 423 (2003)

    Article  Google Scholar 

  12. Conrad, B., Mitzenmacher, M.: Power laws for monkeys typing randomly: the case of unequal probabilities. IEEE Transactions on Information Theory 50, 1403–1414 (2004)

    Article  Google Scholar 

  13. Dickman, R., Moloney, N.R., Altmann, E. G.: Analysis of an information-theoretic model for communication. Journal of Statistical Mechanics: Theory and Experiment 12, 12022 (2012)

    Article  Google Scholar 

  14. Ebeling, W., Neiman, A.: Long-range correlations between letters and sentences in texts. Physica A 215, 233–241 (1995)

    Article  Google Scholar 

  15. Ebeling, W., Pöschel, T.: Entropy and long-range correlations in literary English. Europhys Lett 26, 241–246 (1994)

    Article  Google Scholar 

  16. http://www.voynich.nu/extra/eva.html

  17. Fedwick, P.J.: Bibliotheca Basiliana Universalis I. Brepols-Turnhout, pp. 620–623, 674-678 (1993)

    Google Scholar 

  18. Ferrer i Cancho, R., Solé, R.V.: Two regimes in the frequency of words and the origins of complex lexicons: Zipf’s law revisited. Journal of Quantitative Linguistic 8(3), 165–173 (2001)

    Article  Google Scholar 

  19. Ferrer i Cancho, R., Solé, R. V.: Least effort and the origins of scaling in human language. Proc. Natl. Acad. Sci. 100(3), pp. 788–791. USA (2003)

    Google Scholar 

  20. Ferrer i Cancho, R., Elvevag, B.: Random texts do not exhibit the real Zipf’s law-like rank distribution. PLoS ONE 5, 9411 (2010)

    Article  Google Scholar 

  21. Kešelj, V., Peng, F., Cercone, N., Thomas, C.: N-gram-based author profiles for authorship attribution. In: Kešelj, V., Endo, T. (eds.) PACLING’03. Proceedings of the Conference Pacific Association for Computational Linguistics, pp. 255–264. Dalhousie University, Halifax (2003)

    Google Scholar 

  22. Juola, P.: Authorship Attribution. FNT in Information Retrieval 1, 233–334 (2007)

    Article  Google Scholar 

  23. Landini, G.: Evidence of linguistic structure in the Voynich manuscript using spectral analysis. Cryptologia 25, 275–295 (2001)

    Article  Google Scholar 

  24. Lempel, A., Ziv, J.: On the complexity of finite sequences. IEEE Trans. Inform. Theory IT 22(1), 75–81 (1976)

    Article  Google Scholar 

  25. Lempel, A., Ziv, J.: A universal algorithm for sequential data compression IEEE Transactions on Information Theory 23(3), 337–343 (1977)

    Article  Google Scholar 

  26. Lempel, A., Ziv, J.: Compression of individual sequences via variable-rate coding. IEEE Transactions on Information Theory IT 24(5), 530–536 (1978)

    Article  Google Scholar 

  27. Lü, L., Zhang, Z.K., Zhou, T.: Zipf’s law leads to Heaps’ law: analyzing their relation in finitesize systems. PLoS ONE 5, e14139 (2010)

    Article  Google Scholar 

  28. Li, W.: Random texts exhibit Zipf’s-law-like word frequency distribution. IEEE T Inform Theory 38, 1842–1845 (1992)

    Article  Google Scholar 

  29. Mandelbrot, B.: An informational theory of the statistical structure of language. In: Jackson, W. (ed.) Communication Theory. Butterworths, London (1953)

    Google Scholar 

  30. Maspero, G., Leal, J.: Revisiting Tertullian’s Authorship of the Passio Perpetuae through Quantitative Analysis. In Grzybek, P. (ed.) Text and Language. In: Kelih, E., Maoutek, J. (eds.) Structures. Functions. Interrelations Π Quantitative Perspectives, pp. 99–108. Wien (2010)

    Google Scholar 

  31. Melnyk, S.S., Usatenko, O.V., Yampolskii, V.A.: Competition between two kinds of correlations in literary texts. Phys Rev E 72, 026140 (2005)

    Article  CAS  Google Scholar 

  32. Meredith, A.: Gregory of Nyssa. Routledge, London, New York (1999)

    Google Scholar 

  33. Mitzenmacher, M.: A brief history of generative models for power law and lognormal distributions. Internet Mathematics 1, 226–251 (2003)

    Article  Google Scholar 

  34. Montemurro, M.A.: Beyond the Zipf-Mandelbrot law in quantitative linguistics. Physica A: Statistical Mechanics and its Applications 300, 567–578 (2001)

    Article  Google Scholar 

  35. Montemurro, M.A., Zanette, D.: Towards the quantification of the semantic information encoded in written language. Adv Comp Syst 13, 135–153 (2010)

    Article  Google Scholar 

  36. Montemurro, M.A., Zanette, D.: Keywords and Co-Occurrence Patterns in the Voynich Manuscript: An Information-Theoretic Analysis. PLoS ONE 8, e66344 (2013)

    Article  CAS  Google Scholar 

  37. Newman, M.E.J.: Power laws, Pareto distributions and Zipf’s law. Contemporary physics 46, 323–351 (2005)

    Article  Google Scholar 

  38. Rousseau, P.: Basil of Caesarea. University of California Press, Berkeley (CA), Los Angeles (CA), London (1998)

    Google Scholar 

  39. Schenkel, A., Zhang, J., Zhang, Y.: Long range correlation in human writings. Fractals 1, 47–55 (1993)

    Article  Google Scholar 

  40. Schinner, A.: The Voynich Manuscript: Evidence of the Hoax Hypothesis. Cryptologia 31, 95–107 (2007)

    Article  Google Scholar 

  41. Stamatatos, E.: A Survey of Modern Authorship Attribution Methods. Jour. Am. Soc. Infor. Sci. Tech. 60, 538–556 (2009)

    Article  Google Scholar 

  42. Suzuki, R., Tyack, P.L., Buck, J.: The use of Zipf’s law in animal communication analysis. Anim. Behav. 69, 9–17 (2005)

    Article  Google Scholar 

  43. Wyner, A.D., Ziv, J., Wyner, A.J.: On the role of pattern matching in information theory. IEEE Transactions on information Theory 44(6), 2045–2056 (1998)

    Article  Google Scholar 

  44. Zanette, D.: Statistical Patterns inWritten Language, available at http://fisica.cab.cnea.gov.ar/estadistica/zanette/

  45. Zanette, D., Montemurro, M. A.: Dynamics of text generation with realistic Zipf’s distribution. J Quantitative Linguistics 12, 29–40 (2005)

    Article  Google Scholar 

  46. Ziv, J., Merhav, N.: A measure of relative entropy between individual sequences with application to universal classification. IEEE Transactions on Information Theory 39(4), 1270–1279 (1993)

    Article  Google Scholar 

  47. The digital library has been developed by University of California, Irvine. http://www.tlg.uci.edu

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Mirko Degli Esposti .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2014 Springer International Publishing Switzerland

About this chapter

Cite this chapter

Esposti, M.D. (2014). Mathematical Models of Textual Data: A Short Review. In: Celletti, A., Locatelli, U., Ruggeri, T., Strickland, E. (eds) Mathematical Models and Methods for Planet Earth. Springer INdAM Series, vol 6. Springer, Cham. https://doi.org/10.1007/978-3-319-02657-2_8

Download citation

Publish with us

Policies and ethics