Skip to main content

DZ A text compression algorithm for natural languages

  • Conference paper
  • First Online:
Combinatorial Pattern Matching (CPM 1992)

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 644))

Included in the following conference series:

Abstract

«Texts written in a natural language are essentially made of words of this language». We use this obvious fact, together with an extensive lexicon to define a good model of the statistical behavior of letters in texts. This model is used with the arithmetic coding scheme to build an efficient universal data compression method. Initially our method was specialized in the compression of French texts. However it can be easily adapted to other languages. Tests show that the compression ratio obtained by our method is on the average 30% on French texts. On the same texts Ziv & Lempel's method yields an average ratio of 40%. On other kinds of test files (English text, executable files, sources) the use of an order 1 Markov chain leads to results of the same order as Ziv & Lempel's. We present a new approach to dynamic dictionary construction for natural language compression. The fact well known to linguists that the number of different words is small, makes a dynamic construction possible.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

Bibliography

  1. D. ABRAHAMSON, An adaptive dependency source model for data compression, Commun. ACM, 32,1 (1989), 77–83.

    Google Scholar 

  2. T.C. BELL, J. G. CLEARY, I. H. WITTEN, Text Compression, Prentice Hall advanced reference series, 1990, ISBN 0-13-911991-4.

    Google Scholar 

  3. P.F BROWN, S.A. DELLA PIETRA, V. J. DELLA PIETRA, J.C. LAI, R.L. MERCER, An Estimate of an Upper Bound for the Entropy of English, Preprint (1991).

    Google Scholar 

  4. G.V. CORMACK, R.N.S. HORSPOOL, Data compression using dynamic Markov modelling, Comput. J., 30,6 (1987), 541–550.

    Google Scholar 

  5. M. GUAZZO, A general minimum-redundancy source-coding algorithm, I.E.E.E. Trans. on Inform. Theory, 26,1 (1980), 15–25,January.

    Google Scholar 

  6. D. A. HUFFMAN, A method for the construction of minimum redundancy codes, Proc. IRE, 40 (1952), 1098–1101,September.

    Google Scholar 

  7. D. REVUZ, Dictionnaires et Lexiques Méthodes et Algorithmes, These de Doctorat Université Paris 7, (1991).

    Google Scholar 

  8. J. RISSANEN, Generalized Kraft inequality and Arithmetic coding, IBM J. Res. Dev., 20 (1976), 198–203,May.

    Google Scholar 

  9. J. RISSANEN, G.G. LANGDON Jr., Arithmetic coding, IBM J. Res. Dev., 23,2 (1979), 149–162,March.

    Google Scholar 

  10. T.A. WELCH, A technique for high-performance data compression, IEEE Computer, 17,6 (1984), 8–19,June.

    Google Scholar 

  11. I.H. WITTEN, R.M. NEAL, J.G. CLEARY, Arithmetic coding for data compression, Commun. ACM, 30,6 (1987), 520–540, June.

    Google Scholar 

  12. J. ZIV, A. LEMPEL, A Universal algorithm for sequential data compression, I.E.E.E. Trans. Inform.Theory, 23,3 (1977), 337–343, May.

    Google Scholar 

  13. J. ZIV, A. LEMPEL, Compression of individual sequences via variable-rate coding, I.E.E.E. Trans. Inform.Theory, 24,5 (1978), 530–536, September.

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Editor information

Alberto Apostolico Maxime Crochemore Zvi Galil Udi Manber

Rights and permissions

Reprints and permissions

Copyright information

© 1992 Springer-Verlag Berlin Heidelberg

About this paper

Cite this paper

Revuz, D., Zipstein, M. (1992). DZ A text compression algorithm for natural languages. In: Apostolico, A., Crochemore, M., Galil, Z., Manber, U. (eds) Combinatorial Pattern Matching. CPM 1992. Lecture Notes in Computer Science, vol 644. Springer, Berlin, Heidelberg. https://doi.org/10.1007/3-540-56024-6_16

Download citation

  • DOI: https://doi.org/10.1007/3-540-56024-6_16

  • Published:

  • Publisher Name: Springer, Berlin, Heidelberg

  • Print ISBN: 978-3-540-56024-1

  • Online ISBN: 978-3-540-47357-2

  • eBook Packages: Springer Book Archive

Publish with us

Policies and ethics