Skip to main content

Discovery of Frequent Word Sequences in Text

  • Conference paper
  • First Online:
Pattern Detection and Discovery

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 2447))

Abstract

We have developed a method that extracts all maximal frequent word sequences from the documents of a collection. A sequence is said to be frequent if it appears in more than σ documents, in which σ is the frequency threshold given. Furthermore, a sequence is maximal, if no other frequent sequence exists that contains this sequence. The words of a sequence do not have to appear in text consecutively.

In this paper, we describe briefly the method for finding all maximal frequent word sequences in text and then extend the method for extracting generalized sequences from annotated texts, where each word has a set of additional, e.g. morphological, features attached to it. We aim at discovering patterns which preserve as many features as possible such that the frequency of the pattern still exceeds the frequency threshold given.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 39.99
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. Helena Ahonen. Knowledge discovery in documents by extracting frequent word sequences. Library Trends, 48(1):160–181, 1999. Special Issue on Knowledge Discovery in Databases.

    Google Scholar 

  2. Helena Ahonen. Finding all maximal frequent sequences in text. In ICML99 Workshop, Machine Learning in Text Data Analysis, Bled, Slovenia, 1999.

    Google Scholar 

  3. Helena Ahonen-Myka, Oskari Heinonen, Mika Klemettinen, and A. Inkeri Verkamo. Finding co-occurring text phrases by combining sequence and frequent set discovery. In Ronen Feldman, ed., Proceedings of 16th International Joint Conference on Artificial Intelligence IJCAI-99 Workshop on Text Mining: Foundations, Techniques and Applications, pages 1–9, Stockholm, Sweden, 1999.

    Google Scholar 

  4. Rakesh Agrawal and Ramakrishnan Srikant. Mining sequential patterns. In International Conference on Data Engineering, March 1995.

    Google Scholar 

  5. Heikki Mannila, Hannu Toivonen, and A. Inkeri Verkamo. Discovering frequent episodes in sequences. In Proceedings of the First International Conference on Knowledge Discovery and Data Mining (KDD’95), pages 210–215, Montreal, Canada, August 1995.

    Google Scholar 

  6. Rakesh Agrawal, Heikki Mannila, Ramakrishnan Srikant, Hannu Toivonen, and A. Inkeri Verkamo. Fast discovery of association rules. In Usama M. Fayyad, Gregory Piatetsky-Shapiro, Padhraic Smyth, and Ramasamy Uthurusamy, eds., Advances in Knowledge Discovery and Data Mining, pages 307–328. AAAI Press, Menlo Park, California, USA, 1996.

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2002 Springer-Verlag Berlin Heidelberg

About this paper

Cite this paper

Ahonen-Myka, H. (2002). Discovery of Frequent Word Sequences in Text. In: Hand, D.J., Adams, N.M., Bolton, R.J. (eds) Pattern Detection and Discovery. Lecture Notes in Computer Science(), vol 2447. Springer, Berlin, Heidelberg. https://doi.org/10.1007/3-540-45728-3_14

Download citation

  • DOI: https://doi.org/10.1007/3-540-45728-3_14

  • Published:

  • Publisher Name: Springer, Berlin, Heidelberg

  • Print ISBN: 978-3-540-44148-9

  • Online ISBN: 978-3-540-45728-2

  • eBook Packages: Springer Book Archive

Publish with us

Policies and ethics