Skip to main content

Efficient Text Mining with Optimized Pattern Discovery

  • Conference paper
  • First Online:
Combinatorial Pattern Matching (CPM 2002)

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 2373))

Included in the following conference series:

Abstract

The rapid progress of computer and network technologies makes it easy to collect and store a large amount of unstructured or semi-structured texts such as Web pages, HTML/XML archives, E-mails, and text files. These text data can be thought of large scale text databases, and thus it becomes important to develop an efficient tools to discover interesting knowledge from such text databases.

There are a large body of data mining researches to discover interesting rules or patterns from well-structured data such as transaction databases with boolean or numeric attributes [1,8,13]. However, it is difficult to directly apply the traditional data mining technologies to text or semi-structured data mentioned above since these text databases consist of (i) heterogeneous and (ii) huge collections of (iii) un-structured or semi-structured data. Therefore, there still have been a small number of studies on text mining, e.g., [4,5,12,17].

Our research goal is to devise an efficient semi-automatic tool that supports human discovery from large text databases. Therefore, we require a fast pattern discovery algorithm that can work in time, e.g., O(n) to O(n log n), to respond in real time on an unstructured data set of total size n. Furthermore, such an algorithm has to be robust in the sense that it can work on a large amount of noisy and incomplete data without the assumption of an unknown hypothesis class.

To achieve this goal, we adopt the framework of optimized pattern discovery [11], also known as Agnostic PAC learning [10] in computational learning theory. In optimized pattern discovery, an algorithm tries to find a pattern from a hypothesis space that optimizes a given statistical measure, such as classification error [10], information entropy [11], and Gini index [6], to discriminate a set of interesting documents from a set of uninteresting ones. In the recent developments in computational learning theory, it is shown that such an algorithm can approximate arbitrary distributions on data within a given class of hypotheses very well in the sense of classification accuracy [6,10].

This work is partially supported by the Ministry of Education, Science, Sports, and Culture, Grant-in-Aid for Scientific Research on Priority Areas Informatics (No. 14019070) 2002.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 39.99
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

  1. R. Agrawal, R. Srikant, Fast algorithms for mining association rules, In Proc. VLDB’94, 487–499, 1994.

    Google Scholar 

  2. H. Arimura, A. Wataki, R. Fujino, S. Arikawa, A fast algorithm for discovering optimal string patterns in large text databases, In Proc. 9th Int. Workshop on Algorithmic Learning Theory, LNAI 1501, 247–261, 1998.

    Google Scholar 

  3. T. Asai, K. Abe, S. Kawasoe, H. Arimura, H. Sakamoto, S. Arikawa, Efficient substructure discovery from large semi-structured data, In Proc. 2nd SIAM Int’l. Conf. on Data Mining, 158–174, 2002.

    Google Scholar 

  4. W. W. Cohen, Y. Singer, Context-sensitive learning methods for text categorization, J. ACM, 17(2), 141–173, 1999.

    Google Scholar 

  5. M. Craven, D. DiPasquo, D. Freitag, A. McCallum, T. Mitchell, K. Nigam, and S. Slattery, Learning to construct knowledge bases from the World Wide Web, Artificial Intelligence, 118, 69–114, 2000.

    Article  MATH  Google Scholar 

  6. L. Devroye, L. Gyorfi, G. Lugosi, A probablistic theory of pattern recognition, Springer-Verlag, 1996.

    Google Scholar 

  7. U. M. Fayyad, G. Piatetsky-Shapiro, P. Smyth, and R. Uthurusamy, Advances in Knowledge Discovery and Data Mining, The MIT Press, Cambridge, 1996.

    Google Scholar 

  8. D. Hand, H. mannila, and P. Smyth, Principles of Data The MIT Press, Cambridge, 1996.

    Google Scholar 

  9. T. Kasai, G. Lee, H. Arimura, S. Arikawa, K. Park, Linear-time longest-common-prefix computation in suffix arrays and its applications, In Proc. 12th Combinatorial Pattern Matching, LNCS 2089, 181–192, Springer-Verlag, 2001.

    Google Scholar 

  10. M. J. Kearns, R. E. Shapire, L. M. Sellie, Toward efficient agnostic learning. Machine Learning, 17(2–-3), 115–141, 1994.

    MATH  Google Scholar 

  11. S. Morishita, On classification and regression, In Proc. 1st Int’l. Conf. on Discovery Conference, LNAI 1532, 49–59, 1998.

    Google Scholar 

  12. L. Parida, I. Rigoutsos, A. Floratos, D. Platt, Y. Gao, Pattern discovery on character sets and real-valued data: linear bound on irredundant motifs and an efficient polynomial time algorithm, In Proc. 11th ACM-SIAM Symposium on Discrete Algorithms, 297–308, 2000.

    Google Scholar 

  13. F. Provost, M. Schkolnick, R. Srikant (eds.), Proc. 7th ACM SIGKDD international conference on Knowledge discovery and data mining ACM Press, 2001.

    Google Scholar 

  14. H. Sakamoto, H. Arimura, and S. Arikawa, Extracting partial structures from html documents, In Proc. 14th Florida Artificial Intelligence Research Symposium (FLAIRS’2001), Florida, AAAI, 264–268, May, 2001.

    Google Scholar 

  15. K. Taniguchi, H. Sakamoto, H. Arimura, S. Shimozono and S. Arikawa, Mining semi-structured data by path expressions, In Proc. 4th Int’l. Conf. on Discovery Science, LNAI 2226, 378–388, Springer-Verlag, 2001.

    Google Scholar 

  16. S. Shimozono, H. Arimura, S. Arikawa, Efficient discovery of optimal word-association patterns in large text databases, New Gener. Comp., 18, 49–60, 2000.

    Article  Google Scholar 

  17. J. T. L. Wang, G. W. Chirn, T. G. Marr, B. Shapiro, D. Shasha and K. Zhang, Combinatorial pattern discovery for scientific data: Some preliminary results, In Proc. SIGMOD’94, 115–125, 1994.

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2002 Springer-Verlag Berlin Heidelberg

About this paper

Cite this paper

Arimura, H. (2002). Efficient Text Mining with Optimized Pattern Discovery. In: Apostolico, A., Takeda, M. (eds) Combinatorial Pattern Matching. CPM 2002. Lecture Notes in Computer Science, vol 2373. Springer, Berlin, Heidelberg. https://doi.org/10.1007/3-540-45452-7_2

Download citation

  • DOI: https://doi.org/10.1007/3-540-45452-7_2

  • Published:

  • Publisher Name: Springer, Berlin, Heidelberg

  • Print ISBN: 978-3-540-43862-5

  • Online ISBN: 978-3-540-45452-6

  • eBook Packages: Springer Book Archive

Publish with us

Policies and ethics