Skip to main content

Improving Recall of Regular Expressions for Information Extraction

  • Conference paper
Web Information Systems Engineering - WISE 2012 (WISE 2012)

Part of the book series: Lecture Notes in Computer Science ((LNISA,volume 7651))

Included in the following conference series:

Abstract

Learning or writing regular expressions to identify instances of a specific concept within text documents with a high precision and recall is challenging. It is relatively easy to improve the precision of an initial regular expression by identifying false positives covered and tweaking the expression to avoid the false positives. However, modifying the expression to improve recall is difficult since false negatives can only be identified by manually analyzing all documents, in the absence of any tools to identify the missing instances. We focus on partially automating the discovery of missing instances by soliciting minimal user feedback. We present a technique to identify good generalizations of a regular expression that have improved recall while retaining high precision. We empirically demonstrate the effectiveness of the proposed technique as compared to existing methods and show results for a variety of tasks such as identification of dates, phone numbers, product names, and course numbers on real world datasets.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 39.99
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. Appelt, D.E.: Introduction to information extraction. AI Commun. 12(3), 161–172 (1999)

    Google Scholar 

  2. Babbar, R., Singh, N.: Clustering based approach to learning regular expressions over large alphabet for noisy unstructured text. In: AND Workshop, pp. 43–50 (2010)

    Google Scholar 

  3. Bex, G.J., Neven, F., Schwentick, T., Tuyls, K.: Inference of concise DTDs from XML data. In: VLDB, pp. 115–126 (2006)

    Google Scholar 

  4. Chiticariu, L., Krishnamurthy, R., Li, Y., Raghavan, S., Reiss, F., Vaithyanathan, S.: SystemT: An algebraic approach to declarative information extraction. In: ACL, pp. 128–137 (2010)

    Google Scholar 

  5. Ciravegna, F.: Adaptive information extraction from text by rule induction and generalisation. In: IJCAI, pp. 1251–1256 (2001)

    Google Scholar 

  6. Denis, F.: Learning regular languages from simple positive examples. Machine Learning 44(1/2), 37–66 (2001)

    Article  MathSciNet  MATH  Google Scholar 

  7. Fernau, H.: Algorithms for learning regular expressions from positive data. Inf. Comput. 207(4), 521–541 (2009)

    Article  MathSciNet  MATH  Google Scholar 

  8. Garofalakis, M.N., Gionis, A., Rastogi, R., Seshadri, S., Shim, K.: XTRACT: A system for extracting document type descriptors from XML documents. In: SIGMOD, pp. 165–176 (2000)

    Google Scholar 

  9. Klimt, B., Yang, Y.: Introducing the Enron corpus. In: CEAS (2004)

    Google Scholar 

  10. Lang, K.: 20 Newsgroups (1997), http://people.csail.mit.edu/jrennie/20Newsgroups

  11. Li, Y., Krishnamurthy, R., Raghavan, S., Vaithyanathan, S., Jagadish, H.V.: Regular expression learning for information extraction. In: EMNLP, pp. 21–30 (2008)

    Google Scholar 

  12. Mitchell, T.M.: Generalization as search. Artif. Intell. 18(2), 203–226 (1982)

    Article  Google Scholar 

  13. Nie, Z., Wen, J., Zhang, B.: 2D conditional random fields for web information extraction. In: ICML, pp. 1044–1051 (2005)

    Google Scholar 

  14. Pearl, J.: Heuristics: intelligent search strategies for computer problem solving. Addison-Wesley Longman Publishing Co., Inc., Boston (1984)

    Google Scholar 

  15. Sarawagi, S., Cohen, W.W.: Semi-markov conditional random fields for information extraction. In: NIPS, pp. 1185–1192 (2004)

    Google Scholar 

  16. Wu, T., Pottenger, W.M.: A semi-supervised active learning algorithm for information extraction from textual data: Research articles. JASIST 56(3), 258–271 (2005)

    Article  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2012 Springer-Verlag Berlin Heidelberg

About this paper

Cite this paper

Murthy, K., P., D., Deshpande, P.M. (2012). Improving Recall of Regular Expressions for Information Extraction. In: Wang, X.S., Cruz, I., Delis, A., Huang, G. (eds) Web Information Systems Engineering - WISE 2012. WISE 2012. Lecture Notes in Computer Science, vol 7651. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-35063-4_33

Download citation

  • DOI: https://doi.org/10.1007/978-3-642-35063-4_33

  • Publisher Name: Springer, Berlin, Heidelberg

  • Print ISBN: 978-3-642-35062-7

  • Online ISBN: 978-3-642-35063-4

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics