Skip to main content

Restoring Punctuation and Casing in English Text

  • Conference paper
AI 2009: Advances in Artificial Intelligence (AI 2009)

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 5866))

Included in the following conference series:

Abstract

This paper explores the use of machine learning techniques to restore punctuation and case in English text, as part of which it investigates the co-dependence of case information and punctuation. We achieve an overall F-score of .619 for the task using a variety of lexical and contextual features, and iterative retagging.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 84.99
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 109.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. Abney, S.P.: Parsing by chunks. In: Berwick, R.C., Abney, S.P., Tenny, C. (eds.) Principle-Based Parsing: Computation and Psycholinguistics, pp. 257–278. Kluwer, Dordrecht (1991)

    Google Scholar 

  2. Beeferman, D., Berger, A., Lafferty, J.: Cyberpunc: A lightweight punctuation annotation system for speech. In: Proceedings of 1998 IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP 1998), Seattle, USA (1998)

    Google Scholar 

  3. Bird, S., Klein, E., Loper, E.: Natural Language Processing with Python — Analyzing Text with the Natural Language Toolkit. O’Reilly Media, Sebastopol (2009)

    Google Scholar 

  4. Briscoe, E., Carroll, J., Watson, R.: The second release of the RASP system. In: Proceedings of the COLING/ACL 2006 Interactive Poster System, Sydney, Australia, pp. 77–80 (2006)

    Google Scholar 

  5. Burnard, L.: User Reference Guide for the British National Corpus. Technical report, Oxford University Computing Services (2000)

    Google Scholar 

  6. Daelemans, W., Zavrel, J., van der Sloot, K., van den Bosch, A.: TiMBL: Tilburg Memory Based Learner, version 5.1, Reference Guide. ILK Technical Report 04-02 (2004)

    Google Scholar 

  7. Fellbaum, C.: Wordnet: An Electronic Lexical Database. MIT Press, Cambridge (1998)

    MATH  Google Scholar 

  8. Hsu, C.-W., Chang, C.-C., Lin, C.-J.: A practical guide to support vector classification. Technical report, Department of Computer Science National Taiwan University (2008)

    Google Scholar 

  9. Kiss, T., Strunk, J.: Unsupervised multilingual sentence boundary detection. Computational Linguistics 32(4), 485–525 (2006)

    Article  Google Scholar 

  10. Lita, L.V., Ittycheriah, A., Roukos, S., Kambhatla, N.: tRuEcasIng. In: Proceedings of the 41st Annual Meeting of the Association for Computational Linguistics, Sapporo, Japan, pp. 152–159 (2003)

    Google Scholar 

  11. Liu, H., Motoda, H.: Feature Extraction, Construction and Selection: A Data Mining Perspective. Kluwer Academic Publishers, Dordrecht (1988)

    Google Scholar 

  12. Marcus, M.P., Santorini, B., Marcinkiewicz, M.A.: Building a large annotated corpus of English: the Penn treebank. Computational Linguistics 19(2), 313–330 (1993)

    Google Scholar 

  13. Minnen, G., Carroll, J., Pearce, D.: Applied morphological processing of English. Natural Language Engineering 7(3), 207–223 (2001)

    Article  Google Scholar 

  14. Ngai, G., Florian, R.: Transformation-based learning in the fast lane. In: Proceedings of the 2nd Annual Meeting of the North American Chapter of Association for Computational Linguistics (NAACL 2001), Pittsburgh, USA, pp. 40–47 (2001)

    Google Scholar 

  15. Shieber, S.M., Tao, X.: Comma restoration using constituency information. In: Proceedings of the 3rd International Conference on Human Language Technology Research and 4th Annual Meeting of the NAACL (HLT-NAACL 2003), Edmonton, Canada, pp. 142–148 (2003)

    Google Scholar 

  16. Wynne, M.: A post-editor’s guide to CLAWS7 tagging. UCREL University of Lancaster, Lancaster, England (1996)

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2009 Springer-Verlag Berlin Heidelberg

About this paper

Cite this paper

Baldwin, T., Joseph, M.P.A.K. (2009). Restoring Punctuation and Casing in English Text. In: Nicholson, A., Li, X. (eds) AI 2009: Advances in Artificial Intelligence. AI 2009. Lecture Notes in Computer Science(), vol 5866. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-10439-8_55

Download citation

  • DOI: https://doi.org/10.1007/978-3-642-10439-8_55

  • Publisher Name: Springer, Berlin, Heidelberg

  • Print ISBN: 978-3-642-10438-1

  • Online ISBN: 978-3-642-10439-8

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics