Skip to main content

Hybrid Text Segmentation for Hungarian Clinical Records

  • Conference paper
Advances in Artificial Intelligence and Its Applications (MICAI 2013)

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 8265))

Included in the following conference series:

Abstract

Nowadays clinical documents are getting widely available to researchers who are aiming to develop resources and tools that may help clinicians in their work. While several attempts exist for English medical text processing, there are only few for other languages. Moreover, word and sentence segmentation tasks are commonly treated as simple engineering issues. In this study, we introduce the difficulties that arise during the segmentation of Hungarian clinical records, and describe a complex method that results in a normalized and segmented text. Our approach is a hybrid combination of a rule-based and an unsupervised statistical solution. The presented system is compared with other algorithms that are available and commonly used. These fail to segment clinical text (all of them reach F-scores below 75%), while our method scores above 90%. This means that only the hybrid tool described in this study can be used for the segmentation of Hungarian clinical texts in practical applications.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 39.99
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. Apostolova, E., Channin, D.S., Demner-Fushman, D., Furst, J., Lytinen, S., Raicu, D.: Automatic segmentation of clinical texts. In: Annual International Conference of the IEEE Engineering in Medicine and Biology Society, EMBC 2009, pp. 5905–5908. IEEE (2009)

    Google Scholar 

  2. Baldridge, J., Morton, T., Bierner, G.: The OpenNLP maximum entropy package (2002)

    Google Scholar 

  3. Paul, S., Cho, R.K., Taira, Kangarloo, H.: Text boundary detection of medical reports. In: Proceedings of the AMIA Symposium, p. 998. American Medical Informatics Association (2002)

    Google Scholar 

  4. Csendes, D., Csirik, J., Gyimóthy, T.: The Szeged Corpus: A POS tagged and syntactically annotated Hungarian natural language corpus. In: Proceedings of the 5th International Workshop on Linguistically Interpreted Corpora, pp. 19–23 (2004)

    Google Scholar 

  5. Dridan, R., Oepen, S.: Tokenization: returning to a long solved problem a survey, contrastive experiment, recommendations, and toolkit. In: Proceedings of the 50th Annual Meeting of the Association for Computational Linguistics, pp. 378–382. Association for Computational Linguistics (2012)

    Google Scholar 

  6. Dunning, T.: Accurate methods for the statistics of surprise and coincidence. Computational linguistics 19(1), 61–74 (1993)

    Google Scholar 

  7. Gillick, D.: Sentence boundary detection and the problem with the US. In: Proceedings of Human Language Technologies: The 2009 Annual Conference of the North American Chapter of the Association for Computational Linguistics, Companion Volume: Short Papers, pp. 241–244. Association for Computational Linguistics (2009)

    Google Scholar 

  8. Halácsy, P., Kornai, A., Németh, L., Rung, A., Szakadát, I., Trón, V.: Creating open language resources for Hungarian. In: Proceedings of Language Resources and Evaluation Conference (2004)

    Google Scholar 

  9. Kiss, T., Strunk, J.: Unsupervised multilingual sentence boundary detection. Computational Linguistics 32(4), 485–525 (2006)

    Article  Google Scholar 

  10. Kumar, A.: Monk project: Architecture overview. In: Proceedings of JCDL 2009 Workshop: Integrating Digital Library Content with Computational Tools and Services (2009)

    Google Scholar 

  11. Meystre, S.M., Savova, G.K., Kipper-Schuler, K.C., Hurdle, J.F.: Extracting information from textual documents in the electronic health record: a review of recent research. In: Yearbook of Medical Informatics, pp. 128–144 (2008)

    Google Scholar 

  12. Mikheev, A.: Tagging sentence boundaries. In: Proceedings of the 1st North American chapter of the Association for Computational Linguistics Conference, pp. 264–271. Association for Computational Linguistics (2000)

    Google Scholar 

  13. Mikheev, A.: Periods, capitalized words, etc. Computational Linguistics 28(3), 289–318 (2002)

    Article  Google Scholar 

  14. Orosz, G., Novák, A., Prószéky, G.: Magyar nyelvű klinikai rekordok morfológiai egyértelműsítése. In: IX. Magyar Számítógépes Nyelvészeti Konferencia, Szeged, pp. 159–169. Szegedi Tudományegyetem (2013)

    Google Scholar 

  15. Palmer, D.D., Hearst, M.A.: Adaptive sentence boundary disambiguation. In: Proceedings of the fourth conference on Applied natural language processing, pp. 78–83. Association for Computational Linguistics (1994)

    Google Scholar 

  16. Palmer, D.D., Hearst, M.A.: Adaptive multilingual sentence boundary disambiguation. Computational Linguistics 23(2), 241–267 (1997)

    Google Scholar 

  17. Prószéky, G.: Industrial applications of unification morphology. In: Proceedings of the Fourth Conference on Applied Natural Language Processing, Morristown, NJ, USA, p. 213 (1994)

    Google Scholar 

  18. Prószéky, G., Novák, A.: Computational Morphologies for Small Uralic Languages. In: Inquiries into Words, Constraints and Contexts, Stanford, California, pp. 150–157 (2005)

    Google Scholar 

  19. Read, J., Dridan, R., Oepen, S., Solberg, L.J.: Sentence Boundary Detection: A Long Solved Problem? In: 24th International Conference on Computational Linguistics (Coling 2012), India (2012)

    Google Scholar 

  20. Reynar, J.C., Ratnaparkhi, A.: A maximum entropy approach to identifying sentence boundaries. In: Proceedings of the Fifth Conference on Applied Natural Language Processing, pp. 16–19. Association for Computational Linguistics (1997)

    Google Scholar 

  21. Riley, M.D.: Some applications of tree-based modelling to speech and language. In: Proceedings of the Workshop on Speech and Natural Language, pp. 339–352. Association for Computational Linguistics (1989)

    Google Scholar 

  22. Savova, G.K., Masanz, J.J., Ogren, P.V., Zheng, J., Sohn, S., Schuler, K.K., Chute, C.G.: Mayo clinical text analysis and knowledge extraction system (ctakes): architecture, component evaluation and applications. Journal of the American Medical Informatics Association 17(5), 507–513 (2010)

    Article  Google Scholar 

  23. Schmid, H.: Unsupervised learning of period disambiguation for tokenisation. Technical report (2000)

    Google Scholar 

  24. Siklósi, B., Orosz, G., Novák, A., Prószéky, G.: Automatic structuring and correction suggestion system for hungarian clinical records. In: De Pauw, G., De Schryver, G.M., Forcada, M.L., Tyers, F.M. (eds.) 8th SaLTMiL Workshop on Creation and Use of Basic Lexical Resources for Lessresourced Languages, pp. 29–34 (2012)

    Google Scholar 

  25. Siklósi, B., Novák, A., Prószéky, G.: Context-aware correction of spelling errors in hungarian medical documents. In: Dediu, A.-H., Martín-Vide, C., Mitkov, R., Truthe, B. (eds.) SLSP 2013. LNCS, vol. 7978, pp. 248–259. Springer, Heidelberg (2013)

    Chapter  Google Scholar 

  26. Stevenson, M., Gaizauskas, R.: Experiments on sentence boundary detection. In: Proceedings of the Sixth Conference on Applied Natural Language Processing, pp. 84–89. Association for Computational Linguistics (2000)

    Google Scholar 

  27. Taira, R.K., Soderland, S.G., Jakobovits, R.M.: Automatic structuring of radiology free-text reports. Radiographics 21(1), 237–245 (2001)

    Article  Google Scholar 

  28. Tomanek, K., Wermter, J., Hahn, U.: A reappraisal of sentence and token splitting for life sciences documents. Studies in Health Technology and Informatics 129(pt. 1), 524–528 (2006)

    Google Scholar 

  29. Tomanek, K., Wermter, J., Hahn, U.: Sentence and token splitting based on conditional random fields. In: Proceedings of the 10th Conference of the Pacific Association for Computational Linguistics, pp. 49–57 (2007)

    Google Scholar 

  30. Wrenn, J.O., Stetson, P.D., Johnson, S.B.: An unsupervised machine learning approach to segmentation of clinician-entered free text. In: AMIA Annu. Symp. Proc., pp. 811–815 (2007)

    Google Scholar 

  31. Xu, H., Stenner, S.P., Doan, S., Johnson, K.B., Waitman, L.R., Denny, J.C.: Medex: a medication information extraction system for clinical narratives. Journal of the American Medical Informatics Association 17(1), 19–24 (2010)

    Article  Google Scholar 

  32. Zhu, C., Tang, J., Li, H., Ng, H.T., Zhao, T.: A unified tagging approach to text normalization. In: The 45th Annual Meeting of the Association for Computational Linguistics, pp. 688–695 (2007)

    Google Scholar 

  33. Zsibrita, J., Vincze, V., Farkas, R.: magyarlanc: A Toolkit for Morphological and Dependency Parsing of Hungarian. In: Proceedings of Recent Advances in Natural Language Provessing 2013, Hissar, Bulgaria, pp. 763–771. Association for Computational Linguistics (2013)

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2013 Springer-Verlag Berlin Heidelberg

About this paper

Cite this paper

Orosz, G., Novák, A., Prószéky, G. (2013). Hybrid Text Segmentation for Hungarian Clinical Records. In: Castro, F., Gelbukh, A., González, M. (eds) Advances in Artificial Intelligence and Its Applications. MICAI 2013. Lecture Notes in Computer Science(), vol 8265. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-45114-0_25

Download citation

  • DOI: https://doi.org/10.1007/978-3-642-45114-0_25

  • Publisher Name: Springer, Berlin, Heidelberg

  • Print ISBN: 978-3-642-45113-3

  • Online ISBN: 978-3-642-45114-0

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics