Abstract
Nowadays clinical documents are getting widely available to researchers who are aiming to develop resources and tools that may help clinicians in their work. While several attempts exist for English medical text processing, there are only few for other languages. Moreover, word and sentence segmentation tasks are commonly treated as simple engineering issues. In this study, we introduce the difficulties that arise during the segmentation of Hungarian clinical records, and describe a complex method that results in a normalized and segmented text. Our approach is a hybrid combination of a rule-based and an unsupervised statistical solution. The presented system is compared with other algorithms that are available and commonly used. These fail to segment clinical text (all of them reach F-scores below 75%), while our method scores above 90%. This means that only the hybrid tool described in this study can be used for the segmentation of Hungarian clinical texts in practical applications.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Preview
Unable to display preview. Download preview PDF.
References
Apostolova, E., Channin, D.S., Demner-Fushman, D., Furst, J., Lytinen, S., Raicu, D.: Automatic segmentation of clinical texts. In: Annual International Conference of the IEEE Engineering in Medicine and Biology Society, EMBC 2009, pp. 5905–5908. IEEE (2009)
Baldridge, J., Morton, T., Bierner, G.: The OpenNLP maximum entropy package (2002)
Paul, S., Cho, R.K., Taira, Kangarloo, H.: Text boundary detection of medical reports. In: Proceedings of the AMIA Symposium, p. 998. American Medical Informatics Association (2002)
Csendes, D., Csirik, J., Gyimóthy, T.: The Szeged Corpus: A POS tagged and syntactically annotated Hungarian natural language corpus. In: Proceedings of the 5th International Workshop on Linguistically Interpreted Corpora, pp. 19–23 (2004)
Dridan, R., Oepen, S.: Tokenization: returning to a long solved problem a survey, contrastive experiment, recommendations, and toolkit. In: Proceedings of the 50th Annual Meeting of the Association for Computational Linguistics, pp. 378–382. Association for Computational Linguistics (2012)
Dunning, T.: Accurate methods for the statistics of surprise and coincidence. Computational linguistics 19(1), 61–74 (1993)
Gillick, D.: Sentence boundary detection and the problem with the US. In: Proceedings of Human Language Technologies: The 2009 Annual Conference of the North American Chapter of the Association for Computational Linguistics, Companion Volume: Short Papers, pp. 241–244. Association for Computational Linguistics (2009)
Halácsy, P., Kornai, A., Németh, L., Rung, A., Szakadát, I., Trón, V.: Creating open language resources for Hungarian. In: Proceedings of Language Resources and Evaluation Conference (2004)
Kiss, T., Strunk, J.: Unsupervised multilingual sentence boundary detection. Computational Linguistics 32(4), 485–525 (2006)
Kumar, A.: Monk project: Architecture overview. In: Proceedings of JCDL 2009 Workshop: Integrating Digital Library Content with Computational Tools and Services (2009)
Meystre, S.M., Savova, G.K., Kipper-Schuler, K.C., Hurdle, J.F.: Extracting information from textual documents in the electronic health record: a review of recent research. In: Yearbook of Medical Informatics, pp. 128–144 (2008)
Mikheev, A.: Tagging sentence boundaries. In: Proceedings of the 1st North American chapter of the Association for Computational Linguistics Conference, pp. 264–271. Association for Computational Linguistics (2000)
Mikheev, A.: Periods, capitalized words, etc. Computational Linguistics 28(3), 289–318 (2002)
Orosz, G., Novák, A., Prószéky, G.: Magyar nyelvű klinikai rekordok morfológiai egyértelműsítése. In: IX. Magyar Számítógépes Nyelvészeti Konferencia, Szeged, pp. 159–169. Szegedi Tudományegyetem (2013)
Palmer, D.D., Hearst, M.A.: Adaptive sentence boundary disambiguation. In: Proceedings of the fourth conference on Applied natural language processing, pp. 78–83. Association for Computational Linguistics (1994)
Palmer, D.D., Hearst, M.A.: Adaptive multilingual sentence boundary disambiguation. Computational Linguistics 23(2), 241–267 (1997)
Prószéky, G.: Industrial applications of unification morphology. In: Proceedings of the Fourth Conference on Applied Natural Language Processing, Morristown, NJ, USA, p. 213 (1994)
Prószéky, G., Novák, A.: Computational Morphologies for Small Uralic Languages. In: Inquiries into Words, Constraints and Contexts, Stanford, California, pp. 150–157 (2005)
Read, J., Dridan, R., Oepen, S., Solberg, L.J.: Sentence Boundary Detection: A Long Solved Problem? In: 24th International Conference on Computational Linguistics (Coling 2012), India (2012)
Reynar, J.C., Ratnaparkhi, A.: A maximum entropy approach to identifying sentence boundaries. In: Proceedings of the Fifth Conference on Applied Natural Language Processing, pp. 16–19. Association for Computational Linguistics (1997)
Riley, M.D.: Some applications of tree-based modelling to speech and language. In: Proceedings of the Workshop on Speech and Natural Language, pp. 339–352. Association for Computational Linguistics (1989)
Savova, G.K., Masanz, J.J., Ogren, P.V., Zheng, J., Sohn, S., Schuler, K.K., Chute, C.G.: Mayo clinical text analysis and knowledge extraction system (ctakes): architecture, component evaluation and applications. Journal of the American Medical Informatics Association 17(5), 507–513 (2010)
Schmid, H.: Unsupervised learning of period disambiguation for tokenisation. Technical report (2000)
Siklósi, B., Orosz, G., Novák, A., Prószéky, G.: Automatic structuring and correction suggestion system for hungarian clinical records. In: De Pauw, G., De Schryver, G.M., Forcada, M.L., Tyers, F.M. (eds.) 8th SaLTMiL Workshop on Creation and Use of Basic Lexical Resources for Lessresourced Languages, pp. 29–34 (2012)
Siklósi, B., Novák, A., Prószéky, G.: Context-aware correction of spelling errors in hungarian medical documents. In: Dediu, A.-H., Martín-Vide, C., Mitkov, R., Truthe, B. (eds.) SLSP 2013. LNCS, vol. 7978, pp. 248–259. Springer, Heidelberg (2013)
Stevenson, M., Gaizauskas, R.: Experiments on sentence boundary detection. In: Proceedings of the Sixth Conference on Applied Natural Language Processing, pp. 84–89. Association for Computational Linguistics (2000)
Taira, R.K., Soderland, S.G., Jakobovits, R.M.: Automatic structuring of radiology free-text reports. Radiographics 21(1), 237–245 (2001)
Tomanek, K., Wermter, J., Hahn, U.: A reappraisal of sentence and token splitting for life sciences documents. Studies in Health Technology and Informatics 129(pt. 1), 524–528 (2006)
Tomanek, K., Wermter, J., Hahn, U.: Sentence and token splitting based on conditional random fields. In: Proceedings of the 10th Conference of the Pacific Association for Computational Linguistics, pp. 49–57 (2007)
Wrenn, J.O., Stetson, P.D., Johnson, S.B.: An unsupervised machine learning approach to segmentation of clinician-entered free text. In: AMIA Annu. Symp. Proc., pp. 811–815 (2007)
Xu, H., Stenner, S.P., Doan, S., Johnson, K.B., Waitman, L.R., Denny, J.C.: Medex: a medication information extraction system for clinical narratives. Journal of the American Medical Informatics Association 17(1), 19–24 (2010)
Zhu, C., Tang, J., Li, H., Ng, H.T., Zhao, T.: A unified tagging approach to text normalization. In: The 45th Annual Meeting of the Association for Computational Linguistics, pp. 688–695 (2007)
Zsibrita, J., Vincze, V., Farkas, R.: magyarlanc: A Toolkit for Morphological and Dependency Parsing of Hungarian. In: Proceedings of Recent Advances in Natural Language Provessing 2013, Hissar, Bulgaria, pp. 763–771. Association for Computational Linguistics (2013)
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2013 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Orosz, G., Novák, A., Prószéky, G. (2013). Hybrid Text Segmentation for Hungarian Clinical Records. In: Castro, F., Gelbukh, A., González, M. (eds) Advances in Artificial Intelligence and Its Applications. MICAI 2013. Lecture Notes in Computer Science(), vol 8265. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-45114-0_25
Download citation
DOI: https://doi.org/10.1007/978-3-642-45114-0_25
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-45113-3
Online ISBN: 978-3-642-45114-0
eBook Packages: Computer ScienceComputer Science (R0)