A Clustering Algorithm Based on Document Embedding to Identify Clinical Note Templates

Tang, Chunlei; Plasek, Joseph Michael; Xiong, Yun; Zhang, Zhikun; Bates, David Westfall; Zhou, Li

doi:10.1007/s40745-020-00296-8

A Clustering Algorithm Based on Document Embedding to Identify Clinical Note Templates

Published: 06 June 2020

Volume 8, pages 497–515, (2021)
Cite this article

Annals of Data Science Aims and scope Submit manuscript

Chunlei Tang ORCID: orcid.org/0000-0002-6460-0246¹,
Joseph Michael Plasek¹,
Yun Xiong²,
Zhikun Zhang²,
David Westfall Bates¹ &
…
Li Zhou¹

223 Accesses
3 Citations
1 Altmetric
Explore all metrics

Abstract

This paper proposes a novel unsupervised document embedding based clustering algorithm to generate clinical note templates. We adapted Charikar’s SimHash to embed each clinical document into a vector representation. We modified the traditional K-means algorithm to merge any two clusters with centroids when they are very close. Under the K-means paradigm, our algorithm designates the cluster representative corresponding to the document vector closest to the centroid as the prototype template. On a corpus of clinical notes, we evaluated the feasibility of utilizing our algorithm at the individual author level. The corpus contains 1,063,893 clinical notes corresponding to 19,146 unique providers between January 2011 and July 2016. Our algorithm achieved more than 80% precision and runs in O(n) time complexity. We further validated our algorithm using human annotators who reported it is able to efficiently detect a real clinical document that can represent the other documents in the same cluster at both the department level and the individual clinician level.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Interpretable segmentation of medical free-text records based on word embeddings

Article Open access 28 October 2021

A study into patient similarity through representation learning from medical records

Article 13 September 2022

Categorization of Patient Diseases for Chinese Electronic Health Record Analysis: A Case Study

Abbreviations

EHR:: Electronic health record

References

Dubois S, Romano N, Kale DC, Shah N, Jung K (2017) Effective representations of clinical notes. arXiv preprint arXiv:1705.07025
Tan P, Steinbach M, Karpatne A, Kumar V (2019) Introduction to data mining, 2nd edn. Pearson Education India, London
Google Scholar
Naming clusters (2017) Dataiku.com. https://academy.dataiku.com/cluster-models/513439. Accessed 5 June 2020
Doing-Harris K, Patterson O, Igo S, Hurdle J (2013) Document sublanguage clustering to detect medical specialty in cross-institutional clinical texts. In: Proceedings of the 7th international workshop on data and text mining in biomedical informatics. ACM, pp 9–12
Patterson O, Hurdle JF (2011) Document clustering of clinical narratives: a systematic study of clinical sublanguages. In: AMIA annual symposium proceedings. American Medical Informatics Association, p 1099
Zhang R, Pakhomov S, Melton GB (2014) Longitudinal analysis of new information types in clinical notes. In: AMIA joint summits on translational science proceedings. American Medical Informatics Association, pp 232–237
Cohen R, Elhadad M, Elhadad N (2013) Redundancy in electronic health record corpora: analysis, impact on text mining performance and mitigation strategies. BMC Bioinform 14:10
Article Google Scholar
Downey D, Etzioni O, Soderland S (2010) Analysis of a probabilistic model of redundancy in unsupervised information extraction. Artif Intell 174(11):726
Article Google Scholar
Zhang R, Pakhomov S, McInnes BT, Melton GB (2011) Evaluating measures of redundancy in clinical texts. In: AMIA annual symposium proceedings, pp 1612–1620
Steinbach M, Karypis G, Kumar V (2000) A comparison of document clustering techniques. In: KDD workshop on text mining, vol 400, no 1, pp 525–526
Keogh E, Mueen A (2017) Curse of dimensionality. In: Sammut C, Webb GI (eds) Encyclopedia of machine learning and data mining. Springer, Boston, MA. https://doi.org/10.1007/978-1-4899-7687-1
Chapter Google Scholar
Charikar MS (2002) Similarity estimation techniques from rounding algorithms. In: Proceedings of the thirty-fourth annual ACM symposium on theory of computing. ACM, pp 380–388
Sadowski C, Levin G (2007) Simhash: hash-based similarity detection. Technical report, Google
Boley D, Gini M, Gross R, Han E, Hastings K, Karpis G, Kumar V, Mobasher B, Moore J (1999) Partitioning-based clustering for web document categorization. Decis Support Syst 27(3):329–341
Article Google Scholar
Zhao Y, Karypis G, Fayyad U (2005) Hierarchical clustering algorithms for document datasets. Data Min Knowl Discov 10(2):141–168
Article Google Scholar
Ester M, Kriegel H, Sander J, Xu X (1996) A density-based algorithm for discovering clusters a density-based algorithm for discovering clusters in large spatial databases with noise. In: Proceedings of the second international conference on knowledge discovery and data mining. AAAI, pp 226–231
Liao W, Liu Y, Choudhary A (2004) A grid-based clustering algorithm using adaptive mesh refinement. In: 7th workshop on mining scientific and engineering datasets of SIAM international conference on data mining, vol 22. SIAM, pp 61–69
Zhong S, Ghosh J (2003) A unified framework for model-based clustering. J Mach Learn Res 4:1001–1037
Google Scholar
Wu H, Luk R, Wong K, Kwok K (2008) Interpreting TF-IDF term weights as making relevance decisions. ACM Trans Inf Syst 26(3):13
Article Google Scholar
Hui S, Dechao Z (2016) A weighted topical document embedding based clustering method for news text. In: 2016 IEEE information technology, networking, electronic and automation control conference. IEEE, pp 1060–1065
Sood S (2011) Probabilistic simhash matching. Doctoral dissertation. Texas A&M University
Indyk P, Motwani R (1998) Approximate nearest neighbors: towards removing the curse of dimensionality. In: Proceedings of ACM STOC, pp 604–613
Svenstrup DT, Hansen J, Winther O (2017) Hash embeddings for efficient word representations. In: Advances in neural information processing systems, pp 4928–4936
Sim J, Wright CC (2005) The kappa statistic in reliability studies: use, interpretation, and sample size requirements. Phys Ther 85(3):257–268
Article Google Scholar
McHugh ML (2012) Interrater reliability: the kappa statistic. Biochem Med 22(3):276–282
Article Google Scholar
Institute of Medicine Committee on Quality of Health Care in America (2001) Crossing the quality chasm: a new health system for the 21st century. National Academies Press, Washington
Google Scholar
Clark A (1998) Being there: putting brain, body, and world together again. MIT Press, Cambridge
Google Scholar
Kashyap V, Turchin A, Morin L, Chang F, Li Q, Hongsermeier T (2006) Creation of structured documentation templates using natural language processing techniques. In: AMIA annual symposium proceedings. American Medical Informatics Association, p 977
Ferrández O, South BR, Shen S, Friedlin FJ, Samore MH, Meystre SM (2013) BoB, a best-of-breed automated text de-identification system for VHA clinical documents. J Am Med Inform Assoc 20(1):77–83
Article Google Scholar

Download references

Acknowledgements

This work was partially funded by the Partners Innovation Fund, the National Natural Science Foundation of China Projects No. U1636207, No. U1936213, and the Shanghai Science and Technology Development Fund No. 19511121204, No.19DZ1200802. The authors would like to thank Wenxuan Shen, Hai Cao, and Siyuan Cheng, for their helpful comments on an early draft of the manuscript; and Lynn A. Volk, and Frank Y. Chang for help with the annotation.

Author information

Authors and Affiliations

Division of General Internal Medicine and Primary Care, Harvard Medical School, Brigham and Women’s Hospital, Boston, MA, 02115, USA
Chunlei Tang, Joseph Michael Plasek, David Westfall Bates & Li Zhou
Shanghai Key Laboratory of Data Science, School of Computer Science, Fudan University, Shanghai, 201203, China
Yun Xiong & Zhikun Zhang

Authors

Chunlei Tang
View author publications
You can also search for this author in PubMed Google Scholar
Joseph Michael Plasek
View author publications
You can also search for this author in PubMed Google Scholar
Yun Xiong
View author publications
You can also search for this author in PubMed Google Scholar
Zhikun Zhang
View author publications
You can also search for this author in PubMed Google Scholar
David Westfall Bates
View author publications
You can also search for this author in PubMed Google Scholar
Li Zhou
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Yun Xiong.

Ethics declarations

Conflict of interest

The authors declare that they have no competing interests.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Appendix

See Tables 3 and 4.

Table 3 Values for Table 1 as Fleiss’ Kappa = 0.721 (N = 67, n = 4, k = 3)

Full size table

Table 4 Values for Table 2 as Fleiss’ Kappa = 0.838 (N = 27, n = 4, k = 3)

Full size table

Rights and permissions

Reprints and permissions

About this article

Cite this article

Tang, C., Plasek, J.M., Xiong, Y. et al. A Clustering Algorithm Based on Document Embedding to Identify Clinical Note Templates. Ann. Data. Sci. 8, 497–515 (2021). https://doi.org/10.1007/s40745-020-00296-8

Download citation

Received: 15 March 2019
Revised: 25 March 2020
Accepted: 26 May 2020
Published: 06 June 2020
Issue Date: September 2021
DOI: https://doi.org/10.1007/s40745-020-00296-8

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

A Clustering Algorithm Based on Document Embedding to Identify Clinical Note Templates

Abstract

Access this article

Similar content being viewed by others

Interpretable segmentation of medical free-text records based on word embeddings

A study into patient similarity through representation learning from medical records

Categorization of Patient Diseases for Chinese Electronic Health Record Analysis: A Case Study

Abbreviations

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Ethics declarations

Conflict of interest

Additional information

Publisher's Note

Appendix

Rights and permissions

About this article

Cite this article

Keywords

Navigation

A Clustering Algorithm Based on Document Embedding to Identify Clinical Note Templates

Abstract

Access this article

Similar content being viewed by others

Interpretable segmentation of medical free-text records based on word embeddings

A study into patient similarity through representation learning from medical records

Categorization of Patient Diseases for Chinese Electronic Health Record Analysis: A Case Study

Abbreviations

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Ethics declarations

Conflict of interest

Additional information

Publisher's Note

Appendix

Appendix

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation