Skip to main content

Hierarchical Dirichlet Process for Tracking Complex Topical Structure Evolution and Its Application to Autism Research Literature

  • Conference paper
  • First Online:
Advances in Knowledge Discovery and Data Mining (PAKDD 2015)

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 9077))

Included in the following conference series:

Abstract

In this paper we describe a novel framework for the discovery of the topical content of a data corpus, and the tracking of its complex structural changes across the temporal dimension. In contrast to previous work our model does not impose a prior on the rate at which documents are added to the corpus nor does it adopt the Markovian assumption which overly restricts the type of changes that the model can capture. Our key technical contribution is a framework based on (i) discretization of time into epochs, (ii) epoch-wise topic discovery using a hierarchical Dirichlet process-based model, and (iii) a temporal similarity graph which allows for the modelling of complex topic changes: emergence and disappearance, evolution, splitting and merging. The power of the proposed framework is demonstrated on the medical literature corpus concerned with the autism spectrum disorder (ASD) – an increasingly important research subject of significant social and healthcare importance. In addition to the collected ASD literature corpus which we made freely available, our contributions also include two free online tools we built as aids to ASD researchers. These can be used for semantically meaningful navigation and searching, as well as knowledge discovery from this large and rapidly growing corpus of literature.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 39.99
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. Beykikhoshk, A., Arandjelovic, O., Phung, D., Venkatesh, S., Caelli, T.: Data-mining twitter and the autism spectrum disorder: A pilot study (2014)

    Google Scholar 

  2. Deerwester, S.C., Dumais, S.T., Landauer, T.K., Furnas, G.W., Harshman, R.A.: Indexing by latent semantic analysis. JASIS 41, 391–407 (1990)

    Article  Google Scholar 

  3. Hofmann, T.: Probabilistic latent semantic indexing. SIGIR, 50–57 (1999)

    Google Scholar 

  4. Blei, D.M., Ng, A.Y., Jordan, M.I.: Latent Dirichlet allocation. JMLR 3, 993–1022 (2003)

    MATH  Google Scholar 

  5. Teh, Y.W., Jordan, M.I., Beal, M.J., Blei, D.M.: Hierarchical Dirichlet processes. Journal of the American Statistical Association 101 (2006)

    Google Scholar 

  6. Blei, D.M., Lafferty, J.D.: Dynamic topic models. In: ICML, pp. 113–120 (2006)

    Google Scholar 

  7. Wang, C., Blei, D., Heckerman, D.: Continuous time dynamic topic models. In: UAI, pp. 579–586 (2008)

    Google Scholar 

  8. Ren, L., Dunson, D.B., Carin, L.: The dynamic hierarchical Dirichlet process. In: ICML, pp. 824–831 (2008)

    Google Scholar 

  9. Zhang, J., Song, Y., Zhang, C., Liu, S.: Evolutionary hierarchical Dirichlet processes for multiple correlated time-varying corpora. In: SIGKDD, pp. 1079–1088 (2010)

    Google Scholar 

  10. Wang, X., McCallum, A.: Topics over time: a non-Markov continuous-time model of topical trends. In: SIGKDD, pp. 424–433 (2006)

    Google Scholar 

  11. Dubey, A., Hefny, A., Williamson, S., Xing, E.P.: A nonparametric mixture model for topic modeling over time. In: SDM, pp. 530–538 (2013)

    Google Scholar 

  12. Swanson, D.R.: Undiscovered public knowledge. Library Quarterly 56, 103–118 (1986)

    Article  Google Scholar 

  13. Settles, B.: ABNER: an open Source tool for automatically tagging genes, proteins and other entity names in text. Bioinformatics 21, 3191–3192 (2005)

    Article  Google Scholar 

  14. Rhodes, D.R., Yu, J., Shanker, K., Deshpande, N., Varambally, R., Ghosh, D., Barrette, T., Pander, A., Chinnaiyan, A.M.: A cancer microarray database and integrated data-mining platform. Neoplasia 6, 1–6 (2004)

    Article  Google Scholar 

  15. Simpson, M.S., Demner-Fushman, D.: Biomedical text mining: a survey of recent progress. In: Mining Text Data, pp. 465–517 (2012)

    Google Scholar 

  16. Kumar, V.D., Tipney, H.J.: Biomedical Literature Mining. Springer (2014)

    Google Scholar 

  17. Blei, D.M., Franks, K., Jordan, M.I., Mian, I.S.: Statistical modeling of biomedical corpora: mining the Caenorhabditis genetic center bibliography for genes related to life span. BMC Bioinformatics 7, 250 (2006)

    Article  Google Scholar 

  18. Arnold, C.W., El-Saden, S.M., Bui, A.A., Taira, R.: Clinical case-based retrieval using latent topic analysis. AMIA 2010, 26 (2010)

    Google Scholar 

  19. Arnold, C.W., Speier, W.: A topic model of clinical reports. SIGIR, pp. 1031–1032 (2012)

    Google Scholar 

  20. Wu, Y., Liu, M., Zheng, W., Zhao, Z., Xu, H.: Ranking gene-drug relationships in biomedical literature using latent Dirichlet allocation. In: Pacific Symposium on Biocomputing, pp. 422–433 (2012)

    Google Scholar 

  21. Ferguson, T.S.: A Bayesian analysis of some nonparametric problems. The Annals of Statistics, 209–230 (1973)

    Google Scholar 

  22. Sethuraman, J.: A constructive definition of Dirichlet priors. Technical report, DTIC Document (1991)

    Google Scholar 

  23. Kanner, L.: Irrelevant and metaphorical language in early infantile autism. American Journal of Psychiatry 103, 242–246 (1946)

    Article  Google Scholar 

  24. Miller, G.A., Beckwith, R., Fellbaum, C.D., Gross, D., Miller, K.: WordNet: An online lexical database. Int. J. Lexicograph 1, 235–244 (1990)

    Article  Google Scholar 

  25. Miles, J.H.: Autism spectrum disorders - a genetics review. Nature 13, 278–294 (2011)

    MathSciNet  Google Scholar 

  26. Wakefield, A.J., Murch, S.H., Anthony, A.: Ileal-lymphoid-nodular hyperplasia, non-specific colitis, and pervasive developmental disorder in children. The Lancet, 637–641 (1998) (retracted)

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Adham Beykikhoshk .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2015 Springer International Publishing Switzerland

About this paper

Cite this paper

Beykikhoshk, A., Arandjelović, O., Venkatesh, S., Phung, D. (2015). Hierarchical Dirichlet Process for Tracking Complex Topical Structure Evolution and Its Application to Autism Research Literature. In: Cao, T., Lim, EP., Zhou, ZH., Ho, TB., Cheung, D., Motoda, H. (eds) Advances in Knowledge Discovery and Data Mining. PAKDD 2015. Lecture Notes in Computer Science(), vol 9077. Springer, Cham. https://doi.org/10.1007/978-3-319-18038-0_43

Download citation

  • DOI: https://doi.org/10.1007/978-3-319-18038-0_43

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-319-18037-3

  • Online ISBN: 978-3-319-18038-0

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics