Study of Twitter Communications on Cardiovascular Disease by State Health Departments

Musaev, Aibek; Britt, Rebecca K.; Hayes, Jameson; Britt, Brian C.; Maddox, Jessica; Sheinidashtegol, Pezhman

doi:10.1007/978-3-030-23499-7_12

Aibek Musaev¹⁸,
Rebecca K. Britt¹⁸,
Jameson Hayes¹⁸,
Brian C. Britt¹⁸,
Jessica Maddox¹⁸ &
…
Pezhman Sheinidashtegol¹⁸

Part of the book series: Lecture Notes in Computer Science ((LNPSE,volume 11512))

Included in the following conference series:

International Conference on Web Services

832 Accesses
3 Citations

Abstract

The present study examines Twitter conversations around cardiovascular health in order to assess the topical foci of these conversations as well as the role of various state departments of health. After scraping tweets containing relevant keywords, Latent Dirichlet Allocation (LDA) was used to identify the most important topics discussed around the issue, while PageRank was used to determine the relative prominence of different users. The results indicate that a small number of state departments of health play an especially significant role in these conversations. Furthermore, irregular events like ebola outbreaks also exert a strong influence over the volume of tweets made in general by state departments of health.

You have full access to this open access chapter, Download conference paper PDF

Content Analysis of Twitter Conversations Associated with Turkey–Syria Earthquakes

Linking Obesity and Tweets

COVID-19 Goes on Twitter. Greek Conversations and Discussions

Keywords

1 Introduction

Cardiovascular disease (CVD) is the leading cause of death in the United States [6], with contributing factors including poor health and risk factors such as obesity and diabetes, among others [9]. While prevention is the optimal approach towards reducing CVD, the potential applicability of social media communication remains understudied. Health care professionals have increased their use of social media to engage with the public, to increase health care education, patient compliance, and organizational promotion [22]. To date, social media based health communication research has prioritized studies of theory, message effects, or disseminating interventions to end users [15]. However, state health departments’ social media communication can engage with patients to improve their care [23].

Concurrently, research using social media has advanced considerably in recent years. API scrapers [1] enable the rapid collection of data from social media, while data analysis approaches such as time series analysis of user behaviors [14] and topic modeling through Latent Dirichlet Allocation (LDA) [13] allow large textual datasets to be rapidly analyzed in order to draw insights for basic research as well as in applied contexts, such as those related to public health care and associated campaigns [3].

In this study, we analyzed social media activity of state health departments related to cardiovascular disease. The objectives of this study were (1) to determine the most active state health departments on Twitter with respect to cardiovascular disease, and (2) to determine the most important topics that were discussed and the most important terms used in those discussions.

See the next section for an overview of the proposed methodology and related work. Section 3 presents the experimental results using real data and Sect. 4 concludes the paper.

2 Overview and Related Work

In this study, we analyzed both the tweets posted by state departments of health and their Twitter accounts.

In analyzing tweets, we first propose to determine the peaks of public activity by aggregating the tweets posted by all users in the collected dataset for each month under study. To understand the key drivers of those peaks, we then perform a detailed analysis by identifying the most popular topics discussed during those peaks. See Sect. 2.1 for a description of the topic modeling approach used in this study.

For the user analysis, we first examine the total number of messages posted by state health departments since opening Twitter accounts. Then we analyze their communications with respect to cardiovascular disease using an extension of PageRank algorithm. See Sect. 2.2 for a description of the algorithm used to identify the most important users in the collected dataset.

2.1 Towards Understanding the Most Important Topics Discussed by the Public

Topic modeling in machine learning and natural language processing is a popular approach to uncover hidden topics in a collection of documents. Intuitively, given that a document, such as a tweet, is about a particular topic, one would expect certain words to appear more or less frequently than others. For example, words such as ‘cardiovascular’, ‘heart’, and ‘stroke’ will appear more frequently in tweets on cardiovascular disease, ‘congress’, ‘vote’, and ‘policy’ in documents about politics, and ‘the’, ‘a’, and ‘is’ may appear equally in both. Furthermore, a document typically discusses multiple topics in different proportions, e.g., 60% about politics and 40% about cardiovascular disease in a news article about passing a bill on CVD.

Popular topic modeling algorithms include Latent Semantic Analysis (LSA) [5, 24], Hierarchical Dirichlet process (HDP) [4, 10], and Latent Dirichlet Allocation (LDA) [12, 13]. In this project, we use LDA to uncover the most important topics discussed by the public posted by or mentioning state departments of health. The study of alternative topic modeling algorithms will be explored in our future work.

In LDA, each document can be described by a distribution of topics and each topic can be described by a distribution of words [8]. Here topics are introduced as a hidden (i.e., latent) layer connecting documents to words. Note, that the number of topics is a fixed number that can be chosen either as an informed estimate based on a previous analysis or via a simple trial-and-error approach. See Sect. 3 for an application of LDA approach to determine the topics discussed during the peaks of public activity.

2.2 Ranking State Departments of Health by Social Media Influence

With the rise of social media platforms, such as Twitter, identification of the most influential users garnered a huge amount of interest [19]. Different Twitter influence measures have been proposed. Some are based on simple metrics provided by the Twitter API [7, 17], while others are based on complex mathematical models [11, 16].

In this study, we wanted to rank state departments of health by social media influence. Specifically, we wanted to identify the Twitter accounts whose posts on cardiovascular disease attracted the most amount of attention.

Given the dataset of Twitter users and their tweets, we built a graph where the users serve as nodes. The links between nodes are represented by reply relationships, such that the direction of a link is a directed edge from the author of the reply to the author of the original tweet that was replied to.

Given the directed graph, we can now apply a PageRank algorithm, which is a way of measuring the importance of nodes in a directed graph such as website pages [20]. PageRank works by counting the number and quality of the links pointing to a page to determine a rough estimate of their importance.

See Fig. 1 and the corresponding discussion in Sect. 3 for a visualization of the proposed approach.

3 Evaluation Using Real Data

In this section, we present a pilot study of Twitter communications involving state health departments. We begin by describing the data collection process used in this study in Sect. 3.1. Then we design two sets of experiments as follows. The first set of experiments (Sect. 3.2) analyzes the Twitter activity of all communication involving those departments while the second set (Sect. 3.3) focuses on the communication related to cardiovascular disease.

3.1 Data Collection

In this pilot study, we performed a set of experiments based on real-world data collected from Twitter. For data collection, we used a two-step process. In step 1, we downloaded basic data about tweets containing the keywords that we are interested in. Specifically, we used a JavaScript module called scrape-twitter^{Footnote 1}. It allows for querying Twitter for matching tweets based on keywords. In step 2, we used Twitter’s statuses/lookup^{Footnote 2} feature that returns a complete set of data for up to 100 tweets at a time.

For keywords, we used the Twitter handles of state departments of health for each state, such as @ALPublicHealth for Alabama or @HealthNYGov for New York. This allowed us to collect 319k tweets ranging from November 2, 2007 to December 26, 2018 posted by 52.5k users including the 50 state departments of health. This represents a full dataset of all tweets either posted by or mentioning the state health departments.

For the dataset on cardiovascular disease, we filtered the full dataset based on the keywords related to CVD as follows. We included the keywords directly related to CVD, such as ‘cardiovascular’ and ‘CVD.’ We also added the keywords related to CVD symptoms and risk factors, including ‘heart failure’, ‘heart disease’, ‘heart stroke’, ‘heart failure’, ‘blood pressure’, ‘atherosclerosis’, ‘arrhythmia’, ‘cardiac’, and ‘obesity.’

We released both datasets as a contribution to the research community^{Footnote 3}. These are the first published datasets that contain Twitter activity related to all 50 state departments of health covering an 11 year period. The datasets are provided as listings of tweet IDs in accordance with the Twitter policy^{Footnote 4}.

3.2 Analysis of Twitter Activity Involving State Health Departments

In this experiment, we analyzed the overall Twitter activity involving state departments of health before focusing on the topic of cardiovascular disease in the next section. Specifically, we aggregated the total number of tweets posted by each department since they opened their Twitter accounts and plotted the results in Fig. 2.

As expected, we observed that the departments that opened their Twitter accounts earlier have a larger number of tweets compared to the departments that joined Twitter at a later date. However, there are recent accounts that managed to generate an unusually high amount of activity despite joining Twitter later, such as the Pennsylvania Department of Health account (@PAHealthDept) which joined Twitter on April 28th, 2015.

Next, we analyzed the monthly Twitter activity across all state health departments and plotted the results in a time series graph in Fig. 3. Although we observed an overall increase of the total number of tweets over time, there are several peaks with an unusually high amount of activity compared to other months, such as October, 2014 and October, 2018.

To understand the key drivers of the peak in October, 2014, we analyzed the topics discussed during that month. Using sklearn [18], we trained an LDA model based on an arbitrarily chosen number of topics \(n\_topics=10\). For illustration purposes, we visualized the computed topics and the most important words in those topics using t-SNE, or t-distributed stochastic neighbor embedding [21].

Based on the visualization shown in Fig. 4, we observed that ebola was an important topic of discussion during October 2014, which is when ebola spread outside of Africa^{Footnote 5}. For example, one topic contains keywords, such as ‘says’, ‘dallas’, ‘test’, ‘health’, and ‘negative’ as discussed in the following tweet: JUST IN: Texas Health Presbyterian says test results for a Dallas Co Sheriff’s deputy came back negative for ebola.

3.3 Analysis of Twitter Activity on Cardiovascular Disease

In this experiment, we analyzed Twitter communication on cardiovascular disease across all state health departments and plotted the results in a time series graph in Fig. 5. We observed that the largest peak of activity occurred in February 2018.

Then, we plotted a diagram of user rankings based on reply relationships in Fig. 1. The diagram was implemented using a JavaScript visualization library \(D^3\) [2]. Each node represents a Twitter user in the dataset, directed edges are based on reply relationships, the color intensity is used to represent the number of tweets containing the search keywords each user has. Lastly, the size of a node is computed based on the PageRank algorithm as described in Sect. 2.2.

For visualization purposes, we added labels to the users with the highest ranking scores. Based on these results, Arizona Department of Health (@AZDHS) and South Carolina Department of Public Health (@scdhec) are among the most important social media users on cardiovascular disease.

4 Conclusion and Future Work

This study demonstrates the increasing Twitter usage of state departments of health and the role of cardiovascular disease and other health-related issues in their communication with the public. It also demonstrates the wide disparity in the Twitter presence of different state departments of health, with a minority of institutions taking an especially central role in conversations about health.

More broadly, this paper demonstrates an approach to mine and analyze social media data in order to draw conclusions about the behaviors and relative importance of various human and institutional actors, as well as the topics that drive conversation and the points in time at which those conversations shift.

Future research should address how individual users’ activities affect their influence over time and how prominent actors, in turn, shape the social media conversation. Researchers should also study finer aspects of discourse about cardiovascular disease and other health issues, such as the social structure of conversations among underserved communities, as well as how health-related organizations can influence those conversations to promote healthy behaviors.

Notes

References

Batrinca, B., Treleaven, P.C.: Social media analytics: a survey of techniques, tools and platforms. AI Soc. 30(1), 89–116 (2015)
Article Google Scholar
Bostock, M., Ogievetsky, V., Heer, J.: \({\rm D}^3\) data-driven documents. TVCG 17(12), 2301–2309 (2011)
Google Scholar
Britt, B.C., et al.: Finding the invisible leader: when a priori opinion leader identification is impossible. In: NCA (2017)
Google Scholar
Burkhardt, S., Kramer, S.: Multi-label classification using stacked hierarchical Dirichlet processes with reduced sampling complexity. Knowl. Inf. Syst. 59(1), 93–115 (2019)
Article Google Scholar
Cai, Z., et al.: Impact of corpus size and dimensionality of LSA spaces from Wikipedia articles on AutoTutor answer evaluation. In: Proceedings of the 11th International Conference on Educational Data Mining, EDM 2018, Buffalo, NY, USA, 15–18 July 2018 (2018)
Google Scholar
Centers for Disease Control and Prevention. Heart disease in the United States. https://www.cdc.gov/heartdisease/facts.htm/. Accessed 14 Jan 2019
Cha, M., et al.: Measuring user influence in Twitter: the million follower fallacy. In: ICWSM, p. 30 (2010). 10.10-17
Google Scholar
Debortoli, S., et al.: Text mining for information systems researchers: an annotated topic modeling tutorial. In: CAIS 39, p. 7 (2016)
Article Google Scholar
Van Gaal, L.F., Mertens, I.L., De Block, C.E.: Mechanisms linking obesity with cardiovascular disease. Nature 444, 875–880 (2006)
Article Google Scholar
Kaltsa, V., et al.: Multiple hierarchical Dirichlet processes for anomaly detection in traffic. Comput. Vis. Image Underst. 169, 28–39 (2018)
Article Google Scholar
Katsimpras, G., Vogiatzis, D., Paliouras, G.: Determining influential users with supervised random walks. In: WWW, pp. 787–792 ACM (2015)
Google Scholar
Kim, D.H., et al.: Multi-co-training for document classification using various document representations: TF-IDF, LDA, and Doc2Vec. Inf. Sci. 477, 15–29 (2019)
Article Google Scholar
Li, C., et al.: Mining dynamics of research topics based on the combined LDA and WordNet. IEEE Access 7, 6386–6399 (2019)
Article Google Scholar
Matei, S.A., Britt, B.C.: Structural Differentiation in Social Media: Adhocracy, Entropy and the “1% Effect”. Springer, Heidelberg (2017). https://doi.org/10.1007/978-3-319-64425-7
Book Google Scholar
Moorehead, S.A., et al.: A new dimension of health care: systematic review of the uses, benefits, and limitations of social media for health communication. JMIR 15(4), e85 (2013)
Google Scholar
More, J.S., Lingam, C.: A gradient-based methodology for optimizing time for influence diffusion in social networks. Soc. Netw. Anal. Min. 9(1), 5:1–5:10 (2019)
Article Google Scholar
Noro, T., et al.: Twitter user rank using keyword search. In: 22nd European-Japanese Conference on Information Modelling and Knowledge Bases (EJC 2012), XXIV, Prague, Czech Republic, 4–9 June 2012, pp. 31–48 (2012)
Google Scholar
Pedregosa, F., et al.: Scikit-learn: machine learning in Python. JMLR 12, 2825–2830 (2011)
MathSciNet MATH Google Scholar
Riquelme, F., González-Cantergiani, P.: Measuring user influence on Twitter: a survey. IPM 52(5), 949–975 (2016)
Google Scholar
Sugihara, K.: Using complex numbers in website ranking calculations: a non-ad hoc alternative to Google’s PageRank. JSW 14(2), 58–64 (2019)
Article Google Scholar
Van Der Maaten, L.: Accelerating t-SNE using tree-based algorithms. JMLR 15(1), 3221–3245 (2014)
MathSciNet MATH Google Scholar
Ventola, C.L.: Social media and health care professionals: benefits, risks, and best practices. P&T 39, 491–499 (2014)
Google Scholar
Widmer, R.J., et al.: Social media platforms and heart failure. J. Cardiol. Fail. 23(11), 809–812 (2017)
Article Google Scholar
Yadav, C.S., Sharan, A.: A New LSA and entropy-based approach for automatic text document summarization. Int. J. Semantic Web Inf. Syst. 14(4), 1–32 (2018)
Article Google Scholar

Download references

Author information

Authors and Affiliations

The University of Alabama, Tuscaloosa, AL, 35487, USA
Aibek Musaev, Rebecca K. Britt, Jameson Hayes, Brian C. Britt, Jessica Maddox & Pezhman Sheinidashtegol

Authors

Aibek Musaev
View author publications
You can also search for this author in PubMed Google Scholar
Rebecca K. Britt
View author publications
You can also search for this author in PubMed Google Scholar
Jameson Hayes
View author publications
You can also search for this author in PubMed Google Scholar
Brian C. Britt
View author publications
You can also search for this author in PubMed Google Scholar
Jessica Maddox
View author publications
You can also search for this author in PubMed Google Scholar
Pezhman Sheinidashtegol
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Aibek Musaev .

Editor information

Editors and Affiliations

University of Georgia, Athens, GA, USA
John Miller
University of Alberta, Edmonton, AB, Canada
Eleni Stroulia
Louisiana State University, Baton Rouge, LA, USA
Kisung Lee
Kingdee International Software Group Co., Ltd., Shenzhen, China
Liang-Jie Zhang

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Musaev, A., Britt, R.K., Hayes, J., Britt, B.C., Maddox, J., Sheinidashtegol, P. (2019). Study of Twitter Communications on Cardiovascular Disease by State Health Departments. In: Miller, J., Stroulia, E., Lee, K., Zhang, LJ. (eds) Web Services – ICWS 2019. ICWS 2019. Lecture Notes in Computer Science(), vol 11512. Springer, Cham. https://doi.org/10.1007/978-3-030-23499-7_12

Download citation

DOI: https://doi.org/10.1007/978-3-030-23499-7_12
Published: 14 June 2019
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-23498-0
Online ISBN: 978-3-030-23499-7
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

Study of Twitter Communications on Cardiovascular Disease by State Health Departments

Abstract

Similar content being viewed by others