Keywords

1 Introduction

Social media data are accumulating rapidly, which come from bulletin board system (BBS), Wiki, instant communication, Blog, and video/images from various social channels like Facebook, Twitter, Sina Weibo and so on. On the one hand, information on those data may be relatively consistent and persistent that is valuable for the occurring process of social events [1, 2]. On the other hand, there are a big variety of topics in those data that have complex hidden relationships. How to accurately and effectively mining these hidden relationships among those topics is a challenging topic, but the mine results are critical to identify root causes for public opinions, and then timely take corresponding actions to answer these causes.

The mining of topic relationships is related to the research of Topic Detection and Tracking (TDT) [3,4,5]. Topic detection finds out unknown topics by analyzing relationships between the new stories and the known ones, and clustering relevant stories into specific topics. Topic tracking is to identify and follow events by monitoring the progressive relationships for a given topic and its follow-up stories, which mainly emphasizes on internal topic relationships among multiple stories related to only one given topic. There are few studies on tracking their progress and finding the relationships of the new stories after detecting topics with respect to a specific topic.

Some researchers improved apriori algorithm for association rules [6,7,8,9]. In fact, association rules are good at detecting relationships among multiple topics. In this paper, we propose an approach for detecting multiple-topic relationships based on parallel association rules (i.e., PARMTRD). Our main contributions include:

  1. (1)

    PARMTRD can detect unobvious but critical relationships among multiple different topics hidden under superfacial phenomenon of events and then explore root causes of multiple events, which is different from existing research work that focuses on detecting relationships among stories within only one topic.

  2. (2)

    We improve Apririor algorithm and apply it to find topic relationships under complex scenarios by running parallel mining association rules. After getting frequent keyword sets and the association rules, we obtain association keyword sets for each topic, from which we can select and assemble keywords to find the relevance among multiple events seemingly unrelated.

The rest of the paper is organized as follows. Section 2 presents an overview of the related work. Section 3 introduces definitions of concepts used in our work in this paper. Section 4 proposes the idea of parallel association rules used for efficient multiple-topic relationships mining. Section 5 presents the PARMTRD method. Section 6 presents experiments and results. Section 7 concludes the paper and presents future work.

2 Related Work

A vector space model (VSM) ranks a document with respect to its similarity to a given user query. This similarity can be estimated by calculating the cosine of the angle between a document vector and a query vector [2]. However, VSM assumes that the term is independent of each other, completely ignores the implicit relationship among the terms in a document, which leads to the lack of sequential information of keywords.

LSI (Latent Semantic Indexing) is introduced into the text representation model. For example, the subject model represented by LDA [5, 10,11,12] (Latent Dirichlet Allocation) is widely used. LDA is a three-layer Bayesian parameter model which introduces Dirichlet prior distribution based on PLSA (Probabilistic Latent Semantic Analysis) [13, 14]. The implicit theme of the text is modeled by probability generation model to describe relationships between documents, topics and words. Improved models based on LDA have been proposed, for example, the TOT [15] (Topic over Time) model is proposed to include time as an observable variable into the LDA. Some models use time window for detecting relationships among stories, such as DTM [16, 17] (Dynamic Topic Model), CTDTM [18, 19] (Continuous Time Dynamic Topic Model), DMM [18, 20] (Dynamic Mixture Model), OLDA [21,22,23] (Online Latent Dirichlet Allocation), etc. Although the semantic information is introduced in the subject model, the word co-occurrence relation is not explicitly considered in these work. In addition, the LDA needs to perform the sampling operation repeatedly, which increases space complexity.

Some research is based on co-occurrence relationships of keywords [24,25,26,27,28]. The word co-occurrence refers to the fact that two or more keywords often appear together in the same part of the text, such as an article or a passage. We assume that these keywords are related, and if the probability of the words co-occurrence is higher, then these words have close relationships. Sayyadi and Raschid [24] proposed a KeyGraph approach based on co-occurrence relationships and demonstrated the accuracy of this method for the small data sets. Since the scenario graph visualized by KeyGraph is machine-oriented, Wang et al. [26] proposed a human-oriented algorithm called IdeaGraph. The above two methods do not take into account semantic relations between keywords, therefore Chen et al. [28] combined LDA with KeyGraph, proposing a hybrid term-term relationships analysis approach. In addition, Li et al. [27] and Zhao et al. [25] applied the word co-occurrence graph to the Micro-blog topic detection, demonstrating the effectiveness of this method. Although the word co-occurrence expresses the semantic relations between the words in a certain degree, it is mainly concerned with the co-occurrence relationships between two keywords, ignoring the actual situation of multiple words co-occurrence. For detecting topic relationships among multiple topics we need to get the relationship among multiple keywords. Therefore, we propose the method for topic relationship detection based on parallel association rules.

3 Concept Definitions

Definition 1.

Support. Support is an indication of how frequently an itemset appears in a dataset \( I \). TDT tries to find hotness of a keyword set. An association rule is defined as an implication of the form \( A \Rightarrow B \), where \( A,B \in I \). The support of the itemset \( \{ A,B\} \) is defined as (1):

$$ sup(A \Rightarrow B) = P(A \cup B) = num(A \cup B)/num(I) $$
(1)

\( sup(A \Rightarrow B) \) is the support of the itemset \( \{ A,B\} \). \( P(A \cup B) \) is the probability of the itemset \( \{ A,B\} \) occurrence in the dataset \( I \). \( num(A \cup B) \) is the occurrence number of the itemset \( \{ A,B\} \) in the dataset \( I \). \( num(I) \) is the number of records in the dataset.

Definition 2.

Confidence. Confidence is an indication of the probability that event \( B \) will occur at the same time at the case of prerequisite \( A \). In TDT, it informs that how often items in \( A \) and \( B \) are included in an itemset together in case of itemset \( A \) is included in dataset \( I \). The confidence of an association rule \( A \Rightarrow B \) is defined as (2):

$$ conf(A \Rightarrow B) = P(B\left| A \right.) = \,sup(A \cup B)/sup(A) = num(A \cup B)/num(A) $$
(2)

Where \( num(A \cup B) \) is the number of occurrences of the itemset \( \{ A,B\} \) in the dataset I; \( num(A) \) is the number of occurrences of the itemset \( \{ A\} \) in the dataset \( I \).

Definition 3.

Candidate k_itemset. In TDT, candidate k_itemset is a \( k \) element itemset which has \( k \) \( (k = 1,2, \ldots ,n) \) keywords. The i-th candidate k_itemset is denoted by \( c_{k} [i] = \{ w_{1} [i],w_{2} [i], \ldots ,w_{k} [i]\} \), where \( w_{1} [i] \) is the 1st keyword in \( c_{k} [i] \). A set of candidate k_itemset denoted by \( C_{k} = \{ c_{k} [1],c_{k} [2], \ldots ,c_{k} [i]\} \) includes all candidate k_itemsets in the dataset \( I \), where i is the number of candidate k_itemset in \( C_{k} \).

Definition 4.

Frequent k_itemset. It is a set of \( k \) element keywords whose frequency is above a given support threshold. The j-th frequent k_itemset is denoted by \( l_{k} [j] = \{ w_{1} [j],w_{2} [j], \ldots ,w_{k} [j]\} \), a set of frequent k_itemset is denoted by \( L_{k} = \{ l_{k} [1],l_{k} [2], \ldots ,l_{k} [j]\} \) includes all frequent k_itemsets in the dataset \( I \), where \( j \le i \).

Definition 5.

Association k_itemset. It is a special frequent k_itemset whose confidences of all association rules are above a given threshold. The h-th association k_itemset is denoted by \( a_{k} [h] = \{ w_{1} [h],w_{2} [h], \ldots ,w_{k} [h]\} \), a set of association k_itemset is denoted by \( AS_{k} = \{ a_{k} [1],a_{k} [2], \ldots ,a_{k} [h]\} \) includes all association k_itemsets in the dataset \( I \), where \( h \le j \).

4 Parallel Association Rules for Mining Associations Keyword Sets

In order to improve the performance of mining topic relationships, we propose parallel association rule. Comparing with the traditional Apriori algorithm, the parallel association rule has two advantages:

  • The parallel association rule improves the mining speed of frequent keyword sets by processing public opinion data in parallel based on MapReduce paradigm, which is more suitable for big data processing.

  • The parallel association rule introduces the concept of association keyword sets. By calculating the confidence of frequent keyword sets from each intermediate process, we get the association keyword sets that all the association rules satisfy confidence threshold to obtain the important hidden information that Apriori algorithm ignores.

The parallel association rule divides a calculation task into \( N \) separate subtasks, each of which is \( 1/N \). Based on \( L_{1} \), each subtask implements the iteration of \( AS_{k - 1} \) to \( L_{k} \) according to the assigned k-1_items association keyword set. The global variable \( L_{k} \) is obtained by combining the results of all the subtasks and removing duplicates, obtaining the association rule and getting all the k_item association keyword set. On this basis, the next iteration is performed until the \( AS_{k + 1} \) is empty. The process for obtaining an association keyword set for a topic is shown in Fig. 1.

Fig. 1.
figure 1

Flow chart of related keyword acquisition by parallel association rules

4.1 K_Item Frequent Keyword Set Mining

The acquisition of \( L_{k} \) consists of three processes. First, \( L_{1} \) of each topic is taken as the premise of each iteration of the corresponding topic. Second, \( AS_{k - 1} \) of each topic is divided into \( N \) subtasks, independently forming \( C_{k} \). Finally, the global variable \( L_{k} \) can be obtained according to each subtask results.

1_item Frequent Keyword Set Mining

According to the TOP keywords of all topics we filter out the corresponding data, therefore all the details of popular topics can be contained in the data we got. Therefore, specific steps obtaining \( L_{1} \) of each topic are as follows:

First, a collection of candidate 1_item keyword set is composed of all the keywords of each topic-related dataset. From Definition 3 in Sect. 3, we know that the candidate 1_item keyword set in the position of \( i \) is \( c_{1} [i] \), and the \( C_{1} = \{ c_{1} [1],c_{1} [2], \ldots ,c_{1} [t]\} \), where \( t \) is the number of all the keywords for each topic data.

Second, we scan each topic-related data, counting the frequency \( num(c_{1} [i]) \) of each \( c_{1} [i] \). Derived from (1), the support of \( c_{1} [i] \) is calculated as follows:

$$ sup\_c_{1} [i] = num(c_{1} [i])/num(I) $$
(3)

Finally, we set the support threshold \( min\_sup \). If \( min\_sup \le \,\sup \_c_{1} [i] \), then add \( c_{1} [i] \) to \( L_{1} \), otherwise discard \( c_{1} [i] \). Thus, \( L_{1} = \{ l_{1} [1],l_{1} [2], \ldots ,l_{1} [j]\} \), \( j \le i \).

Generating Candidate k _item Keyword Set

The generation of candidate keyword sets includes a joining and a pruning step. The joining step divides the \( AS_{k - 1} \) into \( N \) separate subtasks, each of which consists of one or \( m \) non-repetition \( a_{k - 1} \), where the value of \( m \) is determined by the number of \( a_{k - 1} \). Then we combine all \( a_{k - 1} \) and \( l_{1} \) one by one, independently generating \( C_{k} \) in each subtask. According to the prior knowledge that all non-empty subsets of frequent keyword sets must also be frequent. The pruning step matches all the subset of each \( c_{k} \) in \( C_{k} \) with all the x_items association keyword sets \( (1 \le x \le k - 1) \), pruning \( c_{k} \) that does not satisfy the prior knowledge and obtaining \( C_{k} \) for generating frequent keyword sets.

For example, the candidate 3_items keyword set {St.Petersburg, subway, explosion} of “the explosion of St.Petersburg” includes 2_items frequent subsets {St.Petersburg, subway}, {St.Petersburg, explosion}, {subway, explosion} and 1_item frequent subsets {St.Petersburg}, {subway}, and {explosion}. As long as any of these subsets does not match all 2_items association keyword sets in \( AS_{2} \) and 1_item frequent keyword sets in \( L_{1} \), the candidate 3_itemset must not be a frequent keyword set according to the prior knowledge and should be pruned.

Generating k _item Frequent Keyword Set

The process of obtaining the global variable \( L_{k} \) from \( C_{k} \) pruned in each subtask is similar to getting \( L_{1} \). The specific steps are as follows:

First, we scan the data set of the corresponding topic, counting the frequency \( num(c_{k} [i]) \) of each \( c_{k} [i] \). Derived from (1), the support of \( c_{k} [i] \) is as follows:

$$ sup\_c_{k} [i] = num(c_{k} [i])/num(I) $$
(4)

Second, we determine the relationship between \( min\_sup \) and \( sup\_c_{k} [i] \) according to the \( min\_sup \) set in advance. If \( min\_sup \le \,sup\_c_{k} [i] \), then add \( c_{k} [i] \) corresponding to \( sup\_c_{k} [i] \) to \( L_{k} \), denote \( l_{k} [j] \), otherwise discard \( c_{k} [i] \). Thus, \( L_{k} \) is independently generated in each subtask.

At last, the global variable \( L_{k} \) is obtained by combining the results of \( N \) separate subtasks and removing duplicates.

4.2 Association Keyword Set Mining

The support reflects the heat of discussion of the keyword set in public opinion datasets. The confidence reflects the relevance of the relationship among different keywords in one keyword set. Thus, the support and confidence of the keyword can directly indicate the relationship between the keyword set and the topic. We can filter out the association keyword set that satisfies both the support and confidence thresholds. Then the potential relations of the keyword set can be found by selecting and assembling obtained association keyword sets. The specific steps to get the k_items association keyword set are as follows:

First, we calculate the confidence of all association rules. Each \( l_{k} [j] \) of global variables \( L_{k} \) can generate multiple association rules. We set \( l_{k} [j_{[s]} ] \) as the keyword set consisting of \( s \) keywords in \( l_{k} [j] \), where \( 1 \le s \le k \), and \( l_{k} [j_{[k - s]} ] \) is the rest of the keywords set. Derived from (2), the confidence of association rule \( l_{k} [j_{[s]} ] \Rightarrow l_{k} [j_{[k - s]} ] \) is as follows:

$$ conf(l_{k} [j_{[s]} ] \Rightarrow l_{k} [j_{[k - s]} ]) = \,sup\_l_{k} [j]/sup\_l_{k} [j_{[s]} ] $$
(5)

Where \( sup\_l_{k} [j] \) is the support of \( l_{k} [j] \), and \( sup\_l_{k} [j_{[s]} ] \) is the support of the keyword set consisting of s keywords in \( l_{k} [j] \).

Second, we set the confidence threshold \( min\_conf \). If \( min\_conf \le conf(l_{k} [j_{[s]} ] \Rightarrow l_{k} [j_{[k - s]} ]) \), then save association rule \( l_{k} [j_{[s]} ] \Rightarrow l_{k} [j_{[k - s]} ] \), otherwise discard it.

Finally, it is judged whether all the rules of \( l_{k} [j] \) satisfy the given confidence threshold. If so, then add \( l_{k} [j] \) to \( AS_{k} \). If not, discard it.

5 Topic Relationships Detection Using Parallel Association Rules

We propose the PARMTRD method to detect relationships among multiple topics. Firstly, PARMTRD selects the public opinion data related to each topic. Secondly, PARMTRD applies the parallel association rule to the data of each topic to obtain the association keyword sets. Then PARMTRD carries out parallel association rules for multiple topics in parallel. Finally, PARMTRD obtains the hidden relationships among multiple topics by selecting and assembling association keyword sets.

The PARMTRD algorithm is described as follows:

Algorithm 1. PARMTRD

Input: the relevant data sets for all topics

Output: the association keyword sets for all topics

  1. (1)

    Scan the relevant data sets for all topics to filter out the data sets corresponding each topic according to the TOP keyword.

  2. (2)

    Obtain the \( L_{1} \) satisfying the \( min\_sup \) of each topic by counting the frequency of each keyword of each topic in the corresponding data sets. At this time \( k = 1 \).

  3. (3)

    Processing each topic is a subtask. Copy \( L_{1} \) corresponding each topic to each subtask, performing step 4 to step 8 for each subtask.

  4. (4)

    When \( k = k + 1 \), \( AS_{k - 1} \) is divided into \( N \) subtasks, independently forming \( C_{k} \).

  5. (5)

    Obtain the \( L_{k} \) by counting the frequency of each \( c_{k} \) in the \( C_{k} \), which satisfies the \( min\_sup \).

  6. (6)

    The global variable \( L_{k} \) is obtained by combining the results of \( N \) separate subtasks and removing duplicates, generating the association rules.

  7. (7)

    Obtain the \( AS_{k} \) by filtering out the \( l_{k} \) that all the associated rules satisfy the \( min\_conf \).

  8. (8)

    Repeat step 4 to 7 until all of \( AS_{k + 1} \) is empty, and record the maximum number of the association keyword set as n.

  9. (9)

    Obtain all association keyword sets by combining the n_items association keyword set and then removing duplicates to obtain related information from multiple topics to detect the potential relevance of multiple topics, where \( 2 \le n \le k \).

6 Experiments and Evaluation

6.1 Experimental Data

A web spider is used to collect news stories of 7 topics from 2017/4/1 to 2017/4/28, where there were 50 to 300 stories for each topic. We used Ansj to extract 10, 15, 20, and 25 keywords from each story separately, and found that extracting 15 keywords can achieve the most appropriate expression of a topic. So we selected 15 keywords from each story as experimental data. The data set for topics is shown in Table 1.

Table 1. Data set for topics relationships mining

6.2 Evaluation

Set the Threshold of Support

Take the topic “US military strike Syria”, for example, the related keyword set of different supports are shown in Table 2 when \( min\_conf = 0.60 \).

Table 2. Associated keyword set results with different supports

We can see that when \( min\_sup \) is between 0.10 and 0.13, the association keyword sets contain all necessary information of the topic. When \( min\_sup = 0.14 \), we will not get the association keyword sets {strike, military} and {United States, strike}, so we will miss some information on this topic. When \( min\_sup = 0.09 \), we get another association keyword set like {terror, terrorism}. This association keyword set has no relationships with other association keyword sets, so it cannot be the necessary information on this topic. Thus we take 0.13 as the support in the experiments.

Set the Threshold of Confidence

We randomly pick keywords for two topics as the experimental data, and we set 0.12, 0.15 and 0.18 respectively as support thresholds. We study the influence of related keyword sets with different support thresholds and the number of stories. Table 3 presents different support thresholds for different number of stories. With different support thresholds, we get different association keyword sets in Table 4. Then we can get the trend graph (Fig. 2) demonstrating the relationships between different confidences and the number of stories based on Table 3.

Table 3. Support thresholds in the different number of stories
Table 4. Associated keyword sets with different number of stories
Fig. 2.
figure 2

The trend graph demonstrating the relationship between different confidences and the number of stories

Increasing the number of stories leads to an increase of the number of association keyword sets for a topic. If we want to filter out redundant keyword sets, we should reduce the confidence threshold while increasing the number of stories.

Increasing the support threshold will filter out some valuable keyword sets when the dataset is constant. In contrast, reducing support threshold will cause redundant association keyword sets. Thus, the polylines with higher supports are always below those with lower supports in the case of the same number of stories. That is, the higher the support threshold is, the lower the confidence threshold. In contrast, the lower the support threshold is, the higher the confidence threshold.

6.3 The Results of Topic Relationship Detection

This experiment selects the “explosion” theme as an example for three topics, which is the explosion in St. Petersburg, the explosion of a church in Egypt and the explosion of the bus with a German football team. The association keyword set of each topic can be obtained from the public opinion dataset. Table 5 presents the specific parameter settings and the experimental results.

Table 5. The parameter setting and experimental results

We treat each keyword in the association keyword sets as a data node (only one node for repeated keywords). The keywords in the same association keyword set build up the relationship among 3 topics. Figure 3 demonstrates the topology of relationships for the three “explosion” topics.

Fig. 3.
figure 3

The relationship topology for three “explosion” topics mined by PARMTRD

From Table 5 and Fig. 3 we can see that there are obvious associated relationships among the three topics with the theme of “explosion”. Three topics include “explosion”, “attack”, “happen”, “Islamic State” and “terror”, reflecting that these three topics may be related to the attacks of the terrorist organization “Islamic State”. In addition, the keywords in the bus explosion {unidentified, Islamic State} can also indicate the actual cause of this case.

6.4 Comparison and Evaluation

In order to verify that PARMTRD could accurately and efficiently detect relationships among multiple topics, we compare it with the word co-occurrence graph [25].

The word co-occurrence graph is based on co-occurrence of words at the same time. Firstly, we divide the data according to time slice. Secondly, to get a keyword set for each time slice, we select the keywords according to the frequency of a keyword for the current time slice and the frequency of the keyword for the last time slice. Then the topic keyword sets are obtained by integrating the keyword sets of each time slice. Finally, we calculate the value of word co-occurrence for each two keywords.

If the value of word co-occurrence is greater than a given threshold, then we add a link between these two keywords. The experimental results are shown in Fig. 4, where each connected graph is a cluster on behalf of a topic.

Fig. 4.
figure 4

Keyword co-occurrence network graph

The keywords of the word co-occurrence graph are clearly divided into three clusters in Fig. 4, which is very different from Fig. 3. Each cluster represents a topic, these clusters have no external relationship with each other, which means that it fails to detect the relationships among the topics. PARMTRD obtains more keywords than the word co-occurrence method and each node has more degree, as shown in Table 6, which means PARMTRD can get more information about topics and more relevance among topics than the word co-occurrence method.

Table 6. Comparison between PARMTRD and Word co-occurrence

7 Conclusion and Future Work

Mining topic relationships from social big data is valuable to understand origination sources of specific events. There lacks research for this direction. This paper proposes a mining approach for multiple-topic relationships detection based on parallel association rules called PARMTRD. It mines the association keyword sets of multiple topics by the parallel association rules from public opinion data with low value density. It can obtain related information from multiple topics by selecting and assembling association keyword sets to detect the potential relevance of multiple topics. The experiments show that our approach can be used to discover the root causes of multiple events, with valuable information that could be not mined with other existing approaches.

Our future work focuses on the tracking and early warning of multiple topics. And we aim to grasp the trend of dynamic development and evolution of multiple topics over time in case of topic drift, and then predict the unknown public opinion behind the cause.