Keywords

These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.

1 Introduction

While organizations generate data that can contribute to improving performance daily, many of these organizations do not have the in-house expertise required to analyse the data. The lack of expertise is prominent in resource constrained environments manifested in rural/remote developing world regions, for instance, where constraints on resources such as access to computational power, reliable electricity, and the Internet pose a further challenge. A cost-effective solution is to outsource the data to a professional third-party data analytics service provider.

A study [1] carried out in technologically resource-constrained environments has revealed that collected crime data are usually not studied or analysed to support crime resolution. A possible reason for this is the lack of the necessary in-house expertise, both in terms of human capital and computational processing power [5, 15, 24]. This deprives policy makers in these regions of the benefits that could have been derived through data analytics. A possible solution to this is to involve a third-party data analytics service provider [1, 2]. However, because of the sensitive nature of crime data it makes sense to ensure that the outsourced data are protected from all unauthorized access including that of an honest-but-curious data mining service provider. Therefore, this paper focuses on developing a test bed framework to preserve privacy during real-time information sharing using the crime domain as an application scenario. However, it is important to stress that the ideas and approaches considered in this study are applicable to other areas or domains as well.

2 Related Work

A naive approach to preserve privacy or anonymity in data is to exclude explicit identifiers such as name and/or identification number. However linking attacks aimed at data deanonymisation, can be provoked successfully by combining non-explicit identifiers (such as date of birth, address and sex) with external or publicly available data [3, 25, 26]. To illustrate how a linking attack can be provoked, let us consider Fig. 1, which shows two compartments (or storage) that contains data. The upper compartment contains a portion of a publicly available table in which “name” is an explicit identifier attribute and the lower compartment shows a portion of a data stream that has been sanitized to exclude explicit identifiers (name) in order to disguise the identities of the individuals associated with the data. However, when a joining operation is performed on both compartments using attributes common to both compartments, the supposedly anonymized individual is re-identified successfully as Ade who lives at 10 Pope Street and also revealing her sensitive information that she has been a victim of rape.

Fig. 1.
figure 1

Illustration of linking attack

According to Sweeney [18, 19] 87% of the population in the United States were uniquely identified by the combination of non-explicit identifiers such as gender, zip code and date of birth from the 1990 census dataset using linking attack. Therefore, Sweeney et al. came up with a better approach named k-anonymity to anonymize data in a manner that linking attack is minimized.

K-anonymity ensures privacy is preserved by hiding each individual in a cluster which contains at least k individuals such that an adversary finds it difficult to get additional individual information, but rather information about a group of k individuals [18, 20]. To understand how k-anonymity works, let us assume an attacker tries to identify a friend in a k-anonymized table, but the only information he has is her birth date and gender. K-anonymity ensures that the adversary finds it difficult to identify the individual by guaranteeing that at least k people have the same date of birth and gender. Thus minimizing the rate of linking attack to at least 1/k. K-anonymity algorithms can generally be grouped into two categories, namely hierarchy-based generalization and hierarchy-free generalization [3]. In the hierarchy-based generalization, data anonymization requires a generalization tree as an input to aid in the anonymization process while in the hierarchy-free generalization, generalization tree is not required as an input rather the algorithm makes use of clustering concepts and some heuristics.

The evolution of k-anonymity has led to the birth of newer privacy models to address its inherent limitation. Some of the popular and newer privacy models that extend k-anonymity are \(\ell \)-diversity and t-closeness [29, 30]. The main essence of \(\ell \)-diversity is to address homogeneity attack to which k-anonymity is vulnerable and it does this by requiring that each cluster in a k-anonymized table has at least \(\ell \) distinct sensitive values [29]. T-closeness further complements \(\ell \)-diversity by ensuring that distribution of sensitive values in a cluster is similar to that of the entire anonymized table [30].

An equally fast-growing data preservation technique is differential privacy. Differential privacy achieves anonymization by altering the data (i.e. unanonymized data) with the addition of mathematical “noise” [13]. In other words, differential privacy preserves privacy through the “difference” between the data supplied and the noise added to it. Interestingly, recent research [16, 17] has shown that the use of t-closeness with k-anonymity can yield similar privacy result as those of differential privacy. In this research we focus on k-anonymity and its complementary techniques because of the simplicity [29], effectiveness [19] and high utility [31] offered, especially when compared to an evolving counterpart such as differential privacy. In addition, recent research [31, 32] has shown that differential privacy is achieved as long as a dataset is anonymized using k-anonymity and t-closeness. Therefore this paper focuses on the use of k-anonymity, \(\ell \)-diversity and t-closeness to achieve anonymization.

The adaptation of k-anonymity & its complementaries to data stream (real-time data) has led current research to integrate the concept of a buffering (or sliding window) mechanism and delay constraint into data stream anonymization [3, 4, 22, 25,26,27]. The buffer is designed to hold a portion of the data stream at every instant of time, after which an anonymization algorithm can be applied to data in the buffer. Delay constraints is required to put a check on each tuple so that it does not stay in the buffer beyond a pre-defined deadline. Inspite of this, many of the existing algorithms adapted for anonymization of data streams face the following challenges:

  • First, buffering according to delay constraints, can result in certain records being held in the buffer for long periods [3, 8, 10]. When such records are time-sensitive or need to be processed in real time, occurrence of delay usually results in high levels of information loss. Since a key requirement of a good anonymization scheme is high data utility, high levels of information loss due to expired tuples or dropped (or suppressed/unanonymizable) records are undesirable.

  • Second, building on the first problem, we note that many of the existing data stream anonymization schemes based on k-anonymity and its derivatives do not take distribution of future data streams into consideration during anonymization [4]. An implication of this is that a record that is likely to offer better anonymization at a lower rate of information loss in a future sliding window or data stream can be anonymized with such a future sliding window rather than the current sliding window or data stream. Therefore, there is a need to have a model that can predict the best sliding window or stream with which a record should be anonymized.

Therefore, the focus in this paper is to present a data-stream anonymization framework that addresses the aforementioned challenges inherent in existing framework. More detailed literature review can be found in [14].

3 Data Stream Anonymization Framework

Figure 2 presents an overview of our Data Stream Anonymization Framework using the crime domain as an application scenario. Users make crime reports electronically and the reports are anonymized in real time at the anonymization layer. The results from the anonymization layer are transferred to third party for data mining process at the application layer.

Fig. 2.
figure 2

System overview

3.1 Users Layer: Crime Reporting Layer

As noted in previous sections, this research considers the crime domain as an application scenario for achieving data stream anonymization. However, the ideas in this research extend to any other domain that requires real-time anonymization of sensitive data. Thus, to enable us to create an application that allow people to report crime in a secured and covert way, we converted the existing paper-based crime reporting system of a University Campus Setting in South Africa into a digitized Crime Reporting System. We chose to use mobile device as platform for crime reporting because the use of mobile devices provides a good security platform for crime report [11, 23].

In order to ensure that our mobile crime reporting system is acceptable and usable in real-life we applied an iterative user-centered design methodology. To this effect we interviewed different key stakeholders within law enforcement agencies and crime victims. We had different iterations in our design until we came up with a final prototype acceptable to all stakeholders. Figure 3 presents the screenshots of our final prototype. More details about the research on the development and deployment of the crime reporting application,CryHelp, can be found in [5].

Fig. 3.
figure 3

Screenshot of our crime reporting application

3.2 Anonymization Layer: Data Stream Anonymization

In our proposed crime reporting application scenario, data arrives in form of streams and contains information that is analyzed for statistical or data mining predictions. These data are temporarily stored in a buffer in order for anonymization to take place. A buffer is used to hold portions of the continuous data stream based on delay constraints that specify the duration for which tuples can remain in the buffer just before anonymization takes place. As illustrated in Fig. 4, the buffer optimizer uses time-based sliding window and Poisson probability to monitor the data in the buffer, ensuring that tuples are anonymised before the expiry time threshold is reached while Fig. 5 illustrates the data anonymizer which uses k-anonymity, \(\ell \)-diversity and t-closeness for data privacy preservation. We opted for Poisson distribution because it is concerned with the number of success that occurs within a given unit of measure. This property of the Poisson distribution makes viewing the arrival rate of the reported crime data as a series of events occurring within a fixed time interval at an average rate that is independent of occurrence of the time of the last event [6]. Only one parameter needs to be known, the rate at which the events occur which in our case is the rate at which crime reporting occurs.

Fig. 4.
figure 4

Details of anonymization layer: buffer optimizer

Fig. 5.
figure 5

Details of anonymization layer: data anonymizer

Fig. 6.
figure 6

Overview of dynamic buffer sizing process

4 Adaptive Buffer Re-sizing Scheme

As illustrated in Fig. 6, a “sliding window (buffer)”, \(\text {sw}_i\), is a subset of the data stream, DS, where \(\text {DS} = \{\text {sw}_{1},\, \text {sw}_{2},\, \text {sw}_{3},\ldots ,\, \text {sw}_{m}\}\) implies that the data stream consists of m sliding windows. The sliding windows obey a total ordering such that for \(i<j\), \(\text {sw}_{i}\) precedes \(\text {sw}_{j}\). Each sliding window, sw\(_i\) only exists for a specific period of time T and consists of a finite and varying number of records, n, so that \(\text {sw}_{i}={R_{0},\ldots ,R_{n-1}}\).

Our adaptive buffer sizing scheme as illustrated in Fig. 7 is categorized into 6 phases and detailed explanation about how each of these phases work is in our earlier publication [12]. We summarize these details as follows. First we begin by setting the size of the buffer to some initial threshold value. Given the time-sensitivity of the data, we set the size of the sliding window, \(\text {sw}_i\), to a value, T. T is a time value that is bounded by a lower bound value, \(t_l\), and an upper bound value, \(t_u\). The anonymization algorithm is applied to the data that was collected in the sliding window \(\text {sw}_i\) during the period T. So, essentially \(\text {sw}_{i} = T\). All records that are not anonymizable from the data collected in \(\text {sw}_i\) are either included in a subsequent sliding window, say \(\text {sw}_{i+1}\) or incorporated into already anonymized clusters of data that are similar in content wise.

Fig. 7.
figure 7

Phases of adaptive buffer re-sizing scheme

In order to determine whether or not an unanonymizable record can be included in a subsequent sliding window, say \(\text {sw}_{i+1}\), we compute its expiry time \(T_{E}\) and compare its values to the bounds for acceptable sliding window sizes \([t_l,t_u]\). We compute \(T_{E}\) as follows:

$$\begin{aligned} T_{E} = \text {sw}_{i} - T_{S} - T_{A} \end{aligned}$$
(1)

where \(T_{E}\) is the time to expiry of a record , \(T_{S}\) is the time for which the record was stored in \(\text {sw}_i\), and \(T_{A}\) is the time required to anonymize the data in \(\text {sw}_i\).

Starting with the unanonymizable record, \(R_i\), that has the lowest \(T_{E}\) and whose value falls within the acceptable bound, [\(t_l,t_u\)], we check for other unanonymizable records, n, that belong to the same data anonymization group as \(R_i\). We then proceed to find the rate of arrival, \(\lambda \), of data in that anonymization group. We compute the arrival rate of records required to anonymize \(R_i\) within time \(T_{E}\) as follows:

$$\begin{aligned} \lambda = \frac{\text {n}}{\text {sw}_1} \times \text {T}_E \end{aligned}$$
(2)

The expected arrival rate, \(\lambda \), is used to determine the probability of arrival of at least the number of records needed to guarantee that delaying anonymizing the unanonymizable record, \(R_i\), to the next sliding window \(\text {sw}_{i+1}\) will not adversely increase information loss. We next determine the minimum number of records, m, required to guarantee anonymization of \(R_i\) in the next sliding window, \(\text {sw}_{i+1}\). When the decision is to include the \(R_i\) into the next sliding window \(\text {sw}_{i+1}\), we need to then compute the optimal size for \(\text {sw}_{i+1}\) in order to minimize information loss from record expiry. We achieve this by finding the probability that m records will actually arrive in the data stream within time, \(T_{E}\), in order to anonymize the unanonymizable record, \(R_i\). We use Eq. 3 to compute the probability of having i = 0 ...m records arrive in the stream within \(T_{E}\)

(3)

where \(\lambda \) is the expected arrival rate, e is the base of the natural logarithm (i.e. \(e = 2.71828\)) and \(\text {i}\) is the number of records under observation.

Therefore the probability of having greater that m or more records arrive in the stream within time \(T_{E}\) is

$$\begin{aligned} 1 - \sum _{i=0}^{m-1} pr \end{aligned}$$
(4)

where pr is the probability outcome of Eq. 3.

If the result of Eq. 4 is greater than a preset probability threshold, \(\delta \), we set the size of the subsequent sliding window, \(\text {sw}_{i+1}\), to the expiry time of the unanonymizable record under consideration. We then mark the unanonymizable record for inclusion in the subsequent sliding window along with other unanonymizable records that have their \(T_{E}\) within bounds for acceptable sliding window sizes \([t_l,t_u]\). In the event that the probability of all unanonymizable records is less than the preset probability threshold, we set the subsequent sliding window size to a random number within the time bound, \([t_l,t_u]\). More detailed explanation of the adaptive buffer scheme can be found in our previous work [12].

5 Experiments and Results

This section presents the implementation and results of the crime-reporting Application, CryApp, and the adaptive buffering scheme algorithm.

5.1 CryHelp Application Evaluation

In order to evaluate the usability of our mobile crime reporting application, CryHelp, we developed a questionnaire. The questionnaire was based on IBM CSUQ [23]. The advantage of the IBM CSUQ [23] is that it allows questionnaires to be divided into scores and specific categories. These categories are: System Overall, System Usefulness, Information Quality and Interface Quality. These categories allow evaluation of each individual component of the system to gauge which aspects perform well or poorly on average. These results directly address the issue of whether a mobile device can be used to effectively and securely send a crime report.

Fig. 8.
figure 8

The chart of the questionnaire score breakdown, with standard deviation 0.05

Figure 8 shows the result of each component of the system. From the figure, it can be seen that overall the system was well received with an overall system score of 77.06%, this suggests the users found the system very usable with a standard deviation of 0.05 for contributing scores System Use, Information Quality and Interface Quality. It is not surprising to find that the interface quality (78.33%), though marginally, is the most appreciated aspect of the system as the design process was centered on the users. These results bode very well for the feasibility of a mobile solution for crime reporting.

5.2 Experiments on Anonymization

Our feasibility study and experiment conducted on the prototype crime data collection application, CryHelp [5], informed the generation of more datasets for the second phase of experiment. The generation of more crime data was done using a random generator softwareFootnote 1 and pseudo-random algorithm based on a Gaussian distribution to populate the crime data-stream based on ground-truth provided by the users, UCT Campus Protection Service and the South African Police Service. Our data are in two sets, which contain 1000 and 10 000 records respectively, the first set contains 1000 records while the second set contains 10000 records, this is a reasonable bound for daily average crime report rates in South Africa [5].

Therefore, this section discusses the gains obtained using Poisson probability distribution to predict the time a sliding window should exist, while ensuring that records do not expire, the number of unanonymizable (or suppressed) records is minimal and privacy is maintained using k-anonymity, \(\ell \)-diversity and t-closeness. The gains obtained are explained in the following sub-sections:

Recovered Unanonymizable Tuples: During anonymization there is usually a trade-off between the rate of IL, suppression and generalization. Usually if an equivalence class (cluster) is unable to satisfy the privacy requirement, such a class is either merged with another class or all its records are suppressed. A higher suppression rate implies that vital information is likely to be concealed from the recipient of the anonymized table, while merging of classes implies an increase in IL, which has the drawback of offering lower data utility. In order to curb this, Poisson probability distribution predicts the chances of such unanonymizable (suppressed) records undergoing anonymization in the next sliding window in a manner that preserves privacy and maximizes data utility with the goal of minimizing delay or expiration of records.

Figures 9 and 10 show the rate at which unanonymizable records were anonymized again, going by the predictions of Poisson probability distribution. It is evident from the figure that many unanonymizable records were recovered and allowed to go for anonymization again. It was also observed that the probability threshold influenced the number of unanonymizable records recovered. This leads to the conclusion that the higher the probability threshold, the lower the probability of unanonymizable records being given a chance for anonymization re-consideration in subsequent sliding window(s). The implication of this is that more records are likely not to be given the chance of another round of anonymization if higher threshold values are used. Another observation is that if a higher threshold value is used, then there are fewer changes or movements in records between sliding windows.

Fig. 9.
figure 9

Poisson probability threshold versus recovered tuples for dataset 1

Fig. 10.
figure 10

Poisson probability threshold versus recovered tuples for dataset 2

Privacy Value/Level Versus Recovered Unanonymizable Tuples: “Privacy level” simply means the degree of anonymity offered, while unanonymizable tuples are those tuples that belong to an equivalence class whose size is less than k. For the purpose of sliding windows that start with a small number of tuples, the minimum privacy level threshold was set as k=2 and the maximum at k= 15; the \(\ell \)-diversity value, \(\ell \), was varied between values 3 and 5 and finally the t-closeness value, t, was alternated between values 0.1 and 0.15 for the two datasets.

As illustrated in Figs. 11 and 12, it was observed that as the privacy value/level increases, the possible number of unanonymizable records that can be recovered using Poisson probability prediction is reduced. The main reason for this is that as the privacy level or degree increases, it is expected that the rate or possibility of achieving anonymization will become increasingly challenging. This definitely also influences the expectation of higher chances of anonymization rate for unanonymizable records. To understand the reason for the decline in the rate of recovered unanonymizable records better, let us assume the k privacy level is set to three and an equivalence class, \(\text {EC}_i\), has two records; this implies that we are looking for at least one more record to make \(\text {EC}_i\) satisfy k-anonymity. In essence, using Poisson probability, the adaptive buffer resizing model attempts to predict the chance of at least one record in \(\text {EC}_i\) arrive within time, t, in the next sliding window. If k is set to four, this will mean the chance of at least two records arriving in the next sliding window. An implication of this is that the chances of having at least two records is more difficult or demanding compared to the chances of just one record. Thus, this explains why the model has a drop in recovered unanonymizable records as privacy level increases. Therefore, the conclusion is that the rate at which unanonymizable records in a current sliding window can be anonymized in a subsequent window is mainly dependent on the privacy value.

Fig. 11.
figure 11

Relationship between privacy level and recovered tuples for dataset 1

Fig. 12.
figure 12

Relationship between privacy level and recovered tuples for dataset 2

5.3 Benchmarking: Poisson Solution Comparison with Non-Poisson Solution

As a baseline case, for evaluating our proposed adaptive buffering scheme we implemented the proactive-FAANST and passive-FAANST. These algorithms are a good comparison benchmark because they are the current state-of-the-art streaming data anonymization and reduce IL with minimum delay and expired tuples [3]. The proactive-FAANST decides if an unanonymizable record will expire if included in the next sliding window, while passive-FAANST searches for unanonymizable records that have expired. A major drawback of these two variants is that there is no way of deciding whether or not such unanonymizable records would be anonymizable during the next sliding window. This is necessary to avoid repeatedly cycling a tuple that has a low chance of anonymization in subsequent sliding window(s). Moreover, these algorithms do not consider the fact that the flow or speed of a data stream could change. These weaknesses of proactive-FAANST and passive-FAANST are what we attempt to address by using Poisson probability distribution to predict if such tuples would be anonymizable in subsequent sliding window(s) by taking into consideration the arrival rate of records, success rate of anonymization per sliding window, time a tuple can exist and rate of suppressed records.

Expired Tuples and Information Loss in Delay: A tuple expires when it remains in the system for longer than a pre-specified threshold called delay [3, 10]. In order to decide whether a tuple has exceeded its time-delay constraint, additional attributes such as arrival time, expected waiting time and entry time were included. As a heuristic, the choice of delay values, \(t_l = 2000\, \text {ms}\) and \(t_u = 5000\, \text {ms}\), is guided by values of delay that are used in published experimentation results [3].

In general, our approach shows that there are fewer expired tuples when compared to passive-FAANST and proactive-FAANST solutions. This is because before our Poisson prediction transfers suppressed records to another sliding window, it checks for the possibility of their anonymization. In other solutions, there is no mechanism in place to check the likelihood of the anonymizability of a suppressed record before allowing it to go to the next sliding window/round. As a result, such tuples are sent to the next sliding window and have high a tendency to expire eventually. Our solution also shows that the lower a k-value, the higher the number of expired tuples. This is because the outcome of Poisson prediction is lower for higher k-values. As a result, there are fewer changes of sliding windows as the k-value increases and this means there is a lower possibility of expired tuples.

One of the main goals of our solution is to reduce IL in delay (i.e. to lower the number of expired tuples). Figure 13 depicts that our solution is successful in achieving its main goal, and the IL (delay) in our solution is lower than in passive and proactive solutions. In order to determine the total number of records that expired, a simple count function was used to retrieve all records that had remain in the buffer longer than the upper limit threshold, \(t_u\). To determine the average expired records, we sum up the expired records in all the experiments and divide the result by the total number of experiments.

Fig. 13.
figure 13

Privacy level versus expired tuples for Poisson solution, passive-FAANST and proactive-FAANST

Data Utility and Information Loss in Record: An important factor that is considered in anonymization is the degree of usability of anonymized data for data analysis or data mining tasks [28]. Therefore, we compared the degree of IL in records of our solution with that of Passive-FAANST and proactive-FAANST. Our result, as illustrated in Fig. 14, shows that at the minimal level of privacy enforcement, the information loss of our solution is on par with the other two schemes, while at the maximal level our solution has better data utility.

Fig. 14.
figure 14

Privacy level versus information loss for Poisson solution, passive-FAANST and proactive-FAANST

6 Conclusions

We started this paper on the note that resource constrained environments lack data analytic expertise that can analyze and mine crime data in real-time. This anonymization process is important in order to provide intervention that can carry out this analysis in timely fashion. We adopted k-anonymity, \(\ell \)-diversity and t-closeness as our anonymization scheme due to their simplicity, efficiency and applicability in real-life. However current literature on integration of these techniques to data stream has issues in terms of performance and privacy. The performance issue deals with information loss in terms of delay and running cost.

To address the challenge of ensuring that delay is optimal during anonymization process, we adaptively resized the buffer to handle intermittent flows of crime reporting traffic optimally by using Poisson Distribution. Results from our prototype implementation demonstrate that in addition to ensuring privacy of the data, our proposed scheme outperforms other with an information loss rate of 1.95% in comparison to 12.7% on varying the privacy level of crime report data records.