Keywords

1 Introduction

Clustering is an unsupervised learning technique aimed to discover the natural grouping of unlabeled objects according to the similarity of their measured intrinsic characteristics [10]. Formally, let \({\mathbf {X}}=\{{\mathbf {x}}_{1},\ldots ,{\mathbf {x}}_{N}\} \) be a set of N objects (or patterns) to be partitioned into K non-overlapping groups (or clusters) \({\mathbf {C}}=\{ {\mathbf {c}}_{1},\ldots ,{\mathbf {c}}_{K}\} \), such that the following three conditions are satisfied: \({\mathbf {c}}_{i}\ne \emptyset \); \({\mathbf {c}}_{1}\cup \ldots \cup {\mathbf {c}}_{K}={\mathbf {X}}\); and \({\mathbf {c}}_{i}\cap {\mathbf {c}}_{j}=\emptyset \) for \({i,j}=1,\ldots ,K\) and \({i}\ne {j}\).

In addition, when the number of clusters is unknown a priori, the problem is referred to as automatic clustering (AC) [3, 4, 6], which consists in discovering the number of clusters as well as the clustering that best fits the actual data structure.

To find the optimal clustering solution to partition N objects into K clusters is a very difficult combinatorial optimization problem which has been proved to be NP-complete when \({K}>3\) [6]. Therefore, the AC problem is frequently formulated as one of numerical optimization, where prototypes (e.g., medoids or centroids) are used as representative points of the clusters, that is, a prototype-based representation is considered. This optimization problem consists in finding the optimal locations of the prototypes that best represent the clusters in the dataset. This formulation has been proved to be NP-hard [1]; thus, such a complexity has motivated the use of diverse nature-inspired metaheuristics to address the AC problem [4, 6], where the best partition is achieved by optimizing an objective function, which is known as the cluster validity index (CVI).

Generally, to evaluate the quality of a clustering solution, the CVI considers both the intracluster dispersion of patterns in every cluster and the intercluster separation among cluster prototypes. Thus, a proper solution representation and an effective evaluation function (or CVI) are important components in the design of an automatic clustering algorithm based on metaheuristics.

On the other hand, data clustering is a difficult problem because the clusters can differ in shape, size, density, overlapping, etc. In addition, the presence of non-linearly separable clusters makes its detection even more difficult. We can stated that a set of patterns is linearly separable if two actual clusters, represented by a couple of prototypes, can be correctly separated by a single hyperplane. This condition is well-estimated by the CVI when the input data is linearly separable. However, the CVI presents a poor performance when the data is non-linearly separable. This behavior is because the criteria of cohesion and separation are unsatisfied simultaneously.

In this work, we propose a multi-prototype clustering representation to encode arbitrary-shaped and non-linearly separable clusters. The proposed representation allows encoding clustering solutions by using a set of prototypes to represent a single cluster. Moreover, a clustering criterion based on connectivity of data is incorporated into the well-known Silhouette index to measure the intracluster dispersion. Finally, both components, the multi-prototype representation and the connectivity-based CVI, are integrated into an automatic clustering algorithm based on differential evolution (MACDE).

The outline of this paper is as follows: Sect. 2 surveys the related work; Sect. 3 presents the proposed approach; Sect. 4 describes the experimental setup; Sect. 5 summarizes the results; and Sect. 6 gives the conclusion.

2 Related Work

In the early years, several researchers performed automatic clustering mainly through modifying K-means and FCM algorithms [6]. Nowadays, diverse nature-inspired metaheuristics have been used to address the AC problem. Indeed, a recent survey [6] reported that the design, development, and application of nature-inspired clustering algorithms to automatic clustering has increased notably during the last decade. In particular, it was reported that evolutionary algorithms are the most used to address this problem.

Regardless of the search mechanisms, the success of nature-inspired clustering algorithms relies mostly on a proper solution representation as well as an effective evaluation function. The related work of these two components is provided in the next.

2.1 Encoding Schemes

The centroid-based representation of fixed length [3] is commonly used in automatic clustering algorithms. All the individuals in the population have the same length defined as \({K}_{\max }+{K}_{\max }\times {D}\), where \({K}_{\max }\) is the maximum number of clusters and D is the dimensionality of the dataset. The first \({K}_{\max }\) entries are real numbers in the range [0, 1] called activation thresholds represented by \( \mathbf T = \{ {T}_{k}\mid {k}=1,\ldots ,{K}_{\max }\} \). The remaining \({K}_{\max }\times D\) entries are reserved for the cluster centroids, \(\overline{\mathbf C }= \{\bar{{\mathbf {c}}}_{k}\mid {k}=1,\ldots ,{K}_{\max }\} \), where \(\bar{{\mathbf {c}}}_{k} \in {\mathbb {R}}^{D}\). To vary the number of clusters, it is considered the following activation rule: the kth centroid is activated if and only if \({T}_{k}>0.5\); otherwise it is inactivated. Hence, only activated prototypes participate in the clustering process.

2.2 Cluster Validity Indices

A cluster validity index (CVI) is a mathematical function used to quantitatively evaluate a clustering solution by considering the intracluster dispersion and the intercluster separation. Generally, a CVI is used for two purposes: to estimate the number of clusters and to find the corresponding best partition. CVIs are optimization functions by nature, that is, the maximum or minimum values indicate the appropriate partitions. Therefore, CVIs have been used as objective functions by evolutionary clustering algorithms to address the AC problem [3].

The CVIs usually use representative points (e.g., centroids or medoids) to represent the groups in a clustering solution. This approach is suitable for compact and hyperspherical-shaped clusters [10]. However, when clusters are non-linearly separable and present arbitrary shapes, their prototypes could share the same region in the feature space. Therefore, in this case, the CVI presents a poor clustering performance.

In order to discover arbitrary-shaped and non-linearly separable clusters, some approaches based on the connectivity of data have been proposed in the literature. Pan and Biswas [7] incorporated graph theory concepts into some CVIs to compute the intracluster cohesion criteria. Also, Saha and Bandyopadhyay [8] presented a new measure of connectivity based on a relative neighborhood graph, which was incorporated in some conventional CVIs such as Davies–Bouldin, Dunn, and Xie–Beni indices. However, the time complexity of these approaches makes them unsuitable to address the AC problem using evolutionary algorithms.

3 Proposed Approach

3.1 Differential Evolution Algorithm

Differential evolution (DE) is an evolutionary algorithm proposed to solve optimization problems over continuous spaces [9]. Let \(\mathbf P ^{g}=\left\{ \mathbf z _{1},\ldots ,\mathbf z _{NP}\right\} \) be the current population with NP members at generation g, where the ith individual is a l-dimensional vector denoted by \(\mathbf z _{i}=\left[ {z}_{{i},1},\ldots ,{z}_{i,l}\right] \).

At the beginning of the algorithm, the variables of the NP individuals are randomly initialized according to a uniform distribution \({z}_{j}^{\text {low}}\le {z}_{i,j}\le {z}_{j}^{\text {up}}\), for \({j}=1,2,\ldots ,{l}\). After initialization, DE enters into a loop of evolutionary operators (mutation, crossover, and selection) until convergence is reached or the maximum number of generations is attained.

Mutation: At each generation g, mutant vectors \(\mathbf v _{i}^{g}\) are created from the current parent population \(\mathbf P ^{g}=\{ \mathbf z _{1},\ldots ,\mathbf z _{NP}\} \). The “DE/rand/1” mutation strategy is frequently used:

$$\begin{aligned} \mathbf v _{i}^{g}=\mathbf z _{r0}^{g}+{F}\left( \mathbf z _{r1}^{g}-\mathbf z _{r2}^{g}\right) , \end{aligned}$$
(1)

where the indices r0, r1 and r2 are distinct integers randomly chosen from the set \(\left\{ 1,2,\ldots ,{NP}\right\} {\setminus }\left\{ {i}\right\} \) and F is a mutation factor which usually ranges within \(\left( 0,1\right) \).

Crossover: After mutation, a binomial crossover operation creates a trial vector \(\mathbf u _{i}^{g}=\left[ u_{i,1}^{g},\ldots ,u_{i,l}^{g}\right] \) as

$$\begin{aligned} u_{i,j}^{g}={\left\{ \begin{array}{ll} v_{i,j}^{g} &{} \text {if }\,\text {rand}_{j}\left( 0,1\right) <CR\,\text {or}\,j=j_{\text {rand}},\\ z_{i,j}^{g} &{} \text {otherwise}, \end{array}\right. } \end{aligned}$$
(2)

where \(\text {rand}\left( 0,1\right) \) is a uniform random number in the range \(\left[ 0,1\right] \), \(j_{\text {rand}}=\text {randint}\left( 1,l\right) \) is an integer randomly chosen in the range \(\left[ 1,l\right] \), and \(CR\in \left[ 0,1\right] \) is the crossover rate.

Selection: This operator selects the best solution between the target vector \(\mathbf z _{i}^{g}\) and the trial vector \(\mathbf u _{i}^{g}\) according to their fitness value \({f}(\cdot )\). Without loss of generality, for a minimization problem the selected vector that transcend to the next generation \({g}+1\) is given by

$$\begin{aligned} \mathbf z _{i}^{g+1}={\left\{ \begin{array}{ll} \mathbf u _{i}^{g} &{} \text {if }\,{f}\left( \mathbf u _{i}^{g}\right) <{f}\left( \mathbf z _{i}^{g}\right) ,\\ \mathbf z _{i}^{g} &{} \text {otherwise}. \end{array}\right. } \end{aligned}$$
(3)

3.2 Multi-prototype Representation

In this paper, we introduce a new multi-prototype representation in which every individual in the population, \({\mathbf {z}}_{i}\), is a fixed-length vector composed of activation thresholds, cluster centroids, and cluster labels. Thus, every individual is a vector of size \({K}_{\max }+({K}_{\max }\times {D})+{K}_{\max }\). The first \({K}_{\max }\) entries are real numbers in the range [0, 1], called the activation thresholds denoted by the set \( \mathbf T = \{ {T}_{k}\mid {k}=1,\ldots ,{K}_{\max }\} \). The next \({K}_{\max }\times {D}\) entries correspond to the cluster centroids denoted by the set \( \overline{\mathbf C }= \{\bar{{\mathbf {c}}}_{k}\mid {k}=1,\ldots ,{K}_{\max }\} \), where \(\bar{{\mathbf {c}}}_{k} \in {\mathbb {R}}^{D}\). The remaining \({K}_{\max }\) entries are integer numbers in the set \(\left\{ 1,\ldots ,{K}_{\max }\right\} \), called the cluster labels, denoted by \(\mathbf L = \{{L}_{k}\mid {k}=1,\ldots ,{K}_{\max }\}\). Then, the ith individual in the DE algorithm is expressed by

$$\begin{aligned} {\mathbf {z}}_i=\left[ {T}_{i,1},\ldots ,{T}_{{i,K}_{\max }},\bar{{\mathbf {c}}}_{i,1},\ldots ,\bar{{\mathbf {c}}}_{{i,K}_{\max }},{L}_{i,1},\ldots ,{L}_{{i,K}_{\max }}\right] =\left\{ \mathbf T _{i},\overline{\mathbf C }_{i},\mathbf L _{i}\right\} . \end{aligned}$$
(4)

The kth centroid is activated if and only if \({T}_{i,k}\ge 0.5\); otherwise, it is inactivated. An activated centroid means that it participates in the clustering process to form “subclusters”. Finally, these subclusters are merged according to their corresponding cluster labels in \(\mathbf L _{i}\) to form the clusters of a potential solution. It is worth mentioning that a centroid point is approximated to its closest pattern point to represent the cluster by a medoid.

Fig. 1.
figure 1

Example of the multi-prototype representation with \({K}_{\max }=5\) and \({D}=2\). The activated centroids are represented by blue stars and the unactivated ones by red stars. The clustering solution is obtained by assigning to every pattern the cluster label in \(\mathbf L \). (Color figure online)

Figure 1 shows an example of the multi-prototype representation, where \({K}_{\max }=5\) and \(D=2\). First, the positions 1, 3, and 5 satisfy the aforementioned activation rule. Therefore, their centroids (0.1, 0.3), (0.6, 0.6), and (0.9, 0.3) form the corresponding subclusters. Finally, the clusters sharing the same cluster labels are merged, that is, the cluster label “3” is assigned to the third cluster and the label “1” is assigned to the first and fifth subclusters, respectively.

3.3 A Connectivity-Based CVI

The proposed index CSil (Connectivity-Silhouette index) is based on a recent proximity measure called maximum edge distance (MED) [2], which is capable to detect groups of different shape, size, and convexity. The computation of the CSil involves the following steps:

  1. 1.

    Build an undirected complete graph \(\mathbf G \left( {V,E}\right) \) from the input dataset \(\mathbf X =\left\{ \mathbf x _1,\ldots ,\mathbf x _{N}\right\} \), where \({v}_{i}\in {V}\) corresponds to \(\mathbf x _{i}\in \mathbf X \), the edge \({e}_{ij}\in E\) denotes the pairwise Euclidean distance (\({d}_{e}\)) between the vertices \({v}_{i}\) and \({v}_{j}\), and N is the total number of patterns in the dataset.

  2. 2.

    Construct the minimum spanning tree \({MST}\left( {V,E}^{t}\right) \) from \(\mathbf G \), where \({E}^{t}\subset {E}\) such that \(\left| {E}^{t}\right| ={N}-1\).

  3. 3.

    Compute the MED distanceFootnote 1 (\({d}_{med}\)) between two patterns \({\mathbf {x}}_{i}\) and \({\mathbf {x}}_{j}\) as:

    $$\begin{aligned} {d}_{med}(v_{i},v_{j}) =\left\{ E_{p}^{t}\in {{\mathcal {P}}}_{ij}\,\,{/}\max \left( e_{p}^{t}\right) \right\} \!. \end{aligned}$$
    (5)

    where \({{\mathcal {P}}}_{ij}={MST}^{\prime }\left( V_{p},E_{p}^{t}\right) \) is a subgraph of \(MST\left( V,E^t\right) \) and represents the path between \(v_{i}\) and \(v_{j}\), where \({V}_{p}\subset V\) and \(E_{p}^{t}\subset E^{t}\); therefore, \(\max \left( e_{p}^{t}\right) \) represents the longest edge in the subset of paths, \(e_{p}^{t}\).

  4. 4.

    Finally, compute the proposed CSil index as

    $$\begin{aligned} \text {CSil}({\mathbf {C}})=\frac{1}{N}\sum _{{\mathbf {c}}_{k}\in {\mathbf {C}}}\,\sum _{{\mathbf {x}}_{i}\in {\mathbf {c}}_{k}}\frac{\text {b}\left( {\mathbf {x}}_{i},{\mathbf {c}}_{k}\right) -\text {a}\left( {\mathbf {x}}_{i},{\mathbf {c}}_{k}\right) }{\max \left\{ \text {b}\left( {\mathbf {x}}_{i},{\mathbf {c}}_{k}\right) ,\text {a}\left( {\mathbf {x}}_{i},{\mathbf {c}}_{k}\right) \right\} }, \end{aligned}$$
    (6)

    where

    $$\begin{aligned} \text {a}\left( {\mathbf {x}}_{i},{\mathbf {c}}_{k}\right)&=\frac{1}{n_{k}}\sum _{{\mathbf {x}}_{j}\in {\mathbf {c}}_{k}}\text {d}_{\text {med}}\left( {\mathbf {x}}_{i},{\mathbf {x}}_{j}\right) ,&\text {b}\left( {\mathbf {x}}_{i},{\mathbf {c}}_{k}\right)&=\min _{{\mathbf {c}}_{r}\in {\mathbf {C}}\setminus {\mathbf {c}}_{k}}\left\{ \frac{1}{n_{r}}\sum _{{\mathbf {x}}_{j}\in {\mathbf {c}}_{r}}\text {d}_{\text {med}}\left( {\mathbf {x}}_{i},{\mathbf {x}}_{j}\right) \right\} . \end{aligned}$$

In order to achieve the proper partitioning, the value of CSil is maximized.

3.4 Avoiding Erroneous Clustering Solutions

When a new individual is created by using the DE evolutionary operators, erroneous clustering solutions could be generate; therefore, the following cases must be considered:

  • Minimum number of subclusters: if all the activation thresholds in \(\mathbf T _i\) are smaller than 0.5, then, two threshold entries are randomly selected and re-initialized in the range \(\left( 0.5,1\right) \). Likewise, if all the activated cluster labels in \(\mathbf L _i\) have the same label, then, a label entry is randomly selected and re-initialized with a different random label chosen from \(\{1,\ldots ,K_{\max }\}\).

  • Empty clusters: If any activated cluster has associated less than two patterns, then, all the clusters are re-initialized such that every cluster would have \(\frac{N}{K}\) patterns and the corresponding prototypes are recalculated by averaging the patterns of every cluster.

  • Out-of-bound variables: If any threshold value in \(\mathbf T _i\) exceeds the unity or becomes negative, then, it is truncated to “1” and “0”, respectively. Similarly, if any cluster label in \(\mathbf L _i\) exceeds \(K_{\max }\) or becomes negative, then, it is truncated to \(K_{\max }\) and “1”, respectively.

3.5 Pseudocode of MACDE Algorithm

The complete pseudocode for the proposed MACDE algorithm is detailed below:

  • Step 1: Generate an initial random population \(\mathbf P ^{0}= \{ \mathbf z _{i}^{0}\mid i=1,\ldots ,NP\}\) as described in Sect. 3.2.

  • Step 2: For each \(\mathbf z _{i}\in \mathbf P ^{0}\), find out the activated centroids in \(\mathbf C _i\) and cluster labels in \(\mathbf L _i\) by applying the activation rule to the thresholds values in \(\mathbf T _{i}\).

  • Step 3: \(\text {For}\) \(t=1\) \(\text {to}\) \(G_{\max }\) \(\text {do}\):

    1. (i)

      For each \(\mathbf z _{i}\in \mathbf P ^{t}\) and considering each pattern \(\mathbf x _{p}\in \mathbf X \) for \(p=1,\ldots ,N\), obtain the \(K_i\) medoids \(\mathbf M _{i}=\{ \mathbf m _{j}\mid j=1,\ldots ,K_i\} \) by assigning each activated centroid \(\bar{\mathbf{c }}_{j}\in \overline{\mathbf C }_{i}\) to the closest \(\mathbf x _{p}\) such that

      $$\begin{aligned} d_{e}(\bar{\mathbf{c }}_{j},\mathbf x _{p}) = \min _{\mathbf{x }_{p}}\in \mathbf X \{d_{e}(\bar{\mathbf{c }}_{j},\mathbf x _{p})\}. \end{aligned}$$
    2. (ii)

      For each \(\mathbf z _{i}\in \mathbf P ^{t}\) and considering each \(\mathbf x _{p}\in \mathbf X \), obtain the clustering solution \(\mathbf C _{i}^{\prime }\) by assigning \(\mathbf x _{p}\) to the closest medoid \(\mathbf m _{j}\in \mathbf M _{i}\) such that

      $$\begin{aligned} d_{e}(\mathbf x _{p},\mathbf m _{j}) = \min _{\mathbf{m }_{j}}\in {\mathbf{M }_{i}}\{d_{e}(\mathbf x _{p},\mathbf m _{j})\}. \end{aligned}$$
    3. (iii)

      For each \(\mathbf z _{i}\in \mathbf P ^{t}\), check if the number of patterns belonging to any cluster in \(\mathbf C _{i}^{\prime }\) is less than two. If so, update the cluster medoids \(\mathbf M _{i}\) using the concept described in Sect. 3.4.

    4. (iv)

      For each \(\mathbf z _{i}\in \mathbf P ^{t}\), create the merged-clustering solution \(\mathbf C _{i}\) from \(\mathbf C _{i}^{\prime }\) using the activated cluster labels in \(\mathbf L _{i}\).

    5. (v)

      Perform the evolutionary operators on each individual \(\mathbf z _{i}\in \mathbf P ^{t}\) to create a mutant vector \(\mathbf v _{i}\) using (1) and then a trial vector \(\mathbf u _{i}\) using (2).

    6. (vi)

      For each \(\mathbf u _{i}\), find out the activated centroids and cluster labels by applying the activation rule. The real values in \(\mathbf L _i\) must be rounded to its nearest integer in order to generate suitable cluster labels.

    7. (vii)

      Repeat steps (i)-(iv) for each trial vector \(\mathbf u _{i}\) in order to verify its validity.

    8. (viii)

      Evaluate fitness of both the target \(\mathbf z _{i}\) and trial \(\mathbf u _{i}\) vectors according to the CSil index based on MED distance in (6). Use only the merged-clustering solution of both vectors. Replace the target vector \(\mathbf z _{i}^{t}\) with the trial vector \(\mathbf u _{i}^{t}\) only if the latter yields a better value of the fitness function.

  • Step 4: Report the final merged-clustering solution obtained by the best individual (the one yielding the highest fitness) at time \(t=G_{\max }\).

4 Experimental Setup

The proposed algorithm (MACDE) was compared with four different clustering approaches: an automatic clustering algorithm based on DE, ACDE [3]; a partitional clustering approach, K-means [10]; a hierarchical clustering method, WARD [10]; and a density-based algorithm, DBSCAN [10].

The testing platform used a LINUX-based computer with 8-cores at 2.7 GHz and 8 GB of RAM. All the algorithms were developed in Matlab R2014a (The Mathworks, Boston, Massachusetts, USA).

4.1 Parameter Settings

The settings adopted in MACDE and ACDE are: number of fitness function evaluations, \(FE=1E5\); population size, \(NP=D\times 10\); crossover rate, \(CR=0.9\); mutation factor, \(F=0.8\); and maximum number of clusters, \(K_{\max }=20\).

The Silhouette index is used as the selection criterion in K-means (executed for different \(K=\{2,\ldots ,K_{\max }\}\)) and WARD (the dendrogram is cut at different levels); whereas, the best parameter settings were tuned in for DBSCAN.

4.2 Data Sets

Four categories of datasets are studied: (i) linearly separable data having well-separated clusters, \(\mathcal {G}_{1}\); (ii) linearly separable data having overlapping clusters, \(\mathcal {G}_{2}\); (iii) non-linearly separable data having well-separated clusters, \(\mathcal {G}_{3}\); and (iv) real-life datasets, \(\mathcal {G}_{4}\). An example dataset of each category is shown in Fig. 2(a).

Fig. 2.
figure 2

Example of datasets in categories \(\mathcal {G}_{1}\), \(\mathcal {G}_{2}\), \(\mathcal {G}_{3}\), and \(\mathcal {G}_{4}\) (left to right). (a) Actual cluster labels and (b) clustering solutions obtained by MACDE. Distinct colors are used to represent different clusters. (Color figure online)

4.3 Clustering Quality Evaluation

The adjusted rand index (ARI) [5] takes as input two partitionings and returns a value in the interval \([{\sim }0,1]\), where “1” indicates perfect similarity between them and “\({\sim }0\)” disagreement. Let \(\mathbf T \) be the true partitioning and \(\mathbf C \) be partitioning obtained by a clustering algorithm. Also, let a, b, c and d, denote, respectively, the number of pairs of data points belonging to the same cluster in both \(\mathbf T \) and \(\mathbf C \), the number of pairs belonging to the same cluster in \(\mathbf T \) but to different clusters in \(\mathbf C \), the number of pairs belonging to different clusters in \(\mathbf T \) but to the same cluster in \(\mathbf C \), and the number of pairs belonging to different clusters in both \(\mathbf T \) and \(\mathbf C \). The ARI value is then computed as follows

$$\begin{aligned} ARI\left( \mathbf T ,\mathbf C \right) =\frac{2\left( ad-bc\right) }{\left( a+b\right) \left( b+d\right) +\left( a+c\right) \left( c+d\right) }. \end{aligned}$$
(7)

5 Experimental Results

We investigated the effectiveness of the proposed approach focusing on two major issues: (i) quality of the clustering solution and (ii) ability to find the actual number of clusters.

Table 1. Number of clusters and quality of the best solution in terms of the ARI (mean values from 31 runs) generated by MACDE, ACDE, K-means, WARD, and DBSCAN. The statistically best (\(\alpha =0.05\)) results are highlighted in bold face. The Kruskal–Wallis test with Bonferroni correction was applied to compare the algorithms.

The experimental results in terms of the ARI index and the estimated number of groups, K, are shown in Table 1. Note that ACDE, K-means, and WARD algorithms presented high performance on linearly-separable data (i.e., categories \(\mathcal {G}_{1}\) and \(\mathcal {G}_{2}\)), since these datasets contain spherical Gaussian clusters, which is the cluster model assumed by these methods. In contrast, these algorithms presented poor performance in non-linearly separable data having having well-separed clusters (i.e., category \(\mathcal {G}_{3}\)). DBSCAN achieved good performance on datasets belonging to category \(\mathcal {G}_{3}\), as these datasets contain dense and spatially well-separated clusters, which is the main assumption made by this algorithm, whereas a lower performance is obtained in datasets of categories \(\mathcal {G}_{1}\) and \(\mathcal {G}_{2}\). In general, ACDE, K-means, WARD, and DBSCAN algorithms obtained poor performance on datasets belonging to \(\mathcal {G}_{4}\), as these contain more complex cluster structures derived from real-life scenarios.

On the other hand, the results for MACDE indicate a good performance across the datasets in categories \(\mathcal {G}_{1}\) and \(\mathcal {G}_{3}\). This is reflected on both the high values of ARI and the slightly difference between the estimated number of clusters K and the actual number of clusters \(K^*\). It is notable that the performance of MACDE is affected by the overlap degree between clusters, that is, as the overlap increases the performance diminishes. This disadvantage is because the minimum separation between clusters is less than the maximum first neighbor distance (this restriction is related to the use of the MST in the CSil index). However, as expected, the multi-prototype representation and the cluster validity index based on the connectivity criteria (CSil) allows discovering arbitrary-shaped cluster regardless the property of linear separability of data. Finally, Fig. 2(b) presents some clustering solutions for the different studied categories of datasets.

6 Conclusions

In this paper, an automatic clustering approach based on DE algorithm was proposed. A new multi-prototype representation and a cluster validity index based on connectivity were proposed to represent and evaluate, respectively, clustering solutions having non-linearly separable clusters. The proposed algorithm (MACDE) has been shown to outperform some well-know clustering techniques across a diverse range of datasets separated by categories. The proposed approach has two main advantages which are: (i) the automatic discovering of the number of clusters and (ii) the data clustering independently of its linear separability (arbitrary-shaped clusters). Finally, it should be noted that data having overlapping clusters may decrease the performance of MACDE.

In the future, it is interesting to investigate and overcome the performance of MACDE for data having strong-overlapping clusters.