Keywords

1 Introduction

Hyperspectral images are remotely sensed images containing hundreds of spectral bands. Though, this increased number of bands helps in identifying the objects uniquely by providing significant amount of information, at the same time it also increases the computational complexity for processing such images. Moreover, as the bands are almost continuous, the neighbouring bands are highly correlated which leads to redundancy [2]. Besides, due to the presence of only limited number of labelled samples, such images suffer from curse of dimensionality which leads to well-known Hughes phenomena [6]. To avoid these issues, dimensionality reduction is an essential pre-processing step for analysis of hyperspectral images. Dimension reduction can be achieved by performing either features extraction or feature selection techniques. Principal component analysis, independent component analysis (ICA) and autoencoder etc [3, 4, 11] are examples of some widely used feature extraction techniques. Though feature extraction techniques result in discriminating set of features, the physical interpretation of the spectral bands get compromised in such techniques due to the linear or non-linear transformation applied over the original feature set. In contrast, feature selection methods select a sub set of highly discriminating features from the original set of features without using any transformation. Thus, band selection methods not only preserve the original interpretation of the spectral bands but also lessen the storage and processing time requirement for hyperspectral images. The feature selection methods can broadly be categorized into supervised and unsupervised methods. For supervised methods, additional information such as class signature, ground truth etc. is needed to be provided along with the dataset. Most of the supervised band-selection algorithms select the subset of bands, which results in the highest class separability. Regression tree, instance based methods are some of the techniques used for supervised band selection [2].

This work presents a supervised band selection algorithm which requires labelled samples as additional information. These labelled samples have been used for calculating information gain ratio of each spectral band of the input hyperspectral image. Initially, correlation among the spectral bands was utilized to cluster the highly correlated bands together and then form each cluster, the band with the highest information gain ratio was selected. Further reduction was carried out by applying band pruning method, which is a second level of filtering. The method is discussed in detail in Sect. 2. Experimental results have shown that the dataset with only the selected bands by the proposed supervised method gave comparable classification accuracy as that achieved by the original high dimensional dataset.

2 Proposed Methodology

This section provides a brief outline of the related techniques followed by a detail description of the proposed methodology.

2.1 Background

Information Gain: Let \(A=(a_1, a_2, \cdots , a_m, C)\) be the feature vector where \(a_i\) represnts the \(i^{th}\) feature and C stands for the class label. Let \(X = \{x_1, x_2,\cdots , x_n\} \) be the set of training samples such that \(\Vert {X}\Vert =n\). Each element \(x_i\in X\) is an ordered tuple and can be expressed as \(x_i= (x_{i1}, x_{i2}, \cdots , x_{im}, c_i)\), where \(x_{ij}\) is the value of the sample \(x_i\) for the feature \(a_j \in A\) and \(c_i\) is the corresponding class label. The function \(values(X, a_j)\) denotes the set of all possible values of the feature \(a_j\) in the given training set X. For each value \(v \in values(X,a_j)\) we can define a subset \(S_v\) of X such that [9]-

$$\begin{aligned} S_v=\{x_i|x_i \in X \wedge x_{ij}= v\wedge v \in values(X,a_j)\} \end{aligned}$$
(1)

Using Eq. 1 information gain for a feature \(a_j\) for the given training set X can be calculated as-

$$\begin{aligned} IG(X, a_j)=H(X)-\sum _{v \in values(X, a_j)}{\frac{\Vert {S_v}\Vert }{n}}\cdot H(S_v) \end{aligned}$$
(2)

where, H(X) is the entropy of the training set X and \(H(S_v)\) is the entropy of the subset \(S_v\) extracted using Eq. 1 respectively.

Entropy of a given random variable measures the average information content of the variable [9]. Mathematically, entropy for a random variable Y with p(y) as probability distribution can be defined using the following equation

$$\begin{aligned} H(Y)=-\sum _{y\in Y}p(y)logp(y) \end{aligned}$$
(3)

Using Eq. 3 entropy for the training set X and the subset \(S_v\) may be calculated as

$$\begin{aligned} H(X)=-\sum _{c\in values(X,C)}p(c)logp(c) \end{aligned}$$
(4)
$$\begin{aligned} H(S_v)=-\sum _{c\in values(S_v,C)}p(c)logp(c) \end{aligned}$$
(5)

Information gain is widely used as a measure of relevance of a feature. Higher is the information gain of a feature more relevant it is. However, it is biased towards the features having many distinct values.

Intrinsic Gain: Intrinsic gain of a feature \(a_j\) for a training set X may be defined as [9]

$$\begin{aligned} IV(X,a_j)=\sum _{v \in values(X, a_j)}\frac{\Vert {S_v}\Vert }{n}\cdot log_2\frac{\Vert {S_v}\Vert }{n} \end{aligned}$$
(6)

Information Gain Ratio: For a given training set X, the ratio of information gain to intrinsic information of an a feature \(a_j\) is referred to as information gain ratio.

$$\begin{aligned} IGR(X,a_j )=\frac{IG(X,a_j )}{IV(X,a_j )} \end{aligned}$$
(7)

Correlation: Correlation is the measure of statistical relationship between two variables. In statistics, Pearson’s product moment correlation between two random samples X and Y can be calculated as [10]

$$\begin{aligned} \rho (X,Y)=\frac{Cov(X,Y)}{\sigma _X\sigma _Y} \end{aligned}$$
(8)

where, Cov(XY) is the co-variance of X and Y; \(\sigma _X\) and \(\sigma _Y\) are the standard deviations of X and Y respectively. The correlation coefficient can take values in the range +1 to −1. Positive value indicates linear relationship of the samples whereas a negative value specifies the inverse linear relationship between the samples. High magnitude of correlation coefficient implies that the variables are strongly related and low magnitude indicates that the variables are almost independent.

2.2 Description of the Proposed Method

In the proposed supervised band selection method, we have considered each labelled pixel of the input hyperspectral image as a training sample. All the spectral bands along with the class label represent the set of features. So, if the input hyperspectral image contains ‘n’ labelled pixels and ‘d’ spectral bands then the set of training samples X can be represented as-

$$\begin{aligned} X=\begin{bmatrix} x_{11} &{}x_{12} &{}\cdots &{}x_{1d} &{}c_1\\ x_{21} &{}x_{22} &{}\cdots &{}x_{2d} &{}c_2\\ \vdots &{} \vdots &{}\ddots &{}\vdots &{}\vdots \\ x_{n1} &{}x_{n2} &{}\cdots &{}x_{nd} &{}c_n\\ \end{bmatrix} \end{aligned}$$
(9)

where, \(x_{ij}\) represents the reflectance value of the pixel \(x_i\) at spectral band \(b_j\) and \(c_i\) is the corresponding class label.

Using Eqs. 1 to 8, we then calculated the information gain ratio of every spectral band in the input hyperspectral image with respect to X. The calculated information gain ratios of the spectral bands can be represented as a row vector using the following notation-

$$\begin{aligned} IGR_{bands}=[IGR_1, IGR_2\ldots .., IGR_d] \end{aligned}$$
(10)

where, \(IGR_i\) is the information gain ratio of the \(i^{th}\) spectral band with respect to X. In the next phase of the proposed method, the similar bands were grouped together using clustering technique. For our experiment we used two different clustering techniques namely hierarchical clustering and spectral clustering. The two methods were evaluated separately and obtained results are discussed in Sect. 4. The clustering phase resulted in k number of clusters such that \(k<=d\). Then from each cluster we chose the spectral band having the highest information gain ratio and added it to the set of representative bands. Then, a classifier was trained and tested using samples consisting of only the k representative bands. Let’s assume that the accuracy achieved was \(ACC_k\). We further reduced the set of representative bands by applying band pruning method to get the final set of selected bands. In band pruning phase, first the set of representative bands were arranged in descending order of their information gain ratio. Then from this sorted list of ‘k’ bands only the leading ‘m’ bands were selected such that \(ACC_m\ge ACC_k\).

3 Experimental Set Up

This section presents a brief overview of the experimental set up used for analysis of the proposed method.

3.1 Dataset Description

For carrying out the experiments, two datasets namely- Indian Pines and Kennedy Space Centre (KSC) were used. Brief description of the datasets is presented below.

Indian Pines Dataset. was acquired by AVIRIS (Airborne Visible/Infrared Imaging Spectrometer) over Indian Pines test center, North-western Indiana. This original scene consists of 224 bands from which a corrected image of 200 bands was constructed by eliminating 24 noisy and water absorption bands. The image comprises of 21025 pixels (10249 labelled and 10776 unlabelled) and 16 classes.

Kennedy Space Centre Dataset. was captured by AVIRIS Sensor over Kennedy Space Centre, Florida. The original dataset contained 224 bands from which only 176 bands were retained in the corrected dataset. The dataset contains total 314368 pixels out of which 5211 are labelled. The image contains total 13 classes.

3.2 Clustering Techniques Used

For clustering the similar bands of the input hyperspectral image, we have used two clustering techniques namely –Agglomerative Hierarchical Clustering (HC) and Spectral Clustering (SC). In agglomerative hierarchical clustering technique, initially each individual band image is treated as a cluster then by applying “bottom-up” approach, in every iteration the pair of clusters exhibiting highest similarity is combined together unless there remains only one cluster [7]. For our experiment we have used correlation distance between the bands as the dissimilarity metric, which can be defines as \(|{1-\rho (b_i,b_j)}|\), where, \(b_i\) and \(b_j\) are two bands images. Average linkage has been used as the merging criterion for the clusters. The output of the hierarchical clustering can be depicted with the help of a dendogram tree. The dendogram may be cut either by specifying the desired height or desired number of clusters to obtain the clusters. For our experiment, desired number of clusters, k, has been used to cut the dendogram.

In spectral clustering, considering the bands as the set of vertices, first a similarity matrix among the vertices is generated. From this similarity matrix, in the subsequent step the graph Laplacian is calculated. Then Eigen values and Eigen vectors of the Laplacian matrix are calculated by performing Eigen analysis. If k number of features are to be selected then only the first k Eigen vectors are computed which are treated as the new features. Finally using k-mean clustering these points are grouped into k clusters [8].

3.3 Classifiers

For evaluating the worth of the selected bands by the proposed algorithm we have used two classifiers namely: K- Nearest Neighbour (KNN) and multi class Support Vector Machine (SVM) opting one-against-one scheme [1, 5]. For SVM classifier we have used the polynomial kernel with parameters \(C=46\) and \(d=5\). For KNN classifier we have used the neighbourhood parameter equal to 101(square root of number of samples) and Euclidean distance as the distance metric. For training and testing the classifiers, the labelled pixels from the dataset were first extracted to be used as input set of samples. The reflectance values of a pixel across the selected bands were considered as the associated feature vector. Cross validation with 10-fold stratified partitions was performed to estimate the classification accuracy. Section 4 presents the detailed classification results achieved by the classifiers using different number of selected bands. The classification accuracy was presented as the mean accuracy of the 10 folds and standard deviation. Accuracy of the \(i^{th}\) fold can be calculated using the following formula-

$$\begin{aligned} AC_i= \frac{TP_i}{Total_i} \end{aligned}$$
(11)

where, \(TP_i\) and \(Total_i\) designate the total number of true positives (i.e., the number of cases where both actual and predicted class labels are same) and total number of testing samples respectively in the \(i^{th}\) fold. For 10 folds the mean accuracy and standard deviations are defined as-

$$\begin{aligned} ACC=\frac{\sum _{i=0}^{10}AC_i }{10} \end{aligned}$$
(12)
$$\begin{aligned} stdv=\sqrt{\frac{\sum _{i}{(AC_i-ACC)^2}}{10}} \end{aligned}$$
(13)

4 Results and Discussion

At first information gain ratio of each band of the input hyperspectral image was calculated. Figure 1a and b show the information gain ratio of each spectral band of Indian Pines and KSC dataset respectively. For each dataset experiments were performed with different number of clusters ranging from 5 to 40 in a step of 5. Tables 1 and 2 summarizes the results obtained by the classifiers for each clustering technique. The highest achieved accuracy is presented in bold.

For KSC dataset, the SVM classifier could achieve the highest accuracy using only 35 representative bands for both the clustering techniques. However, the accuracy obtained by using SC+SVM (spectral clustering and SVM) was 0.94 which is much superior to the highest achieved accuracy of 0.89 by HC+SVM(hierarchical clustering and SVM) technique. After the application of band pruning method, similar accuracy could be achieved using only 25 and 31 selected bands respectively for SC+SVM and HC+SVM techniques. Similarly, for HC+KNN (hierarchical clustering and KNN) and SC+KNN (spectral clustering and KNN) techniques highest accuracy of 0.87 and 0.93 respectively were achieved using only 30 representative bands initially, which were further refined to 27 and 24 bands respectively in the band pruning phase. The accuracy achieved by all the 176 bands in KSC dataset is 0.94 for the SVM classifier and 0.93 for KNN classifier. So, it can be claimed that using less than 18% of the total bands comparable accuracy could be achieved to that of using all bands. Table 3 present the final number of selected bands for different techniques. Here, we have shown the results of application of band pruning technique only to the set of representative bands which resulted in the highest accuracy for the respective techniques.

Fig. 1.
figure 1

Information gain ratios for the bands of -(a) Indian Pines dataset and (b) KSC dataset

Table 1. Experimental results for KSC dataset

For the Indian Pines dataset applying HC+SVM and SC+SVM techniques highest accuracy of 0.82 and 0.81 could be achieved using 40 representative bands. The band pruning technique further reduced the number of bands to 37 and 36 respectively by eliminating the insignificant bands. However, when the KNN classifier was used highest accuracy degraded. Accuracy score of only 0.76 and 0.80 were achieved using 35 and 40 initial representative bands respectively, which later were subsequently reduced to 25 and 34 selected bands. Using all the bands accuracy score of 0.87 and 0.79 were achieved.

Table 2. Experimental results of Indian Pines dataset
Table 3. Final number of selected bands after band pruning

From obtained results, it may be observed that in most cases SC+SVM technique gave better performance than other techniques. Individually also, the Spectral clustering and the SVM classifier performed much better than their counterparts.

5 Conclusion

In this work, an information gain ratio based supervised band selection technique for hyperspectral images has been presented. Experimental analysis of the proposed algorithm over two hyperspectral images revealed that with a very small percentage of selected by the proposed algorithm satisfactory accuracy score could be achieved. In the first round, representative bands, one from each cluster of similar bands, were selected based on highest information gain ratio. A second round of reduction was applied to this set of representative bands to ensure the elimination of the insignificant bands which might have been selected in the first round. This step further reduced the size of the input image and made it suitable for processing and storing.