Keywords

1 Introduction

Recently, both the quantity and quality of remote sensing (RS) images are growing at a rapid rate. These images are used to help solve plenty of important problems. In order to make full use of remote sensing big data, there is an insistent need of efficient information management, mining and interpretation methods. Therefore, significant efforts have been made in developing efficient retrieval methods to search data of interest from large RS databases. Newer generations of high resolution satellite imagery (HRSI) are of increasingly importance for earth observation because they allow to recognize a big range of objects. Considerable amount of literatures has made efforts to extract discriminating features for HRSI retrieval. Feature extraction methods can be divided into three categories. The first category includes methods based on global features, which describe an image as a whole in terms of color [1], texture [2], and shape distributions [3]. Several studies have employed multi-resolution features to index HRSI, such as the Discrete Wavelet Transform (DWT) [4], Gabor filter [5], and Steerable pyramids [6]. The drawback of global features is losing the image detail information. The second category comprises local approaches, which have shown a better performance than global approaches as they possess significant invariance capacities [7, 8]. The local raw features are encoded using bag-of-visual-words (BoVW), Fisher vector (FV) or vector locally aggregated descriptors (VLAD). The third category consists of deep learning methods like Convolutional Neural Networks (CNN), which have provided impressive results in object recognition [9, 10]. However, despite their outstanding efficiency, various factors are likely to limit the performance of deep learning methods. First, they require a large amount of labeled data, and second, they are extremely computationally expensive to train. In this work, we propose an efficient CBIR system for HRSI based on boosted SURF features through the addition of local color, texture and structural content around key points and by using a category specific strategy for dictionary creation. In the rest of the paper, we provide in Sect. 2, a description of the feature extraction method. In Sect. 3, we present the experimental setup. The obtained results and comparison with state of the art references are presented in Sect. 4, and finally we draw conclusions.

2 Feature Extraction

Feature extraction is accomplished by using the Speeded Up Robust Features algorithm (SURF) [11]. It is a scale and rotation-invariant interest point detector and descriptor and is computationally very fast, factors that make it suitable for our retrieval system. The SURF descriptor is boosted by adding color, texture and structural content around the key points as described below:

  1. (a)

    Color features: color information plays a vitally important role. SURF descriptor being designed mainly for gray images, it is then difficult for it to distinct color regions that appear very similar in the corresponding gray images when the color is omitted. Many variations on SURF were developed when considering the use of color data, like color histograms around the key point or the concatenation of the SURF descriptors computed separately for three RGB channels, yielding a 64 × 3-dimensional feature vector. In this work, the problem of the color is addressed in a simple manner, the RGB values of the key point are taken as the color feature. The conducted evaluation shows that using the key point color is better than using the color histogram around that point.

  2. (b)

    Texture features: are based on the assumption that a great amount of the information around a key point is coded in its local variance (LV) distribution, which is in its turn, illumination and rotation invariant. We compute LV around each key point for each color channel. The quantization of local variance into bins requires a priori knowledge of the maximum variance of our dataset. Yet, in the presence of noise, LV becomes sensitive in such a way that noise will induce a significant increase of LV value. To surmount this limitation, we propose to compute the LV on a smoothed neighborhood. Smoothing is obtained through the use of an anisotropic diffusion filter. Anisotropic diffusion, also called Perona–Malik diffusion, is a technique aiming at reducing image noise without removing significant parts of the image content, typically edges, lines or other details that are important for the interpretation of the image. Smoothed local variance histograms are computed over a given neighborhood. The size of the analysis window has an important impact on the ability to represent the textural information. For this investigation, several neighborhood sizes ranging from 3 × 3 pixels up to 13 × 13 pixels were tried. The local variance technique performed better at 5 × 5 window size.

  3. (c)

    Structural features: they are extracted using the Local Binary Pattern (LBP) operator, which was first introduced in 1994 by Ojala et al. [12]. Originally, it operated with 3 × 3 rectangular neighborhood using the center pixel as a threshold. The LBP code associated with the central pixel is calculated by multiplying thresholded values by weights given by the power of two and adding the results. An improved uniform gray-scale and rotation invariant operator denoted \( LBP_{P,R}^{riu2} \) operator consists of a histogram of P + 2 bins, each bin codifies the occurrence of a given shape primitive present in the image. Shape primitives include different types of edges, spots, flat areas, line ends, corners, etc. The drawback of the LBP operator is that a small change in the input image would cause a change in the output. LBP may not work properly for noisy images. In order to make it more robust, we propose to compute it on the smoothed image obtained by the anisotropic diffusion. In this work, the \( LBP_{8,1}^{riu2} \) was used and the smoothed LBP histograms for the three color channels are computed over a 5 × 5 neighborhood around the key point and then, concatenated. The number of local features for each image may be immense. Wherefore, the model of Bag of Visual Words (BoVW) is proposed as an approach to solve this problem by quantizing descriptors into “visual words”. After assigning each image a set of local descriptors, these will be used to generate, with the K-means algorithm, a dictionary of visual words. Afterward, a histogram is computed by counting how many descriptors are assigned to each visual word. Nevertheless, there are two disadvantages of the traditionally visual vocabulary strategy. Firstly, the computational burden of performing clustering on a large pool of feature vectors is heavy in terms of memory and computational time. Secondly, since the clustering algorithm is performed on the whole image set, it cannot guarantee a visual vocabulary with a good discriminative ability for image retrieval. In order to generate more discriminative visual words and reduce the memory demand, we propose to use a category specific visual word creation. It consists of performing k-means clustering of just one category at a time and then grouping the small codebooks together into a single codebook. The feature extraction process is illustrated in Fig. 1.

    Fig. 1.
    figure 1

    Feature extraction process.

3 Experimental Setup

The database used in this study, illustrated in Fig. 2, is a land use/land cover UC Merced dataset (UCMD) [8]. It contains 21 categories of high resolution satellite images. Each category contains 100 images of size (256 × 256 × 3), and a spatial resolution of 30 cm, corresponding to different regions of the earth: agricultural, airplane, baseball diamond, beach, buildings, chaparral, dense residential, forest, freeway, golf course, harbor, intersection, medium density residential, mobile home park, overpass, parking lot, river, runway, sparse residential, storage tanks, and tennis courts. In the indexing phase, our system extracts SURF, color, texture and LBP descriptors for all the images of the dataset. These are used to generate color, texture and LBP codebooks by adopting traditional clustering once, and a category specific clustering another time, and then, a histogram for each description is computed. As a result, we end up with boosted features consisting of four histograms representing each image in the dataset. In the query phase, the feature vectors are extracted from the query image and the retrieval is performed with the chi-square similarity measure. Coefficients are added to weight each description. The experiments indicate that the best precision is obtained when giving more importance to SURF feature and almost equal weights to the added features (ksurf = 0.8, kcolor = 0.15, ktexture = 0.15, kLBP = 0.2). Retrieval performance has been assessed by the most commonly used evaluation measures: Precision (Pr), Recall (Re), Mean Average Precision (MAP), and the Average Normalized Modified Retrieval Rank (ANMRR).

Fig. 2.
figure 2

Land use dataset samples.

4 Results and Discussion

In this section, we present experimental results for retrieval on the 21 image class dataset, as well as, on a reduced dataset containing only eight classes: agricultural, airplane, beach, buildings, and chaparral, dense residential, forest and harbor, and this is mainly for comparison purposes.

The experimental evaluation was initiated by evaluating the performance of SURF features in indexing HRSI. We have noticed that SURF features perform very well for some classes like chaparral, forest, harbor, medium residential, and parking lot, but, it is less efficient for the other classes. This is due to the confusion between classes having similar local characteristics. The most notable confusions were noticed between buildings and intersection, dense residential and mobile home park, freeway and overpass, golf course and sparse residential, river and sparse residential, storage and airplane, as well as between tennis court and mobile home park. Visual similarity of these classes is high even for human observers. We have also, noticed that each image class prefers a different codebook size. However, in terms of the global accuracy, the maximum was obtained for 420 codebook size. We, therefore, chose this size for subsequent tests, to get the compromise time/accuracy. The preliminary results justify the requirement of introducing other visual features, like color and texture that can discriminate similar classes, and the need for generating specific dictionaries for each image class to produce more discriminative visual words. Table 1 illustrates the obtained results for 21 classes, in terms of different percentages of relevant images over the retrieved. Many observations can be made. First, using the category-specific (CS) learned dictionaries leads to a an improvement of 3% in the average precision and 3.6% in MAP measure and a decrease of 0.04 in the ANMRR measure for conventional SURF features. An improvement of 7.89 in the average precision, a gain of 8.22 in MAP measure, and a decrease of 0.02 in the ANMRR measure were obtained for boosted SURF features. Second, boosting SURF features clearly increases the average precision by 14.27 and MAP measure by 14.38, it also decreases the ANMRR value by 0.18 for general codebook creation. An increase of 19.16 in the average precision, and of 19 in the MAP measure is obtained, as well as, a decrease of 0.16 in the ANMRR value for category specific codebook creation. Finally, by using the category-specific (CS) learned dictionaries and boosted SURF features, we have been able to achieve a total improvement in the average precision by 22.16, and 22.6 in MAP measure, and a decrease by 0.2 in the ANMRR measure.

Table 1. Pr@k, MAP and ANMRR values for 21 classes

We performed in Table 2 a comparative evaluation of the proposed approach against several state-of-the-art retrieval methods [8, 9, 13, 14], according to the metrics Pr@10, Pr@50, Pr@100, average precision, MAP, and ANMRR. As can be seen from Table 2, the global descriptors: simple statistics, Gabor texture features and HLS Color histogram have the lowest performance among all the mentioned descriptors. Simple statistics are the least efficient ones. Steerable pyramids, local features as well as global and local morphological descriptors present better performance, with ANMRR values ranging from 0.460 to 0.591. In fact, global methods represent each image by capturing information from the whole scene, without paying attention to the constituents of the image, and by taking into account only a single aspect, such as texture or color, which justifies their poor performance. In the other hand, Dense Sift and Sift local features show less performance than the proposed approach, because they take into account the gradient information around key-points and ignore the other features. The CNN-based descriptors present a best MAP value of 61.85, and a best ANMRR value of 0.316. Our proposed method achieves a MAP value of 67.73, which is higher of about 5.88% to that of CNN based descriptors, and an ANMRR value of 0.26 which is lower of about 0.056 to that of CNN based descriptors. By observing the behavior of the different precision levels Pr@k, the Pr@10 equals 90.60 which is approximately equivalent to that of CNN descriptors, it starts performing better afterwards, ending at Pr@100 which is higher of about 6.04%. These results confirm that when having a small image set, traditional methods become more adequate than deep learning solutions. The algorithm complexity of the proposed approach is similar to most of the local and global approaches. The proposed method requires a small processing power and a small training set, which makes it fast and simple to implement. Few minutes were required to train the whole UC-Merced dataset on an i5 CPU (2.2 Ghz, 8GO Ram). It requires less parameters to set and therefore less memory is needed for storing the image features. On the contrary, a CNN requires GPU’s that most people cannot afford, and millions of labeled images to train it, which makes the training process very time consuming. It also, needs a lot of memory to store the weights of the network, and experience to make design decisions.

Table 2. Retrieval feature evaluation for 21 image classes

For the reduced set, we present in Table 3, the obtained results in terms of MAP measure and in terms of different percentages of relevant images over the retrieved images. We notice an increase in the average precision of 11.95, and 12.59 in MAP measure. Comparison with many state of the art methods [1, 6, 15], shown in Fig. 3. demonstrates that our approach is consistently better.

Table 3. Pr@k, MAP values for 8 classes
Fig. 3.
figure 3

Retrieval feature evaluation for 8 image classes

The Precision/Recall curves are plotted to make comparisons with references [1, 6, 8, 9] for 21 class dataset in Fig. 4, and with [1, 6, 15] for 8 class dataset in Fig. 5. Obviously, the proposed image representation which combines SURF, color, texture and structure features around key-points, and the category specific codebook creation have enabled to obtain a complete description of key points, and to generate highly discriminative visual words, which led to the best performance among the state of the art methods.

Fig. 4.
figure 4

Pr/Re curves for 21 classes

Fig. 5.
figure 5

Pr/Re curves for 8 classes

5 Conclusion

In this paper, an effective image representation method for high resolution remote sensing image retrieval was proposed. The proposed method consists of boosting SURF features through the addition of color, texture and structural content around key points and also, by adopting the strategy of category specific codebook generation, which enabled us to produce more discriminative visual words. The experimental results obtained on UC-Merced dataset outperform many local and global techniques and even deep Neural Networks. The powerful ability of the proposed method for HRSI description makes it suitable to achieve retrieval and classification of color natural images. In a future work, it would be interesting to explore other descriptors to provide complementary information especially for those categories with high interclass correlation such as buildings and dense residential, freeway and runway.