Keywords

1 Introduction

In recent years, convolutional neural networks have attracted widespread attention to the academic circles once again, which are actively applied for image classification and recognition. The basic idea of the neural network method is to train the multi-layer perceptron to obtain the decision function, and then utilize it to classify the pixel [1]. This method requires a large amount data, and the deep neural network has a huge number of connections, so it is easy to be applied for spatial information and can better solve the noise and unevenness in the image. Nowadays, the application of deep learning in remote sensing image processing has become more widespread.

So far, The GaoFen(high-resolution) earth observation system have launched GF-1 high-resolution and wide-amplitude satellite, GF-2 sub-meter Panchromatic satellite, GF-3 1 m radar satellite, GF-4 synchronous gaze satellite and others. Among them, there are different spatial resolution, coverage width, spectrum segment and other performance parameters, which have enhanced the data source of China’s Earth observation system. High-resolution remote sensing image classification has been extensively used in various fields such as land planning, surveying and mapping, national defense security, agriculture, disaster prevention and mitigation [2].

With the development of deep learning, the remote sensing image processing has also been established to some extent. However, its application of industrialization is still in a low stage. A crucial reason is that machine learning methods are inefficiency. Thus, the semantic segmentation based on deep learning has become a research hotspot in this field [3]. The deep convolutional network model is trained by using big data, and it is used to solve automatic extraction of surface geomorphology information in high-resolution remote sensing images. Concurrently, using deep learning to extract the complex information that contained in big data could make more accurate predictions of unknown data [4]. Thus, the academic circle began to apply the deep learning theory about the high-resolution remote sensing images, and explored the multi-layer expression of the potential distribution of high-resolution remote sensing images through the deep neural network, extracting deep features. Therefore, semantic segmentation is incorporated into the deep learning framework, used in high-resolution remote sensing image classification [5].

In this paper, The DeepLabv3 network is used to deal with the GF-2 image of Chenzhou, Hunan Province. We are studied and improved the semantic segmentation method. According to the local geomorphological features labeled the original image and augmented dataset, so could be trained model. The model suitable for GF-2 image data is explored in terms of network layer setting, image input size mode and network parameter setting.

2 Related Work

Semantic segmentation refers to dividing into the pixels of the same object in the image of the same class, and segmenting the different objects to predict the category of each pixel in the image. In semantic segmentation, visual input needs to be divided into distinct semantically interpretable categories [6].

The fully convolutional network proposed by Long [7], which adjusted the general convolutional network structure to enable dense prediction without the fully connected layer. The fully convolutional network replaces the fully connected layers of convolutional layers. This model could arbitrarily segment the inputted image to obtain the expected output size, and it also greatly reduces the running time compared with traditional image block classification method. Therefore, most subsequent semantic segmentation network models to be used in this structure.

One of the main challenges to semantic segmentation based on deep convolutional networks is to solve the problem about pixel loss caused by the pooling layer. Although the pooling layer can further extract abstract features to increase the receptive field [8], the spatial location information on the pixels is lost. Semantic segmentation requires the classification target output image and the input image to have the same resolution and size, so it is necessary to reintroduce the spatial position information on the pixel. There is a different structure that can solve the pixel loss problem.

Atrous convolution [9] structure: This structure used the Atrous convolutional layer to replace pooling layer. It is a feature sampling method. From the dense sampling to the sparse sampling of data, so it can be effortlessly in the pre-trained basic network. There is no necessary to change the structure of the network, and the training parameters will not be greatly improved. In the case of a given convolution kernel, it is increased the receptive field of subsequent convolution. Since the parameters of the large convolution kernel are inefficient, this method is better than increasing the size of the convolution kernel. It could significantly expand receptive field while does not reduce the spatial dimension [10]. Representative neural network: DeepLab, PSPNet, and RefineNet (Fig. 1).

Fig. 1.
figure 1

(a) dilation rate 1 (b) dilation rate 2 (c) dilation rate 4.

3 Method

DeepLabv3 is taken advantage of the second generation residual network (ResNet) as the basic network. The residual network contains 4 residual blocks. Each block contains a different number of residual units, which a residual standard unit consisting of two 1 * 1 convolution kernels and one 3 * 3 convolution kernels. Assuming the neural network input is x and its expected output is H(x), the goals that can be learned now are:

$$ F\left( x \right) = H\left( x \right) - x $$
(1)

The original output is F(x) + x. The residual is equivalent to changing the learning goal, instead of learning a complete output H(x), but the difference between the output and the input H(x) − x [11] (Fig. 2).

Fig. 2.
figure 2

Residual unit structure

In order to deal with multi-scale feature learning, DeepLabv3 effectively extracts multi-scale context information by using different dilation rate (step size) atrous convolution parallel architecture. The last residual block uses the atrous convolution instead of a regular convolution [12]. In addition, each convolution in this residual block utilized the atrous spatial pyramid pooling (ASPP), which utilized multiple atrous convolution kernels and batch normalization in the atrous convolution block, capturing multi-scale context information, and use image-level feature information improvement to optimize the training performance of the network [13], and further improve the classification effect. The block diagram of the initial implementation of the algorithm is as follows (Fig. 3).

Fig. 3.
figure 3

Block diagram of the initial implementation of the DeepLabv3 algorithm

3.1 Atrous Convolution for Detail Feature Extraction

The deep neural network of fully convolutional structure fashion has shown to be effective for the task of semantic segmentation. However, the repeated combination of max-pooling and striding convolution at successive layers of these networks significantly reduces, Which these operations in the network reduce the spatial resolution of the output feature map, so the deconvolution layer is used to recover the spatial resolution. DeepLabv3 is used atrous convolution instead of deconvolution.

Assuming a two-dimensional signal [14], for each location i on pixel, the corresponding output y and a convolution kernel weight w, the dilation rate r, and the output kernel size k, the atrous convolution calculation is performed on the input feature map x

$$ y\left[ i \right] = \sum\limits_{k}^{{}} {x[i + r \cdot k\left] w \right[k]} $$
(2)

The input x and the upsampled convolution kernel obtained by inserting 0 of r − 1 along each spatial dimension between two consecutive convolution kernel weight values are convolved. The standard convolution is a special form of r = 1. The atrous convolution allows us to modify the receptive field of the convolution kernel by changing the setting of the dilation rate r.

3.2 Atrous Spatial Pyramid Pooling

ASPP is showed that it is effective to resample features of multi-scales for precisely and efficiently classifying area of an arbitrary scale. Multiple parallel atrous convolutions with different dilation rates are applied on top of the feature map, and are obtained feature maps of diverse sizes of output. Then, the feature map is pooled to generate a fixed-length representation. Therefore, it is possible to be robustly segmented one object of multiple scales, thereby capturing context information on the image. The features extracted from each dilation rate could be processed in separate branches of a single scale, and finally merged to produce a result [15].

To incorporate global context information on the model, it is applied global average pooling on the last feature map of the model, feed the resulting image-level features in a 1 × 1 convolution with 256 filters, and then bilinearly upsample the feature of the desired spatial dimension. In the end, our improved ASPP consists of (a) one 1 × 1 convolution and three 3 × 3 convolutions with rates = (6, 12, 18) when all with 256 filters and batch normalization, and (b) the image-level features. The resulting features from all the branches are then concatenated and pass through another 1 × 1 convolution before the final 1 × 1 convolution recovery to the desired the resolution [16]. In the end, it could be produced the final classification result.

3.3 Improved DeepLabv3 Model

According to the characteristics of the GF-2 image, the network model is adjusted, majorly to modify the ASPP structure.

DeepLabv3 is replicated several copies of the residual blocks 4 to be cascaded up to the residual block 7. However, this consecutive cascade design is detrimental to semantic segmentation, destroys the detail information on the image and requires higher computational performance and memory, reducing the running rate. After removing the cascading block, it was found that the impact on the results was limited, so the cascading block was not used in this model.

DeepLabv3 is processed the feature map using four parallel atrous convolutions of different dilation rates. ASPP has different sampling rates and can extract information of multiple scales very efficiently. However, it is found that the effective convolution weight in the convolution kernel becomes smaller as the dilation rate increases. Therefore, according to the multi-scales feature of GF-2 image, the dilation rate is reduced, and the number of parallel atrous convolutions (rate (3, 6, 9, 12)) is increased, so that a wider range of context information can be well (Fig. 4).

Fig. 4.
figure 4

Improved DeepLabv3 atrous spatial pyramid pooling layer structure

4 Experiments

The operating system running is Ubuntu, which is used the pre-trained residual network of ImageNet as the basic network framework. The training device GPU is NVIDIA GTX1080Ti. We use Tensorflow to build a deep neural network that dealt with GF-2 images. It is trained and used for data processing. Furthermore, the control variable method is used to study the influence of the parameters of the network on the classification results, which find the appropriate combination of parameters.

4.1 Dataset Introduction

The dataset is derived from the image of the GF-2, which is a satellites developed and launched by China. The dataset is based on the GF-2 image of the Chenzhou area in 2016. There is complex and preserved with intact primary secondary forest communities and low-altitude broad-leaved forests in the Nanling Mountains. The forest plant resources are rich and diverse. This dataset has a spatial resolution of 0.8 meters, specific latitude and longitude is 112°623ˊ112°188ˊ east longitude and 26°535ˊ−25°882 north latitude, and the number of bands is 4. The original GF-2 image is pre-processed with ENVI, and then it is cropped and is marked in the range of 2000 * 2000 pixels, which is marked by Matlab. It can be made a ground true image of an annotation the different image types with diverse colors by manual labeling.

According to the geomorphological characteristics, the classification categories are divided into seven categories, which are forest, others, water, house, road, furrow (farmland as background), except that the background has a corresponding color covering it (Fig. 5).

Fig. 5.
figure 5

Manually annotating images

4.2 Image Preprocessing

The original dataset of the GF-2 image is included 12 training images (6 of which are processed for the original image) and 1 test image. The GF-2 image size is too large, so these images cannot be directly sent to the network for training. The running memory of the device is limited and the input image size is different, thence the training image is cropped. On the other hand, the significant feature of the deep neural network model need a larger number of training samples. In order to avoid the neural network over-fitting in part of dataset, the solution is to augment and expand the existing training dataset, thereby enriching the GF-2 image training dataset and improving the generalization ability of the semantic segmentation model. Therefore, for a GF-2 image, it is cropped into small image blocks for use as a training dataset, implemented using OpenCV.

  1. (1)

    First dataset the 2000 * 2000 training image to a specific size output image according to requiring.

  2. (2)

    Then it is sampled and cropped, wherein the sampling method is random window sampling and then obtaining a small image of a definite size output under the coordinate.

  3. (3)

    Data augmentation is performed on the cropped image, which is produced by random rotation, horizontal and vertical flipping, gamma transformation, blurring, erode, adding noise, and bilinear filtering.

  4. (4)

    Observation data reveal imbalances between categories, such as fewer roads. Therefore, it is necessary to extra augment a fewer categories. Before the network starts to train, search for the cropped data, find images of more roads, and increase the number of such samples by image transformation. This gave rise to a large amount of extended data used as a training dataset.

4.3 Semantic Segmentation Performance Evaluation Method

The following four standards are commonly used in image semantic segmentation to measure the accuracy of an algorithm [17].

  1. (1)

    Pixel Accuracy (PA, Pixel Accuracy): The ratio of the correct pixels in the test image to the total pixels of the ground true image.

  2. (2)

    Mean Pixel Accuracy (MPA, Average Pixel Accuracy): It is calculated the proportion of correctly classified pixels in each class, and then averages all classes.

  3. (3)

    Mean Intersection over Union (MIoU): a standard measure of semantic segmentation. It is calculated the intersection of the two sets and the union of the two sets. In the problem of semantic segmentation, the two sets are the ground true image and the test image, respectively. IoU is calculated on each class and then averaged.

  4. (4)

    Frequency Weighted Intersection over Union (FWIoU): An enhancement to MIoU that sets the weight of each class in its frequency.

4.4 Comparative Analysis of Network Training and Prediction Results

4.4.1 DeepLabv3 Network Training and Prediction Results

The improved network parameters are fine-tuned according to the characteristics of GF-2 image. The basic network selects the residual 101 layers network (Res101). The network training parameters are as follows: where batch norm decay = 0.997, batch size = 4, learning rate = 0.0001, weight decay = 0.00001, 100000 iterations, of which the learning rate was reduced to one tenth to 40,000 and 80,000 times, and the data concentration training and testing ratio is 4:1.

The test image is then placed into the trained model for prediction. The test remote sensing image is too large, usually 2000 * 2000. Therefore, in the prediction program, first the test image is cropped into small image blocks, moving window sampling to predict each image block, and then combine the predicted image blocks to get the final predicted image (Fig. 6).

Fig. 6.
figure 6

(a) Original image (b) Ground truth image (c) Predicted image

By comparing the predicted image and the ground true image, the image information that can be accurately classified. DeepLabv3 is utilized the diverse step size atrous convolution parallel method to effectively extract multi-scales context information. This improved the ASPP structure’s ability to extract global context information by combining image-level information to augmented the effect Image boundaries and details are smoother, without blurring and aliasing. It is classified each pixel and considered the relationship between pixels and pixels, where does not was neglected the spatial regularization used in the usual pixel classification based segmentation method. Thus, it has spatial consistency. For example, large areas of rivers, buildings and forests can be effectively identified.

The downside is the case that ASPP is not sensitive enough to some slender objects information, and can’t extract context information efficiently. For instance, roads and streams are not completely recognized by the model. The atrous convolution is sparse. The slender object fills the blank space with zero weight in the atrous convolution. Therefore, the partial convolution result is zero, and the pixel is recognized as the background, resulting in an incorrect classification decision.

4.4.2 Improved Model Compared to the Original DeepLabv3

The original DeepLabv3 is used the pre-trained Res50 as basic network and the learning rate for the poly strategy, while the batch norm decay of the BN layer is 0.9997, with an initial learning rate of 0.007, and the size of the images is 513, using PASCAL VOC 2012. The dataset is a near-view image. The improved model has five parallel atrous convolution pooling structures with a dilation rate (3, 6, 9, 12,) and others. Compare the original DeepLabv3 with the improved model to get the Table 1.

Table 1. Performance evaluation of original DeepLabv3 and improved models

The improved network is at least the performance better than the original network. The original network is used a larger dilation rate, and the effective convolution weight in the convolution kernel is getting smaller and smaller. The global context information on the image cannot be extracted efficiently. The improved model has more parallel atrous convolutions, which facilitates the extraction of deeper details of the image, to solve complex remote sensing landform information. From the performance indicators of the results of the dataset, it can be concluded that the dilation rate (3, 6, 9, 12) set of data performs best.

4.4.3 Network Depth and Input Size Impact on the Network

The improved DeepLabv3 is trained using the residual 24 layers, the residual 50 layers, the residual 74 layers and the residual 101 layers as the basic network framework. The input image sizes in training are 512 * 512 or 256 * 256, respectively. The test images were not trained in the neural network versatility of the model. It is predicted using a prediction program, and then it is passed through the image quality evaluation program to obtain the following Table 2.

Table 2. Performance evaluation of different depth residual networks and input sizes

Through the above Table 2, it is found that the basic network is the Res101, and the training input image size is 512 * 512 that is the best.

  1. (1)

    The depth of the residual neural network plays a major role in the performance index of the classification objects, because it determines the feature quality of the GF-2 image of different aspects such as invariance and abstraction [18]. Following consideration of the complexity of the network calculation, the depth also affects the running time of the global network training. Res101 runtime is generally more time than the Res24, and the maximum size of the batch size of Res24 can be set to 16, and the memory requirements for the GPU will be reduced.

  2. (2)

    On the other hand, the input image size of the same basic network framework has a little effect on image quality evaluation, and there is a definite improvement on the same basic network. The proportion of each classification objects in training dataset is not balanced, so some categories occupy the proportion is large and the proportion of others is small. If classify an image by moving window, one image should contain as many categories as possible. Generally, the 256 * 256 images have less probability of containing the classification objects than the 512 * 512 images. The large image is divided, when the image is predicted. The cropped image size is larger. The fewer the number of cuts, the lower the error rate of the cutting edge recognition. This will have an effect on the final result.

4.4.4 Small Sample Impact on the Network

In order to explore the deviation of the prediction results of the model when the retraining samples are insufficient, the influence of network depth of its classification decision is tested for the case of a few samples. Reduce the number of training images to 4, and 2, with a sample size 512 * 512 of 30,000, and 10,000 The Res24 and the Res101 were respectively used for training, and the test images were predicted to obtain the following Table 3.

Table 3. Performance evaluation of Res24 and Res101

In the case of reducing the number of samples, performance indicators have declined in various aspects. Due to insufficient training samples identify deviations in classification targets that different within the class and similar between classes.

  1. (1)

    The training data has a significant decline in the classification of 10,000, and many categories have not been classified.

  2. (2)

    The training data is 30,000 pieces that can get closer to the training values of all dataset, and can initially reach the classification effect.

From the table, the performance values of Res24 and Res101 are small gap. If only from the running rate, it is more suitable to select the Res24. In the case of limited samples, the higher efficiency of the running time and classification accuracy of the lightweight neural network is considered under balanced.

4.4.5 Analysis of Different Time-Phase Data

In addition, in order to verify the generalization ability of the model, the Chenzhou area of different time-phase is selected for verification. It is mainly to verify the model to learn image category information on different data samples. The verification images of different phases in other parts of Chenzhou are heterogeneous data, and the classification objects have large differences between the class, and the distribution of training images and verification images are inconsistent, which leads to training and verification of forest, buildings, and roads, are dissimilar. Moreover, the amount of category data on each classification objects is not balanced. The basic network is used as the Res 101, and the input image size 512 * 512.

From the above Table 4, it can be seen that the partial evaluation index MPA of different phase in other regions is higher than the test image, and the PA is only slightly lower than the test image. It can be seen that the model has generality for the classification effect of GF-2 images in Chenzhou area and has well generalization ability.

Table 4. Performance indicators for different phases in other parts of Chenzhou

5 Conclusions

In this experiment, the improved DeepLabv3 network is tested to perform image classification in the GF-2 image of Chenzhou area. Firstly, the original remote sensing image is pre-processed. Then it is labeled as the dataset. Next, the training dataset is cropped, augmented, and trained. Finally, the test image is predicted, and the performance of the result is numerically analyzed by image semantic segmentation evaluation.

By comparing the classification results of different model parameters, it is found that the classification of the Res101 network and the input image size of 512 * 512 is the best. Because the larger the training size, the fewer the number of segments is divided, which can reduce the identification of edge information and have sufficient classification objects. On the other hand, the shallow residual network is easy to converge, and it was fallen into a local optimal solution and cannot reach the best result. The deeper residual network can be extracted features the higher abstraction and captured more image details, so it can be beneficial to the deep network to optimize the classification result. Besides, the depth of the network is explored the impact on insufficient samples. When the training sample is few, the residual network is easily under-fit, and misjudgment occurs when encountering similar pixel and texture features between similar classes.

The model is verified by adding different time-phase images to test. Whether the classification effect is reduced due to factors such as intra-image differences and pixel differences, thereby verifying the generalization ability. Unfortunately, it is not sensitive enough to some slender objects in the image, like roads and streams. These objects are not fully recognized by this model. The next work will focuses on improving the recognition of slender objects.