1 Introduction

Remote sensing images play a vital role in many practical applications [4], such as object detection [36], environment monitoring [5], and resource exploration [37]. Since remote sensing images contain large amounts of information on the earth in a small spatial size, the spatial resolution of remote sensing images is one of the important factors of image quality. The target of super-resolution (SR) is to restore the latent high-resolution (HR) image from the low-resolution (LR) observation. With the rapid development of deep learning (DL) technology, SRCNN [6] first used convolutional neural networks (CNNs) for SR task. Subsequently, many works aim to design various network architectures to address SR problem. Although remote sensing image SR methods based on DL have achieved superior performance, such methods still have many limitations needed to be solved. Specifically, the general degradation process of the image can be described as:

$$\begin{aligned} I_{\text {LR}}=(k\otimes {I_{\text {HR}}})\downarrow _{s} + n \end{aligned}$$
(1)

where \(I_{\text {HR}}\) represents the real HR image, \(I_{\text {LR}}\) represents the generated synthetic LR image, \(\downarrow _{s}\) denotes the downsampling operation with scale factor s, k and \(\otimes\) denote blur kernel and convolution operation, and n is typical additive white Gaussian noise (AWGN). However, most DL-based existing methods [11, 19, 39, 41], usually assume a simplified model of image degradation (e.g., bicubic) that deviates from the degradation process of real images. Therefore, it is challenging to directly apply the traditional DL-based methods to SR of real remote sensing images.

The Generative adversarial networks (GANs) are composed of a generator (G) and a discriminator (D), aiming to generate samples with the same distribution as the input data [3]. Specifically, G learns to transform the input LR image distribution into HR image distribution, while D evaluates to see if the generated distribution is the same as the HR distribution. Following SRGAN [15], a range of GAN-based approaches [10, 16, 28, 42] emerge for improving SR performance. However, there are many mappings that can provide the same distribution to translate LR images into HR images with GAN. Because image SR is highly under-constrained in pixel space, using GAN to directly translate real-world LR images into corresponding HR images is challenging, which limits the further application of these methods [46].

In general, the degraded model adopted in CNN-based algorithms is departed from real-world, and it is also difficult to directly transform LR images to the corresponding HR images with GAN. Therefore, we present a practical multi-degradation SR method that combines supervised CNN-based methods and unsupervised GAN-based methods. Specifically, instead of transforming LR image to the latent HR image, we estimate blur kernels from LR images, which is a much more efficient alternative. Next, these blur kernels are then used to synthesize LR images, which are then used to train our CNN-based SR network. Therefore, our method consists of two stages, as indicated in Fig. 1. In the first step, we estimate real blur kernels from real sensing remote images by a GAN-based model. For the diversity of the obtained blur kernels, we then train another GAN-based network to learn the distribution of extracted kernels, and further generate a blur kernel pool with sufficiently diverse blur kernels. In the second step, we train a deep CNN for SR with HR images and the corresponding LR images degraded by the generated kernels. The following are the paper’s main contributions:

1. A GAN network is proposed to estimate the blur kernel with solely the internal information of the real LR image.

2. We propose a novel network architecture for SR of remote sensing images. By making full use of the image feature information, the densely connected attention mechanism (DCM) is introduced to improve the performance of the network with a limited number of network parameters and low complexity.

3. Extensive experiments demonstrate the proposed method outperforms most current SR methods when dealing with LR images with multi-degradation patterns, and it can significantly balance the number of network parameters and network performance.

Fig. 1
figure 1

Framework of the proposed SR method

Fig. 2
figure 2

Framework of the proposed enhanced kernel estimation method

2 Related work

2.1 CNN-based image super-resolution

Image SR algorithms are generally categorized into traditional and deep learning (DL)-based methods. The DL-based method can learn the mapping from the LR images to the corresponding HR images from a specific training dataset. Since SRCNN [6] was first proposed by Dong et al., many DL-based methods, including EDSR [19], DBPN [11], RNAN [41] and RCAN [39], significantly improved the performance of image SR. Furthermore, many effective network architectures, such as residual learning strategies [20], enhanced the network’s learning ability and convergence, which contributed to the development of a superior network model. However, these methods can only manage with a single degradation type. For instance, SRCNN, EDSR, RCAN, etc. assumed the image degradation is bicubic down-sampling. In other words, when the degradation pattern does not match the predefined one, the performance of these methods will be severely dropped. Blind SR method that can handle different degradation types with only one single model and attract much attention. Some methods, such as DPSR [43], USRNet [44], IKC [8] and SRMD [40], integrated degraded parameter estimation and image SR to improve the robustness of the proposed methods. However, SRMD could only handle the degradation of Gaussian blurring with bicubic down-sampling. USRNet estimated the blur kernel by the traditional way and had some limitations in terms of real-time and so on. With a dynamic sparse attention pattern and Non-Local (NL) operation, Mei et al. [23] proposed an unique Non-Local Sparse Attention (NLSA). With specially designed deep CNNs, Xing et al. [34] presented an end-to-end solution for combined demosaicing, denoising, and SR. Although these methods can achieve promising results for synthetic LR images, the majority of them rely on the accuracy of degraded parameter estimation, which limits their ability in processing real LR images.

2.2 GAN-based image super-resolution

GANs are made up of generator and discriminator that are trained in a “max-min” manner to produce samples with the same distribution as the target data, GANs have been widely applied to SR tasks [17, 26, 29]. SRGAN [17] improved subjective visual performance for bicubic degradation types by introducing perceptual loss and adopting the adversarial training strategy. SRGAN was significantly improved by ESRGAN [29] in terms of the loss function, network structure, and other factors, resulting in superior performance. However, the training of the GAN network is unstable [12]. Therefore, there proposed some useful guidelines for building and training GAN networks in DCGAN [38]. Subsequently, WGAN [1] and WGAN-GP [9] further improved GAN training by overcoming the difficulties in maintaining training balance between the generative network, the discriminative network, and the network designing. [16] developed an unpaired SR method based on GAN with an unpaired kernel correction network and a pseudo-paired SR network that outperformed most current unpaired SR models. Wang et al. [30] presented an unsupervised degradation representation learning approach. However, it is challenging to directly translate real-world LR images into corresponding HR images via GAN [46].

2.3 Remote sensing images for super-resolution

In comparison to natural photographs, remote sensing images have more texture details, wider spatial dimensions, and diverse ground elements. Although the SR algorithms for natural images have achieved promising performance, they still cannot be directly applied for remote sensing images. On the other hand, the lack of sufficient remote sensing image datasets also limits the further application of DL-based SR methods. Therefore, many works for remote sensing images SR have emerged. Specifically, Ducournau et al. [7] and Ma et al. [21] fine-tuned the SRCNN [6] and SRGAN [25] model via several remote sensing image datasets to achieve the SR of remote sensing images. The median-scale and large-scale factors were proposed in RDBPN [24] to improve the resolution of RGB remote sensing images. Ma et al. [22] realized remote sensing image SR by incorporating the wavelet transform and the recursive Res-Net. Lei \(et\ al.\) [18] specially designed the discriminator to address the “discrimination-ambiguity”. Zhang et al. [45] presented a multi-scale attention network (MSAN) to capture the multi-level features of remote sensing images. However, most of these existing methods are only suitable for single degradation, and cannot be directly extended to the SR of multi-degradation remote sensing images.

3 Methodology

3.1 Enhanced kernel estimation method for blind SR

The goal of the kernel estimation method is to extract the latent blur kernel k in the input LR image. By training a predictor, many kernel estimation methods estimate latent blur kernel from the input LR image, as shown in Eq. (2):

$$\begin{aligned} \begin{aligned}&\theta _{F} = {\text {arg}}\mathop {min}\limits _{\theta _{F}}||k-F(I_{\text {LR}};\theta _{F})||^{2}_{2} \\&{\hat{k}} = F(I_{\text {LR}};\theta _{F}) \end{aligned} \end{aligned}$$
(2)

where F and \(\theta _{F}\) stand for the predictor and the parameters of the predictor, respectively. k denotes the latent blur kernel, which is usually predefined as an isotropic or anisotropic Gaussian kernel. \({\hat{k}}\) is the blur kernel estimated by F. \(I_{\text {LR}}\) denotes the observation LR image. However, most existing methods still can not solve the multi-degradation remote sensing SR problem, because real-world blur kernel k is usually unknown and more complex than the predicted.

To this end, we propose the enhanced kernel estimation method to directly obtain the latent blur kernel in the LR input by modeling the degradation process with GAN. The network is shown in Fig. 2. As a result, the generator implicitly records the degrading process of the real LR image. In detail, these patches of the HR image are fed into the generator (\(G_1\)) to generate the corresponding patches with the same distribution as the patches of the LR image. It is worth noting that the HR and LR images are unpaired. Next, the discriminator (\(D_1\)) tries to determine whether the distribution generated by \(G_1\) is the same as the LR image patches. The framework is optimized with “max–min” strategy to solve the adversarial problem shown as:

$$\begin{aligned} \begin{aligned} {{\text {arg}} \mathop {min}\limits _{G_{1}} \mathop {max}\limits _{D_{1}}{{\mathbb {E}}[D_{1}(P_{\text {HR}})] - {\mathbb {E}}[D_{1}(G_{1}(P_{\text {LR}}))]+R}} \end{aligned} \end{aligned}$$
(3)

where \(G_{1}\) and \(D_{1}\) denote the generator and discrimintor, respectively, \(P_{\text {LR}}\) and \(P_{\text {HR}}\) are the patches selected from \(I_{\text {LR}}\) and \(I_{\text {HR}}\), respectively. As shown in Eq. 4, R is the same regularization term as [2].

$$\begin{aligned} R=\alpha \times {L_{\text {sum}}}+\beta \times {L_{\text {b}}}+\gamma \times {L_{\text {sparse}}}+\delta \times {L_{\text {c}}} \end{aligned}$$
(4)

where we empirically set \(\alpha\)=0.5, \(\beta\)=0.5, \(\gamma\)=0.5, \(\delta\)=1, and:

$$\begin{aligned} \begin{aligned}&L_{\text {sum}}=\left| 1-\sum \limits _{i,j}k_{i,j}\right| \\&L_{\text {b}}=\sum \limits _{i,j} {\left| k_{i,j}\times {m_{i,j}}\right| }\\&L_{\text {sparse}}=\sum \limits _{i,j} {\left| k_{i,j}\right| ^{1/2}}\\&L_{\text {c}}=\left| |(x_{0},y_{0})-\dfrac{\sum \nolimits _{i,j} k_{i,j}\times {(i,j)}}{\sum \nolimits _{i,j} k_{i,j}}\right| |_{2} \end{aligned} \end{aligned}$$
(5)

where \(L_{\text {sum}}\) makes the sum of k approach to 1, \(L_{\text {b}}\) encourages values close to the boundaries and m is a constant of weights growing with distance from the center of k. \(L_{\text {sparse}}\) is a sparsity loss that prevents the network from generating over-smooth kernels. \(L_{\text {c}}\) makes sure that the center of the generated kernels at the center of the output and \(x_{0},y_{0}\) denotes the center index.

The network architectures of \(G_{1}\) and \(D_{1}\) are shown in Fig. 2. In the image generation stage, The patches of \(I_{\text {HR}}\) are fed into Conv-group to extract image features. Then, these features are passed through three consecutive Res-blocks, consisting of “Conv3 \(\times\) 3-ReLU-Conv3 \(\times\) 3”, to learn the degradation process representation. In the image identification stage, the patches of \(I_{\text {LR}}\) and the generated image patches are fed into \(D_{1}\), respectively. Moreover, instance normalization (IN) layer is more effective for training GAN, the network of \(D_{1}\) consists of several blocks, which are composed of “Conv4 \(\times\) 4-IN-Conv4 \(\times\) 4” layers.

We obtain a compact blur kernel by convolving the parameters of the generator \(G_{1}\), even if \(G_{1}\) implicitly learns the real degradation procedure for a specific LR image. Specifically, we iteratively convolve the initial blur kernel \(P_0\) (all values are 1) with the filters in the generator \(G_1\), and finally obtain the corresponding compact blur kernel by some post-processing operations. After that, we build a kernel pool \(K=\left\{ k_{1},\ldots ,k_{n}\right\}\), where \(k_{n}\) denotes the \(n_{\text {th}}\) estimated kernel from real remote sensing images.

In this section, we propose the enhanced kernel estimation method based on KernelGAN [2]. However, there are many differences between KernelGAN and our proposed method. Specifically, KernelGAN is an image-specific GAN that trains solely on the LR image, which implicitly learns the downscaling kernel from the specific input LR image. It is implausible for KernelGAN to assume that the high-resolution (HR) image can be obtained after down-sampling from the low-resolution (LR) image, especially when the degradation of LR images is severe. In practical scenarios, the degradation of real remote sensing images is more severe. Therefore, we learn the degradation process from real HR images to real LR images implicitly through the generator, which can better handle various scenarios. Furthermore, to improve the accuracy of kernel estimations, we change the generator architecture from a simple linear structure to a residual network structure, and the discriminator is based on the Patch-GAN. The ablation studies in Sect. 4.5 also demonstrate that the performance of the proposed kernel estimation method is superior to the KernelGAN under multiple situations.

3.2 Kernel augmentation with GAN

Because it is time-consuming to extract implicit kernels from LR inputs, the diversity and quantity of kernels collected are limited, decreasing the performance of the proposed SR network. To obtain enough kernels, we design the kernel augmentation network to generate large amounts of blur kernels. All of the generated blur kernels make up the set \(K^{'}\). Then, the final kernel pool \(K^{+}=K \cup K^{'}\) is obtained. The objective function for our GAN is as follows:

$$\begin{aligned} \begin{aligned} L={\mathbb {E}}[D_{2}(f_{g})-D_{2}(f)] +\lambda {\mathbb {E}}[(||\nabla D_{2}(f_{s})||_{2}-1)^2] \end{aligned} \end{aligned}$$
(6)

where \(D_{2}\) is the discriminative network, \(f_{g}\) and f denote the blur kernels synthesized by \(G_2\) and the blur kernels selected from K, respectively. \(f_s\) is the sample from a distribution that between f and \(f_g\). \(\lambda\) is the regularization factor to prevent overfitting of the network.

As shown in Fig. 3, we adopt the same network architecture as [9] of \(G_{2}\), and the architecture of \(D_{2}\) is the same as \(D_{1}\) (shown in Fig. 2).

Fig. 3
figure 3

Network architecture of \(G_2\)

3.3 Deeply connected multi-scale feature attention network

We build paired datasets by blur kernels randomly selected from \(K^{+}\), and propose a novel SR network that could handle multi-degradations, called DCMNet

Fig. 4
figure 4

Framework of our proposed DCMNet consisted of n MFABs

3.3.1 Overall framework

As shown in Fig. 4, the deeply connected multi-scale feature attention network (DCMNet) contains multi-scale feature attention estimation block (MFAB) and densely connected attention mechanism (DCM). The goal of DCMNet is to obtain a mapping function that estimates the latent HR image from the corresponding LR input:

$$\begin{aligned} I_{\text {HR}} = G_{3}(I_{\text {LR}};\theta _{G_{3}}) \end{aligned}$$
(7)

where \(I_{\text {LR}}\), \(I_{\text {HR}}\) stand for the input LR image and the output of our model, respectively. \(G_{3}\) denote the DCMNet network and \(\theta _{G_{3}}\) is the parameters of \(G_{3}\). DCMNet adopts an iterative architecture that increases resolution progressively.

Specifically, at the beginning of DCMNet, there is a 3\(\times\)3 convolutional layer to extract the features from the input image. Then, n MFABs extract multi-scale features and iteratively estimate high-frequency details. The MFAB consists of multi-scale feature extraction blocks (MFEBs) and a channel-spatial attention module. Moreover, we adopt DCM for the attention module, which can transfer the attention information of low-level to high-level features hierarchically and take full advantage of a deep neural network, improving the performance with limited network parameters and inference time. Finally, through some up-sampling blocks, the input image will be up-sampled to a higher resolution by a scaling factor of s. Specifically, for 2-time and 4-time SR tasks, the number of up-sampling blocks m corresponds to 1 and 2, respectively.

3.3.2 Densely connected attention mechanism

As shown in Fig. 4, the attention of both spatial and channel dimensions are adopted to consider the inter-dependencies of local regions in the SR model. The channel attention can be written as:

$$\begin{aligned} \begin{aligned} M_{\text {c}}(F)&= \sigma ({\text {FC}}({\text {Avg}}(F))+{\text {FC}}({\text {Max}}(F))) \\&= \sigma (W_{1}(W_{0}(F_{\text {avg}}^{\text {c}}))+W_{1}(W_{0}(F_{\text {max}}^{\text {c}}))) \end{aligned} \end{aligned}$$
(8)

where F denotes the image feature maps, FC represents the fully connected network, and \(W_{0} \in R^{C/r\times {C}\times {1}\times {1}}\) and \(W_{1} \in R^{C\times {C}/r\times {1}\times {1}}\) are the weights of the fully connected layer, respectively. r is the reduction ratio. \(\sigma\) is the sigmoid funciton, Avg and Max denote the average-pooling and max-pooling operators, respectively. \(F_{\text {avg}}^{\text {c}}\) and \(F_{\text {max}}^{\text {c}}\) represent the result of the feature maps after channel average-pooling and max-pooling operations. Similarly, the process of spatial attention is as follows:

$$\begin{aligned} \begin{aligned} M_{\text {s}}(F)&= \sigma (f({\text {Avg}}(F);{\text {Max}}(F))) \\&= \sigma (f(F_{\text {avg}}^{\text {s}};F_{\text {max}}^{\text {s}})) \end{aligned} \end{aligned}$$
(9)

where \(\sigma\) denotes the sigmoid funciton, f stands for a convolution filter with the size of 1 \(\times\) 1, \(F_{\text {max}}^{\text {s}}\) and \(F_{\text {max}}^{\text {c}}\) denote the image feature maps after spatial average-pooling and max-pooling operators, respectively.

It seems that the work in [31] is similar to us, but many existing works only considers the attention information in each block, neglecting the relationship between any two positions in the feature maps, leaving network parameters and performance unbalanced. The most existing attention block can be modeled as:

$$\begin{aligned} X^{i+1}=\sigma (M^{i})\odot {X^{i}} \end{aligned}$$
(10)

where \(X^{i}\) denotes the input feature map of the \(i_{\text {th}}\) iteration, and \(M^{i}\) is the output of channel or spatial attention module. \(\odot\) is dot product operation.

It is well known that only blindly increasing the depth of the network cannot boost the performance, because the former information of the network will be lost with the increase of network depth, which limits the performance of the network. To prevent this, we introduce a densely connected attention mechanism (DCM) that can transfer the former attention information to the present one.

$$\begin{aligned} X^{i+1}=\sigma (f(\eta M^{i-1}, \xi M^{i}))\odot {X^{i}} \end{aligned}$$
(11)

where \(f(\cdot )\) denotes the connection function, and \(\eta\), \(\xi\) are learnable parameters. We adopt direct connection scheme and the connection function \(f(\cdot )\) can be presented as:

$$\begin{aligned} f(\eta M^{i-1}, \xi M^{i})=\eta M^{i-1}+\xi M^{i} \end{aligned}$$
(12)

The introduction of the densely connected mechanism can guarantee the network’s characterization with limited parameters and computation by increasing the flow of feature information within the network.

Fig. 5
figure 5

Structure of MFEB

3.3.3 Multi-scale feature extraction block

A larger receptive field can ease the extraction of the global features of the input image, while a smaller receptive field can extract more local and detailed features. Therefore, we adopt MFEB, which contains different sizes of filters, to adaptively detect the image features in different scales. After that, the image features are concatenated via 1 \(\times\) 1 convolution filters, as shown in Fig. 5. However, a larger convolution kernel size will increase the computational complexity, which is not conducive to the convergence of the network. To this end, we introduce the dilated convolution that can increase the receptive field without changing the number of parameters. Specifically, the dilation factors of Conv 5\(\times\)5, Conv 7\(\times\)7, and Conv 9\(\times\)9 are set to 2, 3, and 4, respectively.

3.4 Training with adversarial learning strategy

3.4.1 Discrimator network

We adopt perceptual loss and the generative adversarial training strategy to obtain better visual quality. In detail, we use the proposed DCMNet as the generator \(G_{3}\), and introduce a discriminator model \(D_{3}\), which contains eight convolutional layers with an increasing number of 3\(\times\)3 convolution filters. The convolution filters increase by a factor of 2 from 64 to 512. Finally, these feature maps are fed into dense layers and sigmoid activation function layer to obtain a probability for sample classification. \(D_3\) is based on PatchGAN, which distinguishes image patches of the input image, while the traditional discriminator discriminates the whole input image. This strategy utilizes the local similarity nature of the image and distinguishes the input image many times, leading to more accurate results. For more details, please refer to [25]. The discriminator network \(D_{3}\) and the generator \(G_{3}\) are optimized in the “min-max” manner:

$$\begin{aligned} \begin{aligned} {\text {arg}}\mathop {min}\limits _{G_{3}}\mathop {max}\limits _{D_{3}} {\mathbb {E}}[logD_{3}(I_{\text {HR}})+log(1-D_{3}(G_{3}(I_{\text {LR}})))] \end{aligned} \end{aligned}$$
(13)

where \(I_{\text {HR}}\) and \(I_{\text {LR}}\) denote the real image patches from HR and LR image input, respectively.

3.4.2 Loss function with relativistic adversarial

Under the adversarial learning framework, the final loss function for the generator is:

$$\begin{aligned} L_{\text {G}}=\alpha L_{\text {percep}}+\beta L_{\text {content}}+\gamma L_{\text {adv}} \end{aligned}$$
(14)

where \(\alpha\), \(\beta\), \(\gamma\) are the weighted coefficients to balance different loss terms. The perceputal loss, content loss and the adversial loss are shown as follows:

$$\begin{aligned} \begin{aligned}&L_{\text {percep}}=\frac{1}{H_{j}W_{j}C_{j}}\sum |\Phi _{j}(G_{3}(I_{\text {HR}}))-\Phi _{j}(I_{\text {HR}})|\\&L_{\text {content}}=\frac{1}{{HWC}}\sum |G_{3}(I_{\text {HR}})-I_{\text {HR}}|\\&L_{\text {adv}}=\sum -logD_{\theta _{D_{3}}}(G_{\theta _{G_{3}}}(I_{\text {LR}})) \end{aligned} \end{aligned}$$
(15)

where \(H_{j}\), \(W_{j}\), \(C_{j}\) are the height, width, and the number of channels of the extracted feature maps, respectively. \(\Phi _{j}\) indicates the feature map obtained by the \(j_{\text {th}}\) convolution layer within the VGG-19 network, which was proposed by [27]. H, W, C are the height, width, and the number of channels of the evaluated images. \(D_{\theta _{\text {D}}}(G_{\theta _{G}}(I_{\text {LR}}))\) is the probability that the reconstructed image \(G_{\theta _{\text {G}}}(I_{\text {LR}})\) is a ground-truth (GT) image HR.

4 Experiments

4.1 Datasets

We adopt three different remote sensing datasets, including AID dataset [32], Dota dataset [33] and UC-Merced dataset [35], to train one single and universal model for various degraded remote images. Specifically, AID dataset is an aerial scene dataset that contains 30 aerial scene types such as airport, bare land, center, industrial, and so on. Dota dataset contains 2806 aerial images from different sensors and platforms, such as satellite JL-1, satellite GF-2, and Google Earth. UC-Merced Dataset consists of 21 classes of land-use images selected from aerial orthoimagery with the pixel resolution of 1ft.

4.2 Training details and parameters

The AID dataset is used to train the enhanced kernel estimation GAN. In the SR stage, the Dota dataset is used as the training dataset, 100 images selected from the UC-Merced dataset are used as validation dataset, and the rest of the UC-Merced dataset is used as the testing dataset. Specifically, During the training of the enhanced kernel estimation and kernel augmentation GAN, we first estimate 1000 realistic blur kernels from 1000 images selected from the AID dataset, and the network is trained with 3000 iterations and 96\(\times\)96 image patches were randomly cropped in each iteration. We adopt Adam optimizer (\(\beta _{1}=0.9\), \(\beta _{2}=0.999\)) with the initial learning rate is \(2{\text {e}}^{-4}\), and decayed by 0.1 every 100 iterations. For SR reconstruction network, we use all the images of Dota dataset to synthesize the corresponding LR images by blurring with the blur kernel randomly selected from \(K^{+}\). We crop the input images into non-overlapping 600 \(\times\) 600 image patches to improve the performance of SR result. The parameters of loss function in Eq. (14) are \(\alpha =0.2\), \(\beta =0.8\), \(\gamma =1{\text {e}}^{-3}\). We implement the proposed model in PyTorch framework with GPU of NVIDIA 2080Ti and CPU of Intel Core i7 7700K. All the networks are optimized by Adam optimizer (\(\beta _{1}=0.9\), \(\beta _{2}=0.999\)) and the batch size is set to 16, the learning rate is initialized as \(1{\text {e}}^{-4}\) and decayed by 0.1 every 1000 iterations.

We use not only PSNR and SSIM as evaluation metrics, but also ERGAS and sCC, which are commonly used in remote sensing images, to evaluate the effectiveness of the proposed method more extensively. These metrics, evaluate the quality of the reconstructed images in the pixel domain, and the LPIPS metric is also adopted to compare the similarity of the images in terms of high-level characteristics. However, as described in Sect. 4.5.4, the superior subjective visual quality of the reconstructed images does not necessarily indicate the higher objective metrics (e.g., PSNR, SSIM). In addition to the traditional bicubic un-sampling (BIC) method, several state-of-the-art SR networks are selected and evaluated with ratio 2 and 4, such as RCAN [39], SRMD [40], USRNet [44] and RealSR [13]. For a comprehensive comparison, we adhered to the original code settings and pre-trained weights proposed by the corresponding authors. For a clear understanding, the \(\uparrow\) next to the metrics indicates that the higher the metrics value, the better the image quality. Conversely, the \(\downarrow\) indicates that lower indicator values represent better image quality.

Fig. 6
figure 6

Five blur kernels for evaluating the proposed model. a, b Are isotropic Gaussian blur kernels, the standard deviation (std) are 1.6 and 2.2. ce Are blur kernels extracted from realistic remote sensing images

Table 1 Average results of airplane class in UC-merced dateset on degradations with blur kernel shown in Fig. 6c for scale factor 2
Table 2 Average results of airplane class in UC-merced dateset on degradations with different blur kernels for scale factor 4
Fig. 7
figure 7

Qualitative results of several SR methods at a ratio of 4, where the top-left corner represents the blur kernel (shown in Fig. 6a)

Fig. 8
figure 8

Qualitative results of several SR methods at a ratio of 4, where the top-left corner represents the blur kernel (shown in Fig. 6d)

4.3 Results on synthetic images

In multi-degradations, we adopt 5 blur kernels shown in Fig. 6, where Fig. 6a, b are isotropic Gaussian blur kernels with different standard deviation (std), Fig. 6c–e are real blur kernels extracted from AID dataset. To comprehensively verify the effectiveness of our proposed network, we also train a CNN version of the model called DCMNet-C, and the corresponding GAN version of the model is called DCMNet-G. The spotlight of these two network models is different: DCMNet-G is mainly used to obtain better perceptual performance (photo-realistic images), while the DCMNet-C is more concerned with the objective evaluation metrics (e.g., PSNR and SSIM). Specifically, the DCMNet-C is trained with only the classical MSE loss function, while DCMNet-G introduces perceptual loss to the loss function and is trained in a generative adversarial manner by introducing discriminator. In multi-degradation testing stage, we take Dota and UC-Merced datasets as HR images and adopt blur kernels shown in Fig. 6 to obtain corresponding LR images. The best results are emphasized with bold.

Table 1 reports the test results of different SR models, and our CNN version model DCMNet-C achieves the best results compared with both CNN-based and no GAN-based models, such as SRMD [40], RCAN [39], and USRNet [44]. Though RealSR [13] achieves the best result on LPIPS, DCMNet-G performs better on the other metrics, indicating that DCMNet-G can improve the visual performance while maintaining the content of the image. We can observe that in Table 2, compared with both GAN-based and CNN-based SR methods on airplane class in UC-Merced dataset, our proposed method achieve the best results on different blur kernels. In the case of standard Gaussian blur shown in Fig. 6a, b, USRNet [44] and RealSR [13] also achieve competitive results. However, in the other degrade types shown in Fig. 6c–e, both DCMNet-C and DCMNet-G achieve outstanding SR reconstruction performance, while the performance of other methods are seriously dropped. DCMNet-G outperforms DCMNet-C in the human-oriented metric LPIPS while still maintaining other traditional metrics (PSNR, SSIM, etc. ) that are close to DCMNet-C, which indicates that DCMNet-G could better balance the image content and human visual result. With the blur kernel, Fig. 6c, DCMNet-G specifically improves LPIPS by 0.034, while the PSNR is just 0.11 dB lower than DCMNet-C. Although RealSR achieves the best LPIPS result, both DCMNet-C and DCMNet-G perform significantly better than the other traditional measures (PSNR, SSIM, etc.). This indicates that RealSR fails to maintain the image content, and the visual results of RealSR (see Figs. 7 and 8) generate undesirable aberrations and artifacts. To comprehensively compare the performance of our proposed method, we calculate the standard deviation of PSNR in the case of the blur kernel (shown in Fig. 6d). The results of DCMNet-C and DCMNet-G are 24.88 ± 0.05 dB and 24.19 ± 0.16 dB, which are superior to the 22.08 ± 0.21 dB of USRNet, indicating that both DCMNet-C and DCMNet-G have superior stabilization.

With the ground-truth HR images presented as reference, the perceptual comparisons are shown in Figs. 7 and 8. Specifically, Figs. 7 and 8 show the visual results with gaussian blur kernels Fig. 6a, d, respectively. The results indicate that for LR images degraded with real blur kernel, our method has significant advantages. The zoomed parts show that DCMNet-G can restore image information while maintaining good details on images degraded by the Gaussian blur kernel. Though the most representative model RealSR [13] achieves the competitive results, its outputs have a lot of fake textures, while the DCMNet-G can recover more details and achieve the best qualitative results. Compared with the USRNet [44], DCMNet-G has a lower value of PSNR, SSIM, and sCC metrics, but the outputs of DCMNet-G achieve better visual results.

4.4 Results on real images

In addition to the experiments on synthetic images, we also conduct experiments on real remote sensing images. To further verify the robustness of our proposed model, we randomly selected real remote sensing images taken by the unknown model of aerial cameras and the result is presented in Fig. 9. The ESRGAN [28] is used as one of the representative GAN-based methods for comparison. We can find that DCMNet-G maintains the details of GT images while generating less artificial noise compared to the most representative GAN-based method ESRGAN [28]. Figure 9 shows that in terms of the SR of the real image, the reconstruction result of ESRGAN [28] has fine details while also introducing lots of artificial artifacts, which significantly reduces the quality of the reconstruction result. In contrast, our model can significantly improve the visual quality with less artificial noise, achieving excellent reconstruction results.

Fig. 9
figure 9

Visual results on real remote sensing image obtained by any aerial camera at a ratio of 4. It is obvious that the zoomed part of ESRGAN has more fake texture than DCMNet-G

4.5 Ablation study

4.5.1 The enhanced kernel modeling method

We propose an enhanced blur kernel estimation algorithm based on GAN, which can extract the blur kernel of a real remote sensing image using only unpaired images. To verify the effectiveness, we first generate five different types of blur kernels and obtain LR images by degrading the corresponding HR images with these kernels. Next, we estimate kernels from these LR images via our estimation method. We also compare with the other kernel modeling method based on GAN and the results are shown in Fig. 10. It can be seen that our method can obtain better performance in multi-degradation.

Fig. 10
figure 10

Results of our kernel estimation algorithm. The top row shows the five ground-truth blur kernels. The results are compared with KernelGAN [2]

4.5.2 The number of MFABs

In our method, we propose the multiscale feature extraction block (MFEB) to adaptively detect the image features on different scales and compare the SR performance with different numbers of MFEBs. Table 3 shows the PSNR, SSIM, ERGAS, and LPIPS results of DCMNet-G with 6, 7, 8, 9, 10, and 11 MFEBs on “Airplane” class selected from UC-Merced dataset. It can be seen that with the increase of MFEB, ERGAS and LPIPS gradually decrease, while PSNR and SSIM gradually increase due to the enhancement of the network’s ability to extract features, and our network achieves the optimal performance when the number of MFEB is 9. With the further increase of the number of MFEB, the model becomes more and more difficult to be trained, and the SR result does not increase significantly or even will drops slightly. Consequently, 9 MFEBs are used as the default setting of both DCMNet-G and DCMNet-C.

Table 3 SR result DCMNet-G model with different number of MFABS

4.5.3 The densely connected attention mechanism

To consider the inter-dependencies of local regions in the SR model, we introduce the spatial and channel dimensions, respectively. Different from most existing attention mechanism networks, we take the relationship of feature maps in any positions into consideration and introduce DCM into our network design. Since MFEB extracts multi-scale information of the feature map, and the proposed attention mechanism effectively recalibrates the feature map, we combine MFEB and DCM for achieving higher performance.

Table 4 SR results of model with differents methods

We compare three variants of applying MFEB and DCM in SR network: MFEB only, DCM only, both MFEB and DCM. As Table 4 shows, compared with the model with both MFEB and DCM, either the model with only MFEB or the model with only DCM achieves poorer performance, because either MFEB or DCM does not make full use of the feature map and attention information simultaneously. In addition, to compare the ability of detail reconstruction, the reconstructed HR images obtained by different variants at 4-time scale factor are shown in Fig. 11.

Fig. 11
figure 11

SR results of DCMNet in different ways of applying MFEB and DCM on the test image selected from Dota dataset at 4-time scale factor

4.5.4 The relationship between loss function and visual results

We compare the effects of different weights in Eq. (14) on the SR results. As shown in Fig. 12, when \(\alpha =1\), \(\beta =0\), we get a pure GAN-based model. The output result has sharper edges and richer textures, but unpleasant artifacts are unfortunately generated. When \(\alpha =0\), \(\beta =1\), we get a pure PSNR-oriented CNN-based model and the outputs are relatively blurry. We gradually modify the weights of the perceputal loss and content loss, and find that the output have fewer unpleasant artificts while maintaining fine textures when \(\alpha =0.8\), \(\beta =0.2\). Therefore, \(\alpha =0.8\), \(\beta =0.2\) are used as the default setting of DCMNet-G.

Fig. 12
figure 12

The comparison among different weights of perpectual loss and content loss

4.5.5 The comparison of computational complexity

The computational complexity of a neural network is typically described in terms of the network’s multi-adds calculation (MAC), which is closely related to the network’s inference time. In other words, the greater of MAC indicates the longer the running time and higher computational complexity. To ensure a fair comparison, we calculate multi-adds by assuming the size of the HR image is 256 \(\times\) 256. Table 5 presents the result of network parameters, MAC, and the inference time. As shown in Table 5, compared with SRMD, our proposed method DCMNet-G has 2.33G less MAC although the number of parameters and running time are similar. However, as shown in Tables 1 and 2, SRMD is inferior to DCMNet-G in terms of PSNR, SSIM, and other metrics. Specifically, as shown in Table 2, DCMNet-G has 2.51 dB and 0.071 improvements with respect to PSNR and SSIM under the degradation shown in Fig. 6c, respectively. Compared with other algorithms such as RealSR, DCMNet-G has about 95 % better performance in terms of the number of parameters, MAC, and running time. In addition, DCMNet-G has 3.02 dB and 0.071 improvements in terms of PSNR and SSIM under the degradation shown in Fig. 6c, respectively. Although EDSR [20], SRMD [40] and VDSR [14] have fewer network parameters than our DCMNet-G; however, their performance is significantly worse than our DCMNet-G and they can only handle the specifical degradation type, which cannot be effectively generalized to real remote sensing image SR task. DCMNet-G also outperforms USRNet [44], ESRGAN [28] and RealSR [13] with lower parameter number. Therefore, DCMNet-G obtains the superior trade-off between performance and parameters compared with other current methods.

Table 5 Real-time analysis for scale 2, 4

5 Conclusions

In this paper, a practical SR for multi-degradation remote sensing images with deep neural network is proposed. In this method, we first estimate more accurate blur kernels from the real blurred images through the GAN-based deep learning method. Next, to increase the diversity of the estimated blur kernels, we learn the distribution of these kernels and build a kernel pool consisting of various blurry kernels. Finally, we select blur kernels from the kernel pool to degrade the HR remote sensing images and use the obtained HR/LR image pairs to train the SR network. The introduction of the densely connected mechanism and multi-scale feature extract blocks can guarantee the network’s characterization with limited parameters and complexity by increasing the flow of feature information within the network. To best balance the visual effects and the objective metrics, we introduce the perceptual loss in the training process and gradually adjust the weight of the loss. The experimental results show that the proposed method is effective in handling a wide range of degradation types and outperforms current methods in terms of network parameter count and computing effort. However, there are still some research lines that need to be studied, such as introducing image degradation information during SR reconstruction stage to further improve the network performance, and looking into how to incorporate the proposed algorithm into mobile devices.