1 Introduction

Timely and accurate population distribution data play an essential role in a wide range of critical applications, including but not limited to epidemiological studies (Li et al., 2020), natural disaster management (Chen, Wu, Jin et al., 2022), climate change analysis (O’Neill et al., 2020), ecological vulnerability assessment (He et al., 2018), emergency response (Tian et al., 2020), environmental monitoring (He et al., 2021), and sustainable urban planning (Long et al., 2022). In most scenarios, the population data used in these applications are derived from officially published census data. Although census data have a scientific collection process and high accuracy, the insufficient information in the temporal dimension caused by low update frequency (5–10 years) makes census data hard to meet the needs of many real-world applications, leading to the increase of analysis bias, imbalance of resource allocation, and inefficiency of government management (Song, 2019; Xu et al., 2021).

To address this limitation, scientists seek observation data reflecting human activity to provide richer spatial and temporal information for population mapping. Therefore, remote sensing data are extensively used due to their ability to record continuous observations across time and space (Chen et al., 2020). The most popular remote sensing data include nighttime light (NTL) and optical satellite data (Wang et al., 2018). NTL data capture the footprint of light on the earth’s surface generated by human activities, and its brightness level directly indicates the intensity of human activities, providing support for inferring population distribution (Song et al., 2020). NTL from the US Air Force Defense Meteorological Satellite Program Operational Linescan System (DMSP/OLS), released in the early 1990s, is widely used for population mapping (Yu et al., 2018). However, its limitations of low spatial and radiometric resolutions, blooming effect, and over-saturation problem would result in underestimation in high-density areas (e.g., urban centers), making it only applicable to large-scale estimations (e.g., city or larger scales) (Elvidge et al., 1999). Another frequently used NTL data are from the Visible Infrared Imaging Radiometer Suite (VIIRS) sensor carried by the weather satellite Suomi NPP. Given the advanced luminous technology equipped in VIIRS, NPP/VIIRS NTL images come with a higher spatial resolution and are demonstrated to have better performance in population spatialization compared with its predecessor, DMSP/OLS (Wang et al., 2018).

Optical remote sensing images record reflection information of the earth’s surface in multiple wavebands. Based on the principle that different types of land cover have different population carrying capacity, optical remote sensing images with different resolutions are also widely employed for population spatialization (Wang et al., 2018). Some works also indicate that optical remote sensing images and derived land cover/use products could perform better than NTL data in population mapping (Zeng et al., 2011), and a higher spatial resolution of optical data would also contribute to a higher accuracy (Linard et al., 2011). The development of geographic information and remote sensing technology has increased the accessibility of numerous datasets. Therefore, beyond using a single type of dataset, most of the well-known population count data products are obtained based on the fusion of multiple sources of data, for example, WorldPop from the University of Southampton, UK (Stevens et al., 2015; Tatem, 2017), LandScan from Oak Ridge National Laboratory, US (Bright et al., 2016), Gridded Population of the World (GPW) from Socioeconomic Data and Application Center, NASA, US (Doxsey-Whitfield et al., 2015), and Global Human Settlement Layer (GHSL) from Joint Research Centre, European (Freire et al., 2016). Despite the satisfactory performance of these population data in terms of spatial accuracy, information like short-term population distribution dynamics caused by intra- and inter-city human mobility is highly essential in various applications but still cannot be provided by these static datasets with an update frequency of 1 year or longer (Cheng et al., 2022).

Since 2010, the gradual popularization of smartphones along with the rapid development of mobile Internet triggered by 4G communication technology have completely changed the way of observing human behaviors. Specifically, smart devices users actively or passively upload their spatial-temporal information during their access to various mobile Internet services and applications, and the collection of such a vast amount of information makes it possible to continuously observe people’s real-time spatial behavior at a large spatial scale (Song et al., 2019), which is also named as “social sensing “ (Liu et al., 2015). Social sensing big data have therefore attracted much attention of the scientific communities and are gradually adopted with satellite data in population mapping to improve accuracy and enrich information in the temporal dimension. Some widely used datasets include geotagged social media data (Patel et al., 2017), mobile phone data (Deville et al., 2014), point of interests (POI) data (Bakillah et al., 2014), and smart card data (Ma et al., 2017).

Besides fusing multiple data sources, estimation methods (regression models) for exploring the relationship between population and various features also play a critical role in population mapping. According to the algorithm, these models can be divided into three categories, including (1) statistical model, (2) spatial statistics model, and (3) machine learning. Statistical models are a powerful and widely used class, such as linear regression model (Bagan & Yamagata, 2015), and log-linear regression model (Liu et al., 2018). To address the complex relationship between variables arising from spatial heterogeneity, many works employed local-model-based spatial statistics models, ranging from geographically weighted regression (GWR) (Wang et al., 2018; Xu et al., 2021), geographically and temporally weighted regression (GTWR) (Liu et al., 2021), and Bayesian spatio-temporal model (Wang et al., 2021). In the past few years, the boom of Artificial Intelligence (AI) has made machine learning models increasingly employed in population spatialization, which plays a vital role in improving estimation accuracy. These machine learning models include not only single models like random forest regression models (Cheng et al., 2022), but also ensemble models like XGBoost (Tu et al., 2022).

While many well-designed population spatialization frameworks have been proposed and reported satisfactory accuracy, some new issues have emerged. (1) The fusion of an excessive number of data sources increases the difficulty of data acquisition, especially for studies that use non-public data, making these methods difficult to be replicated by others. (2) Although the spatial statistical models have ideal fit and prediction performance, their transferability is limited due to their nature as local models, which means that such models may not be suitable for estimating non-modeled areas or increase uncertainties in results. (3) For the use of machine learning models, in addition to the cost of parameter tuning required, selecting a suitable machine learning model is also a challenge that must be faced. To alleviate the above problems, this study proposes a population estimation framework based on publicly available spatial datasets and an automatic ensemble learning framework. The framework will be used to estimate the monthly county-level population densities to reveal population distribution dynamic in China in 2015.

2 Materials and methods

Four main procedures are involved in mapping dynamic population density in this study: (1) data collection and feature extraction; (2) automatic ensemble learning model training and prediction; (3) accuracy assessment and comparison; (4) monthly population density mapping.

2.1 Data collection and feature extraction

We incorporated multiple categories of geospatial datasets (Table 1) for dynamic population density mapping, including demographics data, social sensing data, medium-resolution (MR) multispectral images, NTL data, land use data, and topography data. All the datasets except topography data were obtained for 2015, and details of each dataset and extracted features for model training are provided as follows.

Table 1 List of datasets and extracted features used in population density mapping

2.1.1 Demographics

County-level demographics data from the 1% population sample survey in 2015 and corresponding administrative boundaries were collected, spatially combined, and adopted for both model training and validation. County-level in China is equivalent to level-3 of the Global Administrative Unit Layer (GAUL) as defined by the Food and Agriculture Organization of the United Nations (Chen, Wu, Song et al., 2022). A total of 2851 counties in China were included in this study, without the records of Hong Kong, Macau, and Taiwan. We further calculated each county’s population density (/km2) based on total population and administrative area.

2.1.2 Tencent LBS data

Thanks to Tencent’s large number of active users, the LBS data from Tencent have good performance in describing the digital footprints of human activities (Chen, Song, Kwan, et al., 2018), and have been successfully applied in many fields like population distribution mapping (Xu et al., 2021), land use classification (Gong et al., 2020), environmental exposure assessment (Chen, Song, Jiang, et al., 2018; Song et al., 2021), and human mobility mining (Zhu et al., 2018). We collected Tencent LBS data (https://heat.qq.com) generated in 2015 using the method introduced in Song et al. (2018). The raw data is tabulated as the number of location service requests with a spatial resolution of 30 arc-second and a 5-minute update frequency. We then transformed tabular data into hourly aggregated raster data and generated the average hourly raster for each month in 2015. The monthly mean and sum of LBS data were finally derived over each county as features.

2.1.3 Landsat-8 OLI imagery

Launched in February 2013 as part of a long-term Landsat program led by the U.S. Geological Survey and NASA, Landsat-8 is designed to collect medium-resolution multispectral image data and provide seasonal coverage of land surface (Roy et al., 2014). The sensor of Operational Land Imager (OLI) carried by Landsat-8 has nine reflective wavelength bands, six of which (i.e., blue, green, red, NIR, SWIR-1, SWIR-2) are designed for land applications with a spatial resolution of 30-m and a 16-day repeat cycle (Zhang et al., 2018). Here we collected Landsat-8 imagery from January 1 to December 31, 2015, for model training. A pixel-based quality check was first performed to eliminate observations contaminated with clouds and shadows from the entire Landsat-8 archive, using cloud masking and quality assessment information from Landsat-8 metadata. We then calculated the normalized difference vegetation index (NDVI = (NIR - Red) / (NIR + Red)) for each retained pixel. The whole-year maximum NDVI values were further selected and used as the quality index to generate the 2015 cloud-free greenest Landsat-8 composite. Based on this composite image, we calculated the normal difference built-up index (NDBI = (SWIS - NIR) / (SWIR + NIR)) and the normal difference water index (NDWI = (NIR - SWIR) / (NIR + SWIR)). The mean and sum of NDVI, NDBI, and NDWI were then derived for each county.

2.1.4 NPP-VIIRS nighttime light data

NPP-VIIRS NTL data with a spatial resolution of 15 arc-second are highly performing in characterizing various human activities, such as determining urban expansion patterns (Song et al., 2020), and monitoring human mobilities (Cai et al., 2017). In this study, we collected the annual cloud-free composites of 2015 without interference from stray light and lunar illumination. Each county’s annual mean and sum radiance value was then derived as features.

2.1.5 Land-use data

We collected land-use data in 2015 from the Land-use Status Remote Sensing Monitoring Database of China provided by the Chinese Academy of Sciences Resource and Environmental Science Data Center (www.resdc.cn). The data with a spatial resolution of 30-m were produced based on Landsat TM/ETM remote sensing images, having six first-class and 25 second-class land-use types. The county-level coverage rate and total area of land-use types of “urban area”, “rural area”, “water”, “forest”, “cropland”, and “grassland” were derived as features.

2.1.6 DEM and slope

The Shuttle Radar Topography Mission (SRTM) V3 digital elevation data (DEM) (Farr et al., 2007) were collected for China, which was provided by NASA in 2000 with a spatial resolution of 1 arc-second. Since the earth’s surface elevation change is a slow process, the temporal inconsistency of this data with other data (collected in 2015) will have little effect on the estimation results. The mean and sum of DEM and slope were then derived for each county.

2.2 Mapping population density with automatic ensemble learning

We utilized the automatic ensemble learning framework embedded in AutoGluon (Version: 0.5.2) (https://github.com/awslabs/autogluon) to train the regression models for population density mapping. AutoGluon adopts a novel automated machine learning (AutoML) framework that employs a multi-layer stacking strategy (Erickson et al., 2020). As shown in Fig. 1, the multi-layer stacking framework of AutoGluon is constructed by one base layer and a minimum of one stacking layer. The base layer has several base models, and their prediction outputs are concatenated and fed into the stacker models in next layer, which then serve as base models for additional higher stacking layers. To avoid expensive costs of algorithm selection and hyperparameter optimization, AutoGluon simply reuses the same models as stackers in each layer with the same hyperparameter values (Erickson et al., 2020). This technique can be seen as another form of deep learning, using layer-wise training, where the units connected between layers could be arbitrary machine learning models. AutoGluon also enables stacker models in any higher layer to take as input both the predictions from the previous layer and original input features during training. The final stacking layer combines the predictions of the stacker models in a weighted manner by applying ensemble selection (Caruana et al., 2004).

Fig. 1
figure 1

A two-stacking layers example of the multi-layer stacking framework of AutoGluon

AutoGluon has the capacity to automate the process of data pre-processing, base models search, hyper-parameter tuning, and model ensembling during training. In this study, we employed automatic multi-layer stacking and 5-fold cross-validation as the parameters of the Tabular Prediction of AutoGluon model training. The customized base models used in the Tabular Prediction (for regression) include neural network algorithms (e.g., NeuralNetFastAI, NeuralNetTorch), random forest algorithms (e.g., RandomForestMSE), extreme random tree algorithms (e.g., ExtraTreesMSE), K-nearest neighbor algorithm (e.g., KNeighborsDist, KNeighborsUnif), and boosting tree algorithms (e.g., LightGBM, CatBoost, XGBoost, LightGBMXT, LightGBMLarge). In addition, ensemble learning algorithms is used for combining predictions in the final stacking layer.

All the 26 variables listed in Table 1 were used for model training which was performed using CPU. Note that since the national 1% sample survey population data we used represent the population with a standard time of November 1, 2015, at 00:00 (National Bureau of Statistics of China, 2016), we only used the Tencent LBS data collected from November for model training, and the data from other months were only used for dynamic population density mapping presented in Section 3.4.

2.3 Accuracy assessment and comparison

We split the 2851 samples into 80% for training and 20% for validation. Root mean square error (RMSE, Eq.1) was used as the indicator for accuracy validation. To detect the contribution of different inclusive features, we calculated feature importance scores for the final weighted ensemble model via permutation importance. To be specific, the importance score of a feature represents the decrease in the model’s prediction accuracy (namely RMSD in this study) when the values of the feature have been randomly shuffled across rows. The higher a feature score, the more important it is for the prediction accuracy of a model. A negative importance score means the features may be harmful to the model’s prediction, and a model without the feature having a negative score is likely to achieve better predictive performance (Erickson et al., 2020). We further included three more indicators for model comparison: relative root mean square error (%RMSE, Eq.2), mean absolute error (MAE, Eq.3), and coefficient of determination r2.

$$RMSE=\sqrt{\frac{\sum_{i=1}^n{\left({p}_i-\hat{p_i}\right)}^2}{n}}$$
(1)
$$\% RMSE=\frac{RMSE}{\frac{1}{n}{\sum}_{i=1}^n\hat{p_i}}$$
(2)
$$MAE=\frac{\sum_{i=1}^n\left|{p}_i-\hat{p_i}\right|}{n}$$
(3)

where pi is the estimated population density of the ith county, \(\hat{p}\) i is the observed population density of the ith county, and n is the number of counties included in this study.

Four mainstream population distribution datasets (Table 1) were further collected and used to compare the estimation results, including WorldPop, LandScan, GPW, and GHSL. Using census population density as the benchmark, we compared the accuracy of our estimated population density by RMSE with these four grided population datasets at the county level and calculated the county-average %RMSE at the national scale.

3 Results and discussion

3.1 Comparison of different models

A two-layer stacking structure was finally generated by AutoGluon for county-level population density estimation in China, which used 11 individual machine learning models and one weighted ensemble model. Table 2 shows the comparison of performance between different models.

Table 2 Performance of different machine learning models in terms of root mean squared error (RMSE), relative root mean square error (%RMSE), mean absolute (MAE), and R-squared (r2)

In the training phase, the NeuralNetFastAI model is the highest scoring among all the 11 individual models with an RMSD of 752.12, followed by ExtraTreesMSE (RMSD = 841.74), CatBoost (847.61), LightGBMXT (850.97), and LightGBMLarge (862.86). The weighted ensemble model that combines prediction from all the individual models achieved the best prediction performance with an RMSD of 728.37. Also, the training time varied significantly from 0.01 to 32.52 seconds for different models (Table 2), with a mean time of 8.19 seconds and a standard deviation of 11.06 seconds. Models of K-Neighbors take the least training time, while Neural Network models are the most difficult ones to train. The total time used for weighted ensemble model training is 79.63 seconds.

Four criteria were further used to assess the testing performances of models. Among all the individual models, the NeuralNetFastAI has the best performance in RMSD (497.58), %RMSD (55.86), and r2 (0.967), and the ExtraTreesMSE received the lowest MAE (151.45). On the other hand, the weighted ensemble model has the best performance in all criteria, and the received RMSD, %RMSD, MAS, and r2 are 443.01, 49.74, 158.65, and 0.974, respectively. At the same time, we found that the increase in the number of layers did not contribute significantly to the improvement in prediction accuracy in this study. For example, when we increased the layer number of stacking from 2 to 3, the model performance just increased 0.005 in r2 (and 4.45 in RMSD).

3.2 Population density mapping for China

We estimated the population density for 2851 Chinese counties using the generated weighted ensemble model. The high goodness of fit (r2 = 0.974) and low RMSD (427.61) for all counties indicate that the inclusive features and the weighted ensemble model can well estimate county-level population density. As shown in Fig. 2a, the derived map could accurately characterize the numerical and spatial distribution patterns of county population density in China. According to the map, we can identify that, except for the counties within or around provincial capitals, counties with high population density are mainly concentrated in the eastern and southeastern coastal regions. In particular, counties located in provinces of Hebei, Henan, Shandong, Zhejiang, and Jiangsu form the main high-density population concentration areas in China. In addition, counties located in the Sichuan Basin (mainly Sichuan and Chongqing) have higher population densities in southwest China.

Fig. 2
figure 2

Estimation and error of county-level population density in China in 2015: a AutoGluon-based population density mapping; b AutoGluon-based and c NeuralNetFastAI-based estimation error compared to census population density

The difference between census population density and estimation was then calculated via (census- estimation) / census * 100%. As shown in Fig. 2b, due to the good prediction capability of the weighted ensemble model, the vast majority of the estimation errors are within 50%/km2. Some underestimations appear in counties with higher population density, while counties with overestimation are mainly concentrated in lower densely populated regions in western China, such as regions (or provinces) of Tibet, Xinjiang, and Qinghai. We also performed the same estimation using the NeuralNetFastAI model which achieved the best performance second to weighted ensemble model in both training and testing processes. Compared with the NeuralNetFastAI-based result (Fig. 2c), the AutoGluon framework not only has better fitting accuracy, but also alleviates the severe overestimation in less populated areas in the western region. Statistically, 1483 counties showed underestimation, accounting for 51.91% of all the selected counties.

We further compared the estimated county-level population density with the other four population datasets, namely LandScan, WorldPop, GPW, and GHSL. The accuracies of these datasets were evaluated by RMSD based on census population data, and the relationship between any two of them were measured by r2. As the scatterplot matrix shown in Fig. 3, the population density estimated by the proposed model in this study obtained the best performance in both r2 and RMSD, followed by LandScan (r2 = 0.949, RMSD = 535.12), GHSL (r2 = 0.941, RMSD = 633.02), WroldPop (r2 = 0.938, RMSD = 684.28), and GPW (r2 = 0.926, RMSD = 836.07). The results also show that all tested population datasets show good reliability in characterizing the spatial pattern of population density in China, with LandScan and GHSL data having better accuracy.

Fig. 3
figure 3

Comparison of different population datasets with census data in county-level population density (per km2)

3.3 Importance of inclusive features

Table 3 lists inclusive features’ importance scores measured by RMSE. “Importance” and “Standard deviation” mean the estimated importance score and standard deviation of a feature. “p-value” measures confidence level of importance score, and a p-value of 0.01 indicates a 1% probability that the feature negatively affects the prediction of the model. “p99_high” and “p99_low” refer to the upper and lower end of 99% confidence interval for true feature importance score, respectively.

Table 3 Importance score of inclusive features in terms of root mean squared error (RMSE)

According to Table 3, features of the “mean” value in different categories play a more critical role in the county-level population density prediction. Nine of the top ten features with the highest importance score are mean values. Of all the inclusive features, mean Tencent LBS (LBS_mean), urban area coverage (Urban_mean), mean nighttime light brightness (NTL_mean), and mean normal difference built-up index (NDBI_mean) are the four most important features for prediction with both high importance score and low p_value. This result makes perfect sense because of the following reasons. (1) The Tencent LBS data generated by active mobile phone users and record their real-time locations, obviously having an excellent ability to characterize the population distribution. (2) Similarly, the brightness of nighttime light represents the intensity of human socio-economic activity. Except for industrial facilities such as docks and power plants, areas with high lighting levels usually have high densities of human activity. (3) Urban areas are often accompanied by relatively high population density, making it easy to explain that higher urban coverage results in higher population densities. (4) The NDBI could be considered as a representation of the intensity of human modification of the land surface, and a higher NDBI will surely refer to higher population density. More importantly, the significant contribution of variables’ mean value to model performance is a valuable finding of this study. Specifically, the more important contribution of the “mean” value makes the proposed framework reasonable and interpretable, as the observation (ground true for training) and prediction target in the model were set as population density rather than total population of a county. Simply speaking, the “mean” of features well predict the “mean” of the population (i.e., population density). In addition, the less importance of “sum” values provides vital evidence that the proposed framework has potential for cross-scale prediction, for example, the applicability to town and fine-grid-level (e.g., 1-km) population density estimation.

Besides, negative importance scores were also found in six features, namely NDVI_sum, Grassland_mean, Rural_sum, Forest_sum, DEM_sum, and Slop_sum, which means they are likely to play a harmful role in the prediction results. However, the p-values of all these features’ importance scores are larger than 0.5, indicating that their harmful effect should be very weak. We then retained the model by removing these features and found that the prediction performance remained almost the same (r2 = 0.973, RMSE = 448.13).

3.4 Intra-annual population density dynamics in China

We used the weighted ensemble model to estimate county-level population density for months other than November 2015, and only changed the mean and sum of Tencent LBS data to the corresponding month and let the other variables be the same. Person correlation coefficients between monthly county-level population density (Fig. 4a) reveal intra-annual population distribution dynamics in China. Using November as a reference month, we can be found that September and October have the highest similarity with November, while January and February have the most significant differences from November. This result is consistent with earlier studies (Li et al., 2016; Pan & Lai, 2019) and reasonably respond to the population mobility characteristic in China. Specifically, the largest population movement in China occurs during the Spring Festival, when large numbers of labor force population and students return to their hometowns before the Spring Festival and go back to the cities where they work within a short period after the Spring Festival. The spring festival of 2015 is February 19, which explains the relatively large differences between February and the other months, especially from September to November.

Fig. 4
figure 4

Intra-year population density dynamics in China: a Person correlation coefficients between monthly county-level population density in 2015; b Difference between November and February 2015 in population density. M1 to M12 means January to December

We further calculated the difference in population density between November and February (population_densityNov – population_densityFeb) to uncover the spatial pattern of population distribution dynamics during the spring festival in 2015. As shown in Fig. 4b, in general, the most dramatic population density changes occur in areas that are more economically developed. These cities or towns are more attractive to the labor force because of more job opportunities and high salaries. Specifically, cities with significant declines in population density include three major economic zones of the Pearl River Delta city cluster (e.g., Guangdong, Shenzhen, Dongguan, Foshan, and Zhuhai), the Yangtze River Delta city cluster (e.g., Shanghai, Wuxi, Suzhou, Ningbo, Changzhou, Hangzhou, Taizhou, and Wenzhou), and the Beijing-Tianjin-Hebei city cluster (e.g., Beijing, Tianjin, Zhangjiakou, Tangshan, and Baoding), provincial capital cities (e.g., Shenyang, Jilin, Harbin, Jinan, Fuzhou, Haikou, Nanning, Kunming, Guiyang, Chengdu, Wuhan, Changsha, Xian, Zhengzhou, Xining, Lanzhou, and Urumqi), eastern coastal cities (e.g., Dalian, Yantai, Qingdao, and Xiamen), and Chongqing. This result is generally consistent with the results of earlier studies using different study designs (Zhou et al., 2020; Zhu et al., 2021), illustrating the good performance of our estimated monthly population densities in characterizing population spatial distribution dynamics in China.

3.5 Advantages and limitations

Using China as an example, this study presents a new framework for population spatialization using multi-source geospatial data. The framework alleviates some of the problems arising in current population estimation methods and offers the potential to extend to estimations with larger spatial scales for the following reasons. First, using publicly available, easily accessible, and limited data sources. Only five categories of data are used in this framework, including Landsat-8, nighttime light data, land-use map, DEM map, and Tencent LBS data, which are all globally available. The collection and pre-processing of some data can also be performed quickly and efficiently on cloud computing platforms, such as Google Earth Engine, NASA Earth Exchange, Amazon AWS, and PIE-Engine. While the Tencent data values only apply to the China study, other digital footprint data alternatives can be used to map countries and regions outside China, such as geo-tagged Twitter data. Second, employing cost-effective automatic ensemble learning models. AutoGluon not only achieves plausible accuracies in population mapping but also dramatically reduces the cost of model selection and parameter tuning, making the estimation framework proposed in this study easily reproducible and refinable by other scholars. Third, transferability of prediction across time and regions. The satisfactory results of monthly mapping and the revealed characteristics of the intra-annual population dynamics in China suggest that the proposed model has the ability to conduct across-time with only the change of social sensing data (population digital footprints of population) and without the influence of changes in the quality of remote sensing data (e.g., cloudiness, phenology), which could directly contribute to high-temporal-resolution population mapping (e.g., seasonal, monthly, daily). Moreover, the proposed framework’s stable performance and overall high accuracy affirm its transferability across regions. Therefore, for regions where it is challenging to train models due to the lack of observation data, well-trained models from other regions would have the potential to result in good accuracy levels.

Nevertheless, some limitations in this framework study should be pointed out. (1) Since we do not have quality-assured observations at smaller scales, such as the community scale, the performance of the proposed framework at small scales is unknown. (2) The estimated bias in China shows a spatial pattern with overestimation in the west (low population density areas) and underestimation in the east (high population density areas), which is an issue could be addressed. We will attempt to solve this problem in our future work by adding variables or introducing spatial information as appropriate.

4 Conclusions

Using AutoGluon and multi-source geospatial data, this study proposed an efficient framework for population spatialization. Based on this framework, we estimated the county-level population density in China, using a limited number of publicly available datasets ranging from Tencent LBS data, Landsat-8 OLI imagery, nighttime light data, land-use maps, and DEM map. The result showed that the proposed framework could well estimate the population density for a total of 2851 counties in China with a high goodness of fit (r2) of 0.974 and a low RMSD of 427.61. The comparisons with WorldPop, LandScan, GPW, and GHSL data also illustrate that the framework outperforms the current mainstream population mapping frameworks in terms of estimation accuracy. Of all the features involved in the modeling, mean Tencent LBS, urban area coverage, mean nighttime light brightness, and mean normal difference built-up index are the four features that contribute most to the improvement of estimation capacity. Furthermore, the derived monthly county-level population density and the revealed spatial pattern of population dynamics in China are consistent with earlier studies, corroborating the robustness of the proposed framework. This study is the first to apply AutoGluon to population estimation and mapping, and its efficient and automated modeling capabilities will undoubtedly contribute to larger scale and finer spatial-temporal population spatialization studies.