Green Space Reverse Pixel Shuffle Network: Urban Green Space Segmentation Using Reverse Pixel Shuffle for Down-Sampling from High-Resolution Remote Sensing Images

By inergency On Jan 24, 2024

[ad_1]

1. Introduction

According to United Nations projections, the world’s population is expected to increase by 2.25 billion, reaching a total of 9.15 billion by the year 2050 [1]. Such rapid population growth imposes significant ecological pressures on cities [2], highlighting issues such as the urban heat island (UHI) effect [3] and various pollution problems in the process of urbanization. Urban green spaces (UGS) represent a crucial component of urban ecosystems, comprising the vegetative entities within urban areas [4]. UGS play a pivotal role in the sustainable development of cities [5]. On one hand, UGS contribute significantly to enhancing the urban ecological environment [6] and addressing environmental issues like air pollution [7] and noise pollution [8]. On the other hand, UGS can improve the physical and mental well-being of residents [9], reduce stress and anxiety [10], and promote a healthier lifestyle [11].

However, the environmental pollution and land issues resulting from urban expansion and construction pose a significant threat to the urban ecosystem [12]. Therefore, for the sustainable and healthy development of cities and the advancement of the United Nations Sustainable Development Goal No. 11 [13], the rapid and accurate acquisition of UGS information has become increasingly critical. In the past decade, local government authorities conducted surveys and mapping of green spaces using field investigations. Nevertheless, these traditional survey methods were time-consuming, resource-intensive, and often provided incomplete and untimely information, and some small-scale and scattered green spaces were also neglected. This hindered the development and implementation of relevant policies aimed at sustainability. Innovative approaches for UGS survey and statistics methods are still in need of further research.

With the rapid advancement of earth-observing technology, the data acquisition capability of remote sensing has significantly improved, marking the dawn of a new era in multi-platform, multi-angle, multi-sensor, all-weather, and all-time earth observation [14]. Developing a method for the accurate and rapid extraction of UGS information using multispectral or hyperspectral remote sensing imagery is a critical issue. Gandhi et al. [15] proposed a satellite image vegetation change detection method based on the Normalized Difference Vegetation Index (NDVI), utilizing Landsat TM remote sensing data and NDVI and DEM data for multisource vegetation classification. Zhou et al. [16] introduced a city forest type discrimination method based on Linear Spectral Mixture Analysis (LSMA) and Support Vector Machine (SVM), which used LSMA to extract three different vegetation endmembers, including broadleaf forest, coniferous forest, and low vegetation, and their abundances. Zhang et al. [17] generated a fine classification system for the 2015 Global 30 m Land Cover Classification (GLC_FCS30-2015) using Landsat image time series and high-quality training data from the Global Spatial Temporal Spectra Library (GSPECLib) on the Google Earth Engine computing platform. Nevertheless, UGS are characterized by fragmentation and complex backgrounds [18]; thus, there are numerous small-scale UGS, such as roadside trees and independent trees. Small-scale UGS are frequently overlooked because the spatial resolution of multispectral or hyperspectral remote sensing imagery is often limited compared with nature imagery, and small-scale UGS cannot be found even in low-spatial resolution remote sensing imagery. The aforementioned factors will lead to a significant discrepancy between extracted UGS and ground truth. Hence, it is crucial to improve the accuracy of identifying fragmented and scattered UGS.

High-spatial resolution remote sensing imagery plays a pivotal role in addressing the aforementioned challenge. Its spatial resolutions are a few meters or even less than 1 m, offering detailed texture and spectral information of ground objects. This enables a comprehensive understanding of urban environments, supporting decisions for sustainable urban development. However, even with high spatial resolution, remote sensing imagery still lags behind natural imagery in terms of resolution and quality because of the influence of factors like satellite altitude, satellite optical system performance, and manufacturing costs, resulting in the widespread existence of mixed pixels. The edge contours and surface texture information of ground objects are eroded. On the other hand, a UGS encompasses various types of plants with different spectral information, and the spectral information varies greatly between plant species. Consequently, the model needs to store a more extensive range of information to effectively capture the diversity in UGS. At the same time, high-spatial resolution remote sensing imagery contains various objects with similar spectral information, such as UGS, farmlands, forests, and some plant-rich water bodies, for which spectral distinguishment is challenging. These characteristics make the boundary of the spectrum of ground objects in the feature space more “steep” and thus, it is difficult to accurately extract UGS information using threshold methods or shallow learning methods.

With the development of deep learning, image segmentation methods based on deep neural networks have found wide applications in tasks such as pedestrian detection [19], lane recognition [20], and object identification [21]. They have become vital solutions for urban planning, environmental monitoring, ecological research, and more. Furthermore, these methods offer new possibilities for extracting UGS from high-spatial resolution remote sensing imagery because deep learning models can easily extract complex features without manual design and substantial prior knowledge and can learn nonlinear mapping relationship between inputs and outputs [22]. Consequently, Xu et al. [23] proposed a deep learning classification method for UGS, utilizing phenological characteristics as constraints. This approach takes full advantage of the spectral and spatial information provided by high-resolution remote sensing imagery from different periods. Vegetation phenological features are introduced as auxiliary bands into deep learning networks for training and classification. Wang et al. [21] introduced a multi-level UGS segment architecture based on DeepLab V3+, aimed at extracting urban green space information from high-resolution remote sensing imagery. Shi et al. [24] presented a general deep learning (DL) framework for large-scale urban green space mapping and generated fine-grained UGS maps (UGS-1 m) for 31 major cities in mainland China. Liu et al. [25] introduced a novel hybrid approach, the Multi-Scale Feature Fusion and Transformer Network (MFFTNet), as a fresh deep learning method for extracting urban green spaces from GF-2 high-resolution remote sensing satellites. However, the task of extracting UGS is fundamentally different from natural image semantic segmentation tasks [19,20,21]. To begin with, UGS exhibit highly irregular and unpredictable edges compared with objects like people, cars, and buildings, resulting in irregular shapes in remote sensing imagery. Additionally, UGS span a wide range of spatial scales, encompassing large areas like parks, as well as small isolated green spaces such as individual trees and garden plots. Moreover, there are similarities in surface texture features among different UGS. Due to these factors, directly applying natural image semantic segmentation models to the task of UGS segmentation will lead to a performance decline.

To address the aforementioned issues, following a thorough investigation and analysis of UGS, in this paper we proposed an end-to-end UGS segmentation model, named the Green Space Reverse Pixel Shuffle Network (GSRPnet), for extracting UGS from GF-2 remote sensing imagery. The main work presented in this paper can be summarized as follows: (1) To minimize the loss of UGS information during the model down-sampling process, and in line with the characteristics of UGS (a low-rank feature, analysis in later section), an enhanced UGS feature extraction backbone network called RPS-ResNet was proposed. RPS-ResNet replaces the large kernel convolutional layer and the max-pooling layer in the origin ResNet-50 with a reverse Pixel Shuffle approach, which is proposed in this paper. Specifically, the last residual convolutional layer in the origin ResNet-50 was also removed to reduce the model’s parameters without compromising accuracy. (2) Instead of using cross entropy or binary cross entropy for segmentation tasks, Focal Loss and the Dice coefficient are combined for GSRPnet training, and the effects of these two losses on the segmentation accuracy of a UGS under different weights are discussed. (3) To validate the correctness and effectiveness of the ideas and modules proposed in this paper, ablation studies were conducted. In addition, five segmentation models were introduced to illustrate the superiority of GSRPnet.

6. Conclusions

In this paper, we demonstrate that a UGS is a low-rank feature. This implies that the accurate determination of whether a location is UGS can be achieved with just one or a few pixels around it, and the segmentation accuracy is not highly dependent on the depth of the model. Excessive down-sampling, which means a large receptive field, does not contribute significantly to accuracy. In contrast, preserving more texture and spectral features aids in enhancing the model’s accuracy. Therefore, we propose a novel UGS segmentation network, named GSRPnet, which, with a small number of parameters, improves the accuracy of UGS segmentation. The feature extraction backbone network used in GSRPnet, i.e., RPS-ResNet, is an enhancement of ResNet-50. Particularly, it replaces the original down-sampling convolutional layers and max-pooling layers with the reverse Pixel Shuffle method, transforming relationships between adjacent pixels into relationships between layers. This minimizes the loss of UGS feature information caused by the down-sampling process. Experimental results on GaoFen-2 remote sensing imagery show that, with a parameter count of only 17.999 M, GSRPnet outperforms U-net, PSPnet, Segnet, DeepLab V3+, and SGCN-Net in terms of precision, the F1-score, IoU, and OA. This strongly validates the correctness of our proposed approach.

[ad_2]