Urban Vegetation Classification for Unmanned Aerial Vehicle Remote Sensing Combining Feature Engineering and Improved DeepLabV3+

By inergency On Feb 18, 2024

[ad_1]

1. Introduction

Urban vegetation is a key component of urban ecosystems, influencing the urban landscape pattern; it has functions such as absorbing noise, reducing haze, and mitigating the urban heat island effect [1,2,3]. Classifying and extracting urban vegetation information have important research significance and application value in various fields such as urban land use change, ecological environment monitoring, urban vegetation monitoring, and urban planning [4,5].

Traditional vegetation classification is mainly based on field surveys, which are inefficient for complex terrain and large areas. With the development of remote sensing technology, remote sensing images have become widely used in vegetation survey work [6]. Satellite remote sensing is widely used in large-scale vegetation monitoring due to its advantages of large-scale coverage and long-time sequence observation. The spatial resolution of satellite remote sensing is usually limited by technology and cost; thus, although some satellites are able to provide high resolution in urban environments, fine vegetation or features are likely to be inaccurately captured, and satellite observations are often affected by cloud cover. In contrast, unmanned aerial vehicle (UAV) remote sensing, characterized by ultra-high resolution and flexibility, is gradually becoming an important tool for urban vegetation surveys [7]. The orthophoto generated by the UAV after processing can extract detailed spatial and textural vegetation information, which is more suitable for the fine classification of urban vegetation [8].

Traditional remote sensing classification methods for vegetation include two categories: pixel-based and object-based. The pixel-based method uses the pixel of the remote sensing image as the smallest classification unit, and uses the feature information in the pixel to judge the vegetation category. This method is mostly used in medium-/low-resolution images, but an image element may contain multiple feature types in high-resolution images. The method does not take into account the up and down information nor the features of the surrounding pixels; thus, there is the phenomenon of “the same object with different spectra, the same spectrum with different objects” [9]. The object-based classification method segments the image into objects with semantic features, and classifies them as the basic classification unit, which comprehensively considers spectral, shape, textural, contextual, and other information aspects; this method has better applications in high-resolution remote sensing vegetation classification [10,11,12]. However, object-based methods are very much influenced by preset parameters, and improper parameter selection will affect its classification accuracy; furthermore, object-based methods also have problems related to over- and under-segmentation [13].

In recent years, with the rapid development of artificial intelligence technology, scholars have begun using deep learning methods for the remote sensing classification of vegetation [14,15]. Among the deep learning methods, a convolutional neural network (CNN) was first widely used for the remote sensing classification of vegetation. A CNN can automatically extract and learn the vegetation features in an image through convolutional and pooling layers, and simplify the whole classification process by optimizing directly from the original input to the final output in an end-to-end manner. However, the fully connected layer in the CNN structure connects all the neurons in the previous layer and all the neurons in the current layer, resulting in a very large number of parameters. Hence, CNNs are not suitable for pixel-level classification tasks [16,17]. Long et al. [18] proposed a fully convolutional network (FCN) for pixel-level segmentation, which improves the model’s ability to perceive features within an image using standard convolutional layers instead of fully connected layers in the CNN and introduces multi-scale feature maps. Since then, scholars have mostly used improved semantic segmentation networks to classify urban vegetation. Xu Zhiyu et al. [19] utilized an improved U-Net to classify urban vegetation into evergreen trees, deciduous trees, and grasslands, with an overall accuracy of 92.73%. Kuaiyu et al. [20] designed a multi-scale feature-aware network to extract and classify urban vegetation in combination with UAV images, with an average overall accuracy of 89.54%. Lin Na [21] et al. proposed a Sep-UNet semantic segmentation model to extract vegetation information in multiple urban scenes, and obtained better results.

Although all of the above networks can effectively classify urban vegetation, the model assigns larger weights to pixels at the edges of different vegetation types during segmentation, resulting in lower edge segmentation accuracy for neighboring samples. To solve this problem, the Google team proposed the DeepLab series of image segmentation networks [22]. This series of networks continues the full convolution operation taken by the FCN, optimizes and improves it, and has been widely used in image processing tasks in recent years. Among them, DeepLabV3+ is the latest improvement in this series of networks, which combines the advantages of encoder–decoder architecture and atrous spatial pyramid pooling (ASPP) to capture a clear target by gradually recovering the spatial information to capture clear target boundaries [23]. Studies have shown that DeepLabV3+ is suitable for the extraction of green space or vegetation information in cities, e.g., Wenya Liu et al. [24] realized the high-precision and high-efficiency automatic extraction of urban green space through a DeepLabV3+ network. However, the conventional DeepLabV3+ model still suffers from the problems of unrefined classification, large numbers of network parameters, and long training times in urban vegetation classification [25]. Currently, some scholars have tried to improve the DeepLabV3+ network with a lightweight approach, and implemented urban vegetation classification for UAV images [26]. However, the feature learning capability of deep learning models relies on a large amount of training data, and the above scholars only used visible band images as the data; the number of features that the model can learn from the samples is small, thus limiting the performance of the network [27]. In order to overcome this limitation, adding more remote sensing feature data can be considered to make up for the shortcomings of insufficient information from visible light images [28].

In the early stage, due to the small assumption space of shallow machine learning algorithms, it is not possible to express precise mathematical formulas for some complex problems [29]. For this reason, scholars carry out a series of computational processes on the original data through the construction of feature engineering, and refine the more efficient features to facilitate the model’s learning, improving its accuracy [30]. Some studies have shown that feature engineering is not only limited to improving the accuracy of shallow machine learning algorithms, but also constructing good feature engineering, which can improve the learning efficiency and classification accuracy of deep learning models. Sun et al. [31] demonstrated that combining the digital surface model with an FCN can improve the ability to semantically segment remote sensing images and significantly improve the classification results. Lin Yi et al. [32] constructed a feature space containing spectral, textural, and spatial information, which effectively improved the fine classification accuracy of urban vegetation. Cui Bingge et al. [33] improved the information extraction accuracy of wetland vegetation by adding a vegetation index to the deep semantic segmentation network. Therefore, the introduction of feature engineering to improve the accuracy of deep learning networks in urban vegetation classification has certain research significance.

Based on the above discussion, this research proposes a UAV remote sensing urban vegetation classification method that combines feature engineering with improved DeepLabV3+. Feature engineering containing vegetation indices and textural features was constructed under feature optimization to increase the number of features in the samples, and the DeepLabV3+ network was improved to increase the classification accuracy and efficiency of the model. Experiments were conducted in several areas of Zunyi City as the study area, using self-constructed sample data to achieve the accurate and complete classification of trees, shrubs, mixed tree-shrubs, natural grasslands, and artificial grasslands.

3. Experiments

3.1. Constructing the Sample Dataset

The vector label data used in the experiments were all constructed using visual interpretation, i.e., combining the UAV remote sensing image data to classify the vegetation in the training area into five categories, trees, shrubs, mixed trees and shrubs, natural grassland, and artificial grassland, and fully categorizing the non-vegetation features as background values, such as buildings, roads, and water bodies.

In order to reasonably utilize the computer memory, the remote sensing data and labeled data were simultaneously cut into 256 × 256-pixel sample slices using a sliding cut with a 10% overlap rate. It has been shown that the balance of positive and negative samples can improve the performance of the model [60]. For this study, the positive samples are the vegetation samples after being multiclassified, with image element values from 1 to 5, and the negative samples are the other features except the vegetation samples, with an image element value of 0 [61]. Following the principle of selecting high-quality samples, samples with 0-value image elements accounting for more than 80% of the total number of single-sample image elements were removed using histograms; moreover, samples with a certain image element value accounting for 100% of the total number of single-sample image elements were removed from the positive samples in order to balance the multicategory sample size. In addition, in order to keep the training samples sufficient, data augmentation was used for sample expansion, and horizontal flipping, vertical flipping, rotating 90°, rotating 270°, and diagonal mirroring were performed on a sample-by-sample basis. Finally, 11,478 image sample slices and label sample slices each were obtained in the training region, of which 80% were randomly assigned for the training set and 20% for the validation set. Some of the sample slices are demonstrated in Figure 8.

3.2. Feature Optimization

The ReliefF algorithm is executed on the PyCharm platform. Random points are generated on each feature image (vegetation index and textural feature maps), and the gray value extracted from each point is the sample in ReliefF. First, the data are normalized to ensure that the scale of each feature is consistent. For each sample, a weight vector is initialized. For each sample, one of them is randomly selected and the distance between that sample and the others is calculated using the Euclidean distance to find the nearest-neighbor samples of the same kind (i.e., samples belonging to the same category as the current sample) and the nearest-neighbor samples of the dissimilar kind (i.e., samples belonging to a different category than the current sample), respectively. Next, the weights are updated according to the difference with the same class sample and the dissimilar sample, and if the eigenvalue has a greater difference between the same class sample and the dissimilar sample, then its weight will be greater. The above process is iterated several times until the algorithm converges. Finally, the features are ranked according to their final weights, and features with higher weights are ranked higher.

The weights of 7 vegetation indices are calculated first, and then the weights are sorted in ascending order to obtain Table 2. Then, the weights of 24 textural features are calculated, and the weights of the textural features are sorted in ascending order to obtain Table 3. According to the table, the VDVI index contributes the most to the classification of the urban vegetation among the vegetation indices, and the entropy calculated by the green light band contributes the most to the classification of urban vegetation among the textural features, indicating that the VDVI and G_Entropy contribute the most to the classification of urban vegetation in both vegetation and texture, respectively. The textural feature that contributes most to urban vegetation classification is the entropy calculated from the green light band, indicating that VDVI and G_Entropy have the greatest influence on urban vegetation classification in terms of vegetation and texture, respectively. Therefore, the VDVI index and G_Entropy were selected to construct the feature engineering for urban vegetation classification, which was fused with the sample set of remote sensing data to construct the sample dataset, combining the feature engineering and input into the improved DeepLabV3+ model for training. Table 2 shows the ranking of vegetation index weights.

3.3. Experimental Environment and Model Training

The experiments were conducted on a 64-bit operating system of Windows 10, with Tensorflow2.9+Keras as the deep learning framework, and the programming language used was Python3.9. The hardware configuration GPU model is NVIDIA RTX 4090 with 24 GB of video memory, and the CPU model is i9-12900k, with 24 GB of RAM. All deep learning models in this study were built using ReLu (rectified linear unit) as the activation function, and He_Normal as the weight initializer, with appropriate dropout layers added to reduce model overfitting. The hyperparameters of the model in this research were kept the same in training, i.e., the batch size is 32, the number of trainings is 200 epochs, the number of model channels is 6, cross entropy is used as the loss function, and adam is used as the gradient descent optimizer. In order to enable the network to converge quickly and effectively during training, the learning rate was set using segment constant decay, and the initial learning rate was set to 0.001; the learning rate was automatically adjusted to decrease by a factor of 10 every 20 rounds [62]. The model will cause memory overflow if the whole test area image is inputted during prediction, so it is necessary to crop the test area image into 256 × 256-pixel slices before prediction, so that the model can read and predict it in pieces, and the prediction results are synthesized and then output.

3.4. Precision Evaluation

In order to quantify the model’s vegetation classification accuracy on the test images, the commonly used accuracy evaluation metrics in semantic segmentation tasks were selected: the overall accuracy (OA), macro average of the F1-score (MacroF1), intersection over union (IOU), and mean intersection over union (MIOU). OA is the ratio of the number of correctly classified pixels to the total number of all pixels in the classification task, which is an overall index for evaluating the classification effect; MarcoF1 is the F1-score of each vegetation category calculated by precision and recall, and then by finding the mean value. MIOU is the average of the summed IOU values of each vegetation category, which is used to evaluate the overall segmentation accuracy of the model in vegetation classification. In addition, in order to evaluate the efficiency of the deep learning model, the training time is added to the evaluation index, and the fewer the number of model parameters and the shorter the training time, the higher the efficiency of the deep learning model.

5. Discussion

In this study, UAV RGB images were used as the data source. By replacing the backbone network, adjusting the null rate of ASPP, and adding the attention mechanism, we effectively improved the performance of DeepLabv3+. Zhang et al. [26] employed an improved model, akin to our study, to extract vegetation from two residential areas in Nanjing, China, using UAV imagery. The obtained results remain as excellent. The model improvement significantly reduced the training time by 25.70%. In contrast, our method achieves an even more substantial reduction of 32.60%. The key distinction lies in the fact that the former incorporates only channel attention, while we introduce both channel attention and spatial attention. This discrepancy elucidates that CBAM enhances model performance by optimizing the combination of these two attention mechanisms. Lin et al. [32] proposed a methodology involving feature engineering along with an improved deep learning model for classifying vegetation in a plot in Jiaozuo City, Henan Province, China, Binhe Garden District, based on remotely sensed imagery. Due to constraints such as small sample sizes and a lack of migratable sample datasets in their study, OA for vegetation classification was limited to 83.30%, falling short of the OA achieved in our study (92.27%). Our experiment benefits from an ample supply of training samples, facilitating the deep learning model to achieve higher accuracy.

We added feature engineering to improve DeepLabv3+, which further improves the model’s classification results for urban vegetation, especially with the MIOU being 4.91% higher than the method without feature engineering. This result is consistent with previous findings on the combination of deep learning and feature engineering for vegetation extraction [65]. Xu et al. [66] added the vegetation index into a deep learning model for urban vegetation remote sensing classification, and also achieved high accuracy. However, the extraction accuracy of grassland in the study is 75%, while the accuracy of both natural and artificial grassland extracted by our method is above 80%. This may be due to the fact that we added more GLCM into our feature engineering. Since the texture of grass is flatter and obviously different from trees and shrubs, the GLCM improves the segmentation accuracy of grass. In addition, the UAV images used in our study were all from the same altitude in the same season. Considering the possible effects of images from different seasons and altitudes on vegetation classification, we will combine the remote sensing data from different shooting altitudes and seasons to assist in the study of vegetation classification in the future. At the same time, we will consider adding more types of feature information to the feature engineering to enhance the confidence of the model’s classification decision making, in order to further improve its performance.

6. Conclusions

The existing urban vegetation fine classification method requires a lot of time, and is not effective in categorizing the vegetation. Therefore, this research proposed an automatic urban vegetation classification method that combines feature engineering and improved DeepLabV3+ with UAV images as the data source. Through comparison experiments with different methods to validate improvements in effectiveness and the migration test, the following main conclusions are drawn: This research’s method can accurately and completely categorize the vegetation into trees, shrubs, tree-shrub mixes, natural grasslands, and artificial grasslands on the UAV images, and the segmentation effect of trees is the best, achieving a segmentation accuracy of 91.83%. Meanwhile, the feature engineering constructed under feature optimization significantly improves the overall segmentation accuracy of the deep learning model. Replacing the backbone network by adjusting the null rate shortens the model training time while improving its segmentation accuracy. After adding the CBAM, the classification accuracy for urban vegetation is further improved. In conclusion, the improvement mechanisms of this study’s method are all effective in enhancing urban vegetation classification. In addition, the method in this paper has high classification efficiency and a certain migration ability, which are suitable for rapid investigations of urban area vegetation. Overall, the method proposed in this paper can quickly and accurately classify urban vegetation on UAV images, which is of great significance for exploring the changes in and applications of vegetation.

[ad_2]