Individual Tree Species Identification for Complex Coniferous and Broad-Leaved Mixed Forests Based on Deep Learning Combined with UAV LiDAR Data and RGB Images

By inergency On Feb 3, 2024

[ad_1]

1. Introduction

Tree species information plays a crucial role in the dynamic monitoring of forest resources, biodiversity assessment, and estimation of forest biomass and carbon storage. How to quickly and accurately obtain forest tree species information and assess their spatial distribution has become a hot issue that urgently needs to be solved [1]. Traditional tree species information acquisition requires manual field investigation, which is costly and time-consuming. The identification of forest tree species with the help of remote sensing technology has been carried out for nearly forty years. Initially, less efficient methods such as visual interpretation were used to identify individual tree species, which was similar to fieldwork and labor sensitive [2]. With advancements in computer and sensor technologies, machine learning, and deep learning methods, multi-source remote sensing technologies have increasingly been utilized to enhance the automation and accuracy of individual tree information acquisition [3].

Accurate identification of individual tree species based on remote sensing technology requires ultra-high spatial resolution. Unmanned aerial vehicles (UAVs) have the advantages of cost-effectiveness, high flexibility, and adaptability across various terrains. To date, ultra-high spatial resolution remote sensing data from UAVs has been extensively used for small to medium-scale forest resource monitoring [4,5]. UAVs can carry different sensors depending on their carrying capacity. Most sensors fit into one of two categories based on their data acquisition principles. The first category is passive remote sensing data, including RGB images, multispectral images, and hyperspectral images. Passive remote sensing technologies can obtain the spectral and textural information of the measured objects, which is of great help to the identification of tree species [6]. Multispectral and hyperspectral sensors in particular can provide more abundant spectral information about the targets and offer more powerful classification capabilities [7]. However, this comes at a higher cost. In addition, hyperspectral images are composed of dozens to hundreds of bands, leading to substantial data redundancy [8]. Hyperspectral images are also more susceptible to noise and changes in external lighting during data collection, thus requiring a more stringently defined external environment. Under the equivalent flight conditions, RGB images can obtain higher spatial resolution and quality at a lower cost, which makes RGB data more widely used in practice [9]. The second category is active remote sensing data, specifically Light Detection and Ranging (LiDAR) [10]. LiDAR technology can obtain the three-dimensional information of the target objects and can effectively perform individual tree segmentation and tree structure parameters extraction in the face of undulating tree crowns and terrain morphology. However, it generally lacks spectral information and has only a limited ability to identify tree species [11,12].

Due to the obvious advantages and disadvantages of RGB images and LiDAR data, some researchers have attempted to combine these two types of data for tree species identification [13,14,15,16]. In previous studies on the identification of individual tree species combining the two types of data, there are basically two steps: individual tree segmentation and tree species identification [17,18]. The individual tree segmentation is either based on image data or based on point cloud data [11,19]. The commonly used individual tree segmentation methods for the ultra-high spatial resolution RGB images or Canopy Height Models (CHM) images generated from LiDAR data include image binarization, local maximum filtering, watershed segmentation, region growing, and so on [20]. The direct point cloud-oriented method mainly achieves individual tree segmentation through clustering methods, such as k-means, mean-shift, adaptive distance, and others [21,22]. Subsequently, the spectral features, texture features, point cloud spatial features, and point cloud echo features are extracted from RGB images and LiDAR data, respectively, and then machine learning algorithms are implemented for individual tree species identification [2]. Among different machine learning algorithms, support vector machines and random forest algorithms are widely used [23,24]. Although individual tree species identification can be achieved based on RGB images and LiDAR data, the process is relatively complex, and the accuracy of tree species identification is limited by the segmentation effect of individual trees. In addition, the machine learning algorithms need to further analyze and extract data features and debug to seek the optimum parameters [25], resulting in time-consuming, laborious, and poor generalization when extended to different forest types.

Deep learning, as a branch of machine learning, has made tremendous progress in recent years, benefiting from the development of high-performance computing platforms [26]. As an end-to-end model, it does not require the tedious feature analysis and extraction other machine learning algorithms do, which greatly improves the level of automation. Deep learning can be divided into three types of tasks in the field of images, including semantic segmentation, instance segmentation, and object detection. Among them, instance segmentation and object detection can directly achieve the goal of identifying individual tree species [25]. Instance segmentation is able to offer a more accurate delineation of the canopy delineations under simple stand conditions. For example, Hao et al. [27] and Li et al. [6] utilized Mask R-CNN for individual tree detection in plantation. However, in complex forest environments, significant overlap and occlusion between tree crowns posed challenges to accurately and completely outlining individual tree crowns. Furthermore, the production of the dataset required manual delineation of the tree crown contour, making it extremely difficult to produce the dataset [6]. In contrary, object detection can determine individual tree crown positions and boundaries by identifying rectangular candidate boxes, significantly reducing the cost of dataset creation. Object detection is generally classified into one-stage and two-stage methods. Faster R-CNN, as a representative of a two-stage network, was introduced in 2015 [28], and its own and improved networks have been widely used in a variety of fields. For example, Luo et al. [29] and Xia et al. [30] successfully implemented individual tree detection using Faster R-CNN in sparse plantations and on ginkgo trees growing in urban environments, respectively. However, some studies have pointed out that Faster R-CNN has lower detection accuracy in high canopy density forests [31,32]. In recent years, one-stage networks have made significant advancements [33], mainly including You Only Look Once (YOLO) [34], Single Shot MultiBox Detector (SSD) [35], and RetinaNet [36]. RetinaNet was introduced in 2017, and exhibited higher accuracy compared to the two-stage network Faster R-CNN [36]. The YOLO network was initially proposed in 2015, and its regression-based concept can directly generate detection boxes [34,37], which have been rapidly applied in various fields. Chen et al. [38] applied an improved YOLO v4 for individual bayberry tree detection. Wang et al. [39], successfully detected dead trees in protected forests based on the improved LDS-YOLO. Jintasuttisak et al. [40] compared different YOLO models and SSD for individual date palm tree detection, and the results showed that YOLO v5 had the highest accuracy. Puliti et al. [41] utilized YOLO v5 for detecting individual trees damaged by snow accumulation. Dong et al. [42] implemented individual tree crown detection and width extraction in Metasequoia glyptostroboides forests using an improved YOLO v7. Although there have been studies introducing object detection methods into the forestry field, there are few studies on individual tree species identification. The latest version of YOLO v8 [43] was released in March 2023, but there are few research results on YOLO v8, and its model performance needs further exploration.

In summary, current research on individual tree species identification based on RGB images, LiDAR data, or a combination of both from UAVs is mostly based on low canopy density forest stands [44,45,46], and in the tree crown detection task using object detection methods, most studies also do not involve tree species identification [47,48]. There have been no reports on research into individual tree detection and tree species identification in complex coniferous and broad-leaved mixed forests based on the fusion of multi-source remote sensing data from UAV platform and target detection methods. Additionally, there is a negative correlation between UAV flight altitude and data acquisition efficiency. Generally, multi-source remote sensing data is typically acquired through different flight missions. Since the attitude and positioning of UAVs in these missions cannot always be completely consistent, leading to errors, precise data registration has become a significant challenge in the process of multi-source remote sensing fusion [13]. Therefore, further in-depth research is needed to investigate the application effects of different object detection models in complex forest stands, the precise registration between different data sources, and the impact of different data fusion methods and spatial resolutions on the performance of tree species identification models [49].

Based on the current research status, this study proposes an object detection method that can combine LiDAR point cloud and ultra-high spatial resolution RGB image data in natural coniferous broad-leaved mixed forests under complex conditions, achieving highly automated and high-precision identification of individual tree species. The specific objectives of this study are to:

(1) Explore the individual tree species identification ability of a YOLO v8 model in natural mixed coniferous and broad-leaved forests under complex conditions, compare YOLO v8 with current mainstream object detection models (RetinaNet, Faster R-CNN, SSD, YOLOv5), and reveal the impact of different spatial resolution images and YOLOv8 model scales on individual tree species identification results.

(2) Evaluate the effectiveness of the current multi-source remote sensing data band combination method for identifying individual tree species in natural coniferous and broad-leaved mixed forests under complex conditions compared to single data sources.

(3) Propose an improved YOLO v8 model according to the characteristics of the multisource remote sensing forest data to achieve more precise individual tree species identification in natural coniferous and broad-leaved mixed forests under complex conditions.

3. Methods

Firstly, a UAV equipped with both visible light and LiDAR sensors was used, enabling the acquisition of both RGB images and LiDAR data of the study area through a single flight mission. CHM images were obtained from preprocessed LiDAR data, and PCA transformation was performed on the RGB data. Various fused images were obtained through band combination and further combined with plot investigation data for data annotation to complete the production of a multi-source remote sensing forest dataset. Subsequently, YOLO v8, RetinaNet, SSD, Faster R-CNN, and YOLO v5 networks were used to identify individual tree species in the RGB dataset. Then, RGB datasets with different spatial resolution and YOLO v8 scales were used to study the identification ability of YOLO v8 in complex forest stands, and the impact of spatial resolution on individual tree species identification was analyzed. Different fusion methods for multi-source remote sensing data were explored for their effects on tree species identification. Finally, based on the characteristics of the multisource remote sensing forest data, the AMF GD YOLO v8 model was proposed. The model uses a structure known as the Attention Multi-level Fusion Network (AMFNet) as its backbone. This backbone allows multisource remote sensing data to be input into two branches for feature extraction and fusion at multiple scales. It replaces the original model’s Path Aggregation Network with Feature Pyramid Networks (PAN-FPN) neck with a gather-and-distribute mechanism, enhancing the model structure. The performance of the improved model was verified through ablation experiments. The overall research workflow is shown in Figure 3.

3.1. Data Preprocessing

The raw data was processed using DJI Terra 3.4 software. The RGB data was registered and stitched by the software. Combined with the UAV flight altitude and RGB camera parameters, the maximum output of 2.7 cm resolution orthophoto can be obtained. To explore the impact of image spatial resolution on tree species identification, the RGB images with 2.7 cm spatial resolution were selected for subsequent analyses. The original point cloud data was filtered and denoised using LiDAR360 4.1.3 software, and then ground points normalization was performed, and the point cloud data of the study area was obtained after cropping. CHM images were generated based on an inverse distance weight interpolation method. In order to facilitate data fusion, the spatial resolution was consistent with that of the RGB images, which was 2.7 cm.

3.2. Dataset Creation and Data Fusion

The UAV multi-source remote sensing forest dataset was based on the tree locations and species information obtained from plot investigation. The tree location information (GNSS RTK points) and RGB images were imported into ArcGISPro and combined using the point cloud visualization function of LiDAR360 software. This procedure allows the edge of an individual tree crown to be accurately determined. The RGB images were manually annotated using Labelimg 1.8.6 software to obtain the RGB image dataset. In accordance with general practice, the dataset was divided into training, validation, and test sets with a ratio of 6:2:2. Partial data annotation results were shown in Figure 4.

Figure 4 shows the annotation results for a part of the study area in which all trees were annotated. The trees were divided into 8 categories and 7 main tree species, including Betula platyphylla (BP), Pinus koraiensis (PK), Juglans mandshurica (JM), Larix gmelinii (LG), Fraxinus mandshurica (FM), Picea asperata (PA) and Ulmus pumila (UP). Some of the less abundant tree species were grouped as other tree species (OT).

Since the input data for an object detection model is generally a 3-channel RGB image, fusing depth information data can replace one channel of the RGB image with a CHM image, thereby generating RG-D, R-D-B, and D-GB images. However, this approach directly discards 1/3 of the RGB images data, while a PCA algorithm concentrates the main information of the data into the previous component through transformation. Therefore, ENVI 5.3 software was used to perform principal component transformation analysis on the RGB images. The first two components of the principal component transformation were used to fuse with depth information to obtain PCA-D images. Since the image size and spatial resolution were exactly the same as those of the RGB images, the corresponding image dataset was obtained by segmenting it in the same way as RGB images.

Data augmentation can avoid overfitting and improve model robustness and generalization ability [51]. Therefore, a consistent data augmentation method was used to compare the impact on accuracy of different data fusions on individual tree identification on the above data training set. The augmentation methods include flip, rotation, shear, hue, capture, brightness, exposure, noise, and mosaic operations.

3.3. Performance Comparison of Different Object Detection Models

In order to compare the individual tree species identification capabilities of different models, the RGB dataset was trained using various models, including the first-stage algorithms RetinaNet, SSD, YOLO v5, and YOLO v8. In addition, the classic second-stage object detection algorithm Faster R-CNN was also used to train the models to explore their capabilities in identifying individual tree species under complex natural coniferous and broad-leaved mixed forests.

3.4. Tree Species Identification Effectiveness of Different Scales and Spatial Resolutions in YOLO v8

YOLO v8 can be divided into five scales based on different scaling factors, namely n, s, m, l and x, each associated with an increasing number of parameters. With additional parameters, the accuracy of the model rises, but it also makes the model larger, more complex, and slower to run. This study obtained RGB datasets with spatial resolutions of 2.7, 3.6, 5.4, 8.1, 10, 15, 20, 30, 40, 50, and 80 cm through resampling, and conducted training to explore the impact of different spatial resolutions and model scales on the accuracy of individual tree species identification.

3.5. Tree Species Identification Performance of Different Data Fusion Methods

Currently, there is limited research on the optimal fusion method for RGB and CHM image data in the field of forestry. Therefore, this study is based on the YOLO v8 model and trains the RG-D, R-D-B, D-GB, and PCA-D datasets to compare the accuracy impact of different fusion methods on individual tree species identification.

3.6. AMF GD YOLO v8 Model

The YOLO v8 model structure is shown in Figure 5, which integrates the advantages of various YOLO series, enabling YOLO v8 to achieve higher accuracy in various object detection tasks, establishing it as the state-of-the-art (SOTA) in the current YOLO series [43,52].

Although YOLO v8 has robust object detection capabilities, it cannot directly input multisource remote sensing data. The aforementioned band combination fusion methods lead to the loss of some data information. After fusing some RGB data and CHM data into the network, the contributions to the final object detection results are different due to the different effective information attributes contained in the two types of data. Inputting two different forms of data into the same network structure can easily lead to the loss of effective CHM information [6]; dual-branch feature extraction networks are more suitable for processing multimodal data [53,54]. To fully explore the potential of combined multisource remote sensing data in the identification of individual tree species in complex natural coniferous and broad-leaved mixed forests, the AMF GD YOLO v8 object detection network based on YOLO v8 was proposed, as shown in Figure 6.

3.6.1. AMFNet

The Attention Multi-level Fusion Network (AMFNet) consists of two feature extraction branches and data fusion modules, which can simultaneously input RGB and CHM images to achieve multi-level fusion of images. The two feature extraction branches are based on the YOLO v8 backbone. The extracted features are fused through a data fusion module. In the data fusion module, Convolutional Block Attention Module (CBAM) attention mechanism was added [55]. The CBAM module sequentially generates attention feature map information in both channel and spatial dimensions, and then multiplies the two-feature map information with the previous original input feature map for adaptive feature correction, highlighting important data and suppressing irrelevant information. After several trials, attention mechanisms were selected to be added respectively before the fusion of features from the two branches, allowing the attention mechanism to emphasize the useful information of each dataset without mutual interference. However, this can result in a lack of interaction between the different data features after concatenation. To address this issue, we referred to the Shuffle Net structure [56] by adding inter-channel information exchange to enable the network to learn more mixed features and adapt to the features of both data types, thus improving the model’s expressive capability and predictive performance. The specific structure of the data fusion module was shown in Figure 7. Figure 7a–c show ablation experiments conducted on different structural models to further explore their effects.

3.6.2. Gather-and-Distribute Mechanism

The original YOLOv8 was fused with multi-scale features using the PAN-FPN (Path Aggregation Network with Feature Pyramid Networks) structure in the model neck, as shown in Figure 5. However, the FPN method can only fully integrate the features of adjacent layers, and information from other layers can only be indirectly ‘recursively’ obtained. Therefore, based on the neck module of the GOLD-YOLO model [57], the gather-and-distribute (GD) mechanism was introduced into the improved model neck, as shown in Figure 6. Compared with traditional FPN method by using a unified module to collect and fuse information from different levels and then distribute it to different levels, which can effectively avoid the inherent information loss in the traditional FPN structures.

The AMF GD YOLO v8 model is an improvement based on the characteristics of different modalities of multi-source forest remote sensing data. Compared to the original YOLO v8, the improved model can simultaneously input RGB and CHM images for tree species identification and achieve multi-level fusion of RGB and CHM features through a feature fusion model. In response to the characteristic that trees are mostly small to medium-sized targets, the gather-and-distribute mechanism was introduced to comprehensively utilize the features extracted by the backbone. Compared to PAN-FPN, the P2 detection layer was added to enhance the ability to detect small targets and thereby better achieving individual tree species identification in complex natural coniferous and broad-leaf mixed forests.

3.7. Accuracy Evaluation and Experimental Environment

This study used harmonic mean F1 scores of precision and recall, as well as the average precision mAP when the IoU threshold was set to 0.5(mAP@50) as accuracy evaluation metrics [32,44]. The model’s efficiency was assessed using the Frames Per Second (FPS) metric. To validate the performance of the proposed AMF GD YOLO v8 model, ablation experiments were designed to evaluate the effects of different improvement modules, the performance of the AMFNet backbone, different feature combination methods of the AMF module, and the CBAM attention mechanism. The gather-and-distribute mechanism neck in individual tree species identification using UAV multi-source remote sensing datasets was also verified.

The experiment of this study utilized PyTorch as the deep learning framework. The experiments were conducted on a desktop computer with Windows 11 as the operating system. The hardware included an NVIDIA GeForce RTX 3090 GPU with 24 GB of VRAM, 32 GB DDR4 RAM, and an Intel (Core (TM) i7-12700 CPU. This setup provided a robust platform for conducting the deep-learning experiments and evaluations necessary for this study.

5. Discussion

Compared to other object detection models, YOLO v8 has a higher detection accuracy [62,63]. In this study comparing it with current mainstream object detection algorithms, YOLO v8 demonstrated its advantages in tree species identification. At the same time, the effects of YOLO v8 on the identification of individual tree species in complex natural coniferous and broad-leaved mixed forests under different spatial resolutions and data fusion methods was explored. Furthermore, the AMF GD YOLO v8 model was proposed based on the characteristics of multi-source remote sensing forest data, achieving precise identification of individual tree species in complex natural coniferous and broad-leaved mixed forests by combining RGB images and LiDAR data from UAVs.

Under the same hardware configurations, UAV remote sensing platforms can only achieve higher spatial resolution data by adjusting flight altitude. While UAVs have better flexibility, their data acquisition efficiency significantly decreases as flight altitude decreases. Therefore, it is important to clarify the impact of spatial resolution on the identification of tree species in complex forest stands and achieve a balance between regional coverage and spatial resolution, which is crucial to improve the efficiency of UAV data acquisition [64]. The spatial resolution of UAV RGB images has a significant impact on individual tree species identification. Higher spatial resolution can achieve higher identification accuracy, but only at the cost of lower efficiency. However, when the spatial resolution is better than 8 cm, the accuracy improvement becomes less significant. This trend was also confirmed in similar studies [32].

YOLO v8 provides five scales for researchers to use. As the number of parameters increases, its accuracy will improve, but the model will become larger and complex, and the computational efficiency will also decrease. This study used different scales to train datasets at different spatial resolutions to explore the differences in model accuracy and speed. The x scale of YOLOv8 achieved the highest individual tree species identification accuracy but had the lowest speed. In contrast, the n and s scales had lower accuracy but faster speeds. However, the detection accuracy of l and m scales had a smaller difference compared to x scale but faster detection speeds. Thus, in practical applications, it is important to choose the appropriate scales and spatial resolution based on research objectives to achieve an efficient balance between accuracy and detection speed. This consideration is vital for optimizing tree species identification performance and data collecting efficiency using UAV remote sensing.

The difficulty of identifying different trees varies from species to species. According to the individual tree species identification results for different species (Table 3), the identification accuracy for coniferous trees like Pinus koraiensis, Larix gmelinii, and Picea asperata was higher than that for broadleaf trees, which is similar to the research results of Beloiu et al. and Fricker et al. [31,65]. This is because coniferous trees have more regular appearances, with significant and uniform height variations at the top and edges of the crown, whereas broadleaf trees have a smoother crown top and irregular crown shape. Overall, the models achieved good results in identifying individual trees of the dominant species in the study area, but the accuracy of identifying ‘other tree species’ in this category was relatively low, possibly due to their small numbers and the composition of different tree species.

Band combination is the simplest and commonly used method in multi-source remote sensing data fusion [6,27,29,66,67], and different combination methods showed different effects for different tree species. The identification accuracy of DGB and PCA-D was 75.5% and 76.2%, respectively, which indicates a good identification performance. The identification accuracy of tree species using PCA transformation for image fusion proposed in this study is superior to conventional band combinations. The reason is that PCA transformation concentrates information on the front components, reducing some data noise while inputting more useful information into the deep learning network, thereby enhancing predictive result accuracy. The experimental results showed that band combination is a simple but effective approach for fusing multi-source remote sensing data. However, band combination complicates data preprocessing, reduces automation, and involves some information loss. Li et al. [6] noted that because different remote sensing data came from different sensors, there were great differences in data characteristics. When the data after band combination was input into the same network structure, and the network could not maximize the extraction of each data source’s features, leading to the loss of some effective information. In this study, as much information as possible was input into the model through band combination, but its effectiveness was still inferior to the AMFNet GD YOLO v8.

From the model structure (Figure 6), it can be seen that AMFNet GD YOLO v8 can avoid the problems of information loss and the inability of a single feature extraction network to adapt to multi-source remote sensing data. According to the band combination experiment results (Table 2) and the improved model results (Table 3), using only the dual-branch backbone without using CBAM attention mechanisms, channel communication, and GD neck (Figure 7a), the mAP increased by 3.5%. This validates the applicability of the proposed dual-branch backbone for multi-source remote sensing data. Further, according to the ablation experiment results, the CBAM attention mechanism highlighted important features of the data, helping the model focus on the most information parts of the input and ignore unimportant information, thus enhancing the model’s ability to detect and identify individual trees. Compared to the fusion method without feature interaction (Figure 7b) in the AMF module, the feature interaction method (Figure 7c) obtained a 1.1% improvement in individual tree identification accuracy. This indicates that the proposed feature interaction method can address the lack of interaction between features extracted by different branches of the model backbone, thereby improving tree species identification accuracy. In the model’s neck, ablation results proved that the GD mechanism is superior to PAN-FPN, and the improvement in identification accuracy not only comes from the GD mechanism using a unified module to collect and fuse information at various levels, but also effectively avoids the information loss inherent in traditional FPN structures [57]. According to the model structure diagram (Figure 6), the GD mechanism integrated P2 layer detection information into the module, thereby enhanced the detection effect of individual trees in the forest stand.

This study based on deep learning achieved automated and precise identification of individual tree species in natural coniferous and broad-leaved mixed forests. It explored the impacts of different models, spatial resolutions, and data fusion methods on individual tree species identification, and the proposed AMF GD YOLO v8 model achieved encouraging results in individual tree identification. However, there are still some limitations worthy of further research. The dual-branch feature extraction and fusion structure within the AMF module have achieved an improvement in accuracy, yet they have also increased the computational complexity of the model. Future research could focus on developing more lightweight models, which could be deployed in small-scale devices to enable real-time acquisition of tree species information. Alternatively, developing more advanced model architectures to further improve the accuracy of tree species identification.

While the creation of a multi-source remote sensing dataset for forests has validated the efficacy of the AMF GD YOLO v8 model, its generalizability across forest types under varying geographical, climatic, or ecological conditions still requires further verification. However, current research on individual tree species identification using deep learning is hindered by the lack of comprehensive and large-scale public datasets encompassing a wide variety of tree species, which is crucial for enhancing the model’s performance and universality.

[ad_2]