Improved YOLOv7 Target Detection Algorithm Based on UAV Aerial Photography

[ad_1]

1. Introduction

In recent years, the application of UAV remote sensing in various fields has become more and more extensive. For example, UAV remote sensing has outstanding performance in scenarios such as battlefield inspection, disaster rescue, environmental survey, electric power overhaul, and monitoring and inspection. Through the use of drones, the efficiency of accomplishing tasks has been greatly improved. Remote sensing images have a significant improvement in the resolution and accuracy of remote sensing images compared to traditional satellite remote sensing and other means, but they still have not solved the problems of having a long distance from the target, shooting a small target, serious occlusion, and weak recognizable features. In addition, because of the limited load of the UAV, it is difficult for the airborne edge computing platform to meet the arithmetic demand of common deep learning algorithms, which also poses a problem for applications.

The reason for the difficulty of target detection under the UAV perspective is that UAV images have scale changes, sparse and dense distribution, and a higher proportion of small targets, especially the contradiction between the high computational demand of UAV high-resolution images and the limited arithmetic power of the current stage of low-power chips is difficult to balance. Compared with the natural images taken from the ground viewpoint, the wide field of view from the UAV viewpoint provides richer visualization information but also implies more complex scenes and more diverse targets, bringing more useless noise interference to the target detection. Moreover, in the sky view, targets in the image are often more difficult to detect due to factors such as remote shooting, background occlusion, or the influence of lighting; therefore, it is necessary to use high-resolution images. This greatly increases the computational overhead and memory requirements of target detection algorithms, and the direct use of general-purpose target detection algorithms that have not been specially designed will bring unbearable computational overhead and memory requirements, further exacerbating the difficulty of target detection. In real-world application scenarios, which are often faced with fine-grained classification problems similar to identifying vehicle types, these similar targets pose a huge challenge for the model to recognize the target correctly.

Traditional target detection consists of feature extraction, classifier, and region selection. A candidate region is first searched in the image to be detected; then, features are extracted and classified. Since the target may appear at any position in the image, and its aspect ratio and size cannot be determined beforehand, it is necessary to set a sliding window with different scales to traverse the image to be detected. This strategy can determine the location of possible targets, but it has the problems of high time complexity, redundant windows, and poor region matching, which seriously affect the speed and effect of subsequent feature extraction. In fact, affected by the time complexity problem, and for targets with large floating aspect ratios, it is difficult to obtain matching feature regions even if the whole image is traversed. In the feature-extraction stage, features such as local binary patterns, scale-invariant feature transforms, and directional gradient histograms are often used. Because of the uncertainty of the target morphology, the diversity of lighting changes, and the complexity of the target background, it is very difficult to make the features robust. In summary, the effect of traditional detection methods is unstable, easily affected by a variety of conditions, and difficult to put into practical use.

As technology continues to advance and mature, visual target detection plays a key role in practical applications. In recent years, numerous related tech unicorn companies, such as Shangtang Technology and Kuangwei Technology, have emerged in the industry. Meanwhile, computer vision has become crucial in the field of autonomous driving, and some tech companies, especially representative companies such as Tesla, serve as representatives of visual perception leading the development of autonomous driving. Despite many advances in UAV visual inspection, it still faces many challenges. Mainly because, for one thing, aerial images are different from images of natural scenes, making it difficult to identify targets accurately. Second, the human–machine target-detection task has high requirements for real-time and accuracy.

To solve the above problems, this algorithm improves YOLOv7. This algorithm was experimented on VisDrone-2019 on a public dataset, proving the algorithm has high detection accuracy. First, the improved algorithm incorporates dynamic snake convolution (DSC) in Backbone, which significantly improves the model-detection accuracy. Secondly, an improved SPPF instead of the original spatial pyramid pooling structure is used. At last, the original loss function was replaced with EIOU.

4. Analysis of Experimental Results

4.1. Datasets

In order to test the algorithm’s performance, the group conducted tests on the Visdrone2019 dataset. The AISKYEYE team of Tianjin University summarized the dataset, containing 288 videos, 261,908 frames, and 10,209 images. The data were obtained from various drone photography sources, covering a wide range of scenarios, including 14 cities in China with long distances, different environments in urban and rural areas, and various objects (e.g., pedestrians, vehicles, etc.) with varying densities.

VisDrone2019 contains 6471 training images, 548 validation images, and 1610 test images, covering a wide range of traffic scenarios, such as highways, intersections, and T-intersections, as well as a wide range of climatic backgrounds, from day and night to hazy and rainy days. The set can be used to validate the UAV ground-based small target detection performance. All methods within the experiment are trained in the training set and evaluated in the validation set.

4.2. Experimental Steps

Table 1 displays the experimental hardware setup: an Intel(R) Core(TM) i9-113500FCPU @ 3.50 GHz, with model training on a GeForce GTX 3090 featuring 24 GB of video RAM and 40 GB of system RAM. The experiment ran on a Windows operating system, utilized Python 3.8.6 for programming, and was built on the Pytorch 1.11.0 framework, incorporating CUDA 11.6 for enhanced processing.

The experiment leveraged the Adam optimizer for model training to refine and update the network’s weights. The optimizer was configured with specific settings: a 16-bit size, a learning rate of 0.01, a momentum of 0.937, a weight decay of 0.0005, and a training duration extended to 300 epochs to ensure comprehensive training of the network.

This research adopts the Adam optimizer to refine our model, blending the strengths of Momentum and RMSprop algorithms for effective weight adjustment. The essence of Adam, or adaptive moment estimation, lies in its ability to calculate both the mean (first-order moment) and the gradients’ uncentered variance (second-order moment). It then dynamically tailors the learning rate for each parameter based on these calculations. This approach enables Adam to adjust its step size based on the parameter update history, thus offering faster convergence and enhancing both the training’s efficiency and stability, in contrast to conventional stochastic gradient descent (SGD) methods.

4.3. Evaluation Indicators

The model employs precision ( P , recall ( R ), average precision ( A P ), and mean average precision ( m A P ) as metrics to assess its performance. AP serves as the metric for evaluating the accuracy of detecting individual categories, while mAP is calculated by summing the AP values across all categories and dividing by the total number of categories. In the study, mAP0.5 is the mAP with a threshold of 0.5, where IoU measures the overlap ratio between the predicted and actual bounding boxes.

P = T P / ( T P + F P )

R = T P / ( T P + F N )

A P = 0 1 P ( R ) d R

m A P = 1 N 0 1 P ( R ) d R

In the model’s performance evaluation context, T P is the number of positive samples correctly identified as positive by the model. F P represents the count of negative samples incorrectly classified as positive. Meanwhile, F N denotes the number of positive samples that were mistakenly categorized as negative. These metrics are crucial for calculating precision, recall, and other related performance indicators.

4.4. Ablation Experiments

To demonstrate the efficacy of the introduced components, this study performed ablation tests on the VisDrone2019 dataset using the YOLOv7 as the foundational algorithm. These experiments focused on measuring mean average precision (mAP), parameter counts, and frames per second (FPS) to gauge performance enhancements. The outcomes of these tests are summarized in Table 2 below.
Seven sets of ablation experiments were performed under equivalent conditions, as detailed below in Table 2:
  • The first set of experiments for the baseline model, i.e., the YOLOv7 algorithmic model, is used as a reference, which has a mAP value of 50.47% on the Visdrone2019 dataset;

  • The second group is to replace the ELAN of the benchmark model with the improved ELAN_DSC; the number of parameters increases by 16.32M, but the mAP is improved by 3.4%, the accuracy is improved by 2.6%, and the recall is improved by 3.5% compared with the benchmark model, and the main reasons for the model’s enhancement include the following: The target as a fine structure accounts for a very small percentage of the overall image, with a limited pixel composition, and it is easily affected by the complex background. The main reasons for the model improvement are: the target is a small proportion of the overall image, the pixel composition is limited, and it is easily interfered with by the complex background, which makes it difficult for the model to accurately identify the subtle changes of the target, but the addition of the dynamic serpentine convolution to ELAN can effectively focus on the slender and curved target, thus improving the detection performance. Since dynamic serpentine convolution has better segmentation performance and increased complexity compared to normal convolution, the number of parameters in the improved module rises compared to the original model;

  • The third group is replaced by the improved SPPF module, which improves 1.3% over the baseline model, improves 1.1% accuracy, improves 1.2% recall, and reduces the parameters. mAP, P, and R improvements and parameter reductions are analyzed as follows: the improved SPPF module performs Maxpool operations on convolutional kernels of different sizes to differentiate between different targets, increase the receptive field, and extract more important feature information; therefore, mAP, P, and R are improved; the improved module performs serial operations on convolutional kernels of different sizes and therefore reduces the model complexity, so the number of parameters decreases. The improved SPPF module uses different sizes of convolutional kernels for the Maxpool operation to distinguish different targets, increase the receptive field, and extract more important feature information, so the mAP, P, and R are improved; the improved module operates the different sizes of convolutional kernels in a serial manner, so the complexity of the model is reduced, and the number of parameters decreases;

  • The fourth group replaces the loss function with EIOU, the mAP improves by 0.43%, the accuracy improves by 0.5%, the recall improves by 0.6%, and the number of parameters is unchanged because of the unaltered network model and, therefore, unchanged compared to the baseline model. Analysis of the reasons for improving the detection performance: the loss function directly minimizes the difference in the height and width between the target box and the anchor box, which results in faster convergence and a better localization effect;

  • From the second to the seventh set of ablation experiments, the introduction of DSCNet provided the key improvement, with a 3.4% improvement in mAP in Figure 7.

The following figure visualizes the change and improvement in mAP during the training process of the final improved algorithm and the benchmark model. It can be clearly seen that the original algorithm’s mAP increases rapidly in the first 50 rounds and increases slowly from the 50th round until it reaches the final training mAP value around 150 rounds, and after that, it reaches convergence. In comparison, the improved algorithm has a rapid increase in the mAP in the first 30 rounds and a slow increase from the 30th round to around 90 rounds, and after that, reaches convergence. It can be clearly seen that the improved algorithm converges faster, and the mAP increases by 4.33% over the benchmark algorithm.

4.5. Comparative Experiments

Various UAV aerial image target detection algorithms, such as YOLOv4, YOLOv3-LITE, YOLOv5s, Faster RCNN, DMNet, etc., are selected to be compared and analyzed with the improved algorithm of this study on the Visdrone2019 test set. In Table 3, it can be seen that the comparison of this algorithm with others, with 33.0% improvement in mAP compared to Faster RCNN, 11.6% compared to YOLOv4, 24.4% compared to DMNet, and 23.6% compared to YOLOv5s. This algorithm not only improves significantly in mAP compared to mainstream target detection algorithms but is also significantly higher than other algorithms in AP; for example, car detection accuracy reaches 82.4%, van detection accuracy reaches 58.6%, and truck detection accuracy reaches 51.7. Due to other target detection algorithms, the experiments illustrate the effectiveness and practicality of this algorithm for detecting weak and small targets in aerial images of UAVs. The experiment illustrates the effectiveness and practicality of this algorithm for detecting weak targets in UAV aerial images.
Additionally, in order to reflect the advancement of this algorithm and to compare it with the current technical level of the YOLO V7 algorithm in the field of UAV, YOLOv7-UAV [20], PDWT-YOLO [21], and improved YOLOv7 algorithms are selected to compare with this algorithm [22]. It can be seen from Table 4 that this thesis shows that the YOLOv7-UAV algorithm is superior to this algorithm in terms of the parameters, but this algorithm in terms of this index is superior to the PDWT-YOLO and improved YOLOv7 algorithm. In addition, this algorithm is superior to the above algorithms in terms of the mAP metric, which can reach 54.7% in the VisDrone2019 dataset.

4.6. Analysis of Detection Effects

The aerial images of UAVs in different complex scenes in the VisDrone2019 test set are selected for detection, and the detection effect is shown in Figure 8. It can be seen that this study’s algorithm can attenuate the interference of trees and buildings in the complex background of the image and correctly segment and localize the target for the same small target in the complex background scene. It shows that this study’s algorithm has better detection performance in actual scenes, such as lighting conditions, different backgrounds, and target distribution. Also, the confidence threshold is set to 0.25, below which the image confidence is not displayed.
To evaluate the detection performance on UAV aerial images, images under different scenes of very small targets, dark scenes, target occlusion, and complex backgrounds were randomly selected from the Visdrone2019 test challenge set and compared with the former algorithm in Figure 9, Figure 10, Figure 11, Figure 12, Figure 13, Figure 14, Figure 15 and Figure 16.

5. Conclusions

In this study, the problem of the difficult detection of small targets in complex backgrounds, which exists in UAV ground target detection, is successfully solved by introducing dynamic snake convolution (DSC), improved SPPCSPC based on the YOLOv7 model, and employing the EIoU loss function. After experiments, the improved algorithm in this study shows excellent detection effects in different aerial photography scenes and achieves optimal detection results in all nine categories, proving its strong practicality and effectiveness. Considering the similarity in processing requirements between satellite image analysis and UAV aerial images, especially in target detection, background complexity processing, and small target recognition, we believe it also applies to satellite image analysis. Satellite images are commonly used in the fields of geographic information systems (GIS), environmental monitoring, urban planning, and disaster management, where accurate detection and classification of small targets are also crucial. Although the resolution and scale of satellite images may differ from that of UAV images, the improved algorithm proposed in this study can still play an important role in satellite image processing by adjusting the algorithm’s parameters appropriately or making slight modifications.

[ad_2]

This website uses cookies to improve your experience. We'll assume you're ok with this, but you can opt-out if you wish. Accept Read More