Accurate Recognition of Jujube Tree Trunks Based on Contrast Limited Adaptive Histogram Equalization Image Enhancement and Improved YOLOv8

By inergency On Mar 29, 2024

[ad_1]

3.2. YOLOv8 Improvement of Backbone Network GhostNetv2

The traditional convolutional neural network has the problems of redundancy of feature information, large model parameters, and expensive training computation cost, etc. The backbone network of YOLOv8 is based on the CSPDarknet53 network structure, which employs a large number of convolutional layers for extracting features from the input image. Although this structure has a strong ability to extract features, the image size, resolution, and frame rate are high in the practical application of the jujube garden, and, at the same time, because of the limitation of the equipment, the traditional network structure often can only run at a very low batch size in the practical application and the speed cannot meet the requirements.

The feature maps in traditional deep neural networks usually contain rich or even redundant information, while they reduce the computational complexity of deep neural networks by introducing the Ghost module to generate more features using fewer parameters [34]. The ordinary convolution and Ghost module convolution process are shown in Figure 7, and the specific operation of the Ghost module is to divide the original convolution layer into two parts: the first part consists of ordinary convolution, the second part generates more feature maps from the first part by cheap linear transformation represented as “

φ

” in the figure, and, when finally spliced, the parameters and computational complexity are thus reduced without changing the size of the output feature maps.

There are problems in the lightweight network structure of GhostNet, for example, although the Ghost module achieves a reduction in computational cost, the representation capability is also necessarily reduced. GhostNetv2 overall uses the Ghost module and the DFC attention module and extracts the information from different viewpoints in parallel. The Ghost module reduces the computation by decreasing the size of the features [35], while the DFC attention module reduces the computation by decreasing the size of the features, as shown in Figure 8.

The GhostNetv2 bottleneck structure is shown in Figure 9, which is composed of a Ghost module and a DFC attention module. The DFC attention module is used to augment the output features of the Ghost module to capture the long-range dependencies between pixels in different spaces. Compared with the Self-Attention model [36], the computation process of the DFC attention module is more intuitive and simpler, in which the FC layer directly generates the attention graph and is computed as follows:

$a_{h w} = \sum_{h^{'}, w^{'}} F_{h w, h^{'} w^{'}} ⨀ Z_{h^{'} w^{'}},$

(1)

where F is the learnable weights in the FC layer and $⨀$ is the elemental multiplication. Given the feature $Z \in R^{H \times W \times C}$ , it can be seen as HW tokens $Z_{i} \in R^{C}$ . $\{a_{11}, a_{12}, \dots, a_{h w}\}$ as in the generated attention map.

The DFC attention module significantly improves the expressive power of GhostNet by decomposing the fully connected layer into horizontal and vertical fully connected layers, i.e., the input features are aggregated along the horizontal and vertical directions, respectively, in order to capture the long-range dependencies along these two directions. The characteristic expressions for the horizontal direction

F^{H}

and vertical direction

F^{W}

in the calculation process of DFC attention are as follows:

$a_{h w}^{'} = \sum_{h^{'} = 1}^{H} F_{h, h^{'} w}^{H} ⨀ z_{h^{'} w}, h = 1, 2, \dots, H, w = 1, 2, \dots, W,$

(2)

$a_{h w} = \sum_{w^{'} = 1}^{W} F_{w, h w^{'}}^{W} ⨀ a_{h w^{'}}^{'}, h = 1, 2, \dots, H, w = 1, 2, \dots, W,$

(3)

In this paper, the network structure of YOLOv8 is improved through replacement. In the fully connected layer, each pixel has a direct connection to all other pixels, resulting in a computational complexity of O(H²W²), which is very high on high-resolution images. In contrast, DFC attention decomposes the fully connected layer into horizontal and vertical fully connected layers, which reduces the computational complexity to O(H²W + HW²) and greatly reduces the computational effort, and, finally, the features are up-sampled by average pooling and bilinear interpolation to recover the original size of the features. This design allows GhostNetv2 to have a larger sensory field, which better captures the long-distance dependencies between different locations in the image and improves the expressive power of the model.

Part of the Conv module is replaced with GhostConv; Conv and Bottleneck in C2f are replaced with GhostConv and GhostBottleneck, and the improved C2fGhost structure is shown in Figure 10. Not changing the size of the output feature map reduces the parameters and computational complexity improves the detection speed.

3.3. YOLOv8 Improvement of the CA_H Attention Mechanism

In this paper, the object detection area of interest is the trunk area of jujube trees, and we tried to keep the height and speed of the UAV as stable as possible during the process of the image acquisition of the trunks, so in the acquired images, the object area is characterized by the following: the position of the object area in the image is the lower area and more fixed at the same time, and, secondly, the object keeps the uniform speed movement in the image. From the above considerations, it is necessary to increase the attention mechanism in conjunction with the characteristics and to suppress the accuracy degradation caused by lightweighting.

The base attention mechanism chosen in this paper is coordinate attention (CA), and the most significant feature is that it embeds the position information in the design of the mobile network. The core algorithm of the CA attention mechanism is divided into two steps: coordinate information embedding and coordinate information embedding. The core algorithm of the CA attention mechanism is divided into two steps: coordinate information embedding and coordinate attention generation.

Coordinate information embedding comprises the X Avg Pool and Y Avg Pool operations. The specific operation is as follows: first of all, the input feature maps are subjected to two one-dimensional global pooling operations, which require the use of pooling kernels with dimensions of (h,1) and (1,w) to aggregate the features along the vertical and horizontal directions to obtain the two directionally aware feature maps, and the computational formulae are as follows, respectively:

$z_{c}^{h} (h) = \frac{1}{W} \sum_{0 \leq i < W} x_{c} (h, i),$

(4)

$z_{c}^{w} (w) = \frac{1}{H} \sum_{0 \leq i < H} x_{c} (j, w),$

(5)

where given the input x, c denotes the channel, $z_{c}^{h}$ denotes the output of the cth channel at height h, and $z_{c}^{w}$ denotes the output of the cth channel at width w.

Coordinate attention generation is the remaining process in the structure picture, and the specific operation is as follows: firstly, the two feature maps inputted from the Concat segment are transformed using a shared 1 × 1 convolution to reduce the dimensions to the original C/r and to generate the feature maps F_h and F_w, where r denotes the under-sampling ratio [37]; secondly, the intermediate feature maps generated after convolution are sliced, normalized with other operations, and the feature maps F are passed through the nonlinear activation function

δ

to obtain the intermediate feature maps f with spatial information in the horizontal and vertical directions, which are calculated as follows:

$f = δ (F_{1} ([z^{h}, z^{w}])),$

(6)

Then, the intermediate feature map f is convolved to generate a feature map with the same channel as the input F. The CA module finally uses the Sigmoid activation function, which is improved in this paper by adopting the H-Sigmoid activation function to replace and form a new attention mechanism CA_H, and the structure is shown in Figure 10; the computation using the new improved method is performed to obtain the attentional weight g^h in the height direction and the attentional weight g^w in the width direction of the feature map as follows:

Finally, the output y_c with the weights of the attention mechanism in the height and width directions is obtained by multiplicative weighting with the following equation:

$y_{c} (i, j) = x_{c} (i, j) \times g_{c}^{h} (i) \times g_{c}^{w} (j)$

(9)

This attention mechanism decomposes the channel attention into two one-dimensional features encoding processes that aggregate features along the horizontal and vertical directions as shown in Figure 11. This captures long-range dependencies and simultaneously preserves precise positional information. This helps the model to more accurately localize and identify object detection regions that require more attention.

The position information is embedded into the channel attention, taking both spatial and channel attention into account, and the improved CA_H attention mechanism is added to the end of the Neck part of the YOLOv8 network architecture, i.e., the intermediate position after the Concat segment of each up-sampling and before the Head detection layer after the multi-scale feature fusion of the three effective feature layers of the FPN and the PAN, and so the overall network structure of the trunk detection model based on improved yolov8 is shown in Figure 12. This attentional embedding design can filter and analyze a large amount of input feature information in both channel and spatial dimensions to enhance the impact of important features and obtain performance gains [38].

[ad_2]