Automation and Robots for Disaster Response

Using Computer Vision to Collect Information on Cycling and Hiking Trails Users

By inergency On Mar 20, 2024

[ad_1]

This section presents a performance comparison between the most promising models identified above. Thus, YOLOv3-Tiny, MobileNet-SSD V2, and FasterRCNN with ResNet-50 models will be evaluated for detecting pedestrians, cyclists, or other users on walking, hiking, and cycling trails and routes. Firstly, the dataset created for this work is presented. Secondly, the benchmark scenario and the performance metrics are described. And thirdly, the results are presented, analyzed, and discussed.

4.1. Dataset Description

After extensive research, it was decided that we create a new dataset of images that allowed joint and individual detection of people, motorcycles, and bicycles. This dataset will be used to train and validate the models under analysis. Various images were selected from [48,49]. For both websites, there is no need for special licenses to use their images. In addition to these images, some proprietary images were also captured in an environment that resembles the real context of this work. These images were captured with an iPhone 13 in a vertical orientation. They were taken in two different positions, some at ground level and others about 1 m above the ground. The final dataset is available in [50]. It is categorized into three different classes: persons, bicycles, and motorcycles. It consists of 440 images, organized into three subsets: 70% for training, consisting of 309 images; 20% for validation, comprising 89 images; and 10% for testing, with 42 images. The discrepancy in the number of images between the person class and the other two classes is due to the variety of accessories that people can carry with them, such as backpacks, hiking poles, overcoats, and raincoats. For the object detection model to be able to recognize these accessories, a greater number of images of this class were required. Table 2 shows the number of images per set and per class in the dataset.

The Roboflow tool [10] was used for the labeling process [51], where annotations are created for each image. The goal of this phase is to indicate to the model the location of the objects in the images, as well as identify their class. Figure 17 illustrates the process of creating image annotations with Roboflow, while Figure 18 shows the images already annotated.

Images captured in adverse weather and lighting conditions were annotated and incorporated into the dataset, as shown in Figure 19. These images enrich the diversity of the dataset, giving the detection model the ability to deal effectively with different real-world scenarios.

4.2. Benchmark Scenario

The Google Colab platform [52], which provides computing resources, was used to train the models. To train the three models, a machine with the following characteristics was allocated: an NVIDIA T4 graphics card with 16 GB of VRAM and 13 GB of RAM. With the use of a free Google Colab subscription, this platform is limited in the amount of time it can be used. As the graphics card is the most important piece of hardware in the model training process [53], this subscription only allows for approximately 6 h of use. This number of hours is not exact, nor is it disclosed by Google, as these usage limitations are dynamic and vary according to how the platform is used [54]. Therefore, the model was saved in Google Drive [55] at the end of each training epoch. It was also necessary to use another Google account to split up the YOLOv3-Tiny model training process.

To implement the training of the YOLOv3-Tiny model, an existing notebook on Google Colab created by Roboflow was used, which was adapted from YOLOv4 to YOLOv3-Tiny [56]. On the machine provided, the NVIDIA Cuda Compiler Driver [57] was installed to associate the graphics processing unit (GPU) hardware with the execution. It also installed the Darknet framework, which is an open-source neural network architecture implemented in C and CUDA, recognized for its speed, ease of installation, and support for computing on the central processing unit (CPU) and GPU [58]. Next, the YOLOv3-Tiny base weights [59] were downloaded in YOLO Darknet format. With the help of the Roboflow library for Python [60], it was easy to download and prepare the dataset for the framework, as shown in Figure 20.

After this stage, the images and labels were prepared in the correct directories. A training configuration file adjusted to the dataset used was also built. The changes made to this code were the number of epochs (max_batches), where the value was changed to 6000. The value recommended by the notebook follows the instructions in the darknet repository, which states that max_batches should be defined as the number of classes multiplied by 2000 while ensuring that it is not less than the number of training images and not less than 6000 [61].

In the next phase, the training was carried out using the command shown in Figure 21, where Darknet took control of the process.

Once the training was complete, it was possible to analyze the first 1000 epochs, followed by analysis at each increment of 100 subsequent epochs. These analyses were extracted to a results.log file where information was displayed, highlighting the current training epoch number, the average loss in training (loss), average detection accuracy (mAP), the best average accuracy achieved so far (best), time remaining to complete training based on current progress, details of the average loss (avg loss), and the total number of images processed so far (images). Figure 22 illustrates this analysis in the file, at iterations 1299 and 1300 of the training.

The MobileNet-SSD V2 model was implemented using the TensorFlow framework in a Google Colab notebook [62]. It was later copied and adapted to our case [63]. Initially, the dependencies for TensorFlow object detection were installed, guaranteeing the availability of the libraries needed for the process. Next, to prepare the training data, the dataset was downloaded in Pascal VOC format and TFRecords format from the Roboflow platform. The purpose of this notebook is to convert the images and their XML files, in Pascal VOC format, into TFRecords format, to be used by TensorFlow during training. In addition, tests were carried out to evaluate the performance of the trained model, also using test images extracted from the dataset in Pascal VOC format. The use of the Roboflow platform considerably simplified this task.

In the training configuration, the MobileNet-SSD V2 model to be used was selected from the TensorFlow 2 Object Detection Model Zoo [64]. At this stage, hyperparameters were also specified for training the model, such as the number of epochs (num_steps) and batch_size, which represent the number of images to be used per training step. In this case, 6000 epochs and 16 images were selected, respectively. Every 100 epochs, the time taken and metrics such as classification_loss, localization_loss, regularization_loss, total_loss, and learning_rate were displayed. Figure 23 provides a detailed view of the outputs, focusing particularly on the 5900th iteration epoch.

Subsequently, the model resulting from the training was converted into the TensorFlow Lite format, a version optimized for devices with low computing power. Still in the notebook, the model was tested on 10 test images, before the mAP was calculated to assess the model’s effectiveness in terms of average detection accuracy.

ResNet-50 was implemented in a notebook based on [65]. The framework used to train this model was PyTorch [66]. A clone was made of a repository created in [65]. After this step, the Roboflow library [60] was used to download and format the dataset in Pascal VOC format. Next, the training configuration file was prepared to assign the paths of the training and test images and labels. This file, shown in Figure 24, also defines the classes present in the dataset and their ID.

Next, the “train.py” file needs to be executed to start training the model. As can be seen in Figure 25, input parameters need to be set: the location of the training configuration file, the number of epochs to be trained, the model to be used, the name of the output of the trained model, and the batch size.

The model selected was “fasterrcnn_resnet_50_fpn_v2”, the output name was “custom_training”, and the batch size value was 8. For each training epoch, the metrics loss, loss_classifier, loss_box_reg, loss_objectness, loss_rpn_box_reg, and execution time are displayed, as shown in Figure 26.

At the end of each epoch, tests were also carried out on the model, shown in Figure 27, resulting in a summary of validation metrics, such as Average Precision and Average Recall. The mAP is also displayed if these results are better than those obtained previously.

The test environment was then carried out on a device with an Intel-Core I7-11370H CPU, 16 GB of RAM, and an NVIDIA RTX 3050 GPU. All the models were tested on it, allowing an equal comparison between models in terms of accuracy, processing speed, and model efficiency.

4.3. Performance Metrics

In order to assess the effectiveness of models in detecting and classifying objects, it is essential to understand the metrics associated with these methods. AP (average precision) is a commonly used metric in binary classification and information retrieval. It serves to summarize the precision–recall curve and provides a single numerical value that characterizes the quality of retrieval results for a given class or category. This is particularly relevant in tasks like object detection [67]. It is calculated according to Equation (1).

$A P = \sum_{k = 0}^{k = n - 1} [R e c a l l s (k) - R e c a l l s (k + 1)] * P r e c i s i o n s (k)$

(1)

One of the most important and widely used metrics is the mAP. It is calculated as the average of the accuracies at different recall levels (evaluation of the model’s effectiveness in detection). The mAP is an important metric because it is insensitive to the size of the objects, allowing effective comparison of models. It is calculated using Equation (2), using the average of the APs of all the classes considered.

$m A P = \frac{1}{k} \sum_{i}^{k} A P_{i}$

(2)

The F1-Score is a metric commonly employed to evaluate a model’s precision. It combines both precision and recall into a unique measure. Consequently, it offers an assessment of a model’s performance across varying levels of precision and recall. It is mainly calculated using true positives, false positives, and false negatives, as delineated in Equation (3).

$F 1 s c o r e = \frac{T P}{T P + \frac{1}{2} (F P + F N)}$

(3)

Intersection over union (IoU) is also used to assess the accuracy of a model. It is calculated as the ratio of the area of intersection of the detection and the area of union of the detection and the reference rectangle. It is also a metric that is independent of the size of the objects, which makes it useful for comparing models of different sizes. The loss metric represents the collective error in the model’s predictions, calculated as the difference between the model’s output and the desired value. The loss_classifier measures the discrepancy between the model’s predictions and the actual classes of the objects in the image, quantifying how far off the model’s predictions are from the actual classes. This encourages the model to adjust the parameters to minimize this discrepancy. The loss_box refers to the loss associated with the location or bounding box, assessing the discrepancy between the coordinates of the bounding box predicted by the model and the actual coordinates of the object’s bounding box in the image. The loss_rpn_box specifies the loss associated with adjusting the bounding boxes generated by the RPN (Region Proposal Network). The loss_object metric is used to assess the overall accuracy of a model, looking at the accuracy of the location of objects in the image and the correct assignment of categories to objects. It is a weighted combination of these two components, reflecting the overall accuracy of the model in detecting and classifying objects.

Overfitting [68] represents a significant challenge when training CNNs. It is manifested when a model is overtrained, going so far as to memorize specific details of that data. As a result, the model demonstrates outstanding performance on the training data, but fails when dealing with new data. To mitigate this problem, a balance must be struck in accurately capturing meaningful patterns, while avoiding building an overly complex model that adapts too much to the training data. Figure 28 provides an illustration of the concept of overfitting.

For overfitting prevention Early Stopping [69], Figure 29 illustrates the concept, showing the training error in blue and the validation error in red as a function of the training epochs. If the model continues learning after a certain point, the validation error will increase while the training error will continue to decrease. The aim is to find the right moment to stop training, avoiding both underfitting and overfitting.

[ad_2]