Automation and Robots for Disaster Response

AI-Generated Text Detector for Arabic Language Using Encoder-Based Transformer Architecture

By inergency On Mar 18, 2024

[ad_1]

5.1. Large Dataset vs. Custom Dataset Content Variation

When the models were applied to the two datasets, the performance disparities between the two models can be attributed to various factors. First, the large dataset content of AIGT has some non-Arabic scripts due to the translation. Moreover, the narrative structure within this dataset is generally limited to one or two paragraphs per instance of HWT, resulting in a narrower range of writing styles than the custom dataset. In contrast, the custom dataset was developed meticulously by handcrafting it to encompass a diverse range of HWT patterns. Additionally, the large dataset’s scope of AIGT is confined to translated versions of ChatGPT 3.5 outputs. This presents a limitation, as it fails to encompass more advanced and varied writing patterns characteristic of GPT-4 and BARD outputs, which are critical for a robust evaluation of AIGT.

Furthermore, it is important to highlight that the HWT was obtained from Arabic-SQuAD [60] in a large dataset comprising content translated from the English version of SQuAD [49]. This translation aspect is crucial as the quality of these translations inherently affects the dataset’s robustness. Due to its reliance on translated content, the large dataset may not fully capture the nuances and complexity of native Arabic text, which could affect the effectiveness of models trained on this dataset in understanding and processing authentic Arabic language constructs in comparison with the native dataset, such as the custom dataset.

5.2. XLM-R vs. AraELECTRA Performance

The comparative analysis reveals that fine-tuning both the XLM-R and AraELECTRA models demonstrates exceptional performance in classifying HWT and AIGT on large and custom datasets. Specifically, in the larger dataset containing a mix of Arabic and non-Arabic vocabulary, the XLM-R model was expected to perform well due to its capacity to recognize over 100 languages [51]. It accurately classified 4362 out of 4369 examples, a slight improvement over the AraELECTRA model, which correctly identified 4352 and missed classifying 17 examples, as shown in Figure 14.

The high accuracy of the AraELECTRA model, even in a dataset mixed with Arabic and non-Arabic vocabulary, can be attributed to different factors. First, the RTD technique distinguishes between real and fake words within a text, is a core component of the ELECTRA training method, and plays an important role in enhancing the model’s performance. In the RTD setup, a small generator network proposes words to replace tokens in a text, and the discriminator (the main model) has to determine if a word is original or replaced by the generator. This approach is more efficient than traditional language modeling, encouraging the model to learn finer distinctions in word choice and context. That method is particularly beneficial for distinguishing between AIGT and HWT. Secondly, being trained explicitly on Arabic datasets, it is likely to have a more nuanced understanding of Arabic linguistic features. This can contribute to its strong performance in datasets with prominent Arabic text, even if mixed with other languages.

This distinction becomes more evident when examining the custom dataset. While both models performed admirably, the AraELECTRA model achieved perfect accuracy, correctly classifying all instances, as shown in Figure 15a. The XLM-R model was tested under the same hyperparameters as the AraELECTRA model in an experiment focused on the custom dataset. The XLM-R model initially misclassified three instances in this setup, as shown in Figure 15b. Further adjustments were made by lowering both the initial and foremost learning rates. After these modifications, the model incorrectly classified six instances, as shown in Figure 15c. However, this tuning enhanced the XLM-R model’s performance on other datasets during the inference phase, particularly with the custom and AIRABIC datasets, when operating with the adjusted lower initial and default learning rates.

The big challenge in fine-tuning pre-trained models is preventing overfitting. Our study addressed this by carefully adjusting the learning rates for the AraELECTRA model during its application to the custom dataset. Specifically, the initial learning rate was reduced to

1 \times 10^{- 8}

, and the main learning rate was reduced to approximately

1.5 \times 10^{- 6}

. These adjustments enabled the model to be effectively trained for up to 30 epochs. The optimal performance was observed at epoch 27, where the model achieved its lowest loss value of 0.008, as illustrated in Figure 16.

In contrast, the XLM-R model exhibited a different pattern. The validation loss started relatively low, at 0.1, even with the previously mentioned hyperparameters. Subsequent efforts to address this issue have been taken for tackling the loss, such as increasing the dropout ratio to 0.2, 0.3, and subsequently to 0.5, but were unsuccessful in significantly improving the results or mitigating the overfitting problem. Consequently, it was observed that the XLM-R model is particularly susceptible to overfitting. The most favorable results obtained from running the XLM-R model on the custom dataset were by running the model for 10 epochs using the following hyperparameters: batch size 32 with enhancements on the learning rate by applying a warmup phase, learning_rate: $3.2 \times 10^{- 5}$ , initial_learning_rate: $1 \times 10^{- 8}$ , warmup_epochs: 2. The model’s best loss was obtained at epoch 6, with a validation loss of 0.0002. Beyond this point, the model’s loss escalated, indicative of increasing overfitting.

In the experiments conducted on the large dataset, both the XLM-R and AraELECTRA models initially demonstrated low validation losses, starting at 0.03. Despite this promising outset, the XLM-R model exhibited notable overfitting, more so than its counterpart. Extensive experimentation was conducted to optimize hyperparameters specifically for this dataset. Despite the variability in results, both models demonstrated proficiency in detecting AIGT and HWT, though validation losses fluctuated and eventually increased. Optimal results on this dataset were achieved using the XLM-R model for ten epochs, with a batch size of 64, an initial learning rate of $5 \times 10^{- 8}$ , a learning rate of $3.2 \times 10^{- 6}$ , and warmup epochs set at five. These hyperparameters yielded the most favorable outcomes during the inference phase on both the custom and AIRABIC datasets despite being less effective than runs on the custom dataset alone.

In contrast, AraELECTRA displayed superior performance during the inference phase on both the custom and AIRABIC datasets compared to XLM-R. This suggests greater robustness in AraELECTRA, although its results were less optimal than when trained solely on the custom dataset. This disparity underscores the crucial role of dataset quality and composition in model training. Native datasets, with their inherent linguistic authenticity, often provide a more conducive environment for effective training than datasets composed predominantly of translated material.

The robustness of the model is evident during the inference phase when tested against various datasets. Notably, XLM-R demonstrates poor performance following training on large datasets. In contrast, AraELECTRA exhibits satisfactory results under similar conditions. However, when both models are trained on a custom dataset, they each show improved performance, particularly against the benchmark dataset, which was the measurement criteria of our study. More details are in the following section.

[ad_2]