AI-Generated Text Detector for Arabic Language Using Encoder-Based Transformer Architecture
[ad_1]
5.1. Large Dataset vs. Custom Dataset Content Variation
When the models were applied to the two datasets, the performance disparities between the two models can be attributed to various factors. First, the large dataset content of AIGT has some non-Arabic scripts due to the translation. Moreover, the narrative structure within this dataset is generally limited to one or two paragraphs per instance of HWT, resulting in a narrower range of writing styles than the custom dataset. In contrast, the custom dataset was developed meticulously by handcrafting it to encompass a diverse range of HWT patterns. Additionally, the large dataset’s scope of AIGT is confined to translated versions of ChatGPT 3.5 outputs. This presents a limitation, as it fails to encompass more advanced and varied writing patterns characteristic of GPT-4 and BARD outputs, which are critical for a robust evaluation of AIGT.
5.2. XLM-R vs. AraELECTRA Performance
The high accuracy of the AraELECTRA model, even in a dataset mixed with Arabic and non-Arabic vocabulary, can be attributed to different factors. First, the RTD technique distinguishes between real and fake words within a text, is a core component of the ELECTRA training method, and plays an important role in enhancing the model’s performance. In the RTD setup, a small generator network proposes words to replace tokens in a text, and the discriminator (the main model) has to determine if a word is original or replaced by the generator. This approach is more efficient than traditional language modeling, encouraging the model to learn finer distinctions in word choice and context. That method is particularly beneficial for distinguishing between AIGT and HWT. Secondly, being trained explicitly on Arabic datasets, it is likely to have a more nuanced understanding of Arabic linguistic features. This can contribute to its strong performance in datasets with prominent Arabic text, even if mixed with other languages.
In contrast, the XLM-R model exhibited a different pattern. The validation loss started relatively low, at 0.1, even with the previously mentioned hyperparameters. Subsequent efforts to address this issue have been taken for tackling the loss, such as increasing the dropout ratio to 0.2, 0.3, and subsequently to 0.5, but were unsuccessful in significantly improving the results or mitigating the overfitting problem. Consequently, it was observed that the XLM-R model is particularly susceptible to overfitting. The most favorable results obtained from running the XLM-R model on the custom dataset were by running the model for 10 epochs using the following hyperparameters: batch size 32 with enhancements on the learning rate by applying a warmup phase, learning_rate: , initial_learning_rate: , warmup_epochs: 2. The model’s best loss was obtained at epoch 6, with a validation loss of 0.0002. Beyond this point, the model’s loss escalated, indicative of increasing overfitting.
In the experiments conducted on the large dataset, both the XLM-R and AraELECTRA models initially demonstrated low validation losses, starting at 0.03. Despite this promising outset, the XLM-R model exhibited notable overfitting, more so than its counterpart. Extensive experimentation was conducted to optimize hyperparameters specifically for this dataset. Despite the variability in results, both models demonstrated proficiency in detecting AIGT and HWT, though validation losses fluctuated and eventually increased. Optimal results on this dataset were achieved using the XLM-R model for ten epochs, with a batch size of 64, an initial learning rate of , a learning rate of , and warmup epochs set at five. These hyperparameters yielded the most favorable outcomes during the inference phase on both the custom and AIRABIC datasets despite being less effective than runs on the custom dataset alone.
In contrast, AraELECTRA displayed superior performance during the inference phase on both the custom and AIRABIC datasets compared to XLM-R. This suggests greater robustness in AraELECTRA, although its results were less optimal than when trained solely on the custom dataset. This disparity underscores the crucial role of dataset quality and composition in model training. Native datasets, with their inherent linguistic authenticity, often provide a more conducive environment for effective training than datasets composed predominantly of translated material.
The robustness of the model is evident during the inference phase when tested against various datasets. Notably, XLM-R demonstrates poor performance following training on large datasets. In contrast, AraELECTRA exhibits satisfactory results under similar conditions. However, when both models are trained on a custom dataset, they each show improved performance, particularly against the benchmark dataset, which was the measurement criteria of our study. More details are in the following section.
[ad_2]