Q8VaxStance: Dataset Labeling System for Stance Detection towards Vaccines in Kuwaiti Dialect
[ad_1]
1. Introduction
This research aims to label a large dataset of tweets written in the Kuwaiti dialect. The tweets are classified pragmatically depending on their attitude towards vaccines in order to track negative views on social media. This research is an integral part of a more comprehensive attempt to understand the elements that cause vaccine hesitancy and to create practical approaches for addressing it. Furthermore, by analyzing social media data we can better understand the methods of spreading misinformation and vaccine-related conspiracy theories and their consequences on public opinion. Ultimately, this knowledge can help public health officials to propose initiatives to secure the health of individuals and communities.
The main contribution of this research is creating the first dataset of tweets labeled regarding stance towards vaccines in the Kuwaiti dialect (42,764 labeled tweets). This dataset is a valuable resource for researchers studying vaccine hesitancy and its impact on public health. Additionally, this research implements the first Kuwaiti dialect annotation system for vaccine stance detection (Q8VaxStance) by using weak supervised learning and applying prompt engineering to zero-shot models as labeling functions to programmatically annotate the dataset regarding stance towards vaccines in the Kuwaiti dialect. The use of zero-shot models as labeling functions and weak supervised learning frameworks enables us to programmatically annotate a large dataset with minimal assistance from subject matter experts and minimal need for manually labeling a large dataset; thus, it enables us to save time and money, as recruiting expert annotators is an expensive and time-consuming task.
Finally, considering the limited availability of linguistic resources for the Kuwaiti dialect, this research tries to fill this gap in the field of natural language processing by providing a dataset to develop and evaluate machine learning models for stance detection in the Kuwaiti dialect. The following are the research questions of our study:
-
How can we create a labeling system to annotate a large dataset of Kuwaiti dialect tweets for stance detection towards vaccines with or without help from subject matter experts (SMEs)?
-
What experimental setup produces the best performance for the proposed labeling system?
This paper is organized as follows. In the Background section, we review the relevant literature on vaccine hesitancy and stance detection towards the COVID-19 vaccine, natural language processing (NLP) research involving the Kuwaiti dialect, and dataset annotation approaches in NLP. In the Methodology section we describe the dataset collection and preparation process. Next, we explain the process of labeling the dataset manually and describe the steps and architecture of the proposed Q8VaxStance labeling system. Next, in the Experimental Results and Discussion section, we present the results of our performance evaluation based on the Q8VaxStance labeling system experiments. Finally, in the Conclusion section, we summarize the study’s main findings and propose several directions for future work.
2. Background
2.1. Vaccine Hesitancy and Stance Detection Using Social Network Analysis and Natural Language Processing
2.2. Natural Language Processing (NLP) of Kuwaiti Dialect
Based on the above review, there is an opportunity for researchers in the field of NLP to in filling the gap with respect to the Kuwaiti dialect, which remains underrepresented and not widely covered in this academic field.
2.3. Dataset Labeling Approaches
-
SMEs write labeling functions (LFs) that express weak supervision sources such as distant supervision, patterns, and heuristics.
-
Snorkel applies the LFs on unlabeled data and learns a generative model to combine the LF outputs into probabilistic labels.
-
Snorkel uses these labels to train a discriminative classification model such as a deep neural network.
3. Methodology
3.1. Dataset Collection
To collect the dataset containing tweets related to the COVID-19 pandemic in Kuwait, we implemented the following steps:
3.2. Dataset Preparation
To prepare our dataset and make sure that it only contained tweets from Kuwait, we filtered out tweets that did not have one of the following keywords in the user_location field: Koweït, Q8, kw, kwt, kuwait, الكويت, كويتيه, كويتي, وطن النهار, and KU. We programmatically removed unrelated tweets by excluding all posts not written in the Arabic language or containing keywords related to Arabic spam posts. Next, we cleaned the text of the tweets by removing digits, special characters, URLs, emojis, mentions, tashkīl (diacritics), and punctuation. We did not remove hashtags, as based on our observations of the dataset hashtags are heavily used to express the stance towards vaccination; instead, we only removed the hash # and underscore _ characters between the hashtag keywords, which allowed the hashtags to be processed as regular text. After the dataset preparation and cleaning process, the total number of extracted unlabeled tweets was 42,815.
3.3. Dataset Labeling
3.4. Q8VaxStance Labeling System
Our first research question aimed to investigate whether a weak supervised learning approach combined with the prompt engineering of zero-shot models could label a large dataset of tweets for stance detection towards vaccines with limited help from SMEs. To obtain an answer to our first research question, we performed the following steps:
-
We selected the weak supervised learning framework to use in our experiments. After examining several Python packages and frameworks that support weak supervised learning for natural language processing, we decided to use the Snorkel open-sourced software framework [33] based on the good results we were able to establish in [16] for the sentiment classification of the Kuwaiti dialect.
-
We set up 52 experiments, as described in Table 1; for each experiment, we created the labeling functions that determine the stance towards vaccines. Figure 1 illustrates the general Q8VaxStance labeling system architecture used in the KHZSLF experiment setup; the system architecture for the KHLF and ZSLF experiments is similar, with a few labeling functions being excluded depending on the specific experimental setup.
-
We applied the labeling functions on 42,815 unlabeled tweets and trained the model using the Snorkel package to predict the dataset labels. As a first experiment, we created labeling functions to label the dataset based on the presence of specific pro-vaccine and anti-vaccine keywords and hashtags in the tweet texts. In this experiment, we used the same keywords and hashtags that were used before to obtain the dataset from Twitter.
-
We conducted several experiments to compare the performance of using only zero-shot (ZS) learning-based labeling functions versus combining keyword-based labeling functions with zero-shot learning-based labeling functions. We implemented the inference code provided by the ZS models’ creators using the huggingface website. The following pretrained zero-shot models were used in the ZS labeling functions:
-
We applied prompt engineering to check the effect of using different prompts and labels on the labeling system performance, then determined the best labels and prompt combinations that produced the best performance when using the zero-shot learning-based labeling function. To apply prompt engineering, we varied the text of labels and prompts; in addition, we tested different combinations consisting of English labels and prompts, Arabic labels and prompts, and mixed language labels and prompts to check the effect of the language used in the labels and prompts on system performance. Table 2 and Table 3 contain a list of the labels and prompts used in our experiments.
Our second research question aimed to evaluate the performance of the Q8VaxStance system on labeling a large dataset for stance detection towards vaccines. To be able to address this question, we tested the human-labeled dataset using the model we trained using the Snorkel package and the 42,815 unlabeled samples; then, we compared the accuracy, macro-F1 score, and total number of generated labels for each experiment. The details of the experimental results are presented in the next section. Finally, we used ANOVA and Tukey’s HSD statistical tests to compare the experiments in order to determine whether they were statistically significant, as well as to discover the main factors affecting the experimental performance and the labeling functions’ ability to generate more labels.
4. Experimental Results and Discussion
The best accuracy, Macro-F1, and Cohen’s kappa score values were achieved in experiments KHZSLF-EE4 and KHZSLF-EA1, with nearly the same accuracy and Macro-F1 values of 0.83 and 0.83, respectively. Likewise, the Cohen’s kappa score achieved in these experiments was 0.66 and 0.67. Moreover, the best accuracy for the experiments in the groups using Arabic labels and templates was in experiments KHZSLF-AA8 and KHZSLF-AA9, with accuracy, Macro-F1, Cohen’s kappa score values of 0.83, 0.82, and 0.65 respectively.
Next, the results were analyzed to detect which experiments generated a more balanced distribution of the generated dataset labels and which experiments abstained and could not generate many labels. The results show that, on average, the experimental groups KHZSLF-AA, ZSLF-AA, and KHZSLF-EA created nearly balanced datasets. In contrast, experiments KHZSLF-EE, ZSLF-EE, and ZSLF-EA created imbalanced datasets.
The following is a description of each experimental group:
Furthermore, the results indicate that there is a statistically significant difference between the means of the three evaluation metrics (accuracy, macro-averaged F1 score, and the total number of labels) when using zero-shot model labeling functions with any language (AA, AE, or EE) compared to not using zero-shot models (NN), indicating that experiments using zero-shot model labeling functions outperform experiments using only keyword labeling functions.
Therefore, we can conclude that when using mixed zero-shot models with mixed language labels and prompts (AAAEEE), the differences between the experiments are not statistically significant compared to using only zero-shot models, indicating that this experimental setup does not significantly improve the evaluation metrics.
5. Conclusions
In this study, we have attempted to fill a gap in the field of NLP by creating Kuwaiti dialect language resources, as currently the Kuwaiti dialect is underrepresented in the available Arabic language models. These language resources are critical for developing high-performance approaches and systems for different NLP problems. To overcome data annotation challenges, we have proposed an automated system to programmatically label a tweet dataset to detect the stance towards vaccines in the Kuwaiti dialect (Q8VaxStance). The proposed system is based on an approach combining the benefits of weak supervised learning and zero-shot learning.
This research is an essential part of a more comprehensive attempt to understand the elements that cause vaccine hesitancy in Kuwait and to create practical approaches for addressing it. This labeled dataset is the first Kuwaiti dialect dataset for vaccine stance detection. In this research, we conducted 52 experiments to identify the best experimental setup and the main factors that affect the annotation system’s performance metrics by comparing the accuracy value, Macro-F1 score, Cohen’s kappa score, and total number of generated labels. In addition, we studied the statistical significance of the experiments by applying ANOVA and pairwise Tukey’s HSD post hoc statistical tests.
Based on our results, we achieved the best accuracy, Macro-F1 score, and Cohen’s kappa score values in the experiments when using both zero-shot models and keyword detection as labeling functions; experiments KHZSLF-EE4 and KHZSLF-EA1 had nearly the same accuracy, and had Macro-F1 scores of 0.83 and 0.83, respectively. The Cohen’s kappa scores achieved in these experiments were 0.66 and 0.67, respectively, which are considered good annotator agreement scores. As part of our future work, we plan to conduct additional experimentation and refinement in order to achieve perfect agreement and improved performance metrics.
The results of the ANOVA and pairwise Tukey’s HSD post hoc statistical tests showed that the experiments using both zero-shot models and keyword detection as labeling functions (KHZSLF) significantly outperformed those using only the keyword detection labeling functions (KHLF) or only the zero-shot models labeling functions (ZSLF) for all evaluation metrics. When changing the language of the labels and prompts used in zero-shot models, our results showed that the mean total number of generated labels when using Arabic in both labels and prompts (AA) or mixed Arabic English labels (AE) and prompts was statistically significant compared to using English in both labels and prompts (EE), indicating that our proposed annotation system generates more labels when the Arabic language is used in both prompts and labels or in at least one of them.
In our future research, we first intend to experiment more with the proposed annotation system by applying zero-shot and few-shot learning on large language models supporting the Arabic language. Second, we plan to use this generated dataset to fine-tune and compare available Arabic BERT-based language models and large multilingual models to create a trained model for Kuwaiti dialect stance detection. Finally, we plan to use graph neural network algorithms to predict vaccine stances and compare the findings with the results of this research.
[ad_2]