Automation and Robots for Disaster Response

Cybercrime Risk Found in Employee Behavior Big Data Using Semi-Supervised Machine Learning with Personality Theories

By inergency On Mar 28, 2024

[ad_1]

2.7. Big Data Exploration with Selected ML Techniques

In research design step 6 from Figure 1, the ML explore data phase was where the training model was developed. First there was the data collection process, which involved coding and cleaning the data. The big data was processed to remove confidential company information or employee identifying attributes and then preprocessed using ML Python to develop the features and target variables.

Microsoft’s Office software and Microsoft’s Azure platform were in use by the case study company. The author also utilized Microsoft’s Azure AI, ML, and cognitive computing services. To help others, the author can share that Microsoft [15] recently opened their AI and ML cognitive services to the public, providing 25 ML tools and at least 55 services free [see: https://azure.microsoft.com/en-us/pricing/free-services/ (accessed on 1 March 2024)]. Microsoft stated the AI/ML programming libraries contained emotion/sentiment detection, vision/speech recognition, and language understanding utilized by their Bing, Cortana, and Skype Translation products [15]. Google has comparable services.

The research programming environment consisted of the Microsoft Azure CLI ML extension version 2 and the Python SDK Azure-AI-ML version 2. Specifically, these libraries were accessed: azure.ai.ml, azure.identity, azure.ai.ml.entities, azure.ai.ml.constants, and other general-purpose routines commonly leveraged for research programming tasks. The author primarily used Jupyter Notebook as a structured programming editor to prototype and then train an ML model using Pandas Conda commands. This environment was considered comparable to what Dalal et al. [9] used: Python 3.6, NumPy, Pandas, Scikit-learn libraries, and more powerful than a 1 GHz CPU with 2 GB of RAM. The author initially set up a workspace instance attached to a serverless computer to offload the lifecycle management to Azure Machine Learning. Then, a datastore was defined in the cloud as an Azure Data Lake, which was where the case study company securely placed the extracted data. The same work environment and resources were reused by the author for subsequent iterations of the model building. The AI/ML documentation is stored at: https://learn.microsoft.com/en-us/azure/machine-learning/ (accessed on 1 March 2024).

The case study company created an application programming interface (API) with the help of the author to clean up the big data. The API extracted text blobs from the big data source, first ensuring confidential protected information was removed by using lookup registries of employee personal identifier constructs (usernames, identification numbers, etc.) and trade secret phrases for a year. The API returned only text, no images, graphics, or attachments, from employee postings or Microsoft Office applications. The data lake held a trillion datasets and a single file up to a petabyte in size. The size of the current study’s cleaned big data was estimated to be 8.8 gigabytes.

Once the API was operational, the next task was to integrate the FFPT construct into ML. First, the FFPT items were adapted from the work of Goldberg [21] and Costa et al. [22]. Table 1 contains the item keywords adapted for each FFPT factor. Reverse-coded items were not used because they were intended to check social desirability—in the current study, there were no survey self-reported data, only retrospective big data. The FFPT items were loaded into an ML Python lexicon array, structured by factor.

The Vader English library was selected for the tokenization because it is rule-based and effective in previous ML studies [15,18,19]. The WordNet Lemmatizer in the Python library was leveraged for further normalization because it has a large pre-trained dictionary of English words that can provide synonyms. The result was that unimportant connecting words, called stop words, were removed (e.g., ‘but’, ‘and’, ‘in’, ‘or’, etc.), and modern American slang synonyms were added, such as ‘pissed’ to represent ‘outraged’. To facilitate further processing, all words were forced to be lowercase. Thus, the normalization condensed each text blob into a succinct group of words, in lowercase, without punctuation, in hopes of capturing the essence. The sentiment analysis was performed using natural language processing by comparing the normalized big data text blob records to each FFPT item per factor.

FFPT coefficient scores were calculated on a 0 to 1 continuous scale per item. The item scores were averaged to obtain a factor coefficient score. For example, the first item in the FFPT neuroticism factor from Table 1, ‘stressed out’, was compared to the big data record, to generate the item level score. If the big data record contained the tokenized keyword ’stress’ in the past, present, present-perfect, or future verb form, this would be scored 1. If the big data contained the noun form with adjacent keywords (e.g., ‘stressed, wow am I ever…’) it would be scored 1. Averaging the item coefficient scores per factor resulted in all coefficients ranging from 0.1 to 0.95 for every FFPT factor.

2.8. Big Data Analysis, Training Model Development, and Evaluation with ML

Step 7 in the research design of Figure 1 was ‘ML develop models’. The first task was to develop a training model from the normalized and tokenized big data. After that, the quality, validity, and reliability of the model were measured.

Credibility in step 7 was partly achieved through how the research was designed and partly justified by excellent quality scores based on established benchmarks. Credibility was achieved in the current study by fully describing the methods so that another researcher could recreate or replicate the current study and understand how it was accomplished scientifically. Additionally, another concern for credibility is how well the population was identified and how well the sample represented the targeted population. In the current study, the targeted population was made clear, and the sample was taken from the industry. This allows generalizations based on the findings of the current study to be made with confidence by readers and other researchers.

In step 7, ML validity checking was applied to the model. In ML, validity checking is performed by separating the data used to test the model from the data used to train the model. The best practice is to divide the data into three groups using the 80:10:10 rule: the first 80% for training model data, the next 10% for model testing validity/reliability, and the final 10% to validate the final model after tuning in the future [19,20]. It is acceptable to use a 90:10 rule in exploratory studies, with 90% of the data allocated for building a training model and the remaining 10% allocated for validity/reliability testing of the current model as well as for future fine-tuned models [20]. The key issue is how the test dataset is created from the data because random pulls may not be representative of the big data due to the limitations of statistical probability sampling with replacement [19].

There are four common approaches in ML for allocating a test dataset to perform validity/reliability checking on a training model: sampling, holdout, folding, and ensemble. The first three involve allocating a proportion of the entire data using the 80:10:10 or 90:10 rule discussed above. The ensemble approach involves using multiple ML methods to create multiple models, combined with one of the validity/reliability checking approaches (e.g., sampling, holdout, or folding) applied to every model with the results averaged or the overall best result selected [20]. Another approach not mentioned above because it is not practical for big data is the hold-out or leave-one-out, where iterative testing is performed using all but the withheld records, selecting new records each time, which is reliable because it eliminates errors by chance, but it will be extremely time consuming even with powerful computers [19]. The sampling approach is the least robust because it takes random records to create the test (and optionally the validation) datasets, with or without replacement, but there is no assurance the test record distribution values will identically match the distribution values of the training data unless stratified selection takes place, and even then validity/reliability can suffer from pure chance [20]. However, sampling is convenient and may be practical for small datasets. In simple terms, folding is also known as cross-validation, a variation of the hold-out mentioned above, where 10% or another proportion of the records are held-out from the training model data, and this process continues for 10 to 20 times to test the validity/reliability [19,20]. This is considered more robust than sampling and less time/resource intensive than the ensemble approach [20]. A decisive point to mention is that many ML techniques generate quality score coefficients during building the training model, but these are intended to be rough progress indicators—the quality measures must be taken after the training model is finalized and by using the selected approach.

The folding approach was selected for the current study because it was considered a robust accuracy validation/reliability technique by experts [19,20], and it works well for models with dichotomous target variables such as cybersecurity risk. A combined folding and stratified sampling approach was used, with a 20-fold cross-validation of the training with the test data and reporting the average over all classes as the final accuracy coefficient score. The author asserts that this approach is robust because it surpasses the 10-fold cross-validation resampling technique recommended by Lantz [19].

Step 7 of the ML training model validity/reliability checking involved calculating several score coefficients and then evaluating the scores against a priori benchmarks from the literature review or by expert ML practitioners [19,20]. Here, we can call these quality measures in general or a coefficient if referring to a specific score. This task started by calculating a confusion matrix from the ML training model results. Some coefficients are conditional probabilities, and others are formed through regression residual calculations. The confusion matrix is a contingency matrix or table listing the frequency counts or proportions of the true versus false comparisons using the training model algorithms on the test data (e.g., on the folding test records). The terms true and false here do not mean cybercrime risk yes or no, but instead whether a predicted value was correct (true) or not (false). Some ML practitioners suggest displaying a confusion matrix to increase credibility so that other researchers may clearly see the values underlying the coefficient scoring, which was performed in the current study (see discussion section below).

The first ML quality measure calculated from the training model confusion matrix was the classification accuracy (CA). CA is the proportion of correctly classified records, correct positives, and correct negatives. This is found as the cumulative percentages of the diagonal line on the confusion matrix, from top left to bottom right. The mean absolute error, or MAPE, was approximated by using the formula 1-CA.

The sensitivity quality measure shows the true positives, which were correctly classified. Sensitivity calculates the proportion of the positive predictions using the training model algorithm on the folded test data in the current study, which were correctly classified [19]. The reader could think of this as the proportion of true positives correctly identified from the test cases divided by the total number of positives in the folded test data (records correctly classified as true positives and records incorrectly as classified positives that were false, known as false negatives). In formula format, it would be: True positive records/(True positive records/False positive cases). Sensitivity must be balanced as a tradeoff with its reciprocal specificity, discussed next.

The specificity quality measure encapsulates the total negatives correctly classified, the true negative rate [19]. The reader can think of specificity as the proportion of negative records that were correctly classified by the training model out of all negative records in the test data set. The formula is approximated as: true negatives/all negatives in test data. As noted above, specificity is balanced with sensitivity, changing one impacts the other.

The recall ML quality measure indicates how complete the training model test results are, which reflects the number of true positives over the total number of positives [19]—this is the same formula as sensitivity, and the same interpretation applies! The precision ML quality score is an indication of training model trustworthiness. The precision coefficient is the positive predictive value [20]. It is calculated as the proportion of positive records from the test data using the training model algorithm that were correctly identified as truly positive. This is an important coefficient for business decision-makers because it shows how often a ML training model is correct or trustworthy, because untrustworthy models could result in many lost customers over time [19]. The pseudo formula for ML precision is: true records/(true + false records).

The F-measure is an overall score for evaluating all the above quality measures in one index. F-measure evaluates the ML training performance against the test data by combining the precision and recall coefficients into a single index by calculating the harmonic mean. The harmonic mean is used rather than the more common arithmetic mean since both precision and recall are expressed as proportions between zero and one [20]. The pseudo formula for F-measure is: (2 × precision × recall)/(recall + precision).

The next ML training model quality measure calculated in step 7 was the area under the receiver-operating curve (AROC), sometimes abbreviated as AUC. AROC is a scatter plot of the false positive rate (or 1-specifity) on the x-axis versus the true positive rate or sensitivity on the y-axis, plotted as an area against a superimposed cutoff slope line from 0.0 to 1.1 (zero quality). According to Lantz [19], the AROC curve should be used to evaluate the quality tradeoff between the detection of true positives while avoiding false positives—in other words, it should be above the zero quality slope line and the perfect maximum of an imaginary line from 0.0, to 0.1, and then 1.1 (a T-shaped line at the top of the chart). To interpret the AROC, the closer the line is to the perfect T line, the better the training model is in terms of quality by identifying true positives. The coefficient score for AROC is measured as a two-dimensional square representing the total area between the zero-quality line and the plotted line. The AROC benchmarks, according to and adapted from ML practitioner Lantz [19], are:

0.9–1.0 = outstanding quality;
0.8–0.9 = excellent/superior quality;
0.7–0.8 = acceptable/fair quality;
0.6–0.7 = poor quality;
0.5–0.6 = zero quality.

ML subject matter experts Ramasubramanian and Singh [20] as well as Lantz [19] argued that Jacob Cohen’s inter-rater agreement Kappa coefficient statistic is an important error metric to consider for evaluating training models developed to solve business problems because the formula adjusts for chance. For example, in statistical theory, it can be proven that an ML model for a university exam with 80% yes values (positives) and 20% no values (negatives) can be faked with a yes (positive) answer to all questions [19]. The Kappa or equivalent formula adjusts for the expected probability of inter-rater agreement due to pure chance.

In the current study, the Log Loss estimate was calculated. Like the Kappa coefficient, the Log Loss shows how well the training model classified the values in the folded test data as compared to a null model, where prediction would be by chance [20]. Therefore, a lower Log Loss coefficient is desired. A quality score could be generated by taking the reciprocal of the Log Loss index. The following Log Loss benchmarks were adapted from Lantz’s [19] Kappa interpretation guidelines for use in the current study:

0.00–0.20 = low agreement with the null model, benchmark acceptable score;
0.20–0.40 = weak agreement with the null model, baseline acceptance score;
0.40–0.60 = moderate agreement with the null model, borderline to poor score;
0.60–0.80 = good agreement with the null model, poor score;
0.80–1.00 = high agreement with the null model, unacceptable score.

Once the ML model validity/reliability checking in step 7 was completed and the quality scores were considered acceptable, the results were visualized and interpreted. Consequently, the final task in step 7 was to create a word diagram, a node diagram, a scatter plot, and a heat map to visually interpret the results.

The word diagram was created by selecting 50 keywords from the big data associated with cybercrime risk. The learning tree analysis node diagram was created to explain how the big data were classified by the FFPT factors. The scatter plot was developed to contrast the two most important FFPT factors using three parameters. The most important FFPT factors were aligned to the axes, namely, neuroticism on the x-axis and openness on the y-axis. The color was scaled to match the coefficient score of the cybercrime risk association (lighter yellow shades represented minimal risk coefficients, and darker shades of blue were higher risk coefficients). The symbol was arranged to differentiate data values between no cyber risk—shown as a ‘0’, and a potential cyber risk—shown as an ‘x’. A regression slope trend line was superimposed on the plot.

A cluster analysis heatmap was used to contrast and highlight the most frequent instances of the FFPT features related to cybercrime risk, particularly neuroticism and openness. This can simplify and condense scatter plot data into a smaller diagram by using cluster analysis to group similar data values together (whereas in the scatter plot, each sampled data point is shown). The heatmap was created using the coefficients from the learning tree analysis, but in a unique way, by clustering or grouping incoming big data features and the target variable. A dendrogram was created within the heatmap to show the feature cluster relationships in the context of high versus low cybercrime risk.

[ad_2]