Detection of Malicious Websites Using Symbolic Classifier
In this section, the research methodology is presented as well as the used dataset (description and preparation), GPSC algorithm, random hyper-parameter search method, 5-fold cross-validation, and evaluation methodology.
2.2. Dataset Description and Preparation
2.2.1. Dataset Transformation
To obtain the dataset which could be used in GP symbolic classifier the data must be in numeric format. However, the initial dataset consists of numbers and strings. To transform the dataset into numeric format following modifications were made:
The url variable was omitted from further analysis instead url_len is used which is the length of each url;
The ip_add was replaced with net_type variable that is created as a classification process of IP addresses to classes A, B, and C, and later transformed into values 0, 1, and 2;
The geo_location is transformed into numeric format using ISO 3166-1 numeric code format ;
The tld was transformed from string to number format using LabelEncoder . The LabelEncoder encodes the labels with values between 0 and n_classes − 1. In this case, the tld variable has 1247 different types of tld-s, i.e., the range of possible numeric format values are 0–1246.
The who_is was transformed from complete/incomplete to binary values 0 and 1.
The initial https column values “yes” and “no” were transformed into binary values 1 and 0.
The js_len and js_obf_len variables are already in numeric format and will remain unchanged
The content variable will be used to develop two additional variables, i.e., content_len and special_char. The content_len is the length of the content variable value. The special_char represents the number of special characters in a string.
Labels (output of symbolic classifier) were replaced with 1 and 0; 1 for a malicious website and 0 for a benign website.
2.2.2. Statistical Data Analysis
where i is in the range from 0 to 9. Regarding the results of statistical analysis, every dataset variable has different mean and standard deviation values with different ranges between the minimum and maximum values so the scaling/normalizing techniques should be applied. However, an initial investigation without any scaling/normalizing techniques showed that symbolic expressions with high classification accuracy were obtained so the implementation of scaling/normalizing techniques was omitted from further investigation.
), “js_len” (
), “js_obf_len” (
), “special_char” (
), and “content_len” (
). The highest negative correlation value with the target (output) variable “label” (y) is achieved with “https” (
) as the input variable. The correlation analysis showed that the output (target) variable “label” (y) does not have any correlation with “url_len” (
), “geo_loc” (
), “tld” (
), and “net_type” (
) input variables. However, in the GPSC algorithm, all input variables will be used.
2.3. Dataset Balancing Methods
In this paper, the undersampling and oversampling methods were used to investigate their influence on classification accuracy. First, the classic random undersampling and oversampling were applied. The majority of undersampling methods such as Condensed Nearest Neighbour, Edited Nearest Neighbours, Repeated Edited Nearest Neighbors, All KNN, Instance Hardness Threshold, Near Miss, Neighborhood Cleaning Rule, One-Sided Selection, and Tomek Links did not balance the dataset, i.e., drastically lowered the number of samples of benign websites class so they were omitted from further investigation. However, the application of oversampling methods achieved balanced datasets. In this investigation, the following oversampling methods were used: SMOTE, ADASYN, BorderlineSMOTE, and KMeansSMOTE.
Random Undersampling and Oversampling Methods
2.5. Genetic Programming—Symbolic Classifier
represented in tree structure form is shown in Figure 4.
). Another important thing regarding tree structure is that the size of the tree is measured by its depth.
). The functions are defined and stored in function_set which is also a hyperparameter defined before the execution of GP. The list of functions used in these investigations is shown in Table 6.
. However, if the absolute value of the argument is equal to or lower than 0 then the output of
is equal to 0. The function for calculating logarithm with base 10 can be written as:
has a similar definition as the
function. The cube root function can be written as:
After the evaluation of each population member in one generation, the tournament selection process is performed and the winners of the tournament selections are used as parents of the next generation, i.e., on these winners the genetic operations were performed. In the tournament selection process the members from the population are randomly selected. The randomly selected members are then compared with each other and the best among them is the winner of the tournament selection. The tournament selection size value in gplearn is defined using tournament_size hyperparameter.
In this investigation, four different genetic operations were used, i.e., crossover and three types of mutations (subtree, hoist, and point). In the crossover, the first winner of the tournament selection is taken and the subtree that will be replaced is randomly selected. Then on the second winner of the tournament selection, a subtree is randomly selected and is inserted into the first winner to form new population members of the next generation. The size of crossover operations in the gplearn library is defined by setting the value of hyperparameter p_crossover. In the case of subtree mutation, the winner of tournament selection is taken and a random subtree is selected which is replaced with a randomly generated subtree using elements from a primitive set. After subtree replacement, a new population member of the next generation is created. The size of subtree mutation operations in the gplearn library is defined by setting the value of hyperparameter p_subtree_mutation. In the case of hoist mutation on the winner of tournament selection, a random subtree is selected and inside that subtree, another subtree is randomly selected. The second randomly selected subtree replaces the original subtree creating a new population member of the next generation. The size of the hoist mutation operation in the gplearn library is defined by setting the value of hyperparameter p_hoist_mutation. In point mutation, the random nodes are selected on the winner of the tournament selection. The constants and variables are replaced with randomly chosen constants from the primitive set. The functions are also replaced with randomly chosen functions however the newly chosen function must have the same number of arguments as the original function. The size of the point mutation is defined by setting the value of hyperparameter p_point_mutation. The sum of all these genetic operations must be equal to 1. If the sum is less than 1 then the balance of genetic operations shall fall back on reproduction, i.e., the tournament winners are cloned and enters the next generation unmodified.
The evolution of symbolic expressions is propagated until the value of hyperparameter stopping_criteria is reached or the maximum number of generations is reached. The maximum number of generations is self-explanatory and usually, this is the dominating hyperparameter for stopping GP execution. The stopping_criteria is the minimum value of the fitness function and if one of the population members reaches this value the execution of GP is terminated.
To the size of sub-samples from training, the dataset can be defined with hyperparameter max_samples to get more diverse looks at individual symbolic expressions from smaller portions of the data. If max_smaples value is set to 1, then no subsampling is shown. If the value is set below 1 then during the execution of GP the out-of-bag (OOB) fitness value is shown. For a good evolution process, the value of OOB of the best symbolic expression should be near the true fitness function value.
The Advantages and Disadvantages of GPSC Algorithm
For any dataset with defined input variables and the target variable the GPSC will try during its execution to connect input variables with the target variable in a form of symbolic expression (mathematical equation);
The obtained symbolic expressions are sometimes easier to understand and use than complex ML models;
It is not necessary for an individual to have absolute knowledge of the problem and its solutions.
The disadvantages of the GPSC algorithm are:
The dataset size has some influence on GPSC performance. The larger the dataset the more memory is required to calculate the output of each population member;
The correlation between some input variables and target (output) variable has to be high (Pearsons or Spearman correlation value in range −1.0 to −0.5 and 0.5 to 1.0). If all input variables have a low correlation value with the output variable (in the range of −0.5 to 0.5) the bloat phenomenon can occur during the training process (the rise of the size of population members without any benefit to the fitness value) and the obtain symbolic expression will have low accuracy;
The choice of GPSC hyperparameters has a great influence on the training time of the GPSC algorithm as well as the performance of the obtained symbolic expression in terms of its accuracy;
The most sensitive hyperparameter in the GPSC algorithm is the parsimony_coefficient value. If the value is too low the average size of population members can rapidly grow in a few generations which can result in a long training process or the end of Memory Overflow. If the value is too high (for example 10) it can result in choking the evolution process, i.e., poor performance of obtained symbolic expression.