Inergency
An online hub for emergency and natural disaster solutions

Detection of Malicious Websites Using Symbolic Classifier

7
Detection of Malicious Websites Using Symbolic Classifier


In this section, the research methodology is presented as well as the used dataset (description and preparation), GPSC algorithm, random hyper-parameter search method, 5-fold cross-validation, and evaluation methodology.

2.2. Dataset Description and Preparation

The publicly available dataset used in this research can be downloaded from [23]. As already stated, the dataset had to be transformed into a numeric format to be used in the GPSC algorithm. The initial form of the dataset is shown in Table 2.
In Table 2, the website classification dataset is presented in its original form. The dataset consists of url, url_len, ip_add, geo_loc, tld, who_is, https, js_len, js_obf_len, content, and the label. The url is a variable containing web addresses of benign and malicious websites and the variable name stands for Uniform Resource Locator, also known as web address [24]. The ur_len is the length of the url address, i.e., the number of characters in the web address. The ip_add is the internet protocol address of the website. The geo_loc is the country from which the website originates. The tld stands for a top-level domain (com, de, net, …). The who_is variable describes if the website has complete or incomplete information available at who_is website [25]. The https variable has two values “yes” and “no” which represent if the website has a hypertext transfer protocol secure or not. The js_len variable represents the length of JavaScript code on the webpage. The js_obf_len variable represents the length of the obfuscated JavaScript code. The content variable represents the raw webpage content including the JavaScript code. Finally, the label variable represents the class label for the benign or malicious website. All variables except the label variable will eventually end up as the input to the GP symbolic classifier and the label will be the output variable used for training and testing GP symbolic classifier. However, the main problem of the dataset is that majority of variables are of string type which means that some kind of data transformation is needed to represent all the variables in numeric format.

2.2.1. Dataset Transformation

To obtain the dataset which could be used in GP symbolic classifier the data must be in numeric format. However, the initial dataset consists of numbers and strings. To transform the dataset into numeric format following modifications were made:

  • The url variable was omitted from further analysis instead url_len is used which is the length of each url;

  • The ip_add was replaced with net_type variable that is created as a classification process of IP addresses to classes A, B, and C, and later transformed into values 0, 1, and 2;

  • The geo_location is transformed into numeric format using ISO 3166-1 numeric code format [26];
  • The tld was transformed from string to number format using LabelEncoder [27]. The LabelEncoder encodes the labels with values between 0 and n_classes − 1. In this case, the tld variable has 1247 different types of tld-s, i.e., the range of possible numeric format values are 0–1246.
  • The who_is was transformed from complete/incomplete to binary values 0 and 1.

  • The initial https column values “yes” and “no” were transformed into binary values 1 and 0.

  • The js_len represents the total length of JavaScript code embedded in HTML code of a website.

  • The js_len and js_obf_len variables are already in numeric format and will remain unchanged

  • The content variable will be used to develop two additional variables, i.e., content_len and special_char. The content_len is the length of the content variable value. The special_char represents the number of special characters in a string.

  • Labels (output of symbolic classifier) were replaced with 1 and 0; 1 for a malicious website and 0 for a benign website.

The geo_location was initially a string (name of the country) which was later transformed into a numeric value using ISO 3166-1 country codes in numeric format. According to [28], the IP address can be categorized into one of three classes, A, B, and C, which will be eventually transformed into 0, 1, and 2, respectively. Each class of IP address has its range, network address, and host address. It should be noted that each byte of the IP address is represented in xxx form. The range of IP address is specified by decimal values range of the first byte of the network number. The number of bytes of the IP address dedicated to the network part of the address is labeled as a network address. The number of bytes dedicated to the host part of the address is labeled as the host address. The division of IP Address Spaces is shown in Table 3.

2.2.2. Statistical Data Analysis

After the necessary transformations were made, the dataset was obtained which will be used in later investigations. Here, the results of statistical and correlation analysis were performed. The results of the initial statistical analysis are shown in Table 4.
In Table 4, besides the results of the statistical analysis, the symbols of the variables that will be used in GPSC are also presented. Since the GPSC algorithm is used to detect malicious websites in all these investigations the output (target) variable will be “label” (y). The remaining 10 variables in the dataset are labeled with



X
i

where i is in the range from 0 to 9. Regarding the results of statistical analysis, every dataset variable has different mean and standard deviation values with different ranges between the minimum and maximum values so the scaling/normalizing techniques should be applied. However, an initial investigation without any scaling/normalizing techniques showed that symbolic expressions with high classification accuracy were obtained so the implementation of scaling/normalizing techniques was omitted from further investigation.

To investigate the relationship between input and output variables, we will perform Pearson’s correlation analysis [29]. Pearson’s correlation analysis values are between −1 and 1. If the correlation value between the input variable and output variable is equal to −1 this means that the value of the input variable increases the value of the output variable decreases and vice versa. In case the correlation value between the input and output variable is equal to 1 this means that if the value of the input variable increases the value of the output variable also increases and vice versa. The worst possible correlation value between the input and output variable is 0 which means that if the value of the input variable increases or decreases it will not have any effect on the output variable. The best ranges of Pearson’s correlation value are between −1.0 to −0.5 and 0.5 to 1.0. The worst range of the correlation value is between −0.5 to 0.5. The result of Pearson’s correlation analysis performed on the original dataset is shown in Figure 2.
From Figure 2 it can be noticed that the highest positive correlation values with target (output) variable “label” (y) have input variables “who_is” (



X
3

), “js_len” (



X
5

), “js_obf_len” (



X
6

), “special_char” (



X
8

), and “content_len” (



X
9

). The highest negative correlation value with the target (output) variable “label” (y) is achieved with “https” (



X
4

) as the input variable. The correlation analysis showed that the output (target) variable “label” (y) does not have any correlation with “url_len” (



X
0

), “geo_loc” (



X
1

), “tld” (



X
2

), and “net_type” (



X
7

) input variables. However, in the GPSC algorithm, all input variables will be used.

From conducted correlation analysis it can be concluded that the most influential indicators of malicious/benign websites are: information provided on the WhoIs website about the website, JavaScript code embedded in HTML website, JavaScript code obfuscated in HTML website, special characters and content length of website description, and is the website url under https protocol or not. The length of url, geographic location, top-level domain, and net type extracted from the website IP address do not have any influence on the target variable.

2.3. Dataset Balancing Methods

The main problem with this dataset is that it is an imbalanced dataset, i.e., the dataset contains 1,526,619 benign websites and 35,315 malicious websites as shown in Figure 3a. This type of dataset can cause a pretty high accuracy just by predicting the majority class, but the ML algorithm fails to capture the minority class, which is usually the point of creating the model in the first place. The problem of the imbalanced dataset can be solved using:

In this paper, the undersampling and oversampling methods were used to investigate their influence on classification accuracy. First, the classic random undersampling and oversampling were applied. The majority of undersampling methods such as Condensed Nearest Neighbour, Edited Nearest Neighbours, Repeated Edited Nearest Neighbors, All KNN, Instance Hardness Threshold, Near Miss, Neighborhood Cleaning Rule, One-Sided Selection, and Tomek Links did not balance the dataset, i.e., drastically lowered the number of samples of benign websites class so they were omitted from further investigation. However, the application of oversampling methods achieved balanced datasets. In this investigation, the following oversampling methods were used: SMOTE, ADASYN, BorderlineSMOTE, and KMeansSMOTE.

Random Undersampling and Oversampling Methods

Both random undersampling and oversampling methods are the most basic balancing methods that can be applied to imbalanced datasets. The random undersampling method is a method in which the majority class is under-sampled by randomly picking samples from the majority class to reduce its number of samples to match the number of samples from the minority class. The oversampling method is a method that randomly picks samples of the minority class to match the number of samples from the majority class. The results of random undersampling and oversampling methods are shown in Figure 3b,c, and Table 5.

2.5. Genetic Programming—Symbolic Classifier

In this paper, all investigations with genetic programming symbolic classifier (GPSC) were performed using the gplearn library [34] in the Python programming language. GPSC starts by creating the initial population of random naive symbolic expressions that are unfit for the particular task and using genetic operations (crossover and mutation) throughout the number of generations to make them fit for particular tasks. The population members in GP are represented as tree structures. For example the equation



y
=
min
(

X
1

+

X
2

,

X
1

+
3
·

X
3

)

represented in tree structure form is shown in Figure 4.

As seen from Figure 4, the min is a root node, and the lines connecting add functions are called branches. All other elements below the root node are called nodes and the last elements in the tree structure are called leaves (



X
1

,



X
2

, 3,



X
3

). Another important thing regarding tree structure is that the size of the tree is measured by its depth.

To build the initial population of symbolic expressions the constant values, functions, and input variables are required. The constant values are specified with hyperparameter const_range, i.e., the defined range of constant values before executing GP. During the execution GP randomly takes the constants from that range to create symbolic expressions. The variables are the input columns in the dataset used for training and in the symbolic expression they are represented as (




X
1

,

X
2

,

,

X
n


). The functions are defined and stored in function_set which is also a hyperparameter defined before the execution of GP. The list of functions used in these investigations is shown in Table 6.

In Table 6, the numbers in the arity column indicate the number of arguments taken by the mathematical function. The majority of functions shown in Table 6 are already integrated into gplearn library with exception of logarithm with base 2 and 10 and cube root. The definition of these functions had to be modified to avoid errors during the execution.
The logarithm with base 2 can be written as:



y
=






log
2


(
x
)



if

|
x
|
>
0.001
,



0



if

|
x
|
<
0.001
.


So, in case the absolute value of the argument (x) is larger than 0.001 then the function will calculate the




log
2


(
x
)

. However, if the absolute value of the argument is equal to or lower than 0 then the output of




log
2


(
x
)

is equal to 0. The function for calculating logarithm with base 10 can be written as:



y
=






log
10


(
x
)



if
|
x
|
>
0.001
,



0



if
|
x
|
<
0.001
.


The




log
10


has a similar definition as the




log
2


(
x
)

function. The cube root function can be written as:

It should be noted that y used in Equations (14)–(16) does not have any connections with the output (target) variable used in this investigation. In GP the primitive set consists of terminals and mathematical functions while terminals consist of dataset variables and a range of constants. The primitive set is used to build the initial population and later in process of genetic operations (crossover and mutation).
The size of the initial population is specified with hyperparameter population_size. To build the initial population two hyperparameters are required, i.e., init_method and the init_depth. The init_method used in all these investigations is ramped half-and-half. According to [35] the ramped half-and-half is the most commonly used method. In this method, half of the initial population is created using the full method and the other half using grow method. In the full method, the tree nodes are taken from the function set until the maximum tree depth is reached. After the maximum tree depth is reached only members of the terminal set can be chosen. The main disadvantage of this method is that all population members have the same tree depth. In grow, method nodes are selected from the entire primitive set until the depth limit is reached, and after that only terminals are chosen. The term ramped ensures the diversity in tree depth sizes which means that initially, the tree depth range has to be specified. In gplearn this is done with hyperparameter init_depth.
After the initial population is created, the fitness function must be applied to evaluate each population member. The process of evaluating each population member is done using the decision function and fitness function. Since each population member is a symbolic expression the output is determined by providing the instances of the training dataset. Using these instances each population member generates output, i.e., positive or negative number. The output of symbolic expression is used as an argument of decision function which is in this case Sigmoid decision function and can be written as:



y

(
x
)

=

1

1
+

e


x




.

The output of population member is transformed through the decision function into the probability of each class which is used in the fitness function, i.e., log loss [36] function to calculate how close the prediction probability is to the corresponding actual class (0/1 in case of binary classification). In gplearn, the decision function is defined through hyperparameter transformer while the fitness function is defined with hyperparameter metric. In this paper, all GPSC executions were conducted with a Sigmoid transformer while the fitness function (metric) was log loss.

After the evaluation of each population member in one generation, the tournament selection process is performed and the winners of the tournament selections are used as parents of the next generation, i.e., on these winners the genetic operations were performed. In the tournament selection process the members from the population are randomly selected. The randomly selected members are then compared with each other and the best among them is the winner of the tournament selection. The tournament selection size value in gplearn is defined using tournament_size hyperparameter.

In this investigation, four different genetic operations were used, i.e., crossover and three types of mutations (subtree, hoist, and point). In the crossover, the first winner of the tournament selection is taken and the subtree that will be replaced is randomly selected. Then on the second winner of the tournament selection, a subtree is randomly selected and is inserted into the first winner to form new population members of the next generation. The size of crossover operations in the gplearn library is defined by setting the value of hyperparameter p_crossover. In the case of subtree mutation, the winner of tournament selection is taken and a random subtree is selected which is replaced with a randomly generated subtree using elements from a primitive set. After subtree replacement, a new population member of the next generation is created. The size of subtree mutation operations in the gplearn library is defined by setting the value of hyperparameter p_subtree_mutation. In the case of hoist mutation on the winner of tournament selection, a random subtree is selected and inside that subtree, another subtree is randomly selected. The second randomly selected subtree replaces the original subtree creating a new population member of the next generation. The size of the hoist mutation operation in the gplearn library is defined by setting the value of hyperparameter p_hoist_mutation. In point mutation, the random nodes are selected on the winner of the tournament selection. The constants and variables are replaced with randomly chosen constants from the primitive set. The functions are also replaced with randomly chosen functions however the newly chosen function must have the same number of arguments as the original function. The size of the point mutation is defined by setting the value of hyperparameter p_point_mutation. The sum of all these genetic operations must be equal to 1. If the sum is less than 1 then the balance of genetic operations shall fall back on reproduction, i.e., the tournament winners are cloned and enters the next generation unmodified.

The evolution of symbolic expressions is propagated until the value of hyperparameter stopping_criteria is reached or the maximum number of generations is reached. The maximum number of generations is self-explanatory and usually, this is the dominating hyperparameter for stopping GP execution. The stopping_criteria is the minimum value of the fitness function and if one of the population members reaches this value the execution of GP is terminated.

To the size of sub-samples from training, the dataset can be defined with hyperparameter max_samples to get more diverse looks at individual symbolic expressions from smaller portions of the data. If max_smaples value is set to 1, then no subsampling is shown. If the value is set below 1 then during the execution of GP the out-of-bag (OOB) fitness value is shown. For a good evolution process, the value of OOB of the best symbolic expression should be near the true fitness function value.

Another important hyperparameter in GP Symbolic classifier is the parsimony_ coefficient value which is responsible for preventing the bloat phenomenon. During the execution of GP, it can happen that the size of symbolic expressions can rapidly increase from generation to generation without any benefit in lowering the fitness value. This can result in time-consuming GP execution with low classification accuracy or in memory overflow which would terminate GP execution without any solution. This rise of population members without any benefit to fitness function value is called the bloat phenomenon. However, this problem can be solved by implementing the parsimony coefficient value to the fitness function value and by doing so making the large program unfavorable for winning the tournament selection. The value of this parameter has to be finely tuned, i.e., if the value is too large it will choke the evolution process (population members will not evolve), and if the value is too small the bloat can happen very quickly. The list of previously described hyperparameters and their range that was used in this investigation is shown in the following subsection (Table 7).

The Advantages and Disadvantages of GPSC Algorithm

Each of the ML or deep learning algorithms has its advantages and disadvantages so does the GPSC algorithm. According to [37] the advantages of GPSC, as well as the GP algorithm, are:
  • For any dataset with defined input variables and the target variable the GPSC will try during its execution to connect input variables with the target variable in a form of symbolic expression (mathematical equation);

  • The obtained symbolic expressions are sometimes easier to understand and use than complex ML models;

  • It is not necessary for an individual to have absolute knowledge of the problem and its solutions.

The disadvantages of the GPSC algorithm are:

  • The dataset size has some influence on GPSC performance. The larger the dataset the more memory is required to calculate the output of each population member;

  • The correlation between some input variables and target (output) variable has to be high (Pearsons or Spearman correlation value in range −1.0 to −0.5 and 0.5 to 1.0). If all input variables have a low correlation value with the output variable (in the range of −0.5 to 0.5) the bloat phenomenon can occur during the training process (the rise of the size of population members without any benefit to the fitness value) and the obtain symbolic expression will have low accuracy;

  • The choice of GPSC hyperparameters has a great influence on the training time of the GPSC algorithm as well as the performance of the obtained symbolic expression in terms of its accuracy;

  • The most sensitive hyperparameter in the GPSC algorithm is the parsimony_coefficient value. If the value is too low the average size of population members can rapidly grow in a few generations which can result in a long training process or the end of Memory Overflow. If the value is too high (for example 10) it can result in choking the evolution process, i.e., poor performance of obtained symbolic expression.

Comments are closed.