Inergency
An online hub for emergency and natural disaster solutions

Financial Technical Indicator and Algorithmic Trading Strategy Based on Machine Learning and Alternative Data

10
Financial Technical Indicator and Algorithmic Trading Strategy Based on Machine Learning and Alternative Data


In this section, we describe the technical analysis indicators that will be used in the empirical part of this paper. We also briefly review the mathematical aspects of the two machine learning algorithms that we consider in our analysis: extreme gradient boosting (XGBoost) and light gradient boosted machine (LightGBM).

2.1. Technical Analysis Indicators

We classify the technical analysis indicators into two groups. The first one contains such indicators that, combined with the Sentiment and Popularity metrics, construct the Trend Indicator for each company in our dataset. By means of the XGBoost or LightGBM algorithms, the aim of the Trend Indicator is to recognize the future trend of market prices. More precisely, it returns three labels (“0”, ”1”, and “2”) where the labels “2”, “1”, and “0” denote upward, neutral, and downward trends, respectively. These technical indicators are:

1.
The Simple Moving Average (SMA), which is the sample mean of the close daily prices over a specific time interval (see Ellis and Parbery 2005, for more details). Therefore, to assess the short-, medium-, and long-term price directions, we include in our analysis SMA for 7, 80, and 160 days, respectively.
2.
The Average True Range (ATR) measures the price variation over a specified time interval. We refer to Achelis (2001) for its formal definition. In the empirical analysis, we focus on the effect of the variability in a short time interval, i.e., 7 days.
3.
The Relative Strength Index (RSI) measures the magnitude of the price changes Levy (1967). This indicator ranges from 0 to 100, with two thresholds indicating the oversold at level 30 and overbought at level 70. In our analysis, we include this index with a 7-day time-horizon.
4.

The Donchian Channel (DC) is used to detect strong price movements, looking for either a candlestick breaking the upper ceiling (for bullish movements) or the lower floor (for bearish ones).

All the aforementioned indicators were used as inputs in the machine learning supervised models, except the DC indicator, whose variations represent the output of the models. If the DC predicted variation is positive, then Trend Indicator assumes label “2”, label “0” if DC variation is negative, and label “1” if DC’s line is flat. For DC, we consider a 5-day time horizon.

The second family contains the indicators that, combined with the Trend Indicator, will be used in a signal buying/selling strategy. More precisely, they are used as inputs in our stock picking/trading algorithm. In this second group, we include the SMA indicator and the technical analysis indicators reported below:

1.

The Average Directional Index (ADI) measures the trend’s strength but without telling us whether the movement is up or down. The ADI ranges from 0 to 100, and a value above 25 points out the fact that the trend of the price company is strong enough for being traded.

2.
The Momentum Indicator (MOM) is an anticipatory indicator that measures the rate of the rise or fall of stock prices (see Thomsett 2019, and references therein).

In the empirical analysis, we use these indexes with a 7-day time-horizon. Our classification is motivated by the fact that SMA, ATR, RSI, and DC are widely used for capturing the price movements while the remaining indicators are able to capture the buying/selling time. In fact, the ADI indicates the existence of a trend movement while MOM tells us the direction of such movement; therefore, a change of MOM sign could be translated in a buying/selling signal.

2.2. Extreme Gradient Boosting and Light Gradient Boosted Machine Algorithms

In this section we present the basic theory behind the selected models: the XGBoost and the LightGBM algoritms. Both numerical procedures come from the same macro-family of decision-tree-based procedures, but they are the most complicated and performant models since they represent the most advanced level of boosting.

  • It computes second-order gradients to understand the direction of gradients themselves.

  • It uses



    L
    1

    ,



    L
    2

    regularizations and tree pruning to prevent overfitting.

  • It parallelizes on your own machine for improving the velocity of the computations.

Letting




x
1

,

,

x
n


be the vector inputs containing n features and




y
1

,

,

y
n


the corresponding observed outputs, a tree ensemble model combines K trees to obtain the estimated outputs, that is:





y
i

^

=



k
=
1

K



f
k


(

x
i

)

where each



f
k

is a prediction from the k-th decision tree. With this model construction, the train phase is carried out by minimizing a regularized loss function between the observed and predicted outputs. For a multiclassification problem, that is our case, the multiclass logistic loss function (mlogloss) can be used. Let the true labels for a set of samples be encoded as a 1-of-J binary indicator matrix Y, i.e.,



y

i
,
j


= 1 if sample i has label j taken from a set of J labels. Let P be a matrix of probability estimates, with




p

i
,
j


=
Pr

(

y

i
,
j


=
1
)

. Then, the mlogloss L is defined as:



L
=



1
N





i
=
1

N




j
=
1

M



y

i
,
j


l
o
g

(

p

i
,
j


)

.

Moreover, another important aspect is the regularization phase, in which the model controls its complexity, preventing the overfitting problem. The XGBoost algorithm uses the following regularizing function:



Ω
=
γ
T
+


1
2


λ



j
=
1

T




ω
j


2


where T is the number of leaves on a tree,




ω
j



R
T


is the score on the j-th leaf of that tree,


γ

and


λ

are, instead, the parameters used for controlling the overfitting, respectively, setting the minimum gain split threshold and degree of regularization for



L
1

or



L
2

. Combining (2) and (3), we have the objective function used in the minimization problem:



Obj
=


i


l
(


y
i

^

,

y
i

)

+


k


Ω
(

f
k

)

where the first sum is used for controlling the predictive power; indeed,



l



y
i

^

,

y
i



is a differentiable convex loss function that measures the difference between the prediction




y
i

^

and the target



y
i

, while the remaining term in (4) is used for controlling the complexity of the model itself. The XGBoost procedure exploits the gradient descent algorithm to minimize the quantity in (4): it is an iterative technique that computes the following equation at each iteration (given an objective function).






Obj
(
y
,

y
^

)




y
^




Then, the prediction



y
^

is improved along with the direction of the gradient, to minimize the objective (actually, in order to make XGBoost convert faster, it also takes into consideration the second-order gradient using the Taylor approximation since not all the objective functions have derivatives). Therefore, in the end, removing all the constant terms, the resulting objective function is   




Obj

(
t
)


=



i
=
1

n


[

g
i


f
t


(

x
i

)

+


1
2



h
i


f

t

2


(

x
i

)

]

+
Ω

(

f
t

)

where




g
i

=




l
(

y
i

,


y
^


(
t

1
)


)





y
^


(
t

1
)






and




h
i

=





2

l

(

y
i

,


y
^


(
t

1
)


)






y
^


(
t

1
)


.

Therefore, (5) is the objective function at the t-th step and the goal of the model is to find an



f
t

that optimizes this quantity.

The main problem is to obtain a tree that improves predictions along with the gradient. To find such a tree it is necessary to answer to two further questions:

1.

How can we find a good tree structure?

2.

How can we assign prediction scores?

First, let us assume we already have the answer to the first question and let us try to answer the second question. It is possible to define a tree as




f
t


(
x
)

=

ω

q
(
x
)

where (



q
:

R
m


T

) is a “directing” function which assigns every data point to the



q
(
x
)

-th leaf, such as



F
=
{
f

(
X
)

=

ω
q


(
x
)

}

. Therefore, it is possible to describe the prediction process as follows:

Then, it is necessary to define the set that contains the indices of the data points that are assigned to the j-th leaf as follows:




I
j

=

{
i
|
q

(

x
i

)

=
j
}

Thus, now it is possible to rewrite the objective function as






Obj

(
t
)




=






i
=
1

n


[

g
i


f
t


(

x
i

)

+


1
2



h
i


f

t

2


(

x
i

)

]

+
γ
T
+


1
2


λ



j
=
1

t


ω

j

2






=






j
=
1

T


[

(



i


I
j





g
i

)


ω
j

+


1
2



(



i


I
j






h
i

+
λ

)


ω

j

2

]
+
γ
T




In (7), the first part has a summation over the data points, but in the second one the summation is performed leaf by leaf for all the T leaves. Since it is a quadratic problem of



ω
j

, for a fixed structure



q

(
x
)


, the optimal value is




ω

j

*

=






i


I
j





g
i






i


I
j






h
i

+
λ

and, therefore, simply substituting, the corresponding value of the objective function is




Obj

(
t
)


=



1
2





j
=
1

T





(



i


I
j





g
i

)

2





i


I
j






h
i

+
λ

+
γ
T

where the leaf score



ω
j

is always related to the first and second order of the loss function g and h and the regularization parameter


λ

. This is how it is possible to find the score associated with a leaf assuming to know the structure of a tree.

Now, we move back to the first question: how can we find a good tree structure? Since this is a difficult question to answer, a good strategy is to split it into two sub-questions:

1.

How do we choose the feature split?

2.

When do we stop the split?

Starting from question number one, the first thing to say is that in any split the goal is, of course, to find the best split-point that will optimize the objective function; therefore, for each feature it is crucial to first sort the numbers, then scan the best split-point and finally choose the best feature.

Every time that a split is performed, a leaf is transformed into an internal node having leaves with different scores than the initial one.

Clearly, the principal aim is to calculate the gain (or the eventual loss) obtained from such a split. Usually, in other tree-based algorithms, the computation of this gain is generally made through the Gini index or entropy metric, but in the XGBoost this calculation is based on the objective function. In particular, XGBoost exploits the set of indices I of data points assigned to the node, where



I
L

and



I
R

are the subsets of indices of data points assigned to the two new leaves. Now, recalling that the best value of the objective function on the j-th leaf is (8) without the first summation and the T value in the last term, the gain of the split is:



gain
=


1
2







(



i


I
L





g
i

)

2





i


I
L






h
i

+
λ

+




(



i


I
R





g
i

)

2





i


I
R






h
i

+
λ








(



i

I




g
i

)

2





i

I





h
i

+
λ



γ

where:






  • (



    i


    I
    L





    g
    i

    )

    2





    i


    I
    L






    h
    i

    +
    λ



       is the value of the left leaf;






  • (



    i


    I
    R





    g
    i

    )

    2





    i


    I
    R






    h
    i

    +
    λ



       is the value of the right leaf;






  • (



    i

    I




    g
    i

    )

    2





    i

    I





    h
    i

    +
    λ



       is the objective of the previous leaf;


  • γ

    is the parameter which controls the number of leaves (i.e., the complexity of the algorithm).

To understand whether when transforming one leaf into two new leaves there is an improvement of the objective or not, it is enough to look at the value (positive or negative) of this gain. Therefore, in conclusion, the XGBoost algorithm, to build a tree, first finds the best split-point recursively until the maximum depth (specifiable by the user) is reached and then it prunes out the nodes with negative gains with a bottom-up order.

LightGBM is a fast, distributed, high-performance tree-based gradient boosting framework developed in Ke et al. (2017). The most important features of this algorithm that differentiate it from XGBoost are the faster training speed and the fact that it supports parallel, distributed, and GPU learning and that it is capable of handling large-scale data. Another main difference between LightGBM and XGBoost is the way in which they grow a tree: the first one uses leaf-wise tree growth, expanding those leaves that bring a real benefit to the model, while the second uses level-wise tree growth, expanding the tree one level at a time and then cutting off the unnecessary branches at the end of the process.

The first thing that makes LightGBM faster in the training phase is the way it sorts the numbers: this algorithm takes the inputs and divides them into bins, reducing a lot of the computation effort needed to test all the possible combinations of splits. This process, called histogram or bin way of splitting, clearly makes computations much faster than the ones of XGBoost. The second improving characteristic of this algorithm is called exclusive feature bundling (EFB), which reduces the dimension of the features that are mutually exclusive. For example, if there are two features, green and red, which correspond to the color of a financial candle, taking either value 1 or 0 based on the corresponding candlestick’s color, then these features are mutually exclusive since a candle cannot be green and red at the same time. Thus, this process creates a new bundled feature with a lower dimension, using new values for identifying the two cases, which in this situation are number 11 for the green candle and number 10 for the red one. Therefore, by reducing the dimensionality of some of the input features, the algorithm is able to run faster, since it has fewer features to evaluate. The third characteristic that differentiates LightLGBM from XGBoost is the so-called gradient-based one side sampling (GOSS), which helps the LightLGBM algorithm to iteratively choose the sample to use for the computations. Suppose that the dataset used has 100 features, then the algorithm computes 100 gradients




G
1

,

G
2

,

,

G
100


and sorts them in descending order, for example,




G
73

,

G
24

,

,

G
8


. Then, the first 20% of these records are taken out and an additional 10% is also randomly taken from the remaining 80% of gradient records. Therefore, since the gradients are descendingly ordered, the algorithm takes only the 10% of the features on which it performs well and the 20% of those features on which it performs poorly (high gradient means high error), thus, on which it still has a lot to learn. Afterwards, these two percentages of features are combined together, creating the sample on which the LGBM trains, calculates the gradients again, and again applies the GOSS in an iterative way.

Comments are closed.