Book Machine Learning Finance Python
Book Machine Learning Finance Python
Book Machine Learning Finance Python
In Finance
Using Python
https://financetrain.com
Introduction
This ebook provides a conceptual framework as well as practical insights into working in the
Machine Learning field with Python. The book's contents combine theoretical concepts with
programming examples of how to apply machine learning algorithms using Python's Scikit-learn
library. All of the examples are related to the use of machine learning in finance.
We begin by outlining the initial stages of a machine learning project, such as data
preprocessing and the exploration of dataset features to determine how they relate to the
variable they are intended to predict. Then we explain the process of training and testing a
dataset for supervised learning problems, as well as the significance of the loss function when
evaluating the ability of a particular model. Following that, we discuss some issues in the
training step, such as the bias-variance trade-off and the importance of maintaining a good
balance between these elements.
After introducing the key ideas of a machine learning project's workflow, the following sections
explain popular algorithms from Supervised Learning as well as useful approaches to avoid
overfitting, such as the K-Fold Cross Validation technique.
2 https://financetrain.com
Finally, we explain the basic concepts of Neural Networks and their structure. This advanced
technique is suitable for financial applications as it can understand complex relationships
among the data and work with different distributions of financial time series.
Machine learning has numerous applications in quantitative finance, and its techniques have
advanced rapidly in recent years. The following are some examples of problems that machine
learning could solve in finance:
1. Supervised Learning
2. Unsupervised Learning
3. Reinforcement Learning
All three methods have their procedures and are used for different tasks such as prediction,
finding data patterns, and categorizing data, among others.
In this e-book, we'll focus on supervised and unsupervised learning. We'll explain both domains
using Python examples to give a practical overview and theoretical explanations to help readers
grasp how these models function.
3 https://financetrain.com
Machine Learning Steps
Working in machine learning requires a variety of abilities, including those from statistics,
mathematics, and the fields of data cleaning and preprocessing. Throughout this book, we will
focus on data preprocessing, training and testing models, model selection, performance
evaluation and hyper-parameter tuning1. The end-to-end workflow in machine learning can be
described by the following picture:
These steps initially occur in this order, but it is very common that once we have the model's
predictions, we begin processing the data again to create new features, remove non-significant
features, modify model parameters (hyperparameter tuning), and change the model's algorithm
(model selection).
In the next section, we will explain important steps in the first stages of a machine learning
project. We will provide code examples using the Python Scikit Learn library, which has already
implemented a large number of machine learning algorithms.
Scikit Learn is the most used library for Machine Learning in Python which has an extensive API
for running machine learning algorithms.
1
Change default values of parameters of the models.
4 https://financetrain.com
Data Preprocessing
Data preprocessing is where data scientists spent most of their time. These tasks entail
selecting the appropriate features and cleaning and preparing them to become inputs or
independent variables in a machine-learning model.
Model performance is strictly related to the selection and cleaning of the features (variables).
Below we describe common tasks which are necessary to conduct before fitting and evaluating
a model. These tasks improve the accuracy of the model due to the increase in the quality of
the inputs.
While conducting data analysis in python, each column of a dataset is loaded into the
environment with a specific data type. The most common data types are float, string, integer,
and datetime. In many cases, numerical data is loaded as a character or string type because it
may contain a specific character, causing the entire column to be interpreted as a character
column.
Similarly, datetime fields that do not have the correct format are interpreted as strings or
character objects. These variables must be transformed into a datetime object before
performing any date-based analysis. The same is true for numerical data.
One of the first steps in ensuring data integrity is to examine the data types of a dataset to see
if they have the correct type.
Missing values are a common issue in machine learning models, and their causes include
human errors, interruptions in the data flow, and incorrect measurements, among others. Most
machine learning algorithms do not accept missing values and throw errors if a dataset contains
them.
As a result, some mechanism to deal with or remove them is required to solve this problem.
Removing rows or columns with missing data can have an impact on model performance as the
data size decreases.
A more elegant method is to use a simple imputation method such as replacing missing values
with the mean or median of the variable. The median is preferable to the mean because it is
5 https://financetrain.com
not affected by outliers. This solution would preserve the size of the data. In the case of
categorical values, missing values can be replaced with the most common categorical value in
the column.
Handling Outliers
In financial time series data, outliers can be identified by plotting the distribution of the returns
and observing if the distribution has extreme fat tails, indicating an anomaly in the data.
Also, outliers can be detected by using the percentiles of the data. Researchers can define
certain percentage thresholds in the upper-bottom of the distribution where values beyond
these limits are considered outliers.
Binning
Binning means grouping a numerical or categorical variable into bins. In cases where
categorical values have low frequency in a huge dataset, they can be binned into a category
called “others”, which makes the model more robust.
Value Bin
30 Low
50 Mid
20 Low
100 High
80 High
55 Mid
AAA Conservative
BBB Moderate
6 https://financetrain.com
C+ Risky
A+ Conservative
C Risky
B+ Moderate
Logarithmic Transformation
This technique is used in many statistical analysis and machine learning models, as the log
removes the skewness of the data and approximates its distribution to a normal distribution. Log
transformation of the variables also decreases the effect of outliers.
Encoding Data
One of the most common encoding methods in machine learning is called One Hot Encoding.
The method spread the values in a column to multiple columns, where the values in the original
columns are used to rename the new columns.
After the transformation, the new columns take two possible values which are 1 or 0. This
method is mainly used with categorical variables and is similar to creating dummy variables for
each of the categorical values on a specific column.
Scaling
After scaling a dataset, all the continuous variables become identical in terms of the range. This
process is not critical for some algorithms, but there are algorithms such as K-means
(Unsupervised technique) that work with the distance measure, so it is required to scale all
the inputs to have values that can be compared.
7 https://financetrain.com
Normalization
𝑿 − 𝑿𝒎𝒊𝒏
𝑿𝒏𝒐𝒓𝒎 =
𝑿𝒎𝒂𝒙 − 𝑿𝒎𝒊𝒏
Normalization scales all values in the range 0 and 1. From each value of the variable, we
subtract the min value and divide this by the difference between the max value and the min
value. This procedure does not change the distribution of the feature. Before normalization,
outliers should be handled.
Standardization
𝒙− 𝝁
𝒛=
𝝈
This is also called the z-score. We scale each value of a column by removing from it the mean
of the column and dividing by the standard deviation of the column. This technique decreases
the effect of the outliers in each feature.
Featured Selection
Feature Selection is one of the core concepts in machine learning and has a high impact on
the performance of the model. Irrelevant or partially irrelevant features can negatively impact
the model's performance.
In this process, those features which contribute most to the prediction variable are selected. To
get an idea about which features could have more predictive power in a machine learning
model, we will load Open, High, Low, Close, and Volume (OHLCV) data for AMZ stock, and
create some new features using python.
Afterwards, we will use data visualizations and other common approaches for a smart selection
of the features.
8 https://financetrain.com
• Make a Heat Map to show the correlation between each of the features and the target
variable.
• Fit a Random Forest Model to extract the feature importance of the independent
variables (we will explain this algorithm in future sections but now will use the readily
available Random Forest function in the scikit-learn library).
Python Setup
Before you proceed further, make sure that you have Python set up on your computer and
know how to install Python packages.
Before you import these libraries in your python scrips, you need to install them using the pip
command on your terminal.
On your computer move to your project directory. This directory is where we will create all our
python files and store our data. In this directory, create a new file and save it as feature-
selection.py. We’ve provided this file along with this ebook for your reference.
Add the following lines to the script to import the libraries we need.
9 https://financetrain.com
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
import seaborn as sns
from sklearn.ensemble import RandomForestRegressor
To run Python scripts with the python command, you need to open a command-line and type
in the word python , or python3 if you have both versions, followed by the path to your
script, just like this: $ python3 hello.py. If everything works okay, after you press Enter, your
script will be executed.
I’ve created another directory called data inside my working directory and stored the AMZ.csv
file in it. The data for AMZ stock is loaded into the environment using the read_csv() method of
pandas library.
amz = pd.read_csv("../data/AMZ.csv")
This will load the data from the AMZ.csv file into your environment. You can preview the data by
adding the command python print(amz) to your script and re-executing it.
We use the amz object dataframe which has OHLCV data from the AMZ ticker and pass it into
the get_technical_indicators() function. This function generates technical indicators to be used
as features for a machine learning model.
10 https://financetrain.com
def get_technical_indicators(dataset):
'''
params:
dataset: OHLCV data for AMZ ticker from 1999-09-01 to 2019-09-20
returns
features dataframe with the calculations of all technical Indicators such as
MACD, 20 period’s standard deviation, ROC, CCI, EMA
'''
# Sort values by dates. Old dates at the top
dataset.sort_index(inplace=True)
# Create MACD
features['26ema'] = dataset['AdjClose'].ewm(span=26).mean()
features['12ema'] = dataset['AdjClose'].ewm(span=12).mean()
features['MACD'] = (features['12ema']-features['26ema'])
return features
features = get_technical_indicators(amz)
11 https://financetrain.com
The features object stores all the calculations of the get_technical_indicators() function.
We will now create a new dataframe called vars_ and we will retrieve and store selected
features from the features object into it.
We will now write a function called correlation() that will make scatterplots between each of the
features and the target variable AdjClose. This will help us visualize the correlation between
AdjClose and each of the technical features ('MACD', '20sd', 'TR', 'ma21', 'ROC', 'CCI').
jet= plt.get_cmap('jet')
colors = iter(jet(np.linspace(0,1,10)))
columns = vars_.columns
correlation(amz,vars_,columns,2,3)
12 https://financetrain.com
Figure 1: Feature Selection Scatterplots
The scatterplot shows that there is an extremely positive correlation between the ma21 variable
and the AdjClose. We can also see that there is a positive correlation between 20sd and TR
with the target variable AdjClose.
On the other hand, AdjClose has a weak correlation with CCI and MACD.
Heat maps are another tool that we can use to explore the relevance of each feature with
respect to the target variable. Using the following code, we make a Heat Map between the
features and the target variable:
df = vars_.copy()
df['AdjClose'] = amz['AdjClose']
colormap = plt.cm.inferno
plt.figure(figsize=(10,5))
13 https://financetrain.com
corr = df.corr()
sns.heatmap(corr[corr.index == 'AdjClose'], linewidths=0.1, vmax=1.0, square=True,
cmap=colormap, linecolor='white', annot=True);
plt.show()
From the Heat Map, we observe that the ma21 has a perfect correlation with the AdjClose
variable. Also, 20sd and TR have a positive correlation with the AdjClose, while the MACD and
CCI don’t show a significant correlation with AdjClose.
Lastly, we will use a method that is provided by the Random Forest algorithm that gives
information about the importance of different features. First, we need to fit the model and then
utilize the method called feature_importance_ (these steps are explained in the next sections,
but here we want to inspect feature_importance_ method) on the model object.
In the following lines of code, we will fit a Random Forest Regressor model between the
features and the target variable of the model, and extract the features_importance_ method.
Finally, we make a bar plot of the features' importance measure.
14 https://financetrain.com
# Fit Random Forest Regressor and extract feature importance
prices = pd.DataFrame(amz['AdjClose'])
vars_model = prices.join(vars_)
vars_model= vars_model.dropna()
X = vars_model[['MACD','20sd','TR','ROC','CCI']].values
y = vars_model['AdjClose'].values
forest = RandomForestRegressor(n_estimators=1000)
forest = forest.fit(X, y)
importances = forest.feature_importances_
values = list(zip(vars_model.columns[1:],importances))
headers = ['feature','score']
values_df = pd.DataFrame(values,columns = headers)
15 https://financetrain.com
The bar plot of the feature importance shows that 20sd has the highest importance in the
prediction of the target variable AdjClose. On second place comes the TR feature with
significantly lower importance.
In this section, we provided some tools to analyze the power prediction of the features before
starting to train a Machine Learning model. It is important to inspect the independent variables
first for selecting the best features.
After the inspection of the features, the next step consists of the training and testing steps
where we split the data into the train and test data.
Train-Test Data
Training
Once the features of the model are selected and the data is cleaned, the next step is to
generate the train and test dataset. This step applies to Supervised Learning problems.
Unsupervised Learning models don’t require a train and a test dataset.
Classification and Regression problems would supervise or “train” a model with specific data
in order to provide predictions of the target variable y. The process of training a dataset is
conducted by choosing a set of relevant features or independent variables and combining these
with a response y (labelled data) which is the observed value of the target variable.
In this phase, the algorithm is trained on the data and will determine the influence of each
feature on the response y. Finally, we can make predictions for out-of-sample or unseen data
based on prior training experience.
This process has two main stages which are called training and testing the model. In the
training phase as we described above, we fit the model with the data and afterwards, we use
the test data to assess the model performance.
The training dataset includes both features and the target variable while the testing dataset
includes only features that are used to run the model and get the predictions of the target
variable. The training dataset usually represents 70%-80% of the total data and the test dataset
is the remaining portion of the data which is preserved to test the model's accuracy.
16 https://financetrain.com
Testing - Evaluate Model Performance
The predictions made by the model will be compared with the real observation of the variable to
obtain a measure of model performance. In order to evaluate the model performance, both
classification and regression problems need to define a function called loss function that will
compute the error of the model.
This loss function is a measure of the accuracy of the model. It calculates the differences
between the true value of the response y and the estimation from the model 𝒚 ̂.
The accuracy of the model is higher when the loss function is at a minimum. This means that
the difference between the true values and the estimated values is small. There are many
factors that contribute to the minimization of the loss function such as the quality of the data,
the number of features used to train the model as well as the size of the data used.
Researchers and machine learning engineers will work with the model in order to minimize the
loss function. However, if the loss function is minimized too severely, the model can get good
results with the training data, but could fail in their performance to predict new data.
2
Is a very basic loss function that assign 1 to correct predictions and 0 to incorrect predictions. Not carry how the
errors are made.
3
Measure the performance of a classification model whose output is a probability between 0 and 1. The Cross
Entropy increases as the predicted value diverge from the actual label.
17 https://financetrain.com
The above issue is generated when the model is “overfitted” to the data used in the training
phase but has not learned how to generalize to new, unseen data. In machine learning, this
situation is called overfitting.
Overfitting happens when a model captures the noise of the underlying pattern in data. These
models have low bias and high variance.
• Bias is the difference between the average prediction of the model and the correct value
that we are trying to predict. A model with high bias pays very little attention to the
training data and oversimplifies the model.
• Variance is the variability of model prediction for a data point. A model with high
variance pays a lot of attention to training data and is not good at generalizing its
predictions on the data that it hasn’t seen before. The result is that the model performs
very well on training data but has high error rates for new data.
Both bias and variance lead to a common situation in machine learning that is called the bias-
variance tradeoff, which we will explain next.
Model Selection
Model selection refers to selecting the best statistical machine-learning model for a particular
problem. For this task, we need to compare the relative performance between models.
Therefore, the loss function and the metric that represents it becomes fundamental for
selecting the right and non-overfitted model.
We can state a machine learning supervised problem with the following equation:
𝒚 = 𝒇(𝒙)+ ∈
• matrix x that contains the predictor’s factors x1, x2, x3, … xn. These factors can be the
lagged prices/returns of a time series or some others factors such as volume, foreign
exchange rates, etc.
• y is the response vector that depends on the function f and the predictors x.
18 https://financetrain.com
• f contains the underlying relationship between the x features and the y response and
can be modeled with a linear regression if the underlying relationship is linear or with a
Random Forest or Support Vector Machine algorithm if the underlying relationship is
non-linear.
• ∈ represents the error term, which is often assumed to have a mean zero and a standard
deviation of one.
Once we fit a particular model for a certain dataset, we need to define the loss function that
we will use to assess model performance. Many measures can be used for the loss function.
Some common measures for the loss function are the Absolute Error and the Squared Error
between predicted values and real values.
̂ )𝟐
𝑳𝒐𝒔𝒔 𝑭𝒖𝒏𝒄𝒕𝒊𝒐𝒏 𝒂𝒔 𝑺𝒒𝒖𝒂𝒓𝒆𝒅 𝑬𝒓𝒓𝒐𝒓: (𝒚 − 𝒚
Both choices are non-negative, so the best value for the loss function is zero. The Absolute
Error and Squared Error compute the difference between the true value (𝒚) and the prediction
̂) for each observation of the dataset.
(𝒚
Both Absolute Error and Squared Error are vectors or arrays of n x 1 dimension, reflecting
the error term for each of the observations. In order to aggregate the error term of a certain
model between all the predicted and real values of a variable, a popular measure is the Mean
Squared Error which is simply the average of the squared loss:
𝑛
1
𝑀𝑆𝐸 = ∑(𝑦𝑖 − 𝑦̂)2
𝑖
𝑛
𝑖=1
Bias-Variance Tradeoff
The most important part of a machine learning model is its capacity to predict or categorize new
unseen data (data that was not used in training the model). For this reason, the important
measure is the MSE error with test data, which is denominated as test MSE. The goal is to
select a model with the lowest test MSE among all other models.
19 https://financetrain.com
The bias-variance tradeoff refers to the model’s ability to minimize bias and variance. Bias is
the difference between a model's average prediction and the actual value we are trying to
predict. Variance is the variability of a model's prediction for a given data point, which indicates
how spread our data is. If we decrease the bias of the model by using more features, the
variance of the model increases (overfitting), and on the other hand, if the model is too simple
(has very few parameters), it will have high bias and low variance (underfitting).
It is necessary to find the right balance between bias and variance without overfitting and
underfitting the data. The prediction error in a Supervised machine-learning algorithm can be
divided into three different parts:
• Bias Error
• Variance Error
• Irreducible Error
First, we will write the equation which breaks down these three factors:
2 2
𝐸 (𝑦0 − 𝑓̂(𝑥0 )) = 𝑉𝑎𝑟 (𝑓̂(𝑥0 )) + ⟦𝐵𝑖𝑎𝑠 𝑓̂(𝑋0 )⟧ + 𝑉𝑎𝑟(∈)
Variance Error
𝑉𝑎𝑟 (𝑓̂(𝑥0 ))
The first term on the right-hand side is the variance of the estimation across many testing sets.
This measures the average model deviation among different testing data. In particular, a model
with high variance is suggestive that it is overfitting to the training data. In this scenario, the
model is capturing the noise of the training dataset but it is poor for new data.
Bias Error
2
⟦𝐵𝑖𝑎𝑠 𝑓̂(𝑋0 )⟧
The squared bias characterizes the difference between the averages of the estimate and the
true values. A model that has high bias, would not capture the underlying behavior of the true
functional form well. One example of this case is when a linear regression is used to fit a
model with a non-linear relationship between their features.
20 https://financetrain.com
Irreducible error
𝑉𝑎𝑟(∈)
This term represents the minimum lower bound for the test MSE. This error cannot be reduced
regardless of which algorithm is used. This error reflects the influence that unknown variables
have between the interaction of the features and the target variable.
Bias and variance move in different directions. The relative rate of change between these two
factors determines whether the expected test MSE increases or decreases.
In the plot below, we show the trade-off between bias and variance. At first, as flexibility
increases the bias tends to drop quickly, faster than the increase in variance generating a
reduction in the test MSE of the model (Total Error). However, as flexibility increases further,
there is less reduction in bias and the variance rapidly increases, causing model overfitting.
The goal of any supervised machine learning algorithm is to achieve low bias and low variance,
and all models need a good balance between these factors. For this reason, it is necessary to
understand both factors to reach the optimal point that consists of the lower bias variance.
Note: In the following section, we will provide some theory behind the various machine learning
models along with the formulas. You may want to read it in detail or just skim it to understand
21 https://financetrain.com
what the model does but ignore the formulas. It’s up to you. In the section after that, we will
build the models using Python. If you haven’t reviewed the formulas, it will not affect your
understanding of the python code. But knowing what each algorithm is doing does help in
understanding what’s happening behind the scenes.
As we mentioned earlier, both classification and regression models fall under the umbrella of
supervised learning. These models are characterized to have a group of features or
independent variables and a target variable that the model aims to predict.
This target variable is called the labeled data and is the main property of Supervised Learning
models.
Classification problems have the goal of estimating membership for a set of features into a
particular group. A common classification problem in the financial sector is to determine the
price direction for the next day based on N days of asset price history.
Regression problems involve estimating a real value response with a set of features or
predictors as the independent variables. In the financial field, an example of such problem is to
estimate the next day’s asset price based on the historical prices (or other features) of the price.
The regression problem would estimate the real value of the price and not just its direction.
Common regression techniques include Linear Regression, Support Vector Regression and
Random Forest.
In the next section, we explain relevant concepts of the most popular algorithms used for
Supervised Learning. These algorithms are the following:
22 https://financetrain.com
▪ Multiple Linear Regression
▪ Logistic Regression
▪ Decision Tree
▪ Random Forest
▪ Support Vector Machine
▪ Linear Discriminant Analysis
𝑝
𝑇
𝑦(𝑥) = 𝛽 𝑋 + ɛ = ∑ 𝛽𝑗 𝑥𝑗 + 𝜖
𝑗=0
𝛽𝑇 𝑋 is the matrix notation of the equation with terms 𝛽𝑇 , 𝑋 𝟄 ʀ𝒑+𝟏 and ɛ ~ 𝑁(𝜇, 𝜎 2 )
𝛽𝑇 (𝑡𝑟𝑎𝑛𝑠𝑝𝑜𝑠𝑒 𝑜𝑓 𝛽) and 𝑋 are both real-valued vectors with dimension p+1 and ɛ is the residual
term which represents the difference between the predictions of the model and the true
observation of the variable y.
The vector 𝛽𝑇 = (𝛽0, 𝛽1 , … 𝛽𝑃 ) stores all the beta coefficients of the model. These coefficients
measure how a change in some of the independent variables impacts the dependent or target
variable.
The vector 𝑋 = (1, 𝑥1, 𝑥2 , … 𝑥𝑝 ) hold all the values of the independent variables. Both vectors
(𝛽𝑇 and 𝑋), are p+1 dimensional because of the need to include an intercept term.
The model for multiple linear regression, given n observations, can also be written as follows:
The goal of the linear regression model is to minimize the difference between the predictions
and the real observations of the target variable. For this purpose, a method called Ordinal
Least Squares (OLS) is used which will derive the optimal set of 𝛽 coefficients for fitting the
model.
23 https://financetrain.com
Ordinal Least Squares (OLS)
Formally the OLS model will minimize the Residual Sum of Squares (RSS) between the
observations of the target variable and the predictions of the model. The RSS is the loss
function metric to assess model performance in the linear regression model and has the
following formulation:
𝐑𝐒𝐒(𝛃) = ∑𝐍𝐢=𝟏(𝐲𝐢 − 𝛃𝐓 𝐱 𝐢 )
The Residual Sum of Squares, also known as the Sum of Squared Errors (SSE), measures
the difference between the predictions 𝛃𝐓 𝐱 𝐢 and the observations 𝐲𝐢 . With the minimization of
this function, it is possible to get the optimal parameter estimation of the vector 𝜷.
𝛛𝐑𝐒𝐒
= -2𝐗 𝐓 (𝐲 − 𝐗𝛃)
𝛛𝛃
Remember that 𝑋 is a matrix with all the independent variables and has N observations and p
features. Therefore, the dimension of 𝑋 is N (rows) x p+1 (columns).
One assumption of this model is that the matrix 𝑿𝑻 𝑿 should be positive-definite. This means
that the model is valid only when there are more observations than dimensions. In cases of
high-dimensional data (e.g., text document classification), this assumption is not true.
Under the assumption of a positive-definite 𝑿𝑻 𝑿, the differentiated equation is set to zero and
the 𝛽 parameters are calculated:
𝑋 𝑇 (𝑦 − 𝑋𝛽) = 0
𝛽̂𝑂𝐿𝑆 = (𝑋 𝑇 𝑋)−1 𝑋 𝑇 𝑦
Later, we will show an example using a dataset of Open, High, Low, Close and Volume of the
S&P 500 to fit and evaluate a multiple linear regression algorithm using Scikit learn library.
24 https://financetrain.com
Logistic Regression
The Logistic Regression algorithm is used for classification problems. It provides an output
that we can interpret as a probability that a new observation belongs to a certain class.
Generally, logistic regression is used to classify binary classes but works on multiple and
ordinal classes too.
Logistic regression estimates a continuous quantity which is the probability that an event
occurs. This probability is compared with a certain threshold that allows taking the decision
about the classification of the new data.
If the threshold of the probability is equal to 0.5, we can classify the new data to each of the
classes by comparing the probability value with that threshold.
Instead of fitting a straight line or hyperplane as in a linear regression model, the logistic
regression model uses the logistic function4 to reflect the output of a linear equation between 0
and 1. With this function, it is possible to map real values of the predictions into probabilities.
1
𝑙𝑜𝑔𝑖𝑠𝑡𝑖𝑐(ɳ) =
1 + 𝑒𝑥𝑝ɳ
4
Logistic regression uses the sigmoid function for the two class logistic regression and the softmax function for the
multiclass logistic regression
25 https://financetrain.com
Figure 5: Logistic Regression Sigmoid Function
𝑦̂ = 𝛽0 + 𝛽1 𝑋1 + 𝛽2 𝑋2 + ⋯ + 𝛽𝑃 𝑋𝑃
1
𝑃(𝑦 = 1) =
1 + 𝑒𝑥𝑝−(𝛽0+𝛽1𝑋1 +𝛽2𝑋2 +⋯+ 𝛽𝑃 𝑋𝑃 )
• Binary Logistic Regression: The target variable has only two possible outcomes.
• Multiple Logistic Regression: The target variable has three or more nominal
categories.
• Ordinal Logistic Regression: The target variable has three or more ordinal categories
such as assign movies score from 1 to 5.
26 https://financetrain.com
Logistic Regression in Python
Using the scikit-learn package from python, we can fit and evaluate a logistic regression
algorithm with a few lines. For binary classification problems, the library provides interesting
metrics to evaluate model performance such as Confusion Matrix, Receiving Operating
Curve (ROC)5, and Area Under the Curve (AUC)6.
In the Logistic Regression model (as well as in the rest of the models), we can change the
default parameters from the scikit-learn implementation, with the aim of avoiding model
overfitting or changing any other default behavior of the algorithm.
For Logistic Regression, some of the parameters that could be changed are the following:
• Penalty: (string) specify the norm used for penalizing the model when its complexity
increases, in order to avoid overfitting. The possible parameters are “l1” and “l2” and
“none”. “l1” is the Lasso Regression7 and “l2” is the Ridge Regression8 that represents
two different ways to increase the magnitude of the loss function. Default value is “l2”.
“none” means no regularization parameter.
• C: (float): the default value is 1. With this parameter we manage the λ value of
regularization as C = 1/λ. Smaller values of C mean strong regularization as we penalize
the model hard.
• multi_class: (string) the default value is “ovr” that will fit a binary problem. To fit a
multiple classification, we should pass “multinomial”.
• solver: (string) algorithm to use in the optimization problem: “newton-cg”,
“lbfgs”,”liblinear”,”sag”, and ”saga”. Default is “liblinear”. These algorithms are
5
Probability Curve that is plotted among the True Positive Rates on the y-axis and among the False Positive Rates
of the model on the x-axis.
6
This curve tells how much the model is capable to distinguish between two classes. A value of 1 in the AUC,
means that the model has the best separability between classes.
7
Adds “absolute value of magnitude” of coefficient as penalty term to the loss function
8
Adds “squared magnitude” of coefficient as penalty term to the loss function
27 https://financetrain.com
related to how the optimization problem achieves the global minimum in the loss
function.
▪ “liblinear”: is a good choice for small datasets
▪ “Sag” or “saga”: useful for large datasets
▪ “lbfgs”,”sag”,”saga”, or “newton-cg”: handle multinomial loss, so they are
suitable for multinomial problems.
▪ “liblinear” and “saga”: handle “l1” and “l2” penalty
The penalty has the objective of introducing regularization which is a method to penalize
complexity in models with large number of features by adding new terms to the loss function.
It is possible to tune the model with the regularization parameter lambda (λ) and handle
collinearity (high correlation among features), filter out noise and prevent overfitting.
By increasing the value of lambda (λ) in the loss function we can control how well the model
fits to the training data. As stated above, the value of λ in the logistic regression algorithm of
scikit-learn is given by the value of the parameter C, which is 1/λ.
To show these concepts mathematically, we write the loss function without regularization and
with the two ways of regularization: “l1” and “l2” where the term ∑𝑝𝑗=1 𝑥𝑖𝑗 𝛽𝑗 are the predictions
of the model.
𝑛 𝑝
∑(𝑌𝑖 − ∑ 𝑥𝑖𝑗 𝛽𝑗 )2 + 0
𝑖=1 𝑗=1
𝑛 𝑝 𝑝
𝑛 𝑝 𝑝
∑(𝑌𝑖 − ∑ 𝑥𝑖𝑗 𝛽𝑗 )2 + 𝜆 ∑ 𝛽𝑗 2
𝑖=1 𝑗=1 𝑗=1
28 https://financetrain.com
Decision Trees
A decision tree is a popular supervised learning algorithm which can handle classification
and regression problems. For both problems, the algorithm breaks down a dataset into smaller
subsets by using if-then-else decision rules within the features of the data.
The general idea of a decision tree is that each feature is evaluated by the algorithm and used
to split the tree based on the capacity that they have to explain the target variable. The features
could be categorical or continuous variables.
In the process of building the tree, the most important features are selected by the algorithm in
a top-down approach creating decision nodes and branches of the tree and making
predictions at points when the tree cannot be expanded anymore.
The tree is composed of a root node, decision nodes and leaf nodes. The root node is the
topmost decision on the tree and is the first time where the tree is split based on the best
predictor of the dataset.
The decision nodes are the intermediate steps in the construction of the tree and are the
nodes used to split the tree based on different values of the independent variables or features
of the model. The leaf nodes represent the end points of the tree and hold the prediction of the
model.
29 https://financetrain.com
Figure 6: Decision Tree Structure
The methods to split the tree are different depending on whether it is a classification or a
regression problem. We will provide an overview of the different measures that are used to
split the tree into both types of problems.
Classification Problem
In a classification problem, the algorithm decides which is the best feature to split the tree
based on two measures that are Entropy and Information Gain.
Entropy
Let’s say we have a dataset with N classes. The mathematical formula of entropy is as follows:
30 https://financetrain.com
𝑵
𝑬(𝑺) = − ∑ 𝒑𝒊 𝒍𝒐𝒈𝟐 𝒑𝒊
𝒊=𝟏
The goal of a decision tree problem is to reduce the measure of entropy while building the
tree. As we stated above, entropy is a measure of the randomness of data, and this measure
needs to be reduced. For this purpose, entropy is combined with Information Gain in order to
select the feature that has a higher value of Information Gain.
Information Gain
The Information gain is based on the decrease in Entropy after a dataset is split on an
attribute or feature. The attribute that has a high value of Information Gain, will be used to split
the tree into nodes in a top-down approach.
𝑌
𝐼𝐺(𝑌, 𝑋) = 𝐸(𝑌) − 𝐸( )
𝑋
𝑬: Entropy
𝒀: dependent variable
𝐗: independent variable
𝒀
In the equation above, we subtract the entropy of the target variable Y given X (𝑬( )) from the
𝑿
entropy of the target variable Y (𝑬(𝒀)), to calculate the reduction of uncertainty of Y given
additional piece of information X.
The aim of the decision tree in this type of problem is to reduce the entropy of the target
variable. For this, the decision tree algorithm would use the Entropy and the Information
Gain of each feature to decide what attribute will provide more information (or reduce the
uncertainty of the target variable).
31 https://financetrain.com
Decision Tree Regression Problem
In a regression problem, the tree is composed (like in the classification problem) of nodes,
branches and leaf nodes, where each node represents a feature or attribute, each branch
represents a decision rule and the leaf nodes represent an estimation of the target variable.
Contrary to the classification problem, the estimation of the target variable in a regression
problem is a continuous value and not a categorical value. The ID39 algorithm can be used to
construct a decision tree for regression which replace the Information Gain metric with the
Standard Deviation Reduction (SDR).
The Standard Deviation Reduction is based on the decrease in standard deviation after a
dataset is split on an attribute. The tree will be split by the attribute with the higher Standard
Deviation Reduction of the target variable.
The standard deviation is used as a measure of the homogeneity of the data. The tree is built
top-down from the root node and data is partitioned into subsets that contain homogeneous
instances of the target variable. If the samples are homogeneous, their standard deviation is
zero. The mathematical formula of the Standard Deviation Reduction is the following:
• Y is target variable
• X is a specific attribute of the dataset
• 𝑺𝑫𝑹(𝒀, 𝑿) is the reduction of the standard deviation of the target variable Y given the
information of X.
• 𝑺(𝒀) is the standard deviation of the target variable
• S(Y, X) is the standard deviation of the target variable given the information of X
9
Iterative Dichotomiser 3
32 https://financetrain.com
The calculation of S(Y,X) can be broken down with the following formula that is important to
understand the process:
𝑺(𝒀, 𝑿) = ∑ 𝑷(𝒄)𝑺(𝒄)
𝒄 ∈𝑿
• 𝑺(𝒄) - standard deviation of the target variable based on unique values of the
independent variable.
• 𝑷(𝒄) - number of times (probability) of the value of the independent variables over the
total values of the independent variable
So, the S(Y,X) is the sum of the probability of each of the values of the independent variable
multiplied by the standard deviation of the target variable based on those values of the
independent variable.
The standard deviation S(Y,X) is calculated for each feature or independent variable of the
dataset. The feature with the higher value of S(Y,X) is the feature that has more capability to
explain the variations in the target variable and, therefore, is the feature that the decision tree
algorithm will use to split the tree.
The idea is that the feature that will be used to split the tree is the one that has the greatest
Standard Deviation Reduction. The overall process of splitting a decision tree under the
Standard Deviation Reduction can be synthetized in the following steps:
10 𝑆
CV = 𝑥̅ ∗ 100 (Standard deviation) /(mean of the target variable) * 100
33 https://financetrain.com
• If the CV is less than a certain threshold such as 10% that subset does not need further
splitting and the node becomes a leaf node.
One of the most common problems with a decision tree is that they tend to overfit a lot. There
are situations where the tree is built with specific features of the training data and the predictive
power for unseen data is reduced.
This is because the branches of the tree follow strict and specific rules of the training dataset
but for samples that are not part of the training set, the accuracy of the model is not good.
There are two main ways to address over-fitting in decision trees. One is by using the
Random Forest algorithm and the second is by pruning the tree.
Random Forest
Random forest algorithm avoids overfitting as it is an ensemble of n decision trees (not just
one) with n results in the end. If the target variable is a categorical variable, the final result of
random forest is the most frequent response variable among the n results of the n Decision
Trees. If the target variable is a continuous variable, the final result of random forest is the
average of the target variable in the leaf node.
Pruning
Pruning is done after the initial training is complete, and involves the removal of nodes and
branches starting from the leaf nodes such that the overall accuracy is not affected. The
pruning technique will reduce the size of the tree by removing nodes that provide little power to
classify or predict new instances.
• Select k features randomly from the dataset and build a decision tree from those features
where k < m (total number of features)
• Repeat this n times in order to have n decision trees from different random combinations
of k features.
34 https://financetrain.com
• Take each of the n Decision Trees and pass a random variable to predict the outcome
and store this outcome to get a total of n outcomes from n decision trees.
• If the target variable is a categorical variable, each tree in the forest would predict the
category to which the new record belongs and the new record is assigned to the
category that has the majority vote.
• If the target variable is a continuous variable, each tree in the forest predicts a value for
the target variable and the final value is calculated by taking the average of all the values
predicted by the trees that are part of the forest.
Using the scikit learn package from Python, it is possible to use and tune a Random Forest
model based on predefined conditions that will give instructions to the algorithm regarding the
construction of these trees that are part of the forest.
Hyperparameter Tuning
The scikit-learn library allows tuning of some important parameters in the tree construction that
could increase the predictive power of the model or make the model faster.
• n_estimators: represents the number of trees in the forest. In general, a higher number
of trees increases the performance and makes the predictions more stable. The default
value of this parameter is 10.
• max_features: this parameter reflects the maximum number of features from the dataset
that the Random Forest is allowed to use in an individual tree when considering the best
split. The default value of the parameter is sqrt(n_features)11.
• min_samples_leaf: the minimum number of observations in each leaf node. This
parameter would prevent further splitting when the number of observations in the node is
less than the value of this parameter. The default value of this parameter is 1.
11
Root Square of the n features of the model
35 https://financetrain.com
• max_depth: the maximum depth of the tree. The depth of a decision tree is the length
of the longest path from the root to a leaf. The default value of this parameter is None. In
this case, the tree is split until nodes contain less than min_samples_split samples.
• n_jobs: tell the engine the number of jobs to run in parallel for fit and predict.
• random_state: this parameter makes the model reproducible, as it will output the same
result every time the model runs if it has a definite value of the random state.
• oob_score: Boolean value that allows testing the model performance keeping a portion
of the data to test the model performance.
• Classify new documents into positive and negative sentiment categories based on other
documents that have already been classified.
• Classify new emails into spam or not spam based on a large set of emails that have
already been marked as spam or not spam.
• Bioinformatics (Protein Classification, Cancer Classification)
• Other tasks that are characterized to have a higher feature dimensional space.
The algorithm creates a partition of the feature space into two subspaces. Once these
subspaces are created, new unseen data can be classified in some of these locations.
36 https://financetrain.com
To address non-linear relationships, the algorithm follows a technique called the kernel trick12
to transform the data and find an optimal boundary between the possible outputs.
The separation between the two zones is given by a boundary which is called the hyperplane.
Next, we will explain key concepts about how the hyperplane works in the Support Vector
Machine algorithm.
Hyperplane
An interesting feature of the Support Vector Machine algorithm is that this hyperplane does
not need to be a linear hyperplane. The hyperplane can introduce various types of non-linear
decision boundaries. This feature increases the usability of the Support Vector Machine
algorithm as it is capable of addressing non-linear relationships between features and target
variables.
The algorithm will find the hyperplane that best separates the classes of the target variable by
training a set of n observations of the features. Each observation is a pair of the p-dimensional
vector of features (𝒙𝟏 , 𝒙𝟐 , 𝒙𝟑 , … , 𝒙𝒑 ) and the class 𝒚𝒊 which is the label.
Usually, the classes of the target variable are defined as +1 and -1, where each of these
classes has its own meaning (negative vs positive, spam vs non-spam, malign vs benign).
𝑦𝑖 =1 implies that the sample with the feature vector 𝑥𝑖 belongs to class 1 and 𝑦𝑖 =−1 implies that
the sample belongs to class -1.
The class +1 could be “non-spam” or “positive” sentiment and the class -1 could be “spam” and
“negative” sentiment. Depending on which side of the hyperplane a new observation is located,
we assign it a particular class (-1, +1).
12
Basically these are techniques to project the data into a higher dimension such that a linear separator would be
sufficient to split the features space.
37 https://financetrain.com
Deriving the Hyperplane
The dimension of the hyperplane depends on the number of features. If the number of input
features is two (ʀ2 ), the hyperplane can be drawn as a line. If the number of input features is
three, then the hyperplane becomes a two-dimensional plane.
In order to optimize the hyperplane or the separation boundary between the two classes, there
are two important concepts that need to be described. These are the concepts of Maximal
Margin Hyperplane (MMH) and the Maximal Margin Classifier (MMC).
The classes of a target variable can be well segregated by more than one hyperplane as we
can observe in Figure 7. However, we should select the hyperplane with the maximum
distance between the nearest observations data points and the hyperplane, and this condition is
achieved by the Maximal Margin Hyperplane.
Therefore, maximizing the distance between the nearest points of each class and the
hyperplane would result in an optimal separating hyperplane.
38 https://financetrain.com
Figure 8: Maximum Separating Hyperplane
The continuous grey line in figure 8 is the Maximal Margin Hyperplane, as is the hyperplane
that maximizes the distance between it and the nearest data points of the two classes. To find
the Maximal Margin Hyperplane, it is necessary to compute the perpendicular distance from
each of the training observations 𝒙𝒊 for a given separating hyperplane.
The smallest perpendicular distance to a training observation from the hyperplane is known as
the margin. The Maximum Margin Hyperplane is the separating hyperplane where the margin
is the largest.
Maximal Margin Classifier (MMC) is a linear binary classifier algorithm that seeks to find the
hyperplane with maximum margin to separate the two classes in a dataset. The margin is
defined as the distance between the hyperplane and the closest data points from each class,
known as support vectors. The goal is to maximize the margin to minimize the chance of
misclassification and improve the generalization of the model.
Non-linear Relationships
Support Vector Machine handles situations of non-linear relations in the data by using a
kernel function which maps the data into a higher dimensional space where a linear
hyperplane can be used to separate classes.
39 https://financetrain.com
If the data can’t be separated by linear discrimination, it is possible to use some trick to project
the data in a higher dimension and apply a linear hyperplane. Figure 9 shows a scenario
where the data cannot be separated using a linear hyperplane.
In situations such as those in Figure 9, Support Vector Machine becomes extremely powerful
because using the kernel trick technique we can project the data into a higher-dimensional
space and thereby we’re able to fit nonlinear relationships with a linear classifier.
Making a transformation of the space, and adding one more dimension to the data, will allow
the data to be linearly separable. The following plot shows how the previous classes can be
linearly separated with the increment of the dimension of space from 2 dimensions to 3
dimensions.
40 https://financetrain.com
Figure 10: Kernel Transformation to 3D
The kernel trick represents a real strength in the Support Vector Machine algorithm as we can
solve non-linear classification problems with efficiency.
• Kernel: (string) select the type of hyperplane used to separate the data. The options are
“linear”, “rbf”, and “poly”. “linear” will use a linear hyperplane. “rbf” and “poly” use a non-
linear hyperplane. The default argument is “rbf”.
• Gamma: (float) is a parameter for non-linear hyperplanes. The higher the gamma value,
the more it tries to exactly fit the training data set. The default argument is “auto” which
uses 1 / n features. Low gamma values mean we’re considering a faraway point when
deciding the decision boundary.
• C: (float) is the penalty parameter of the error term. It regulates overfitting by controlling
the trade-off between smooth decision boundaries and classifying the training points
correctly. Greater values of C lead to overfitting of the training data.
• Degree: (integer) is a parameter used when the kernel is set to “poly”. This is the degree
of the polynomial used to find the hyperplane to split the data. The default value is 3.
Using degree=1 is the same as using a “linear” kernel.
41 https://financetrain.com
Python Examples
In this section, we show two examples of Machine Learning problems using the scikit-learn
library from python. With these examples, we hope to highlight the important steps in working
with a Machine Learning problem and explain key concepts in each of the stages.
import pandas as pd
from datetime import datetime
import numpy as np
from sklearn.linear_model import LinearRegression
import matplotlib.pyplot as plt
We will work with SPY data between the dates 2010-01-04 and 2015-12-07.
First, we use the read_csv() method to load the csv file into the environment:
42 https://financetrain.com
SPY_data = pd.read_csv("../data/SPY_regression.csv")
SPY_data.head(10)
# Reverse the order of the dataframe in order to have oldest values at top
SPY_data.sort_values('Date',ascending=True)
43 https://financetrain.com
SPY_data['High-Low_pct'] = (SPY_data['High'] - SPY_data['Low']).pct_change()
SPY_data['ewm_5'] = SPY_data["Close"].ewm(span=5).mean().shift(periods=1)
SPY_data['price_std_5'] = SPY_data["Close"].rolling(center=False,window=
30).std().shift(periods=1)
Step 3: Visualization of the relationship between features and the target variable
Before training the dataset, we will make some plotting to observe the correlations between the
features and the target variable
jet= plt.get_cmap('jet')
colors = iter(jet(np.linspace(0,1,10)))
# Take the name of the last 6 columns of the SPY_data which are the model features
variables = SPY_data.columns[-6:]
correlation(SPY_data,variables,3,3)
Figure 11: Correlations between Features and Target Variable (Adj Close)
44 https://financetrain.com
The correlation matrix between the features and the target variable has the following values:
SPY_data.corr()['Adj Close'].loc[variables]
High-Low_pct -0.010328
ewm_5 0.998513
price_std_5 0.100524
volume Change -0.005446
volume_avg_5 -0.485734
volume Close -0.241898
Both the scatterplot and the correlation matrix reflect that the Exponential Moving Average for 5
periods is highly correlated with the Adj Close variable. We also observe a negative correlation
between Adj Close and the volume average for 5 days and with the Volume to Close ratio.
45 https://financetrain.com
Step 4 - Train the Dataset and Fit the model
When we performed the feature calculations, our SPY_data also got some NaN values in some
of these columns. We will review how many NaN values there are in each column and then
remove these rows.
SPY_data.isnull().sum().loc[variables]
High-Low_pct 1
ewm_5 1
price_std_5 30
volume Change 1
volume_avg_5 5
volume Close 5
To train the model, it is important to drop any missing value in the dataset.
SPY_data = SPY_data.dropna(axis=0)
46 https://financetrain.com
In the next step, we will fit the model with the LinearRegression classifier. We are trying to
predict the Adj Close value of the Standard and Poor’s index. So, the target of the model is the
"Adj Close" Column
lr = LinearRegression()
X_train = train[["High-Low_pct","ewm_5","price_std_5","volume_avg_5","volume
Change","volume Close"]]
lr.fit(X_train,Y_train)
Create the test features dataset (X_test) which will be used to make the predictions
X_test = test[["High-Low_pct","ewm_5","price_std_5","volume_avg_5","volume
Change","volume Close"]].values
Predict the Adj Close values using the X_test dataframe and Compute the Mean Squared Error
between the predictions and the real observations.
close_predictions = lr.predict(X_test)
mae = sum(abs(close_predictions - test["Adj Close"].values)) / test.shape[0]
print(mae)
18.0904
The Mean Absolute Error of the model is 18.0904. This metric is more intuitive than others
such as the Mean Squared Error, in terms of how close the predictions were to the real price.
47 https://financetrain.com
Finally, we will plot the error term for the last 25 days of the test dataset. This allows observing
how much is the error term in each of the days and asses the performance of the model by
date.
# Create a dataframe that output the Date, the Actual and the predicted values
df1['Date'] = df1['Date'].dt.strftime('%Y-%m-%d')
df1.set_index('Date',inplace=True)
error = df1['Actual'] - df1['Predicted']
# Plot the error term between the actual and predicted values for the last 25 days
error.plot(kind='bar',figsize=(8,6))
plt.grid(which='major', linestyle='-', linewidth='0.5', color='green')
plt.grid(which='minor', linestyle=':', linewidth='0.5', color='black')
plt.xticks(rotation=45)
plt.show()
48 https://financetrain.com
Machine Learning Classifier for Forecasting
In the second example, we will create a supervised classifier model that will train a dataset with
a set of features and then use test data to predict price direction at day t with information only
known at day t-1. Price direction can be up when the closing price at t is higher than the price at
t-1, and down when the closing price at t is lower than at t-1.
For this task, we will create a set of features that are the lagged returns for the previous 2 days
and volume percent change in each day. Then we train the dataset and fit different models with
a set of algorithms that are the Logistic Regression, Support Vector Machine, Support
Vector Classifier, Random Forest and Linear Discriminant Analysis.
For each model, we will output two metrics that are used in classification problems to assess
model performance. These metrics are the Hit Rate and the Confusion Matrix.
Hit Rate
The Hit Rate provides a measure of the percentage of the number of times the classifier makes
correct predictions (up and down). This indicator can be expressed with the following formula:
𝒏
𝟏
∑ 𝑰(𝒚𝒋 = 𝒚̂𝒋 )
𝒏
𝒋=𝟏
𝐈(𝐲𝐣 = 𝐲̂)
𝐣 is the indicator function and is equal to 1 if yj = y
̂j and 0 if yj ⧧ŷj
Confusion Matrix
The Confusion Matrix gives a measure of how many times the classifier predicts ‘up’ correctly
and how many times it predicts ‘down’ correctly.
49 https://financetrain.com
In a binary classification problem, the confusion matrix is a 2 x 2 contingency table that
determines the False Positive Rate13 (Type 1 error) and the False Negative Rate14 (Type II
error).
𝑈𝑇 𝑈𝐹
( )
𝐷𝐹 𝐷𝑇
Scikit-learn library provides methods to calculate the Hit Rate and the Confusion Matrix for a
classifier. The dataset of the model is the SPY data between 2015-01-01 to 2019-09-18.
▪ Import libraries, and load SPY data into the environment using the read_csv() function.
This is the dataset for the model with dates between 2015-01-01 and 2019-09-18
▪ Create features and target variable with the model_variables() function
▪ Fit different models: Logistic Regression, Random Forest, Support Vector Machine,
Linear Discriminant Analysis
▪ Obtain metrics such as Confusion Matrix and Hit Rate for each of the models.
13
When incorrectly reject a True null hypothesis. 𝑈𝐹 in the contingency table
14
When fail to reject the null hypothesis. 𝐷𝐹 in the contingency table
50 https://financetrain.com
Import Libraries and Load Data
import pandas as pd
import numpy as np
import datetime
from sklearn.ensemble import RandomForestClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis as LDA
from sklearn.metrics import confusion_matrix
from sklearn.svm import LinearSVC, SVC
data = pd.read_csv("../data/spy_data.csv",index_col='Date')
def model_variables(prices,lags):
'''
Parameters:
prices: dataframe with historical data of SPY with the variables closes and
volume.
lags: Number of lags of the closing price that will be created by the
function
to make the lagged returns features of the machine learning model
Output:
tsret: dataframe with index date, the independent variables(X) and the
dependent variable(y) of the machine learning model
'''
inputs["Close"] = prices["Close"]
inputs["Volume"] = prices["Volume"]
# Create the shifted lag series of prior trading period close values
for i in range(0, lags):
tsret = pd.DataFrame(index=inputs.index)
inputs["Lag%s" % str(i+1)] = prices["Close"].shift(i+1)
51 https://financetrain.com
if (abs(x) < 0.0001):
tsret["returns"][i] = 0.0001
# Pass the dataset(data) and the number of lags 2 as the inputs of the
# model_variables function
variables_data = model_variables(data,2)
# Use the prior two days of returns and the volume change as predictors
# values, with direction as the response
dataset = variables_data[["Lag1","Lag2","VolumeChange","Direction"]]
dataset = dataset.dropna()
# Create the dataset with independent variables (X) and dependent variable y
X = dataset[["Lag1","Lag2","VolumeChange"]]
y = dataset["Direction"]
# Split the train and test dataset using the date in the date_split variable
# This will create a train dataset of 4 years data and a test dataset for more than
# 9 months data.
date_split = datetime.datetime(2019,1,1)
52 https://financetrain.com
Create the (Parametrised) Models
for m in models:
# Train each of the models on the training set
m[1].fit(X_train, y_train)
# Make an array of predictions on the test set
pred = m[1].predict(X_test)
# Output the hit-rate and the confusion matrix for each model
53 https://financetrain.com
The results of the code are the Hit Rate and the Confusion Matrix for each of the models
trained. The diagonal of the matrix represents the correct predictions (up and down), and the
inverse of the diagonal represents incorrect predictions (the prediction was down and the price
go up, or the prediction was up and the price go down).
LR:
0.583
[[28 29]
[46 77]]
LDA:
0.589
[[28 28]
[46 78]]
LSVC:
0.583
[[28 29]
[46 77]]
RSVM:
0.572
[[20 23]
[54 83]]
RF:
0.600
[[34 32]
[40 74]]
The model with the best prediction score is the Random Forest with a Hit Rate of 60%. We
have changed the default parameter min_samples_leaf of the Random Forest Classifier from
its default value of 1 to 30.
This means that each leaf node has at least 30 observations and that a split will be considered
if it leaves at least 30 training samples in each left and right branches.
54 https://financetrain.com
All the models work better to predict down periods compared to up periods, as the true positive
rate for the “down”15 days is significantly higher than the true positive rate for the “up”16 days.
Basically, to perform Cross Validation we need to keep aside a portion of the data that is not
used to train the model. The goal of Cross Validation is to estimate the test error of the model,
by holding a subset of the dataset in order to use them as test observations. This approach
gives a more accurate estimate of the test error.
The classical method for training and testing a dataset is called the Validation Set approach.
We have used this approach in both examples of Multivariate Linear Regression and
Classifier Forecasting. This consists of splitting the dataset into a train and a test set.
Commonly, around 80% of the data is used to train the dataset and the other 20% of the data is
used as the test set.
The splitting is done in chronological order, that is, the first two-thirds represent the first two-
thirds of the historical data. One of the drawbacks of this method is that by choosing a different
length for the train and the test data, the model performance can vary significantly.
15 𝐷𝑇
𝐷𝑇 +𝑈𝐹
16 𝑈𝑇
𝑈𝑇+𝐷𝐹
55 https://financetrain.com
Likewise, if we have a limited amount of data, there is a possibility of high bias because we
would miss some information about the data that was not used for training. If the amount of
data is huge and the train and test data have the same distribution, this approach is acceptable.
The second approach to address overfitting is to train and test the model using a method called
K-Fold Cross Validation.
K-Fold Cross Validation is a more sophisticated approach that generally results in a less
biased model compared to other methods. This method consists of the following steps:
1. Divide the n observations of the dataset into k mutually exclusive and equal or close-to-
equal sized subsets known as “folds”.
2. Fit the model using k-1 folds as the training set and one fold (kth) as the test set. After
each iteration has been finished, store the error of the model.
3. Repeat this process k times using one different fold every time as a test set and the
remaining folds (k-1) as the training set.
4. Once all the iterations have finished, take the mean of the k models. This would be the
Mean Squared Error of the model.
The error model in using the K-Fold cross validation has the following formula:
𝑘
1
𝐶𝑉𝑘 = ∑ 𝑀𝑆𝐸𝑖
𝑘
𝑖=1
An important consideration of this approach is the selection of the number of folds. The choice
of the number of folds should be done on the basis that each fold needs to have enough data
points to provide a fair estimate of the model performance. On the other hand, the k number
should not be so small such as 2, in order to have enough trained models for assessing the
model performance.
56 https://financetrain.com
K-Fold Cross Validation Example
We will now work with an example of Cross Validation using the K-Fold method using the
python scikit-learn library. The example would have k parameter equal to 5. By using a for
loop, we will fit each model using 4 folds for training data and 1 fold for testing data, and then
we will call the accuracy_score method from scikit-learn to determine the accuracy of the
model.
• Import the libraries and load data into the environment. Our data is the Open, High, Low,
and Close data for EURUSD.
• Create features with the create_features() function.
• Run the model with the Validation Set approach.
• Run the model with the K-Fold cross Validation approach.
57 https://financetrain.com
Import Libraries and Load Data
import pandas as pd
from sklearn.model_selection import KFold
import numpy as np
from sklearn.metrics import accuracy_score
from sklearn import tree
import matplotlib.pyplot as plt
import seaborn as sns
# Read csv file to load into the environment OHLC data from EURUSD.
eurusd_ohlc = pd.read_csv("../data/eurusd.csv")
Create Features
# The create_features function, receives the eurusd_ohlc paramete and creates new
features to use in a machine learning model
def create_features(fx_data):
'''
Parameters:
fx_data: has Open-High-Low-Close data for currency pair EURUSD between
2001-08-21 to 2019-09-21
Return:
fx_data: dataframe with original and new data with the features of the model
target: target variable to predict, which contains the direction of the
price. The values can be 1 for up direction and -1 for down direction.
'''
# Make the positive gains (up) and negative gains (down) Series
58 https://financetrain.com
up, down = delta.copy(), delta.copy()
up[up < 0] = 0
down[down > 0] = 0
fx_data['RSI'] = RSI
fx_data.dropna(inplace=True)
# Create the target variable that take the values of 1 if the stock price
# go up or -1 if the stock price go down
target = np.where(fx_data['Close'].shift(-1) > fx_data['Close'], 1, -1)
# Validation Set approach: take 80% of the data as the training set and 20 % as the
test set. X is a dataframe with the input variable
X = features[['High-Low', 'pct_change', 'ret_5','RSI']]
clf = tree.DecisionTreeClassifier(random_state=20)
59 https://financetrain.com
# Create the model on train dataset
model = clf.fit(X_train, y_train)
# Initialize the accuracy of the models to blank list. The accuracy of each model
will be appended to this list
accuracy_model = []
(3989, 4) (998, 4)
(3989,) (998,)
51.4028
[50.501, 52.004, 48.9468, 46.1384, 51.3541]
These 4 lines above are the outputs of the print() messages. (3989, 4) (998, 4) are the size of
the X_train and X_test dataset where 3989 is the number of observations in the train dataset
and 4 is the number of features in the train dataset. 998 is the number of observations in the
test dataset, and 4 is the number of features in the test dataset.
60 https://financetrain.com
(3989,) (998,) are the size of y_train and y_test. 51.4028 is the accuracy score with the
Validation set approach and [50.501, 52.004, 48.9468, 46.1384, 51.3541] is the
accuracy_model list which shows the accuracy in each iteration using the K-Fold Cross
Validation method.
K-Fold Cross Validation gives a better idea of how the model will perform with new or live data
because we have used 5 different testing sets to obtain measures of the model performance.
Finally, we use a bar plot to visualize the score measure in each iteration:
scores = pd.DataFrame(accuracy_model,columns=['Scores'])
61 https://financetrain.com
Unsupervised Learning
Unsupervised learning models are composed of features that are not associated with a
response. This means that these machine learning algorithms do not have labelled data as their
interest lies in the features' attributes themselves.
Many machine learning problems use unsupervised techniques because sometimes it can be
very expensive or time-consuming to label all data or there can be cases where the data has
high dimensionality that’s not suitable for supervised learning techniques which could lead to
overfitting and poor test performance.
Some examples of unsupervised learning problems in quantitative finance are the following:
• Portfolio/asset clustering,
• Market regime identification
• Trading signal generation with natural language processing.
There are two important techniques for unsupervised algorithms: Dimensionality Reduction
and Cluster.
Dimensionality Reduction
Some machine learning problems have a large number of features that are correlated in a high-
dimensional space. Dimensionality reduction is the process of reducing the number of features
under consideration, by obtaining a set of principal features.
The most common method for Dimensionality Reduction is Principal Component Analysis
(PCA). This method reduces the features dimension of the data by transforming their original
space into a set of linearly, uncorrelated variables called principal components.
62 https://financetrain.com
The principal components can be founded as the eigenvectors17 of the covariance matrix of
the data. This procedure tries to preserve the components of the data with the higher variation
and remove components with fewer variations.
In quantitative finance, this technique could be applied to a large number of correlated stocks in
order to reduce their dimensionality by looking at a smaller set of uncorrelated factors.
Cluster Analysis
Another important unsupervised learning technique is Cluster Analysis. Its goal is to assign a
cluster label for each observation of a dataset in order to group the dataset into clusters with
similar properties. In quantitative finance, clustering is used to identify assets with similar
characteristics which become useful to construct diversified portfolios.
A key element of the cluster technique is to define the number of clusters K in which the data
will be partitioned. The scikit-learn library provides a straightforward procedure to perform this
task using the K-means algorithm. First, we describe the inner steps of the algorithm and then
we demonstrate it with an example. The algorithm performs the following steps:
1. Pick k cluster centers (centroids) randomly. The number of clusters is given by the k
parameter.
2. Assign each observation of the dataset to the nearest cluster by using the Euclidean
Distance between the observation and each centroid.
3. Find the new centroid or new cluster mean corresponding to each observation by taking
the average of all the points assigned to that cluster.
4. Repeat steps 2 and 3 until none of the cluster assignments change. This means until the
clusters remain stable.
17
The Eigenvectors represents the key information of a matrix. These are used when the features dimension of the
data is high and is necessary to reduce the dimension to manage the dataset.
63 https://financetrain.com
This algorithm can be used for more than one purpose. One good reason to use this algorithm
is to identify different clusters of stocks that can support a portfolio diversification strategy
among stocks in different clusters.
We will provide an example of the implementation of the K-Means algorithm in python. This
example involves clustering a dataset that contains information on all the stocks that compose
the Standard & Poor Index.
• Obtain the 500 tickers for the SPY500 by scrapping the ticker symbols from Wikipedia.
The function obtain_parse_wike_snp500() is used to perform this task.
• Obtain close prices from last year for each of the symbols using the Quandl API. (Note:
You will need your own free Quandl API key).
• Calculate the mean and variance of the returns for each stock.
• Choose the best k value for the optimal number of clusters.
• Fit the model with the k number of clusters.
import pandas as pd
import numpy as np
from math import ceil
import bs4
import requests
import quandl # need to do pip install quandl
from sklearn.cluster import KMeans
import matplotlib.pyplot as plt
def obtain_parse_wiki_snp500():
""" Download and parse the Wikipedia list of S&P500 constituents using requests
and Beautiful Soup.
"""
response =
requests.get( "http://en.wikipedia.org/wiki/List_of_S%26P_500_companies" )
soup = bs4.BeautifulSoup(response.text)
# This selects the first table, using CSS Selector syntax and then ignores the
header row ([1:])
symbolslist = soup.select('table')[0].select('tr')[1:]
64 https://financetrain.com
# Obtain the symbol information for each row in the S&P500 constituent table
symbols = []
for i, symbol in enumerate(symbolslist):
tds = symbol.select('td')
symbols.append( (tds[0].select('a')[0].text, # Ticker
tds[1].select('a')[0].text, # Name
tds[3].text # Sector
)
)
return symbols
tickers = obtain_parse_wiki_snp500()
The tickers object is a list of tuples with the ticker, Company Name and Sector of each
company. To observe the structure of the tickers object we show the first 10 elements:
tickers[:10]
We will use this list in the following function to loop over each symbol of tickers and get the
close price from the Quandl API. We perform the loop using a try-except block to handle errors
regarding symbols changed or different in the Quandl database.
def get_quandl_data(symbols):
"""
This function would loop over all the symbols from the SPY & 500 and retrieve
the Close price column from the WIKI database from Quandl between the dates given by
the start_date and end_date parameters
"""
65 https://financetrain.com
symbols = [symbol[0] for symbol in symbols]
stocks_info = []
return stocks_info
data = get_quandl_data(tickers)
The data object is a list of tuples where each tuple has 2 elements. The first element is the
ticker and the second element is the Date and Close Price for each of the stocks. We need to
parse this information in order to make a dataframe that has the date as index and the close
prices for each stock as columns. The following lines do this job:
# Get the date and prices from the data object (second element of the tuple)
closes = [data[i][1] for i in range(0,len(data))]
# Store closes object in a dataframe and obtain the transpose of the dataframe.
#With this we will have the Date as index and the Prices of each stock as columns.
closes = pd.DataFrame(closes).T
We will now calculate the annual mean returns and standard deviation of the returns of these
stocks.
66 https://financetrain.com
std = closes.pct_change().std() * np.sqrt(252)
ret_var.head(10)
In order to determine the optimal number of clusters k for the ret_var dataset, we will fit
different models of the K-means algorithm while varying the k parameter in the range 2 to 14.
For each model we calculate the Sum Squared Error (SSE) by using the inertia_ method18 of
the model fitted. In each iteration, we append the inertia to the SSE list. Then, we take the
model with the least value of SSE.
X = ret_var.values
18
Inertia tells how far away the points within a cluster are. The small the inertia the better is the value.
67 https://financetrain.com
sse = []
for k in range(2,15):
kmeans = KMeans(n_clusters = k)
kmeans.fit(X)
plt.plot(range(2,15), sse)
plt.title("Elbow Curve")
plt.show()
The graph is named as Elbow Curve and shows that the optimal value of k is 5.
Figure 15: Elbow Curve: Determine the optimal value of k. K-means Cluster
We chose k=5 and fit the model with this parameter value.
68 https://financetrain.com
The different groups or clusters of the dataset are reflected in the following graph:
We can view the presence of outliers at only one point on the upper right side of the graph. This
outlier forms its own cluster. In order to have a better categorization of the stocks within the
SPY index, we would remove those stocks and fit the model one more time.
# Find the stock with the highest value in the Standard Deviation variable
stdOrder = ret_var.sort_values('Standard Deviation',ascending=False)
first_symbol = stdOrder.index[0]
69 https://financetrain.com
Figure 17: Clusters of ret_var dataset without Outliers. k=5
The x-axis of Figure 17 refers to the returns of the stocks and the y-axis is the standard
deviation of each stock. Stocks that are in the upper-right cluster are the stocks with the higher
value of returns and standard deviation.
Finally, we will assign to each stock its corresponding cluster number (1, 2, 3, 4, and 5) and
make a data frame with this information. Once we have the cluster number for each stock, we
can create a diversified portfolio, in the long term, between stocks from different clusters.
70 https://financetrain.com
The structure of the stockClusters dataframe is as below:
These are the first few rows of the stockClusters dataframe. We conclude this section with a
categorization of each stock from the SPY 500 in terms of returns and risk. This could be an
important tool for portfolio diversification.
71 https://financetrain.com
Neural Networks Overview
Neural networks is an advanced concept within the field of Deep Learning. As we saw in the
previous sections of this e-book, machine learning involves working with algorithms that try to
predict a target variable or segment data to find relevant patterns without human intervention.
In contrast, in deep learning architecture, there is more than one layer of these algorithms and
these systems represent a network of algorithms that do not necessarily need labelled data to
make predictions. Below we can visualize the main components of a Neural Network:
A Neural Network is composed of input layers, hidden layers and output layers.
The input layers are the beginning of the workflow in a neural network and receive information
about the initial data or the features of a dataset.
Hidden layers identify the important information from the inputs, leaving out the redundant
information. They make the network more efficient and pass the important patterns of data to
the next layer which is the output layer.
72 https://financetrain.com
In the middle of the hidden layer and the output layer, the network has an activation
function. This activation function transforms the inputs into the desired output and can take
many forms such as a sigmoid function (as in the binary classification of logistic regression) or
other types of functions (RELU, Tanh, softmax among others that are beyond the scope of this
book).
The goal of the activation function is to capture the non-linear relationship between the inputs
and it also helps to convert the inputs into a more useful output.
Figure 16 shows a network architecture called “feed-forward network” as the input signals flow
in only one direction (from inputs to outputs)19. The key purpose of the model is to find the
optimal weights (W) that minimize the prediction error. Just like other machine learning
algorithms, a Neural Network also has a loss function that will be minimized.
One common optimization algorithm to minimize the loss function that is used in Neural
Network is called “gradient descent” which works on multiple iterations trying to find the
weights that return the least value of the error term.
1. Assign random values for the weights and pass them as parameters of the function.
2. Calculate the change in Sum Squared Error (SSE) when the weights are changed by
small amounts from their original random values.
3. Adjust the weights to reduce the impact of the prediction errors.
4. Use the new weights for predictions and to calculate the new SSE.
5. Repeat steps 2 and 3 until the adjustments in weights don’t significantly reduce the error
term.
19
There are other types of flows such as backpropagations, where the data going back along the neural networks
and inspect every connection to check how the output would behave according to a change on the weight.
73 https://financetrain.com
There are some important advantages of Neural Networks that are pointed out below:
• They are suitable to represent and deal with non-linear relationships in the data.
• These models are better choices to avoid overfitting the model. After learning from
their initial inputs and the relationships between factors, they are able to generalize
and predict using unseen data.
• They do not impose restrictions on the input variables such as some distribution
pattern of the data.
• They are able to model heteroscedasticity20 data with high volatility and non-constant
variance. This represents a breakthrough in financial time series prediction, where
the data has a high and time-dependent variance.
Due to their ability to model complex relationships and understand underlying factors of the
data, these models are a good tool to forecast stock prices. Recent advances in the field are
the usage of Long Short Term Memory (LSTM) models and Recurrent Neural Network for
forecasting.
20
The variability or variance of the time series is not constant over time. The opposite is called homoscedasticity.
74 https://financetrain.com
Note from Finance Train
I hope you found this book useful. Please also check out our other ebooks.
Link: https://financetrain.gumroad.com/l/quantitative-trading-strategies-r
Derivatives with R
Link: https://financetrain.gumroad.com/l/derivatives-with-r
Link: https://financetrain.gumroad.com/l/credit-risk-modelling-with-r
Link: https://financetrain.gumroad.com/l/financial-time-series-r
Link: https://financetrain.gumroad.com/l/data-visualization-with-r
END
75 https://financetrain.com