Article Review 9 Eng

Download as pdf or txt
Download as pdf or txt
You are on page 1of 21

Machine Learning

Ensemble Methods and Reinforcement


Learning
Daftar Isi
Ensemble Methods: Elegant Techniques to Produce Improved
Machine Learning Results 3
Voting and Averaging Based Ensemble Methods 4
Majority Voting 5
Weighted Voting 5
Simple Averaging 5
Weighted Averaging 5
Stacking Multiple Machine Learning Models 6
Bootstrap Aggregating 7
Boosting: Converting Weak Models to Strong Ones 9
What is reinforcement learning? 10
Examples of reinforcement learning 11
Challenges with reinforcement learning 12
What distinguishes reinforcement learning from deep learning
and machine learning? 13
Use Case 19
References 21
Other Reading Sources: 21

2
Ensemble Methods: Elegant Techniques to Produce Improved
Machine Learning Results
Machine Learning, in computing, is where art meets science. Perfecting a machine
learning tool is a lot about understanding data and choosing the right algorithm. But
why choose one algorithm when you can choose many and make them all work to
achieve one thing: improved results. In this article, Toptal Engineer Necati Demir
walks us through some elegant techniques of ensemble methods where a
combination of data splits and multiple algorithms is used to produce machine
learning results with higher accuracy.

Ensemble methods are techniques that create multiple models and then combine
them to produce improved results. Ensemble methods usually produce more
accurate solutions than a single model would. This has been the case in a number of
machine learning competitions, where the winning solutions used ensemble
methods. In the popular Netflix Competition, the winner used an ensemble method
to implement a powerful collaborative filtering algorithm. Another example is KDD
2009 where the winner also used ensemble methods. You can also find winners
who used these methods in Kaggle competitions, for example here is the interview
with the winner of the CrowdFlower competition.

It is important that we understand a few terminologies before we continue with this


article. Throughout the article I used the term “model” to describe the output of the
algorithm that trained with data. This model is then used for making predictions.
This algorithm can be any machine learning algorithm such as logistic regression,
decision tree, etc. These models, when used as inputs of ensemble methods, are
called ”base models”.

3
In this blog post I will cover ensemble methods for classification and describe some
widely known methods of ensemble: voting, stacking, bagging and boosting.

Voting and Averaging Based Ensemble Methods


Voting and averaging are two of the easiest ensemble methods. They are both easy
to understand and implement. Voting is used for classification and averaging is used
for regression.

In both methods, the first step is to create multiple classification/regression models


using some training dataset. Each base model can be created using different splits
of the same training dataset and same algorithm, or using the same dataset with
different algorithms, or any other method. The following Python-esque pseudocode
shows the use of the same training dataset with different algorithms.

train = load_csv("train.csv")
target = train["target"]
train = train.drop("target")
test = load_csv("test.csv")

algorithms = [logistic_regression,
decision_tree_classification, ...] #for classification
algorithms = [linear_regression,
decision_tree_regressor, ...] #for regression

predictions = matrix(row_length=len(target),
column_length=len(algorithms))

for i,algorithm in enumerate(algorithms):


predictions[,i] = algorithm.fit(train,
target).predict(test)

According to the above pseudocode, we created predictions for each model and
saved them in a matrix called predictions where each column contains predictions
from one model.

4
Majority Voting
Every model makes a prediction (votes) for each test instance and the final output
prediction is the one that receives more than half of the votes. If none of the
predictions get more than half of the votes, we may say that the ensemble method
could not make a stable prediction for this instance. Although this is a widely used
technique, you may try the most voted prediction (even if that is less than half of
the votes) as the final prediction. In some articles, you may see this method being
called “plurality voting”.

Weighted Voting
Unlike majority voting, where each model has the same rights, we can increase the
importance of one or more models. In weighted voting you count the prediction of
the better models multiple times. Finding a reasonable set of weights is up to you.

Simple Averaging
In a simple averaging method, for every instance of a test dataset, the average
predictions are calculated. This method often reduces overfit and creates a
smoother regression model. The following pseudocode code shows this simple
averaging method:

final_predictions = []
for row_number in len(predictions):
final_predictions.append(
mean(prediction[row_number, ])
)

Weighted Averaging
Weighted averaging is a slightly modified version of simple averaging, where the
prediction of each model is multiplied by the weight and then their average is
calculated. The following pseudocode code shows the weighted averaging:

5
weights = [..., ..., ...] #length is equal to
len(algorithms)
final_predictions = []
for row_number in len(predictions):
final_predictions.append(
mean(prediction[row_number, ]*weights)
)

Stacking Multiple Machine Learning Models

Stacking, also known as stacked generalization, is an ensemble method where the


models are combined using another machine learning algorithm. The basic idea is to
train machine learning algorithms with training dataset and then generate a new
dataset with these models. Then this new dataset is used as input for the combiner
machine learning algorithm.

The pseudocode of a stacking procedure is summarized as below

base_algorithms = [logistic_regression,
decision_tree_classification, ...] #for classification

stacking_train_dataset = matrix(row_length=len(target),
column_length=len(algorithms))
stacking_test_dataset = matrix(row_length=len(test),
column_length=len(algorithms))

for i,base_algorithm in enumerate(base_algorithms):


stacking_train_dataset[,i] = base_algorithm.fit(train,
target).predict(train)
stacking_test_dataset[,i] =
base_algorithm.predict(test)

final_predictions =
combiner_algorithm.fit(stacking_train_dataset,
target).predict(stacking_test_dataset)

6
As you can see in the above pseudocode, the training dataset for the combiner
algorithm is generated using the outputs of the base algorithms. In the pseudocode,
the base algorithm is generated using a training dataset and then the same dataset
is used again to make predictions. But as we know, in the real world we do not use
the same training dataset for prediction, so to overcome this problem you may see
some implementations of stacking where the training dataset is splitted. Below you
can see a pseudocode where the training dataset is split before training the base
algorithms:

base_algorithms = [logistic_regression,
decision_tree_classification, ...] #for classification

stacking_train_dataset = matrix(row_length=len(target),
column_length=len(algorithms))
stacking_test_dataset = matrix(row_length=len(test),
column_length=len(algorithms))

for i,base_algorithm in enumerate(base_algorithms):


for trainix, testix in split(train, k=10): #you may
use sklearn.cross_validation.KFold of sklearn library
stacking_train_dataset[testcv,i] =
base_algorithm.fit(train[trainix],
target[trainix]).predict(train[testix])
stacking_test_dataset[,i] =
base_algorithm.fit(train).predict(test)

final_predictions =
combiner_algorithm.fit(stacking_train_dataset,
target).predict(stacking_test_dataset)

Bootstrap Aggregating

The name Bootstrap Aggregating, also known as “Bagging”, summarizes the key
elements of this strategy. In the bagging algorithm, the first step involves creating

7
multiple models. These models are generated using the same algorithm with
random sub-samples of the dataset which are drawn from the original dataset
randomly with bootstrap sampling method. In bootstrap sampling, some original
examples appear more than once and some original examples are not present in the
sample. If you want to create a sub-dataset with m elements, you should select a
random element from the original dataset m times. And if the goal is generating a
dataset, you follow this step n times.

At the end, we have n datasets where the number of elements in each dataset is m.
The following Python-esque pseudocode show bootstrap sampling:

def bootstrap_sample(original_dataset, m):


sub_dataset = []
for i in range(m):
sub_dataset.append(
random_one_element(original_dataset)
)
return sub_dataset
The second step in bagging is aggregating the generated models. Well known
methods, such as voting and averaging, are used for this purpose.

The overall pseudocode look like this:

def bagging(n, m, base_algorithm, train_dataset, target,


test_dataset):
predictions = matrix(row_length=len(target),
column_length=n)
for i in range(n):
sub_dataset = bootstrap_sample(train_dataset, m)
predictions[,i] =
base_algorithm.fit(original_dataset,
target).predict(test_dataset)

final_predictions = voting(predictions) # for


classification
final_predictions = averaging(predictions) # for
regression

8
return final_predictions

In bagging, each sub-samples can be generated independently from each other. So


generation and training can be done in parallel

You can also find implementation of the bagging strategy in some algorithms. For
example, the Random Forest algorithm uses the bagging technique with some
differences. Random Forest uses random feature selection, and the base algorithm
of it is a decision tree algorithm.

Boosting: Converting Weak Models to Strong Ones

The term “boosting” is used to describe a family of algorithms which are able to
convert weak models to strong models. The model is weak if it has a substantial
error rate, but the performance is not random (resulting in an error rate of 0.5 for
binary classification). Boosting incrementally builds an ensemble by training each
model with the same dataset but where the weights of instances are adjusted
according to the error of the last prediction. The main idea is forcing the models to
focus on the instances which are hard. Unlike bagging, boosting is a sequential
method, and so you can not use parallel operations here.

The general procedure of the boosting algorithm is defined as follows:

def adjust_dataset(_train, errors):


#create a new dataset by using the hardest instances
ix = get_highest_errors_index(train)
return concat(_train[ix], random_select(train))

models = []
_train = random_select(train)
for i in range(n): #n rounds
model = base_algorithm.fit(_train)

9
predictions = model.predict(_train)
models.append(model)
errors = calculate_error(predictions)
_train = adjust_dataset(_train, errors)

final_predictions = combine(models, test)

The adjust_dataset function returns a new dataset containing the hardest instances,
which can then be used to force the base algorithm to learn from.

Adaboost is a widely known algorithm which is a boosting method. The founders of


Adaboost won the Gödel Prize for their work. Mostly, the decision tree algorithm is
preferred as a base algorithm for Adaboost and in sklearn library the default base
algorithm for Adaboost is decision tree (AdaBoostRegressor and
AdaBoostClassifier). As we discussed in the previous paragraph, the same
incremental method applies for Adaboost. Information gathered at each step of the
AdaBoost algorithm about the ‘hardness’ of each training sample is fed into the
model. The ‘adjusting dataset’ step is different from the one described above and
the ‘combining models’ step is calculated by using weighted voting.

What is reinforcement learning?

Reinforcement learning is the training of machine learning models to make a


sequence of decisions. The agent learns to achieve a goal in an uncertain, potentially
complex environment. In reinforcement learning, an artificial intelligence faces a
game-like situation. The computer employs trial and error to come up with a
solution to the problem. To get the machine to do what the programmer wants, the
artificial intelligence gets either rewards or penalties for the actions it performs. Its
goal is to maximize the total reward.

10
Although the designer sets the reward policy–that is, the rules of the game–he gives
the model no hints or suggestions for how to solve the game. It’s up to the model to
figure out how to perform the task to maximize the reward, starting from totally
random trials and finishing with sophisticated tactics and superhuman skills. By
leveraging the power of search and many trials, reinforcement learning is currently
the most effective way to hint at a machine's creativity. In contrast to human beings,
artificial intelligence can gather experience from thousands of parallel gameplays if
a reinforcement learning algorithm is run on a sufficiently powerful computer
infrastructure.

Examples of reinforcement learning


Applications of reinforcement learning were in the past limited by weak computer
infrastructure. However, as Gerard Tesauro’s backgammon AI superplayer
developed in 1990’s shows, progress did happen. That early progress is now rapidly
changing with powerful new computational technologies opening the way to
completely new inspiring applications.

Training the models that control autonomous cars is an excellent example of a


potential application of reinforcement learning. In an ideal situation, the computer
should get no instructions on driving the car. The programmer would avoid
hard-wiring anything connected with the task and allow the machine to learn from
its own errors. In a perfect situation, the only hard-wired element would be the
reward function.

● For example, in usual circumstances we would require an autonomous


vehicle to put safety first, minimize ride time, reduce pollution, offer
passengers comfort and obey the rules of law. With an autonomous race

11
car, on the other hand, we would emphasize speed much more than the
driver’s comfort. The programmer cannot predict everything that could
happen on the road. Instead of building lengthy “if-then” instructions, the
programmer prepares the reinforcement learning agent to be capable of
learning from the system of rewards and penalties. The agent (another
name for reinforcement learning algorithms performing the task) gets
rewards for reaching specific goals.

● Another example: deepsense.ai took part in the “Learning to run” project,


which aimed to train a virtual runner from scratch. The runner is an
advanced and precise musculoskeletal model designed by the Stanford
Neuromuscular Biomechanics Laboratory. Learning how to run is a first
step in building a new generation of prosthetic legs, ones that
automatically recognize people’s walking patterns and tweak themselves
to make moving easier and more effective. While it is possible and has
been done in Stanford’s labs, hard-wiring all the commands and predicting
all possible patterns of walking requires a lot of work from highly skilled
programmers.

Challenges with reinforcement learning


The main challenge in reinforcement learning lies in preparing the simulation
environment, which is highly dependent on the task to be performed. When the
model has to go superhuman in Chess, Go or Atari games, preparing the simulation
environment is relatively simple. When it comes to building a model capable of
driving an autonomous car, building a realistic simulator is crucial before letting the
car ride on the street. The model has to figure out how to brake or avoid a collision
in a safe environment, where sacrificing even a thousand cars comes at a minimal
12
cost. Transferring the model out of the training environment and into the real world
is where things get tricky.

Scaling and tweaking the neural network controlling the agent is another challenge.
There is no way to communicate with the network other than through the system of
rewards and penalties.This in particular may lead to catastrophic forgetting, where
acquiring new knowledge causes some of the old to be erased from the network (to
read up on this issue, see this paper, published during the International Conference
on Machine Learning).

Yet another challenge is reaching a local optimum – that is the agent performs the
task as it is, but not in the optimal or required way. A “jumper” jumping like a
kangaroo instead of doing the thing that was expected of it-walking-is a great
example, and is also one that can be found in our recent blog post.

Finally, there are agents that will optimize the prize without performing the task it
was designed for. An interesting example can be found in the OpenAI video below,
where the agent learned to gain rewards, but not to complete the race.

What distinguishes reinforcement learning from deep learning and machine


learning?
In fact, there should be no clear divide between machine learning, deep learning and
reinforcement learning. It is like a parallelogram – rectangle – square relation, where
machine learning is the broadest category and deep reinforcement learning the
most narrow one. In the same way, reinforcement learning is a specialized
application of machine and deep learning techniques, designed to solve problems in
a particular way.

13
Figure 1: Matrix Method and Problem

Although the ideas seem to differ, there is no sharp divide between these subtypes.
Moreover, they merge within projects, as the models are designed not to stick to a
“pure type” but to perform the task in the most effective way possible. So “what
precisely distinguishes machine learning, deep learning and reinforcement learning”
is actually a tricky question to answer.

● Machine learning – is a form of AI in which computers are given the ability


to progressively improve the performance of a specific task with data,
without being directly programmed ( this is Arthur Lee Samuel’s definition.
He coined the term “machine learning”, of which there are two types,
supervised and unsupervised machine learning

14
Supervised machine learning happens when a programmer can provide a
label for every training input into the machine learning system.

● Example; by analyzing the historical data taken from coal mines,


deepsense.ai prepared an automated system for predicting dangerous
seismic events up to 8 hours before they occur. The records of seismic
events were taken from 24 coal mines that had collected data for several
months. The model was able to recognize the likelihood of an explosion by
analyzing the readings from the previous 24 hours.

Figure 2: Coal Mines Data

From the AI point of view, a single model was performing a single task on a clarified
and normalized dataset.

Unsupervised learning takes place when the model is provided only with the input
data, but no explicit labels. It has to dig through the data and find the hidden
structure or relationships within. The designer might not know what the structure is
or what the machine learning model is going to find.

15
An example we employed was for churn prediction. We analyzed customer data and
designed an algorithm to group similar customers. However, we didn’t choose the
groups ourselves. Later on, we could identify high-risk groups (those with a high
churn rate) and our client knew which customers they should approach first.

Another example of unsupervised learning is anomaly detection, where the


algorithm has to spot the element that doesn’t fit in with the group. It may be a
flawed product, potentially fraudulent transaction or any other event associated
with breaking the norm.

Deep learning consists of several layers of neural networks, designed to perform


more sophisticated tasks. The construction of deep learning models was inspired by
the design of the human brain, but simplified. Deep learning models consist of a few
neural network layers which are in principle responsible for gradually learning more
abstract features about particular data.

Although deep learning solutions are able to provide marvelous results, in terms of
scale they are no match for the human brain. Each layer uses the outcome of a
previous one as an input and the whole network is trained as a single whole. The
core concept of creating an artificial neural network is not new, but only recently
has modern hardware provided enough computational power to effectively train
such networks by exposing a sufficient number of examples. Extended adoption has
brought about frameworks like TensorFlow, Keras and PyTorch, all of which have
made building machine learning models much more convenient.

Example: deepsense.ai designed a deep learning-based model for the National


Oceanic and Atmospheric Administration (NOAA). It was designed to recognize Right
whales from aerial photos taken by researchers. For further information about this

16
endangered species and deepsense.ai’s work with the NOAA, read our blog post.
From a technical point of view, recognizing a particular specimen of whales from
aerial photos is pure deep learning. The solution consists of a few machine learning
models performing separate tasks. The first one was in charge of finding the head
of the whale in the photograph while the second normalized the photo by cutting
and turning it, which ultimately provided a unified view (a passport photo) of a
single whale.

Figure 3: Whale Recognition

The third model was responsible for recognizing particular whales from photos that
had been prepared and processed earlier. A network composed of 5 million neurons
located the bowhead bonnet-tip. Over 941,000 neurons looked for the head and
more than 3 million neurons were used to classify the particular whale. That’s over
9 million neurons performing the task, which may seem like a lot, but pales in
comparison to the more than 100 billion neurons at work in the human brain. We
later used a similar deep learning-based solution to diagnose diabetic retinopathy
using images of patients’ retinas.

17
Reinforcement learning, as stated above employs a system of rewards and penalties
to compel the computer to solve a problem by itself. Human involvement is limited
to changing the environment and tweaking the system of rewards and penalties. As
the computer maximizes the reward, it is prone to seeking unexpected ways of
doing it. Human involvement is focused on preventing it from exploiting the system
and motivating the machine to perform the task in the way expected.
Reinforcement learning is useful when there is no “proper way” to perform a task,
yet there are rules the model has to follow to perform its duties correctly. Take the
road code, for example.

Example: By tweaking and seeking the optimal policy for deep reinforcement
learning, we built an agent that in just 20 minutes reached a superhuman level in
playing Atari games. Similar algorithms in principle can be used to build AI for an
autonomous car or a prosthetic leg. In fact, one of the best ways to evaluate the
reinforcement learning approach is to give the model an Atari video game to play,
such as Arkanoid or Space Invaders. According to Google Brain’s Marc G. Bellemare,
who introduced Atari video games as a reinforcement learning benchmark,
“although challenging, these environments remain simple enough that we can hope
to achieve measurable progress as we attempt to solve them”.

In particular, if artificial intelligence is going to drive a car, learning to play some


Atari classics can be considered a meaningful intermediate milestone. A potential
application of reinforcement learning in autonomous vehicles is the following
interesting case. A developer is unable to predict all future road situations, so letting
the model train itself with a system of penalties and rewards in a varied
environment is possibly the most effective way for the AI to broaden the experience
it both has and collects.
18
Use Case
Background & Problem Statement:

You are a data scientist at IDX Partners and currently help the finance team from
the biggest banking corporation in Indonesia with machine learning models. Finance
team faces challenges in accurately assessing the creditworthiness of its customers.
This process involves analyzing a large number of factors, including credit history,
income, liabilities, and more. A single predictive model may not be sufficient to
handle the complexity of this data and produce consistent predictions.

As a data scientist, you propose to use ensemble methods to handle this problem!
How to implement ensemble methods step by step ?

Solution:

Implement ensemble methods in machine learning to enhance the accuracy and


stability of creditworthiness predictions. Ensemble methods utilize a combination of
multiple models to improve overall performance and reduce the risk of overfitting
or underfitting.

Creating ensemble methods involves combining multiple models to improve overall


performance and enhance predictive accuracy. Here's a step-by-step guide to
implementing ensemble methods, focusing on Random Forest as an example:

● Step 1: Gather and Prepare Data


○ Ensure you have a well-prepared dataset with features and target
variables. Split the data into training and testing sets.
● Step 2: Choose Base Models

19
○ Select diverse base models that capture different aspects of the data.
For example, in the case of Random Forest, the base models are
individual decision trees.
● Step 3: Build Base Models
○ Train each base model on a subset of the training data. In the case of
Random Forest, train multiple decision trees with random subsets of
features and observations.
● Step 4: Combine Base Models
○ For Random Forest:
○ Aggregate the predictions of each decision tree.
○ Use a majority vote for classification problems or average the
predictions for regression problems.
● Step 5: Evaluate Ensemble Model
○ Assess the performance of the ensemble model on the testing set
using appropriate metrics (accuracy, precision, recall, F1-score for
classification; mean squared error for regression).
● Step 6: Tune Hyperparameters if needed
○ Optimize hyperparameters such as the number of base models, the
depth of individual trees, and other relevant parameters to improve
the ensemble's performance.
● Step 7: Feature Importance Analysis
○ Examine feature importance scores generated by the ensemble.
○ Identify key features contributing to predictions.

20
References

https://www.toptal.com/machine-learning/ensemble-methods-machine-le

arning#:~:text=Ensemble%20methods%20are%20techniques%20that,winni

ng%20solutions%20used%20ensemble%20methods

https://deepsense.ai/what-is-reinforcement-learning-the-complete-guide/

https://towardsdatascience.com/various-ways-to-evaluate-a-machine-lear

ning-models-performance-230449055f15

Other Reading Sources:

https://www.jeremyjordan.me/evaluating-a-machine-learning-model/

https://www.analyticsvidhya.com/blog/2019/08/11-important-model-evalua

tion-error-metrics/

21

You might also like