Article Review 9 Eng
Article Review 9 Eng
Article Review 9 Eng
2
Ensemble Methods: Elegant Techniques to Produce Improved
Machine Learning Results
Machine Learning, in computing, is where art meets science. Perfecting a machine
learning tool is a lot about understanding data and choosing the right algorithm. But
why choose one algorithm when you can choose many and make them all work to
achieve one thing: improved results. In this article, Toptal Engineer Necati Demir
walks us through some elegant techniques of ensemble methods where a
combination of data splits and multiple algorithms is used to produce machine
learning results with higher accuracy.
Ensemble methods are techniques that create multiple models and then combine
them to produce improved results. Ensemble methods usually produce more
accurate solutions than a single model would. This has been the case in a number of
machine learning competitions, where the winning solutions used ensemble
methods. In the popular Netflix Competition, the winner used an ensemble method
to implement a powerful collaborative filtering algorithm. Another example is KDD
2009 where the winner also used ensemble methods. You can also find winners
who used these methods in Kaggle competitions, for example here is the interview
with the winner of the CrowdFlower competition.
3
In this blog post I will cover ensemble methods for classification and describe some
widely known methods of ensemble: voting, stacking, bagging and boosting.
train = load_csv("train.csv")
target = train["target"]
train = train.drop("target")
test = load_csv("test.csv")
algorithms = [logistic_regression,
decision_tree_classification, ...] #for classification
algorithms = [linear_regression,
decision_tree_regressor, ...] #for regression
predictions = matrix(row_length=len(target),
column_length=len(algorithms))
According to the above pseudocode, we created predictions for each model and
saved them in a matrix called predictions where each column contains predictions
from one model.
4
Majority Voting
Every model makes a prediction (votes) for each test instance and the final output
prediction is the one that receives more than half of the votes. If none of the
predictions get more than half of the votes, we may say that the ensemble method
could not make a stable prediction for this instance. Although this is a widely used
technique, you may try the most voted prediction (even if that is less than half of
the votes) as the final prediction. In some articles, you may see this method being
called “plurality voting”.
Weighted Voting
Unlike majority voting, where each model has the same rights, we can increase the
importance of one or more models. In weighted voting you count the prediction of
the better models multiple times. Finding a reasonable set of weights is up to you.
Simple Averaging
In a simple averaging method, for every instance of a test dataset, the average
predictions are calculated. This method often reduces overfit and creates a
smoother regression model. The following pseudocode code shows this simple
averaging method:
final_predictions = []
for row_number in len(predictions):
final_predictions.append(
mean(prediction[row_number, ])
)
Weighted Averaging
Weighted averaging is a slightly modified version of simple averaging, where the
prediction of each model is multiplied by the weight and then their average is
calculated. The following pseudocode code shows the weighted averaging:
5
weights = [..., ..., ...] #length is equal to
len(algorithms)
final_predictions = []
for row_number in len(predictions):
final_predictions.append(
mean(prediction[row_number, ]*weights)
)
base_algorithms = [logistic_regression,
decision_tree_classification, ...] #for classification
stacking_train_dataset = matrix(row_length=len(target),
column_length=len(algorithms))
stacking_test_dataset = matrix(row_length=len(test),
column_length=len(algorithms))
final_predictions =
combiner_algorithm.fit(stacking_train_dataset,
target).predict(stacking_test_dataset)
6
As you can see in the above pseudocode, the training dataset for the combiner
algorithm is generated using the outputs of the base algorithms. In the pseudocode,
the base algorithm is generated using a training dataset and then the same dataset
is used again to make predictions. But as we know, in the real world we do not use
the same training dataset for prediction, so to overcome this problem you may see
some implementations of stacking where the training dataset is splitted. Below you
can see a pseudocode where the training dataset is split before training the base
algorithms:
base_algorithms = [logistic_regression,
decision_tree_classification, ...] #for classification
stacking_train_dataset = matrix(row_length=len(target),
column_length=len(algorithms))
stacking_test_dataset = matrix(row_length=len(test),
column_length=len(algorithms))
final_predictions =
combiner_algorithm.fit(stacking_train_dataset,
target).predict(stacking_test_dataset)
Bootstrap Aggregating
The name Bootstrap Aggregating, also known as “Bagging”, summarizes the key
elements of this strategy. In the bagging algorithm, the first step involves creating
7
multiple models. These models are generated using the same algorithm with
random sub-samples of the dataset which are drawn from the original dataset
randomly with bootstrap sampling method. In bootstrap sampling, some original
examples appear more than once and some original examples are not present in the
sample. If you want to create a sub-dataset with m elements, you should select a
random element from the original dataset m times. And if the goal is generating a
dataset, you follow this step n times.
At the end, we have n datasets where the number of elements in each dataset is m.
The following Python-esque pseudocode show bootstrap sampling:
8
return final_predictions
You can also find implementation of the bagging strategy in some algorithms. For
example, the Random Forest algorithm uses the bagging technique with some
differences. Random Forest uses random feature selection, and the base algorithm
of it is a decision tree algorithm.
The term “boosting” is used to describe a family of algorithms which are able to
convert weak models to strong models. The model is weak if it has a substantial
error rate, but the performance is not random (resulting in an error rate of 0.5 for
binary classification). Boosting incrementally builds an ensemble by training each
model with the same dataset but where the weights of instances are adjusted
according to the error of the last prediction. The main idea is forcing the models to
focus on the instances which are hard. Unlike bagging, boosting is a sequential
method, and so you can not use parallel operations here.
models = []
_train = random_select(train)
for i in range(n): #n rounds
model = base_algorithm.fit(_train)
9
predictions = model.predict(_train)
models.append(model)
errors = calculate_error(predictions)
_train = adjust_dataset(_train, errors)
The adjust_dataset function returns a new dataset containing the hardest instances,
which can then be used to force the base algorithm to learn from.
10
Although the designer sets the reward policy–that is, the rules of the game–he gives
the model no hints or suggestions for how to solve the game. It’s up to the model to
figure out how to perform the task to maximize the reward, starting from totally
random trials and finishing with sophisticated tactics and superhuman skills. By
leveraging the power of search and many trials, reinforcement learning is currently
the most effective way to hint at a machine's creativity. In contrast to human beings,
artificial intelligence can gather experience from thousands of parallel gameplays if
a reinforcement learning algorithm is run on a sufficiently powerful computer
infrastructure.
11
car, on the other hand, we would emphasize speed much more than the
driver’s comfort. The programmer cannot predict everything that could
happen on the road. Instead of building lengthy “if-then” instructions, the
programmer prepares the reinforcement learning agent to be capable of
learning from the system of rewards and penalties. The agent (another
name for reinforcement learning algorithms performing the task) gets
rewards for reaching specific goals.
Scaling and tweaking the neural network controlling the agent is another challenge.
There is no way to communicate with the network other than through the system of
rewards and penalties.This in particular may lead to catastrophic forgetting, where
acquiring new knowledge causes some of the old to be erased from the network (to
read up on this issue, see this paper, published during the International Conference
on Machine Learning).
Yet another challenge is reaching a local optimum – that is the agent performs the
task as it is, but not in the optimal or required way. A “jumper” jumping like a
kangaroo instead of doing the thing that was expected of it-walking-is a great
example, and is also one that can be found in our recent blog post.
Finally, there are agents that will optimize the prize without performing the task it
was designed for. An interesting example can be found in the OpenAI video below,
where the agent learned to gain rewards, but not to complete the race.
13
Figure 1: Matrix Method and Problem
Although the ideas seem to differ, there is no sharp divide between these subtypes.
Moreover, they merge within projects, as the models are designed not to stick to a
“pure type” but to perform the task in the most effective way possible. So “what
precisely distinguishes machine learning, deep learning and reinforcement learning”
is actually a tricky question to answer.
14
Supervised machine learning happens when a programmer can provide a
label for every training input into the machine learning system.
From the AI point of view, a single model was performing a single task on a clarified
and normalized dataset.
Unsupervised learning takes place when the model is provided only with the input
data, but no explicit labels. It has to dig through the data and find the hidden
structure or relationships within. The designer might not know what the structure is
or what the machine learning model is going to find.
15
An example we employed was for churn prediction. We analyzed customer data and
designed an algorithm to group similar customers. However, we didn’t choose the
groups ourselves. Later on, we could identify high-risk groups (those with a high
churn rate) and our client knew which customers they should approach first.
Although deep learning solutions are able to provide marvelous results, in terms of
scale they are no match for the human brain. Each layer uses the outcome of a
previous one as an input and the whole network is trained as a single whole. The
core concept of creating an artificial neural network is not new, but only recently
has modern hardware provided enough computational power to effectively train
such networks by exposing a sufficient number of examples. Extended adoption has
brought about frameworks like TensorFlow, Keras and PyTorch, all of which have
made building machine learning models much more convenient.
16
endangered species and deepsense.ai’s work with the NOAA, read our blog post.
From a technical point of view, recognizing a particular specimen of whales from
aerial photos is pure deep learning. The solution consists of a few machine learning
models performing separate tasks. The first one was in charge of finding the head
of the whale in the photograph while the second normalized the photo by cutting
and turning it, which ultimately provided a unified view (a passport photo) of a
single whale.
The third model was responsible for recognizing particular whales from photos that
had been prepared and processed earlier. A network composed of 5 million neurons
located the bowhead bonnet-tip. Over 941,000 neurons looked for the head and
more than 3 million neurons were used to classify the particular whale. That’s over
9 million neurons performing the task, which may seem like a lot, but pales in
comparison to the more than 100 billion neurons at work in the human brain. We
later used a similar deep learning-based solution to diagnose diabetic retinopathy
using images of patients’ retinas.
17
Reinforcement learning, as stated above employs a system of rewards and penalties
to compel the computer to solve a problem by itself. Human involvement is limited
to changing the environment and tweaking the system of rewards and penalties. As
the computer maximizes the reward, it is prone to seeking unexpected ways of
doing it. Human involvement is focused on preventing it from exploiting the system
and motivating the machine to perform the task in the way expected.
Reinforcement learning is useful when there is no “proper way” to perform a task,
yet there are rules the model has to follow to perform its duties correctly. Take the
road code, for example.
Example: By tweaking and seeking the optimal policy for deep reinforcement
learning, we built an agent that in just 20 minutes reached a superhuman level in
playing Atari games. Similar algorithms in principle can be used to build AI for an
autonomous car or a prosthetic leg. In fact, one of the best ways to evaluate the
reinforcement learning approach is to give the model an Atari video game to play,
such as Arkanoid or Space Invaders. According to Google Brain’s Marc G. Bellemare,
who introduced Atari video games as a reinforcement learning benchmark,
“although challenging, these environments remain simple enough that we can hope
to achieve measurable progress as we attempt to solve them”.
You are a data scientist at IDX Partners and currently help the finance team from
the biggest banking corporation in Indonesia with machine learning models. Finance
team faces challenges in accurately assessing the creditworthiness of its customers.
This process involves analyzing a large number of factors, including credit history,
income, liabilities, and more. A single predictive model may not be sufficient to
handle the complexity of this data and produce consistent predictions.
As a data scientist, you propose to use ensemble methods to handle this problem!
How to implement ensemble methods step by step ?
Solution:
19
○ Select diverse base models that capture different aspects of the data.
For example, in the case of Random Forest, the base models are
individual decision trees.
● Step 3: Build Base Models
○ Train each base model on a subset of the training data. In the case of
Random Forest, train multiple decision trees with random subsets of
features and observations.
● Step 4: Combine Base Models
○ For Random Forest:
○ Aggregate the predictions of each decision tree.
○ Use a majority vote for classification problems or average the
predictions for regression problems.
● Step 5: Evaluate Ensemble Model
○ Assess the performance of the ensemble model on the testing set
using appropriate metrics (accuracy, precision, recall, F1-score for
classification; mean squared error for regression).
● Step 6: Tune Hyperparameters if needed
○ Optimize hyperparameters such as the number of base models, the
depth of individual trees, and other relevant parameters to improve
the ensemble's performance.
● Step 7: Feature Importance Analysis
○ Examine feature importance scores generated by the ensemble.
○ Identify key features contributing to predictions.
20
References
https://www.toptal.com/machine-learning/ensemble-methods-machine-le
arning#:~:text=Ensemble%20methods%20are%20techniques%20that,winni
ng%20solutions%20used%20ensemble%20methods
https://deepsense.ai/what-is-reinforcement-learning-the-complete-guide/
https://towardsdatascience.com/various-ways-to-evaluate-a-machine-lear
ning-models-performance-230449055f15
https://www.jeremyjordan.me/evaluating-a-machine-learning-model/
https://www.analyticsvidhya.com/blog/2019/08/11-important-model-evalua
tion-error-metrics/
21