Random Forest

Random Forest
Random Forest is an extension over bagging. This algorithm takes the random selection of features rather
than using all features to grow trees along with taking the random subset of data. Decision trees, being prone
to overfit, have been transformed to random forest by training many trees over various subsamples of the
data. Random forest works well on large datasets, but might be time consuming and can be parallelized
across multicore processor. Random forest works pretty well for classification problems.
Classification problems: use RF Algorithm as RandomForestClassifier() in Python
DATASET:
The objective is to identify each of a large number of black-and-white rectangular pixel displays as one of the
26 capital letters in the English alphabet. The character images were based on 20 different fonts and each
letter within these 20 fonts was randomly distorted to produce a file of 20,000 unique stimuli. Each stimulus
was converted into 16 primitive numerical attributes (statistical moments and edge counts) which were then
scaled to fit into a range of integer values from 0 through 15. We typically train on the first 16000 items and
then use the resulting model to predict the letter category for the remaining 4000. See the article cited above
for more details.
Attribute Information:
1. letter capital letter (26 values from A to Z)

2. x-box horizontal position of box (integer)
3. y-box vertical position of box (integer)
4. width width of box (integer)
5. high height of box (integer)
6. onpix total # on pixels (integer)
7. x-bar mean x of on pixels in box (integer)
8. y-bar mean y of on pixels in box (integer)
9. x2bar mean x variance (integer)
10. y2bar mean y variance (integer)
11. xybar mean x y correlation (integer)
12. x2ybr mean of x * x * y (integer)
13. xy2br mean of x * y * y (integer)
14. x-ege mean edge count left to right (integer)
15. xegvy correlation of x-ege with y (integer)
16. y-ege mean edge count bottom to top (integer)
17. yegvx correlation of y-ege with x (integer)
PYTHON CODE:
# importing libraries
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import classification_report
from sklearn.metrics import accuracy_score
from sklearn.metrics import confusion_matrix
import numpy
import pandas
# Reading dataset
letterdata = pandas.read_csv(‘ …….csv’)
print(“Dimension of dataset: “, letterdata.shape)
print(“Names of the variables:\n”, letterdata.columns)
#Using random seed for generating the same dataset

numpy.random.seed(3000)
training, test = train_test_split(letterdata, test_size=0.3)
x_trg = training.drop(‘letter’, axis=1)
y_trg = training[‘letter’]
x_test = test.drop(‘letter’, axis=1)
y_test = test[‘letter’]
#Creating a random forest model

print(“-----RANDOM FOREST MODEL-------“)
letter_forest = RandomForestClassifier()
letter_forest.fit(x_trg, y_trg)
print(“Accuracy of random forest model on training dataset:”, letter_forest.score(x_trg, y_trg))
forest_pred = letter_forest.predict(x_test)
forest_acc_score = accuracy_score(y_test, forest_pred)
print(“Accuracy of random forest model on test dataset:”, forest_acc_score)
#Creating a new random forest model with grid search

print(“-----RANDOM FOREST MODEL WITH BEST PARAMETERS-------“)
from sklearn.model_selection import GridSearchCV
x = RandomForestClassifier()
param_grid = {‘max_features’:[“auto”, “sqrt”, “log2”], ‘criterion’:[“gini”, “entropy”]}
letter_forest_grid = RandomForestClassifier()
letter_forest_CV = GridSearchCV(estimator = letter_forest_grid, param_grid = param_grid, cv =5)
letter_forest_result = letter_forest_CV.fit(x_trg, y_trg)
print(“Best Parameters:\n”, letter_forest_CV.best_params_)
#Creating the model with best scores

letter_forest_best = RandomForestClassifier(max_features =
letter_forest_result.best_params_[‘max_features’], criterion =
letter_forest_result.best_params_[‘criterion’])
#Evaluating the model considering best parameters

letter_forest_best.fit(x_trg, y_trg)
print(“Accuracy on training set with best parameters:”, letter_forest_best.score(x_trg, y_trg))
#Predicting the new model

letter_forest_pred2 = letter_forest_best.predict(x_test)
print(“Classification Report:\n”, classification_report(y_test, letter_forest_pred2))
#Accuracy of the new model

letter_forest_acc_score2 = accuracy_score(y_test, letter_forest_pred2)
print(“Accuracy of the new random forest model:”, letter_forest_acc_score2)

Random Forest

Uploaded by

Copyright:

Available Formats

Random Forest

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Random Forest

Uploaded by

Copyright:

Available Formats

Random Forest

Classification problems: use RF Algorithm as RandomForestClassifier() in Python

1. letter capital letter (26 values from A to Z)

#Using random seed for generating the same dataset

#Creating a random forest model

#Creating a new random forest model with grid search

#Creating the model with best scores

#Evaluating the model considering best parameters

#Predicting the new model

#Accuracy of the new model

You might also like