Skip to content

Commit f76ca87

Browse files
authored
Merge pull request animator#652 from Soubeer/main
Added Random Forest
2 parents 0595830 + 4df2cb9 commit f76ca87

File tree

2 files changed

+172
-0
lines changed

2 files changed

+172
-0
lines changed

contrib/machine-learning/index.md

+1
Original file line numberDiff line numberDiff line change
@@ -4,6 +4,7 @@
44
- [Regression in Machine Learning](Regression.md)
55
- [Confusion Matrix](confusion-matrix.md)
66
- [Decision Tree Learning](Decision-Tree.md)
7+
- [Random Forest](random-forest.md)
78
- [Support Vector Machine Algorithm](support-vector-machine.md)
89
- [Artificial Neural Network from the Ground Up](ArtificialNeuralNetwork.md)
910
- [Introduction To Convolutional Neural Networks (CNNs)](intro-to-cnn.md)
+171
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,171 @@
1+
# Random Forest
2+
3+
Random Forest is a versatile machine learning algorithm capable of performing both regression and classification tasks. It is an ensemble method that operates by constructing a multitude of decision trees during training and outputting the average prediction of the individual trees (for regression) or the mode of the classes (for classification).
4+
5+
## Introduction
6+
Random Forest is an ensemble learning method used for classification and regression tasks. It is built from multiple decision trees and combines their outputs to improve the model's accuracy and control over-fitting.
7+
8+
## How Random Forest Works
9+
### 1. Bootstrap Sampling:
10+
* Random subsets of the training dataset are created with replacement. Each subset is used to train an individual tree.
11+
### 2. Decision Trees:
12+
* Multiple decision trees are trained on these subsets.
13+
### 3. Feature Selection:
14+
* At each split in the decision tree, a random selection of features is chosen. This randomness helps create diverse trees.
15+
### 4. Voting/Averaging:
16+
For classification, the mode of the classes predicted by individual trees is taken (majority vote).
17+
For regression, the average of the outputs of the individual trees is taken.
18+
### Detailed Working Mechanism
19+
#### Step 1: Bootstrap Sampling:
20+
Each tree is trained on a random sample of the original data, drawn with replacement (bootstrap sample). This means some data points may appear multiple times in a sample while others may not appear at all.
21+
#### Step 2: Tree Construction:
22+
Each node in the tree is split using the best split among a random subset of the features. This process adds an additional layer of randomness, contributing to the robustness of the model.
23+
#### Step 3: Aggregation:
24+
For classification tasks, the final prediction is based on the majority vote from all the trees. For regression tasks, the final prediction is the average of all the tree predictions.
25+
### Advantages and Disadvantages
26+
#### Advantages
27+
* Robustness: Reduces overfitting and generalizes well due to the law of large numbers.
28+
* Accuracy: Often provides high accuracy because of the ensemble method.
29+
* Versatility: Can be used for both classification and regression tasks.
30+
* Handles Missing Values: Can handle missing data better than many other algorithms.
31+
* Feature Importance: Provides estimates of feature importance, which can be valuable for understanding the model.
32+
#### Disadvantages
33+
* Complexity: More complex than individual decision trees, making interpretation difficult.
34+
* Computational Cost: Requires more computational resources due to multiple trees.
35+
* Training Time: Can be slow to train compared to simpler models, especially with large datasets.
36+
### Hyperparameters
37+
#### Key Hyperparameters
38+
* n_estimators: The number of trees in the forest.
39+
* max_features: The number of features to consider when looking for the best split.
40+
* max_depth: The maximum depth of the tree.
41+
* min_samples_split: The minimum number of samples required to split an internal node.
42+
* min_samples_leaf: The minimum number of samples required to be at a leaf node.
43+
* bootstrap: Whether bootstrap samples are used when building trees. If False, the whole dataset is used to build each tree.
44+
##### Tuning Hyperparameters
45+
Hyperparameter tuning can significantly improve the performance of a Random Forest model. Common techniques include Grid Search and Random Search.
46+
47+
### Code Examples
48+
#### Classification Example
49+
Below is a simple example of using Random Forest for a classification task with the Iris dataset.
50+
51+
```python
52+
import numpy as np
53+
import pandas as pd
54+
from sklearn.datasets import load_iris
55+
from sklearn.ensemble import RandomForestClassifier
56+
from sklearn.model_selection import train_test_split
57+
from sklearn.metrics import accuracy_score, classification_report
58+
59+
60+
# Load dataset
61+
iris = load_iris()
62+
X, y = iris.data, iris.target
63+
64+
# Split dataset
65+
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)
66+
67+
# Initialize Random Forest model
68+
clf = RandomForestClassifier(n_estimators=100, random_state=42)
69+
70+
# Train the model
71+
clf.fit(X_train, y_train)
72+
73+
# Make predictions
74+
y_pred = clf.predict(X_test)
75+
76+
# Evaluate the model
77+
accuracy = accuracy_score(y_test, y_pred)
78+
print(f"Accuracy: {accuracy * 100:.2f}%")
79+
print("Classification Report:\n", classification_report(y_test, y_pred))
80+
81+
```
82+
83+
#### Feature Importance
84+
Random Forest provides a way to measure the importance of each feature in making predictions.
85+
86+
87+
```python
88+
import matplotlib.pyplot as plt
89+
90+
# Get feature importances
91+
importances = clf.feature_importances_
92+
indices = np.argsort(importances)[::-1]
93+
94+
# Print feature ranking
95+
print("Feature ranking:")
96+
for f in range(X.shape[1]):
97+
print(f"{f + 1}. Feature {indices[f]} ({importances[indices[f]]})")
98+
99+
# Plot the feature importances
100+
plt.figure()
101+
plt.title("Feature importances")
102+
plt.bar(range(X.shape[1]), importances[indices], align='center')
103+
plt.xticks(range(X.shape[1]), indices)
104+
plt.xlim([-1, X.shape[1]])
105+
plt.show()
106+
```
107+
#### Hyperparameter Tuning
108+
Using Grid Search for hyperparameter tuning.
109+
110+
```python
111+
from sklearn.model_selection import GridSearchCV
112+
113+
# Define the parameter grid
114+
param_grid = {
115+
'n_estimators': [100, 200, 300],
116+
'max_features': ['auto', 'sqrt', 'log2'],
117+
'max_depth': [4, 6, 8, 10, 12],
118+
'criterion': ['gini', 'entropy']
119+
}
120+
121+
# Initialize the Grid Search model
122+
grid_search = GridSearchCV(estimator=clf, param_grid=param_grid, cv=3, n_jobs=-1, verbose=2)
123+
124+
# Fit the model
125+
grid_search.fit(X_train, y_train)
126+
127+
# Print the best parameters
128+
print("Best parameters found: ", grid_search.best_params_)
129+
```
130+
#### Regression Example
131+
Below is a simple example of using Random Forest for a regression task with the Boston housing dataset.
132+
133+
```python
134+
import numpy as np
135+
import pandas as pd
136+
from sklearn.datasets import load_boston
137+
from sklearn.ensemble import RandomForestRegressor
138+
from sklearn.model_selection import train_test_split
139+
from sklearn.metrics import mean_squared_error, r2_score
140+
141+
# Load dataset
142+
boston = load_boston()
143+
X, y = boston.data, boston.target
144+
145+
# Split dataset
146+
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)
147+
148+
# Initialize Random Forest model
149+
regr = RandomForestRegressor(n_estimators=100, random_state=42)
150+
151+
# Train the model
152+
regr.fit(X_train, y_train)
153+
154+
# Make predictions
155+
y_pred = regr.predict(X_test)
156+
157+
# Evaluate the model
158+
mse = mean_squared_error(y_test, y_pred)
159+
r2 = r2_score(y_test, y_pred)
160+
print(f"Mean Squared Error: {mse:.2f}")
161+
print(f"R^2 Score: {r2:.2f}")
162+
```
163+
## Conclusion
164+
Random Forest is a powerful and flexible machine learning algorithm that can handle both classification and regression tasks. Its ability to create an ensemble of decision trees leads to robust and accurate models. However, it is important to be mindful of the computational cost associated with training multiple trees.
165+
166+
## References
167+
Scikit-learn Random Forest Documentation
168+
Wikipedia: Random Forest
169+
Machine Learning Mastery: Introduction to Random Forest
170+
Kaggle: Random Forest Guide
171+
Towards Data Science: Understanding Random Forests

0 commit comments

Comments
 (0)