Skip to content

Commit 9e1a20d

Browse files
committed
second
1 parent a32635c commit 9e1a20d

File tree

1 file changed

+52
-45
lines changed

1 file changed

+52
-45
lines changed

contrib/machine-learning/Random_Forest.md

Lines changed: 52 additions & 45 deletions
Original file line numberDiff line numberDiff line change
@@ -1,7 +1,8 @@
1-
Random Forest
1+
# Random Forest
2+
23
Random Forest is a versatile machine learning algorithm capable of performing both regression and classification tasks. It is an ensemble method that operates by constructing a multitude of decision trees during training and outputting the average prediction of the individual trees (for regression) or the mode of the classes (for classification).
34

4-
Table of Contents
5+
## Table of Contents
56
Introduction
67
How Random Forest Works
78
Advantages and Disadvantages
@@ -13,51 +14,54 @@ Hyperparameter Tuning
1314
Regression Example
1415
Conclusion
1516
References
16-
Introduction
17+
18+
## Introduction
1719
Random Forest is an ensemble learning method used for classification and regression tasks. It is built from multiple decision trees and combines their outputs to improve the model's accuracy and control over-fitting.
1820

19-
How Random Forest Works
20-
Bootstrap Sampling:
21-
Random subsets of the training dataset are created with replacement. Each subset is used to train an individual tree.
22-
Decision Trees:
23-
Multiple decision trees are trained on these subsets.
24-
Feature Selection:
25-
At each split in the decision tree, a random selection of features is chosen. This randomness helps create diverse trees.
26-
Voting/Averaging:
21+
## How Random Forest Works
22+
### 1. Bootstrap Sampling:
23+
* Random subsets of the training dataset are created with replacement. Each subset is used to train an individual tree.
24+
### 2. Decision Trees:
25+
* Multiple decision trees are trained on these subsets.
26+
### 3. Feature Selection:
27+
* At each split in the decision tree, a random selection of features is chosen. This randomness helps create diverse trees.
28+
### 4. Voting/Averaging:
2729
For classification, the mode of the classes predicted by individual trees is taken (majority vote).
2830
For regression, the average of the outputs of the individual trees is taken.
29-
Detailed Working Mechanism
30-
Step 1: Bootstrap Sampling: Each tree is trained on a random sample of the original data, drawn with replacement (bootstrap sample). This means some data points may appear multiple times in a sample while others may not appear at all.
31-
Step 2: Tree Construction: Each node in the tree is split using the best split among a random subset of the features. This process adds an additional layer of randomness, contributing to the robustness of the model.
32-
Step 3: Aggregation: For classification tasks, the final prediction is based on the majority vote from all the trees. For regression tasks, the final prediction is the average of all the tree predictions.
33-
Advantages and Disadvantages
34-
Advantages
35-
Robustness: Reduces overfitting and generalizes well due to the law of large numbers.
36-
Accuracy: Often provides high accuracy because of the ensemble method.
37-
Versatility: Can be used for both classification and regression tasks.
38-
Handles Missing Values: Can handle missing data better than many other algorithms.
39-
Feature Importance: Provides estimates of feature importance, which can be valuable for understanding the model.
40-
Disadvantages
41-
Complexity: More complex than individual decision trees, making interpretation difficult.
42-
Computational Cost: Requires more computational resources due to multiple trees.
43-
Training Time: Can be slow to train compared to simpler models, especially with large datasets.
44-
Hyperparameters
45-
Key Hyperparameters
46-
n_estimators: The number of trees in the forest.
47-
max_features: The number of features to consider when looking for the best split.
48-
max_depth: The maximum depth of the tree.
49-
min_samples_split: The minimum number of samples required to split an internal node.
50-
min_samples_leaf: The minimum number of samples required to be at a leaf node.
51-
bootstrap: Whether bootstrap samples are used when building trees. If False, the whole dataset is used to build each tree.
52-
Tuning Hyperparameters
31+
### Detailed Working Mechanism
32+
* #### Step 1: Bootstrap Sampling:
33+
Each tree is trained on a random sample of the original data, drawn with replacement (bootstrap sample). This means some data points may appear multiple times in a sample while others may not appear at all.
34+
* #### Step 2: Tree Construction:
35+
Each node in the tree is split using the best split among a random subset of the features. This process adds an additional layer of randomness, contributing to the robustness of the model.
36+
#### Step 3: Aggregation:
37+
For classification tasks, the final prediction is based on the majority vote from all the trees. For regression tasks, the final prediction is the average of all the tree predictions.
38+
### Advantages and Disadvantages
39+
#### Advantages
40+
* Robustness: Reduces overfitting and generalizes well due to the law of large numbers.
41+
* Accuracy: Often provides high accuracy because of the ensemble method.
42+
* Versatility: Can be used for both classification and regression tasks.
43+
* Handles Missing Values: Can handle missing data better than many other algorithms.
44+
* Feature Importance: Provides estimates of feature importance, which can be valuable for understanding the model.
45+
#### Disadvantages
46+
* Complexity: More complex than individual decision trees, making interpretation difficult.
47+
* Computational Cost: Requires more computational resources due to multiple trees.
48+
* Training Time: Can be slow to train compared to simpler models, especially with large datasets.
49+
### Hyperparameters
50+
#### Key Hyperparameters
51+
* n_estimators: The number of trees in the forest.
52+
* max_features: The number of features to consider when looking for the best split.
53+
* max_depth: The maximum depth of the tree.
54+
* min_samples_split: The minimum number of samples required to split an internal node.
55+
* min_samples_leaf: The minimum number of samples required to be at a leaf node.
56+
* bootstrap: Whether bootstrap samples are used when building trees. If False, the whole dataset is used to build each tree.
57+
##### Tuning Hyperparameters
5358
Hyperparameter tuning can significantly improve the performance of a Random Forest model. Common techniques include Grid Search and Random Search.
5459

55-
Code Examples
56-
Classification Example
60+
### Code Examples
61+
#### Classification Example
5762
Below is a simple example of using Random Forest for a classification task with the Iris dataset.
5863

59-
python
60-
Copy code
64+
'''
6165
import numpy as np
6266
import pandas as pd
6367
from sklearn.datasets import load_iris
@@ -85,11 +89,14 @@ y_pred = clf.predict(X_test)
8589
accuracy = accuracy_score(y_test, y_pred)
8690
print(f"Accuracy: {accuracy * 100:.2f}%")
8791
print("Classification Report:\n", classification_report(y_test, y_pred))
88-
Feature Importance
92+
93+
'''
94+
95+
#### Feature Importance
8996
Random Forest provides a way to measure the importance of each feature in making predictions.
9097

91-
python
92-
Copy code
98+
99+
'''
93100
import matplotlib.pyplot as plt
94101

95102
# Get feature importances
@@ -108,11 +115,11 @@ plt.bar(range(X.shape[1]), importances[indices], align='center')
108115
plt.xticks(range(X.shape[1]), indices)
109116
plt.xlim([-1, X.shape[1]])
110117
plt.show()
111-
Hyperparameter Tuning
118+
'''
119+
#### Hyperparameter Tuning
112120
Using Grid Search for hyperparameter tuning.
113121

114-
python
115-
Copy code
122+
'''
116123
from sklearn.model_selection import GridSearchCV
117124

118125
# Define the parameter grid

0 commit comments

Comments
 (0)