Decision Tree Using Sci-Kit Learn
Decision Tree Using Sci-Kit Learn
Decision Tree Using Sci-Kit Learn
1. Data Analyst
2. Data Scientist
3. Data Engineer
4. Business Analyst
“Data” RDBMS SQL
Python different libraries for ML tasks
Linear Regression / Logistic Regression etc.
2 kinds of ML :
1. Supervised ML Algo : Labeled data/target var is present
a. Regression - Target var is cont. Algo ex – Linear Reg
b. Classification – Target var is discreet class. Algo ex – Logistic Regression
Linear Regression : Find the best fit line with minimized errors
MLR : y = m1x1 + m2x2 + … + c
Calc m and c ypredicted!
RSS Residual sum of square sum(y-ypred)sq.
Assumptions of L.R.
Evaluation Metrics :
Y = 1/1+e^-(mx+c)
Log(p/1-p) = mx+c
Information Gain
Information gain is used to decide which feature to split on at each step in building the tree. Simplicity is
best, so we want to keep our tree small. To do so, at each step we should choose the split that results in
the purest daughter nodes. A commonly used measure of purity is called information. For each node of
the tree, the information value measures how much information a feature gives us about the class. The
split with the highest information gain will be taken as the first split and the process will continue
until all children nodes are pure, or until the information gain is 0.
Hyper parameters
max_depth: int or None, optional (default=None)
The maximum depth of the tree. If None, then nodes are expanded until all leaves are pure or until all
leaves contain less than min_samples_split samples.
Download the csv files and load it into the jupyter environment.
Data Dictionary:
1. Let’s import the data :
2. Feature Selection :
print("Accuracy:",metrics.accuracy_score(y_test, pred))
We have achieved the accuracy is 70% that can be improved by tuning some parameters.
6. Let’s visualize the decision tree:
First lets fix the depth of decision tree classifier.
# Creating Decision Tree classifer object
dt_clf = DecisionTreeClassifier(max_depth=3)
Cons :
In Decision Tree, there is a high risk of overfitting.
In comparison to other machine learning techniques, it has a low prediction accuracy.
In a decision tree with categorical variables, information gain leads to a biased response
towards attributes with more categories.
When there are a lot of class labels, calculations can get complicated.
Summary
Decision trees are easy to comprehend and use, and they work well with large datasets. There
are three main aspects to decision trees: decision nodes, chance nodes (which denotes
probability), and end nodes (denoting conclusion). Decision trees algorithm can be used to with
large datasets, and they can be pruned to avoid overfitting if needed.
Despite their many advantages, decision trees are not appropriate for all forms of data, such as
datasets with imbalances or continuous variables.
References
Jupyter Book Online
Kaggle - Diabetes dataset
Official documentation of Decision Tree