Machine Learning
Machine Learning
Machine Learning
Power Ahead
MACHINE LEARNING
SHREYA PRAKASH
MARCH ‘22
DATE – 03/08/2022
1
Table of Contents
Contents
INTRODUCTION …………………………………………………………………………………………………………………………….4
Q1 Basic data summary, Univariate, Bivariate analysis, graphs, checking correlations, outliers
and missing values treatment (if necessary) and check the basic descriptive statistics of the
dataset.……………………………………………………………………….5-9
Q2 Split the data into train and test in the ratio 70:30. Is scaling necessary or not? ………….9-10
Q3 Build the following models on the 70% training data and check the performance of these
models on the Training as well as the 30% Test data using the various inferences from the
Confusion Matrix and plotting a AUC-ROC curve along with the AUC values. Tune the models
wherever required for optimum performance.:
a. Logistic Regression Model
b. Linear Discriminant Analysis
c. Decision Tree Classifier – CART model
d. Naïve Bayes Model
e. KNN Model
f. Random Forest Model
g. Boosting Classifier Model using Gradient boost.. ………………………….10-17
Part 2:
A dataset of Shark Tank episodes is made available. It contains 495 entrepreneurs making their
pitch to the VC sharks. You will ONLY use “Description” column for the initial text mining
exercise.
Q1 Pick out the Deal (Dependent Variable) and Description columns into a separate data
frame………………………………………………………………20-22
Q2 Create two corpora, one with those who secured a Deal, the other with those who did not
secure a deal.........................................................................22-25
2
Q4 Refer to both the word clouds. What do you infer?.....................................29
Q5 Looking at the word clouds, is it true that the entrepreneurs who introduced devices are less
likely to secure a deal based on your analysis?....................................................30
List of Figures
Scatter Plot
Pairplot
Correlation HeatMap
Boxplot
Scree Plot
LIST OF TABLES
Dataset Sample
Contingency Table
3
Problem 1
INTRODUCTION
Here we are analysing which mode of transport chosen by employee of ABC Company. The decision is
based on the parameters like age, salary, work experience etc. in the dataset ‘Transport.csv’ .In this
project we are building several Machine Learning models and comparing them so that we can get the
best model .
SAMPLE OF DATASET
4
Checking duplicate values
Question 1
Basic data summary, Univariate, Bivariate analysis, graphs, checking correlations, outliers and missing
values treatment (if necessary) and check the basic descriptive statistics of the dataset..
Answer:-
As checked in info function there are 9 attributes in the dataset also 2 are of float , 5 are of int and 2 are
of object data type .
5
We can see that lowest mean is of license
Checking Outlier
6
As checked there are outliers . Now treating outlier.
Univariate Analysis
As checked most of the employee is from almost 23 to 28 ages , also as checked population of Male
Gender is more than female .
No of Engineers is also more than of Non Engineer . No of MBA is less than then no of non MBA.
7
Work Experience is from 0 to 8 years maximum . Salary is mostly from 8 Lakhs to 17-18 lakhs .
Distance is following Normal distribution . Employee with license is more than employee without license.
Boxplot
Bivarate Analysis
8
Age , Work Exp and Salary are positively correlated .
Question 2
Split the data into train and test in the ratio 70:30. Is scaling necessary or not?
Yes , scaling is necessary as there is difference in the range of different variables , so to make them in one
range we need to scale the data
9
Now we need to split the data into 70:30 .Here Transport is the dependent variable so we will split the
data accordingly
Question 3
Build the following models on the 70% training data and check the performance of these models on the
Training as well as the 30% Test data using the various inferences from the Confusion Matrix and plotting
a AUC-ROC curve along with the AUC values. Tune the models wherever required for optimum
performance.: a. Logistic Regression Model b. Linear Discriminant Analysis c. Decision Tree Classifier –
CART model d. Naïve Bayes Model e. KNN Model f. Random Forest Model g. Boosting Classifier Model
using Gradient boost
Answer
Here were are building different models to find best accuracy of all the models on Training and Testing
set .
Logistic regression
We can see accuracy is 0.7611 , now using classification matrix , here also we can see accuracy is 0.76
10
We can also check coeffecient of variation
Here we are using Linear Regression to find accuracy .We can accuracy is almost similar to logistic
regression.
11
Here we can see accuracy of Training set is 76 % and Test set is 78 percent.
The accuracy through DTC is 0.73 . We can check the same through classification matrix
12
Accuracy of Test and Training data is 1 and 0.73 there can be chances of over fitting
GaussianNB
13
Accuracy of Test and Training set is 0.76 and 0.81
KNeighborsClassifier
14
Now through confusion matrix we can check True positive , false positive , true negative
Now we will see classification report of training set and test data , we can see accuracy of training set is 1
and that test data is 0.79
15
Gradient Boosting
Classification matrix of training set and test set is 0.95 and 0.73
Confusion Matrix of GB
16
AUC ROC Curve
We can find out AUC ROC score for all the models
17
Question 4
Answer:-
Model performance on both training and test set . Performance of both the set can be changed by
changing the features . We should know how to extract the features and what features should be used
together to obtain maximum business profit.
18
Here we can see GNB has the best accuracy then RFC followed by KNN model. With respect to accuracy
GNB has best accuracy
Question 5
19
Answer:- Business insights are age , Work Exp and salary are are contributing much , so it should be
checked whether these parameters are checked..
Work should be done on non MBA people to choose ne transport , as they are using private transport.
Work should also be done to encourage more people who has not done MBA to choose new transport
medium.
Part 2
A dataset of Shark Tank episodes is made available. It contains 495 entrepreneurs making their pitch to
the VC sharks. You will ONLY use “Description” column for the initial text mining exercise.
Question 1
Pick out the Deal (Dependent Variable) and Description columns into a separate data frame.
Data description
20
Removing Null values
21
There are no duplicate rows in the data frame.
Here we making a new data frame named as data and we are placing column deal and description into
that data frame.
Question 2
Create two corpora, one with those who secured a Deal, the other with those who did not secure a deal.
Here we are creating 2 corpora , one who has secured a deal and one who has not secured deal.
First we are making 2 data frame one who has secured a deal as df_true and one who has not secured a
deal as df_false ,now we are grouping the data dataframe according to deal parameter , the one whose
deal parameter value is true is in df_true and the one whose deal parameter value is false they are in
df_false dataframe.
22
After making corposes
23
After making corposes we can see there are 251 df_true rows and 244 df_false rows .
Here data frame df_true who has secured deal has all True value of deal parameter
24
Also data frame df_false who has not secured deal has all False as value of deal parameter in df_false
data frame.
Question 3
The following exercise is to be done for both the corpora: a) Find the number of characters for both the
corpuses. b) Remove Stop Words from the corpora. (Words like ‘also’, ‘made’, ‘makes’, ‘like’, ‘this’, ‘even’
and ‘company’ are to be removed) c) What were the top 3 most frequently occurring words in both
corpuses (after removing stop words)? d) Plot the Word Cloud for both the corpora.
Answer
Here first we are dropping deal column from both df_true and df_false data frame and calculating
total no of characters.
25
df_false after dropping deal column from df_false
Now after using sum function we can calculate total no of characters in both the corpus.
b) Remove Stop Words from the corpora. (Words like ‘also’, ‘made’, ‘makes’, ‘like’, ‘this’, ‘even’ and
‘company’ are to be removed)
Answer
Here we have to remove stop words from both the corpuses which cracked the deal and which did
not crack the deal
First we can remove stop words from df_True who cracked deal , we are using description column to
remove stop words .Below is showing data frame after removing stop words.
26
Now we will be checking df_false who did not crack the deal , we will be using description column to
remove stop words from this data frame .
c) What were the top 3 most frequently occurring words in both corpuses (after removing stop
words)?
Answer
Here we checking 3 most frequently words used from both the corpuses after removing stop
words.
Here we are using function nltk.FreqDist() to check frequency of words from both the corpuses . After
finding frequency we are using freq.most_common(3) function to find 3 most commonly used parameter
from both the new data frame nsw_false and nsw_true .
Here first row shows 3 most commonly used parameter in nsw_false and 2nd row shows most commonly
words used in nsw_true.
Answer
Word Cloud is a technique of data visualization. It is used for representing text data .Here the size of
each word indicates its frequency or importance
Here we are plotting world cloud for both thr corpora , the one which has cracked the deal and the one
which has not cracked the deal . First we will check the world cloud who has cracked the deal .
We are combining all the words who has cracked the deal one by one using join function . we are using
WorldCloud function to visualize the data . Also using generate function to generate the cloud of words.
27
Wordcloud of true corpose
We are combining all the words who has not cracked the deal one by one using join function from the
dataframe . we are using WorldCloud function to visualize the data . Also using generate function to
generate the cloud of words.
28
Question 4
Answer
The wordcloud who has secured a deal contains words like 'one', 'design' , 'free' ,'children'
,'offer', 'easy' ,'online','use' .These word indicated that Deals aimed towards catering to the liking
of children .Through Cloudword we can see that this deal provided offers or a free
sample/product, easy to use, having a good design and was unique in its creativity are more
likely to have the deal.
The wordcloud who did not secured deal contains words such as 'one', 'designed' , 'help' ,'device'
,'bottle', 'premium' ,'use' .These word indicates that deals with a mediocre design, have low
chance to suit to solve a problem, products involving water bottles, having a higher and premium
price and have less usability are having less chance to secure a deal.
It is observed that words such as 'one', 'designed' ,'system' and 'use' have a higher weight in both
these wordclouds.This indicates that either these were not the prominent factors to whether a
deal is cracked or not or might be they have been used in a different context/respect in the
description in each case.
29
Question 5
Looking at the word clouds, is it true that the entrepreneurs who introduced devices are less likely to
secure a deal based on your analysis?
Answer
The word 'device' is not easily found in the case when deal is secured in the wordcloud while it is
easily found in the case when deal is not secured in the wordcloud. This indicates that the word
'device' has occurred frequently when a deal was rejected or not made hence this implies or
indicated the statement given in the question is true.
30