Machine Learning

GREAT LEARNING
Power Ahead
MACHINE LEARNING
SHREYA PRAKASH
PGP – DSBA ONLINE
MARCH ‘22
DATE – 03/08/2022
1
Table of Contents
Contents
Problem Part 1 …………………………………………………………………………………………………………………………………
INTRODUCTION …………………………………………………………………………………………………………………………….4
SAMPLE OF DATASET …………………………………………………………………………………………………………………….4
EXPLORATORY DATA ANAYSIS ……………………………………………………………………………………………………….4
Q1 Basic data summary, Univariate, Bivariate analysis, graphs, checking correlations, outliers
and missing values treatment (if necessary) and check the basic descriptive statistics of the
dataset.……………………………………………………………………….5-9
Q2 Split the data into train and test in the ratio 70:30. Is scaling necessary or not? ………….9-10
Q3 Build the following models on the 70% training data and check the performance of these
models on the Training as well as the 30% Test data using the various inferences from the
Confusion Matrix and plotting a AUC-ROC curve along with the AUC values. Tune the models
wherever required for optimum performance.:
a. Logistic Regression Model
b. Linear Discriminant Analysis
c. Decision Tree Classifier – CART model
d. Naïve Bayes Model
e. KNN Model
f. Random Forest Model
g. Boosting Classifier Model using Gradient boost.. ………………………….10-17
Q4 Which model performs the best? …………………………………...18-19
Q5 What are your business insights?.........................................................20
Part 2:
A dataset of Shark Tank episodes is made available. It contains 495 entrepreneurs making their
pitch to the VC sharks. You will ONLY use “Description” column for the initial text mining
exercise.
Q1 Pick out the Deal (Dependent Variable) and Description columns into a separate data
frame………………………………………………………………20-22
Q2 Create two corpora, one with those who secured a Deal, the other with those who did not
secure a deal.........................................................................22-25
Q3 The following exercise is to be done for both the corpora:

a) Find the number of characters for both the corpuses.
b) Remove Stop Words from the corpora. (Words like ‘also’, ‘made’, ‘makes’, ‘like’, ‘this’, ‘even’
and ‘company’ are to be removed)
c) What were the top 3 most frequently occurring words in both corpuses (after removing stop
words)?
d) Plot the Word Cloud for both the corpora. ……………………………25-29
2
Q4 Refer to both the word clouds. What do you infer?.....................................29
Q5 Looking at the word clouds, is it true that the entrepreneurs who introduced devices are less
likely to secure a deal based on your analysis?....................................................30
List of Figures
Scatter Plot
Pairplot
Correlation HeatMap
Boxplot
Scree Plot
LIST OF TABLES
Dataset Sample
Contingency Table
3
Problem 1
INTRODUCTION
Here we are analysing which mode of transport chosen by employee of ABC Company. The decision is
based on the parameters like age, salary, work experience etc. in the dataset ‘Transport.csv’ .In this
project we are building several Machine Learning models and comparing them so that we can get the
best model .
SAMPLE OF DATASET
EXPLORATORY DATA ANALYSIS
Checking Null Values
4
Checking duplicate values
Question 1
Basic data summary, Univariate, Bivariate analysis, graphs, checking correlations, outliers and missing
values treatment (if necessary) and check the basic descriptive statistics of the dataset..
Answer:-
As checked in info function there are 9 attributes in the dataset also 2 are of float , 5 are of int and 2 are
of object data type .
Here total no of rows are 444 and 9 column.
5
We can see that lowest mean is of license
Checking NULL values
As checked there is no NULL value
There is no duplicate values as well
Checking Outlier
6
As checked there are outliers . Now treating outlier.
Univariate Analysis
As checked most of the employee is from almost 23 to 28 ages , also as checked population of Male
Gender is more than female .
No of Engineers is also more than of Non Engineer . No of MBA is less than then no of non MBA.
7
Work Experience is from 0 to 8 years maximum . Salary is mostly from 8 Lakhs to 17-18 lakhs .
Distance is following Normal distribution . Employee with license is more than employee without license.
Boxplot
As seen Age , Work Exp , Salary , Distance is having outlier
Bivarate Analysis
8
Age , Work Exp and Salary are positively correlated .
Question 2
Split the data into train and test in the ratio 70:30. Is scaling necessary or not?
Yes , scaling is necessary as there is difference in the range of different variables , so to make them in one
range we need to scale the data
9
Now we need to split the data into 70:30 .Here Transport is the dependent variable so we will split the
data accordingly
Question 3
Build the following models on the 70% training data and check the performance of these models on the
Training as well as the 30% Test data using the various inferences from the Confusion Matrix and plotting
a AUC-ROC curve along with the AUC values. Tune the models wherever required for optimum
performance.: a. Logistic Regression Model b. Linear Discriminant Analysis c. Decision Tree Classifier –
CART model d. Naïve Bayes Model e. KNN Model f. Random Forest Model g. Boosting Classifier Model
using Gradient boost
Answer
Here were are building different models to find best accuracy of all the models on Training and Testing
set .
Logistic regression
We are using Logistic regression to find accuracy of model
Logistic Regression Accuracy
We can see accuracy is 0.7611 , now using classification matrix , here also we can see accuracy is 0.76
10
We can also check coeffecient of variation
Through coefficient of variation we can see
A) Gender , Age , MBA are having positive impact on target variable.

B) Work Exp , Salary and Distance are having negative impact on target variables.
C) Engineer and license is having no impact
D) Train accuracy and Test accuracy is same , so we can say there is no case of over fitting.
Linear Discriminant Regression Analysis
Here we are using Linear Regression to find accuracy .We can accuracy is almost similar to logistic
regression.
11
Here we can see accuracy of Training set is 76 % and Test set is 78 percent.
Now using Decision Tree let’s check the accuracy
The accuracy through DTC is 0.73 . We can check the same through classification matrix
Checking confusion matrix for both training and test set
Classification matrix for both training set and test set
12
Accuracy of Test and Training data is 1 and 0.73 there can be chances of over fitting
GaussianNB
Accuracy of GNB is 0.813
13
Accuracy of Test and Training set is 0.76 and 0.81
Confusion matrix for Test and Train data
KNeighborsClassifier
Now we will check classification matrix
Training and Test set accuracy is 0.85 and 0.78
Confusion matrix for KNN
14
Now through confusion matrix we can check True positive , false positive , true negative
Random Forest Model
We can see accuracy is 0.79 of training set
Now we will see classification report of training set and test data , we can see accuracy of training set is 1
and that test data is 0.79
Confusion matrix of RFC
15
Gradient Boosting
Accuracy of model is 0.73
Classification matrix of training set and test set is 0.95 and 0.73
Confusion Matrix of GB
16
AUC ROC Curve
We can find out AUC ROC score for all the models
Also we can plot AUC ROC curve for all models
17
Question 4
Which model performs the best?
Answer:-
Model performance on both training and test set . Performance of both the set can be changed by
changing the features . We should know how to extract the features and what features should be used
together to obtain maximum business profit.
18
Here we can see GNB has the best accuracy then RFC followed by KNN model. With respect to accuracy
GNB has best accuracy
Question 5
What are your business insights?
19
Answer:- Business insights are age , Work Exp and salary are are contributing much , so it should be
checked whether these parameters are checked..
Work should be done on non MBA people to choose ne transport , as they are using private transport.
Work should also be done to encourage more people who has not done MBA to choose new transport
medium.
Part 2
A dataset of Shark Tank episodes is made available. It contains 495 entrepreneurs making their pitch to
the VC sharks. You will ONLY use “Description” column for the initial text mining exercise.
Question 1
Pick out the Deal (Dependent Variable) and Description columns into a separate data frame.
Data description
Checking NULL values
20
Removing Null values
21
There are no duplicate rows in the data frame.
Here we making a new data frame named as data and we are placing column deal and description into
that data frame.
Question 2
Create two corpora, one with those who secured a Deal, the other with those who did not secure a deal.
Here we are creating 2 corpora , one who has secured a deal and one who has not secured deal.
First we are making 2 data frame one who has secured a deal as df_true and one who has not secured a
deal as df_false ,now we are grouping the data dataframe according to deal parameter , the one whose
deal parameter value is true is in df_true and the one whose deal parameter value is false they are in
df_false dataframe.
The original data dataframe is having total 495 rows
22
After making corposes
23
After making corposes we can see there are 251 df_true rows and 244 df_false rows .
Let’s check data frame who has secured deal
Here data frame df_true who has secured deal has all True value of deal parameter
24
Also data frame df_false who has not secured deal has all False as value of deal parameter in df_false
data frame.
Question 3
The following exercise is to be done for both the corpora: a) Find the number of characters for both the
corpuses. b) Remove Stop Words from the corpora. (Words like ‘also’, ‘made’, ‘makes’, ‘like’, ‘this’, ‘even’
and ‘company’ are to be removed) c) What were the top 3 most frequently occurring words in both
corpuses (after removing stop words)? d) Plot the Word Cloud for both the corpora.
a) Find number of characters for both the corpuses.
Answer
Here first we are dropping deal column from both df_true and df_false data frame and calculating
total no of characters.
Df_true after dropping deal column from data frame
25
df_false after dropping deal column from df_false
Now after using sum function we can calculate total no of characters in both the corpus.
b) Remove Stop Words from the corpora. (Words like ‘also’, ‘made’, ‘makes’, ‘like’, ‘this’, ‘even’ and
‘company’ are to be removed)
Answer
Here we have to remove stop words from both the corpuses which cracked the deal and which did
not crack the deal
First we can remove stop words from df_True who cracked deal , we are using description column to
remove stop words .Below is showing data frame after removing stop words.
26
Now we will be checking df_false who did not crack the deal , we will be using description column to
remove stop words from this data frame .
c) What were the top 3 most frequently occurring words in both corpuses (after removing stop
words)?
Answer
Here we checking 3 most frequently words used from both the corpuses after removing stop
words.
Here we are using function nltk.FreqDist() to check frequency of words from both the corpuses . After
finding frequency we are using freq.most_common(3) function to find 3 most commonly used parameter
from both the new data frame nsw_false and nsw_true .
Here first row shows 3 most commonly used parameter in nsw_false and 2nd row shows most commonly
words used in nsw_true.
d) Plot the Word Cloud for both the corpora.
Answer
Word Cloud is a technique of data visualization. It is used for representing text data .Here the size of
each word indicates its frequency or importance
Here we are plotting world cloud for both thr corpora , the one which has cracked the deal and the one
which has not cracked the deal . First we will check the world cloud who has cracked the deal .
We are combining all the words who has cracked the deal one by one using join function . we are using
WorldCloud function to visualize the data . Also using generate function to generate the cloud of words.
27
Wordcloud of true corpose
Now WordCloud of who did not crack the deal
We are combining all the words who has not cracked the deal one by one using join function from the
dataframe . we are using WorldCloud function to visualize the data . Also using generate function to
generate the cloud of words.
WordCloud of Who did not get the deal
28
Question 4
Refer to both the word clouds. What do you infer?
Answer
The wordcloud who has secured a deal contains words like 'one', 'design' , 'free' ,'children'
,'offer', 'easy' ,'online','use' .These word indicated that Deals aimed towards catering to the liking
of children .Through Cloudword we can see that this deal provided offers or a free
sample/product, easy to use, having a good design and was unique in its creativity are more
likely to have the deal.
The wordcloud who did not secured deal contains words such as 'one', 'designed' , 'help' ,'device'
,'bottle', 'premium' ,'use' .These word indicates that deals with a mediocre design, have low
chance to suit to solve a problem, products involving water bottles, having a higher and premium
price and have less usability are having less chance to secure a deal.
It is observed that words such as 'one', 'designed' ,'system' and 'use' have a higher weight in both
these wordclouds.This indicates that either these were not the prominent factors to whether a
deal is cracked or not or might be they have been used in a different context/respect in the
description in each case.
29
Question 5
Looking at the word clouds, is it true that the entrepreneurs who introduced devices are less likely to
secure a deal based on your analysis?
Answer
The word 'device' is not easily found in the case when deal is secured in the wordcloud while it is
easily found in the case when deal is not secured in the wordcloud. This indicates that the word
'device' has occurred frequently when a deal was rejected or not made hence this implies or
indicated the statement given in the question is true.
30

Machine Learning

Uploaded by

Copyright:

Available Formats

Machine Learning

Uploaded by

Document Information

Original Description:

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Machine Learning

Uploaded by

Copyright:

Available Formats

GREAT LEARNING

PGP – DSBA ONLINE

Problem Part 1 …………………………………………………………………………………………………………………………………

SAMPLE OF DATASET …………………………………………………………………………………………………………………….4

EXPLORATORY DATA ANAYSIS ……………………………………………………………………………………………………….4

Q4 Which model performs the best? …………………………………...18-19

Q5 What are your business insights?.........................................................20

Q3 The following exercise is to be done for both the corpora:

EXPLORATORY DATA ANALYSIS

Checking Null Values

Here total no of rows are 444 and 9 column.

Checking NULL values

As checked there is no NULL value

Checking duplicate values

There is no duplicate values as well

As seen Age , Work Exp , Salary , Distance is having outlier

We are using Logistic regression to find accuracy of model

Logistic Regression Accuracy

Through coefficient of variation we can see

A) Gender , Age , MBA are having positive impact on target variable.

Linear Discriminant Regression Analysis

Now using Decision Tree let’s check the accuracy

Checking confusion matrix for both training and test set

Classification matrix for both training set and test set

Accuracy of GNB is 0.813

Confusion matrix for Test and Train data

Now we will check classification matrix

Training and Test set accuracy is 0.85 and 0.78

Confusion matrix for KNN

Random Forest Model

We can see accuracy is 0.79 of training set

Confusion matrix of RFC

Accuracy of model is 0.73

Also we can plot AUC ROC curve for all models

Which model performs the best?

What are your business insights?

Checking NULL values

Checking duplicate values

The original data dataframe is having total 495 rows

Let’s check data frame who has secured deal

a) Find number of characters for both the corpuses.

Df_true after dropping deal column from data frame

d) Plot the Word Cloud for both the corpora.

Now WordCloud of who did not crack the deal

WordCloud of Who did not get the deal

Refer to both the word clouds. What do you infer?

You might also like