100% found this document useful (1 vote)
147 views

Answer Report (Preditive Modelling)

The document discusses two problems involving predicting prices and employee behavior using datasets. For Problem 1, the goal is to predict stone prices using a dataset of 27,000 cubic zirconia to help the company distinguish more and less profitable stones. For Problem 2, the goal is to predict whether employees will opt into a package using data on 872 employees. The document then describes the datasets and variables for each problem.

Uploaded by

Shweta Lakhera
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
100% found this document useful (1 vote)
147 views

Answer Report (Preditive Modelling)

The document discusses two problems involving predicting prices and employee behavior using datasets. For Problem 1, the goal is to predict stone prices using a dataset of 27,000 cubic zirconia to help the company distinguish more and less profitable stones. For Problem 2, the goal is to predict whether employees will opt into a package using data on 872 employees. The document then describes the datasets and variables for each problem.

Uploaded by

Shweta Lakhera
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 29

Executive Summary

In Problem 1, we are provided with the dataset containing the prices and other
attributes of almost 27,000 cubic zirconia. We are required to help the company
in predicting the price for the stone on the bases of the details given in the
dataset so it can distinguish between higher profitable stones and lower
profitable stones so as to have better profit share.

In Problem 2, we are provided with details of 872 employees of a company.


Among these employees, some opted for the package and some didn't. We are
required to help the company in predicting whether an employee will opt for the
package or not on the basis of the information given in the data set. Also, we
need to emphasize on the factors on the basis of which the company will focus
on particular employees to sell their
packages.

Introduction
The purpose of this whole exercise is to explore the dataset. Explore the dataset using
techniques like EDA, linear regression, LDA and logistic regression. We will use
functions like EDA, linear regression, LDA and logistic regression to come up with the
required solutions.

Data Description

 Of Problem 1

1. Carot:  Carat weight of the cubic zirconia.


2. Cut: Describe the cut quality of the cubic zirconia. Quality is increasing order Fair,
Good, Very Good, Premium, Ideal.
3. Color:  Colour of the cubic
zirconia. With D being the
worst and J the best.
4. Clarity: Clarity refers to the absence of the Inclusions and Blemishes. (In order
from Worst to Best in terms of avg price) IF, VVS1, VVS2, VS1, VS2, Sl1, Sl2, l1
5. Depth:  The Height of cubic zirconia, measured from the Culet to the table, divided
by its average Girdle Diameter.
6. Table: The Width of the cubic zirconia's Table expressed as a Percentage of its
Average Diameter.
7. Price:  the Price of the cubic zirconia.
8. X: Length of the cubic zirconia in mm.
9. Y: Width of the cubic zirconia in mm.
10. Z:  Height of the cubic zirconia in mm.
 Of Problem 2
1. Holiday_ Package: Opted for Holiday Package yes/no?
2. Salary: Employee salary
3. Age: Age in years
4. Edu: Years of formal education
5. no_young_children : The number of young children (younger than 7 years)
6. no_older_children : Number of older children
7. foreign: foreigner Yes/No

Problem Statement 1:
You are hired by a company Gem Stones co ltd, which is a cubic zirconia manufacturer.
You are provided with the dataset containing the prices and other attributes of almost
27,000 cubic zirconia (which is an inexpensive diamond alternative with many of the same
qualities as a diamond). The company is earning different profits on different prize slots.
You have to help the company in predicting the price for the stone on the bases of the
details given in the dataset so it can distinguish between higher profitable stones and
lower profitable stones so as to have better profit share. Also, provide them with the best
5 attributes that are most important.

Q1. Read the data and do exploratory data analysis. Describe the
data briefly. (Check the null values, Data types, shape, EDA,
duplicate values). Perform Univariate and Bivariate Analysis.

Answer:

Table 1: Summary of Dataset 1


Table 2: Description of Dataset

Summary Of Dataset:
The dataset contains 26967 rows and 11 columns. As we can infer from the dataset, there
are 2 integer type features, 6 float type and 3 object type features. Price is the target
variable and others are the predictor variable. The first column is the index, i.e. “Unnamed:
0” as only serial number, hence we can remove it.

Exploratory Data Analysis:

Step1: We will first check if there are any duplicates in our dataset. As we can see from
the output, there are 34 duplicates in our dataset. Therefore, before proceeding, we
remove the duplicates from our dataset.

Figure 1: Treatment of Duplicates

Step 2: We will then check for any missing values in our dataset. As we can see from the
output, there are 697 missing values in our dataset. Therefore, before proceeding, we will
impute median values in the missing places
Figure 2: Treatment of Missing Values

Step 3: We will then use boxplot method to detect outliers in our dataset and visualize
the presence of outliers in our dataset. Post that, we will proceed and treat the outliers
present in our dataset.
Figure 3: Data Visualization of the Presence of Outliers
After detecting the outliers, we will treat the outliers and remove them from our dataset.
Figure 4: Data Visualization of the Treatment of Outliers

Step 4: We will analyze the data using Univariate Analysis.


Figure 4: Data Visualization of the Univariate Analysis

Inference: After performing univariate analysis, we noticed that the distribution of some
quantitative features like ‘Carat’, and the target feature ‘Price’ are heavily right skewed.

Step 5: We will then analyze the data using Bi-Variate analysis.


 It involves the analysis of two variables, for the purpose of determining the empirical
relationship between them.
 We observed that most features correlate with the price of Diamond. The
remarkable exception is ‘depth which has a negligible correlation, i.e. (<1%)
Figure 5: Data Visualization of the Bi-Variate Analysis

Figure 6: Heat Map Depicting Correlation Between Attributes


Figure 7: EDA on Categorical Columns

Inferences from Exploratory Data Analysis:

1) As we can infer from the data, ‘Price’ is the target variable while all others are the
predictors. The data set contains 2697 rows, 11 columns. There are 2 integer type
features, 6 float type and 3 object type features. Price is the target variable and
others are the predictor variable. The first column is the index, i.e. “Unnamed: 0” as
only serial number, hence we can remove it.
2) In the dataset the mean and median values do not much difference. We observed
minimum value of ‘x’, ‘y’, ‘z’ are zero. This signifies that they are faulty values. As we
are aware dimensionless or 2-dimensional diamonds are not possible. Hence, we
have to filter out those as they are clearly faulty data entries. Furthermore, there are
three object data type, i.e. ‘cut’, ‘colour’ and ‘clarity’.
3) We observed there are 697 missing values in ‘depth’ column. There are some
duplicate rows present, which is nearly 0.12% of the total data. Therefore, in this
case we have dropped the duplicate rows.
4) There are significant amount of outliers present in some variables. Hence we have
treated the outliers. We also observed that the distribution of some quantitative
features like, ‘carat’ and the target feature ‘price’ are heavily right-skewed.
5) From the whole analysis, it looks like most features do correlate with the price of
Diamond. The notable exception is ‘depth’ which is negligible correlation.
Furthermore, the premium cut on Diamonds are the most expensive, followed by
the very good cut.
Q2. Impute null values if present, also check for the values which are
equal to zero. Do they have any meaning or do we need to change
them or drop them? Check for the possibility of combining the sub
levels of a ordinal variables and take actions accordingly. Explain
why you are combining these sub levels with appropriate reasoning.

Answer:
 We start by checking through the dataset for any null values that are present in our
data. As per our analysis, we infer that there are 697 null values in ‘depth’ column.
 Followed by which, median was computed for each attribute so that it can be used
to replace the null values that are present in the dataset.
 We then go ahead and replace the null values by the median.

Figure 8: Listing of Null Values and Replacing them with Median


values

Further Inference:
Scaling is not required in this scenario. Regression techniques are recommended because it
would gradient descent to converge fast and reach the global minima. When number of
features becomes large, it helps the in running the model quickly else the starting point
would be very far from minima, if scaling is not done in pre-processing. We have process the
model without scaling for now.
Q3. Encode the data (having string values) for Modelling. Split the
data into train and test (70:30). Apply Linear regression using scikit
learn. Perform checks for significant variables using appropriate
method from statsmodel. Create multiple models and check the
performance of Predictions on Train and Test sets using Rsquare,
RMSE & Adj Rsquare. Compare these models and select the best one
with appropriate reasoning.

Answer: Inference from Train- Test Split

 We copy all the predictor variables into X data frame and copy target into the y data frame.
Using the dependent variable, we split the X and Y data frames into training set and test set.
 For this, we used the SKlearn package and then split X and Y in 70:30 ratio. We then invoke
the linear regression function and find the best fit model on training data.
 The intercept is the expected mean values of Y when X=0 and when X is not equal to zero
then the intercept has no intrinsic meaning.

Figure 9: Regression Plot Between Predicted y


and Actual

 From the scatter plot, we see that it is linear and there is very strong correlation present
between the predicted y and actual y.
 It also indicates that there is lot of spread which indicates some unexplained variances on
the output.
 As the training data and the test data are almost inline, we can conclude that this model is a
Right-Fit model.
Inference from Linear Regression Using Stats Model
 Assuming null hypothesis as true, we observed that the overall P value is less than alpha, so
rejecting HO and accepting Ha that atleast one regression co-efficient is not 0. Here all
regression co-efficient are not 0
 Hence, we can conclude that the attributes which are having p value greater than 0.05 are
poor predictor for price.

Figure 10: Regression Results

Q4.Basis on these predictions, what are the business insights and


recommendations.
Answer: We observed that there is very strong correlation between predicted y and actual y.
However, there are lots of spread. That indicates some kind of noise present in the data set, i.e.
Unexplained Variances on the output.
Inference from Linear Regression Performance Metrics:
We observed that the training data and test data score are almost inline. Therefore, we can
conclude that this model is Right-Fit model.

Inference from Multi Collinearity:


From our analysis, we observed there is strong multi collinearity in our data set.

Inference from Statsmodels:


From our analysis, below are some of the observations we concluded:
 We concluded that the best five attributes that are most important are ‘Carat’, ‘Cut’, ‘color’,
‘clarity’ and width i.e. ‘y’ for predicting the price.
 When ‘carat’, 'cut', 'colour', ‘y’ and ’clarity’ increases by 1 unit, diamond’s price also
increases, keeping all other predictors constant.
 There are also some negative co-efficient values, we can see ‘x’, i.e. Length of the cubic
zirconia in mm, having negative co-efficient and p value is lee than 0.05. Hence, we can
conclude that higher the length of the stone is a lower profitable stones.

Recommendations To Business:
 Gem Stones company should consider the features ‘Carat’, ‘Cut’, ‘colour’, ‘clarity’ and width,
i.e. ‘y’ as most important for predicting the price to distinguish between higher profitable
stones and lower profitable stones so as to have better profit share.
 As we can see from the model, higher the width (y) of the stone is higher the price.
 Hence, the stones having higher width should consider in higher profitable stones. The
‘Premium Cut’ on diamonds are the most expensive, followed by ‘Very Good’ Cut, these
should be considered in higher profitable stones.
 The diamonds clarity with ‘VS1’ and ‘VS2’ are the most expensive. Hence, these two
categories should also be considered in higher profitable stones.
 We observed for ‘x’, higher the length of the stone, lower is the price. Likewise, higher the
‘z’, lower is the price. This is because if a diamond’s height is too large, diamond will become
‘Dark’ in appearance because it will no longer return an attractive amount of light.
 Stones with higher ‘z’ is also are lower is profitability.

Problem Statement 2:
You are hired by a tour and travel agency which deals in selling holiday packages. You are
provided details of 872 employees of a company. Among these employees, some opted for the
package and some didn't. You have to help the company in predicting whether an employee will
opt for the package or not on the basis of the information given in the data set. Also, find out the
important factors on the basis of which the company will focus on particular employees to sell
their packages.

Q1. Read the dataset. Do the descriptive statistics and do null value
condition check, write an inference on it. Perform Univariate and
Bivariate Analysis. Do exploratory data analysis.
Answer:

Figure 11: Summary of Problem 2

Shape of dataset:

Figure 12: Shape of Problem 2

Info of dataset:
 We have no null values in the dataset
 We have variables of integer and object data type.

Figure 13: Info of Problem 2


Description of dataset

Figure 13: Description of Problem 2


 The data that we have is of integer and continuous data. Here, the holiday package is
our target variable.

Null Values In Our dataset

Figure 14: Null Values In Problem 2

Unique Values For Categorical Variables


Figure 15: Unique Values For Categorical Variables

Univariate Analysis
Figure 16: Univariate Analysis

Inference:
 We can see that most of the distribution are right skewed except for educ.
 Salary distribution has maximum number of outliers
 There are some outliers in educ, no of young children and no. of older children.

Categorical Univariate Analysis

Figure 16: Categorical Univariate Analysis


Inference:
 Maximum of the employees don’t prefer to go to foreign.
 The employees who prefer holiday package are slightly less who don’t.

Bivariate Analysis Data Distribution

Figure 17: Bivariate Analysis Data Distribution

Inference:
 There is hardly any correlation between the data, the data seems to be normal. There is no
huge difference in the data distribution among the holiday package.

Checking For Correlation


Figure 18: Check Correlation

Inference:
There is hardly any correlation between the data so there is no collinearity.

Q2. Do not scale the data. Encode the data (having string values) for
Modelling. Data Split: Split the data into train and test (70:30). Apply
Logistic Regression and LDA (linear discriminant analysis).

Answer:
 We have done One hot encoding to create dummy variables and we can see all
values for foreign_yes as 0.
 Better results are predicted by logistic regression model if encoding is done.

We have then split the data in 70:30 ratio. Below is the output:
Figure 19: Data Splitting

Logistic Regression

 We will then proceed with prediction on the training set.


 Post that, we will get the probabilities on the test set.

Figure 20: Logistic Regression

Linear Discriminant Analysis

We will build LDA Model. Below is the output:


Probability Prediction

Figure 21: Linear Discriminant Analysis

Q3. Performance Metrics: Check the performance of Predictions on


Train and Test sets using Accuracy, Confusion Matrix, Plot ROC curve
and get ROC_AUC score for each model Final Model: Compare Both
the models and write inference which model is best/optimized.
Answer:

Performance Metrics For Linear Regression

Confusion Matrix On the Training Data


Figure 22: Confusion Matrix on Training Data

Inference:
We observed that precision for 1 is 0.63, recall is 0.45 accuracy is 0.63 and f1
score is 0.63.

Confusion Matrix On the Test Data

Figure 23: Confusion Matrix on Test Data


Inference:
We observed that precision for 1 is 0.69, recall is 0.45 accuracy is 0.66 and f1
score is 0.55.

AUC and ROC for the training data

Accuracy – Training Data:

AUC and ROC for the testing data


Changing The Cutt off Value to Check Optimal value that
gives better Accuracy and F1 Score
Figure 24

AUC and ROC for the training data

Inference:
 LDA works better when there is category target variable.

Q4. Basis on these predictions, what are the insights and


recommendations.

Answer: Following are the inferences we conclude from the whole analyses:
 Most of the employees who are above 50 don’t opt for holiday
packages. It seems like they are not interested in holiday packages at all.
 Employees who are in the age gap of 30 to 50 opt for holiday packages.
It seems like young people believe spending on holiday packages so age
here plays a very important role in deciding whether they will opt for
package or not.
 Also, people who have salary less than 50000 opt for holiday packages.
Hence, salary is also a deciding factor for the holiday package.
 Education also plays an important role in deciding the holiday packages.
 To improve the customer base, company needs to look into those
factors.

Recommendations To Business:

 We observed that most of the people who are older prefer to visit
religious places so it would be better if we target those places and
provide them with packages where they can visit religious places.
 We can also look into family dynamics of the people of the older people,
if the older people have elder children
 People who earn more than 150000 don’t spend much on the holiday
packages, they tend to go for lavish holidays and company can provide
them with customized packages according to their wish, such as fancy
hotels, longer vacations, and personal cars during the holiday. Such
people who earn more than 150000, company can provide them extra
facilities according to their own wishes at the moment.

You might also like