Answer Report (Preditive Modelling)
Answer Report (Preditive Modelling)
In Problem 1, we are provided with the dataset containing the prices and other
attributes of almost 27,000 cubic zirconia. We are required to help the company
in predicting the price for the stone on the bases of the details given in the
dataset so it can distinguish between higher profitable stones and lower
profitable stones so as to have better profit share.
Introduction
The purpose of this whole exercise is to explore the dataset. Explore the dataset using
techniques like EDA, linear regression, LDA and logistic regression. We will use
functions like EDA, linear regression, LDA and logistic regression to come up with the
required solutions.
Data Description
Of Problem 1
Problem Statement 1:
You are hired by a company Gem Stones co ltd, which is a cubic zirconia manufacturer.
You are provided with the dataset containing the prices and other attributes of almost
27,000 cubic zirconia (which is an inexpensive diamond alternative with many of the same
qualities as a diamond). The company is earning different profits on different prize slots.
You have to help the company in predicting the price for the stone on the bases of the
details given in the dataset so it can distinguish between higher profitable stones and
lower profitable stones so as to have better profit share. Also, provide them with the best
5 attributes that are most important.
Q1. Read the data and do exploratory data analysis. Describe the
data briefly. (Check the null values, Data types, shape, EDA,
duplicate values). Perform Univariate and Bivariate Analysis.
Answer:
Summary Of Dataset:
The dataset contains 26967 rows and 11 columns. As we can infer from the dataset, there
are 2 integer type features, 6 float type and 3 object type features. Price is the target
variable and others are the predictor variable. The first column is the index, i.e. “Unnamed:
0” as only serial number, hence we can remove it.
Step1: We will first check if there are any duplicates in our dataset. As we can see from
the output, there are 34 duplicates in our dataset. Therefore, before proceeding, we
remove the duplicates from our dataset.
Step 2: We will then check for any missing values in our dataset. As we can see from the
output, there are 697 missing values in our dataset. Therefore, before proceeding, we will
impute median values in the missing places
Figure 2: Treatment of Missing Values
Step 3: We will then use boxplot method to detect outliers in our dataset and visualize
the presence of outliers in our dataset. Post that, we will proceed and treat the outliers
present in our dataset.
Figure 3: Data Visualization of the Presence of Outliers
After detecting the outliers, we will treat the outliers and remove them from our dataset.
Figure 4: Data Visualization of the Treatment of Outliers
Inference: After performing univariate analysis, we noticed that the distribution of some
quantitative features like ‘Carat’, and the target feature ‘Price’ are heavily right skewed.
1) As we can infer from the data, ‘Price’ is the target variable while all others are the
predictors. The data set contains 2697 rows, 11 columns. There are 2 integer type
features, 6 float type and 3 object type features. Price is the target variable and
others are the predictor variable. The first column is the index, i.e. “Unnamed: 0” as
only serial number, hence we can remove it.
2) In the dataset the mean and median values do not much difference. We observed
minimum value of ‘x’, ‘y’, ‘z’ are zero. This signifies that they are faulty values. As we
are aware dimensionless or 2-dimensional diamonds are not possible. Hence, we
have to filter out those as they are clearly faulty data entries. Furthermore, there are
three object data type, i.e. ‘cut’, ‘colour’ and ‘clarity’.
3) We observed there are 697 missing values in ‘depth’ column. There are some
duplicate rows present, which is nearly 0.12% of the total data. Therefore, in this
case we have dropped the duplicate rows.
4) There are significant amount of outliers present in some variables. Hence we have
treated the outliers. We also observed that the distribution of some quantitative
features like, ‘carat’ and the target feature ‘price’ are heavily right-skewed.
5) From the whole analysis, it looks like most features do correlate with the price of
Diamond. The notable exception is ‘depth’ which is negligible correlation.
Furthermore, the premium cut on Diamonds are the most expensive, followed by
the very good cut.
Q2. Impute null values if present, also check for the values which are
equal to zero. Do they have any meaning or do we need to change
them or drop them? Check for the possibility of combining the sub
levels of a ordinal variables and take actions accordingly. Explain
why you are combining these sub levels with appropriate reasoning.
Answer:
We start by checking through the dataset for any null values that are present in our
data. As per our analysis, we infer that there are 697 null values in ‘depth’ column.
Followed by which, median was computed for each attribute so that it can be used
to replace the null values that are present in the dataset.
We then go ahead and replace the null values by the median.
Further Inference:
Scaling is not required in this scenario. Regression techniques are recommended because it
would gradient descent to converge fast and reach the global minima. When number of
features becomes large, it helps the in running the model quickly else the starting point
would be very far from minima, if scaling is not done in pre-processing. We have process the
model without scaling for now.
Q3. Encode the data (having string values) for Modelling. Split the
data into train and test (70:30). Apply Linear regression using scikit
learn. Perform checks for significant variables using appropriate
method from statsmodel. Create multiple models and check the
performance of Predictions on Train and Test sets using Rsquare,
RMSE & Adj Rsquare. Compare these models and select the best one
with appropriate reasoning.
We copy all the predictor variables into X data frame and copy target into the y data frame.
Using the dependent variable, we split the X and Y data frames into training set and test set.
For this, we used the SKlearn package and then split X and Y in 70:30 ratio. We then invoke
the linear regression function and find the best fit model on training data.
The intercept is the expected mean values of Y when X=0 and when X is not equal to zero
then the intercept has no intrinsic meaning.
From the scatter plot, we see that it is linear and there is very strong correlation present
between the predicted y and actual y.
It also indicates that there is lot of spread which indicates some unexplained variances on
the output.
As the training data and the test data are almost inline, we can conclude that this model is a
Right-Fit model.
Inference from Linear Regression Using Stats Model
Assuming null hypothesis as true, we observed that the overall P value is less than alpha, so
rejecting HO and accepting Ha that atleast one regression co-efficient is not 0. Here all
regression co-efficient are not 0
Hence, we can conclude that the attributes which are having p value greater than 0.05 are
poor predictor for price.
Recommendations To Business:
Gem Stones company should consider the features ‘Carat’, ‘Cut’, ‘colour’, ‘clarity’ and width,
i.e. ‘y’ as most important for predicting the price to distinguish between higher profitable
stones and lower profitable stones so as to have better profit share.
As we can see from the model, higher the width (y) of the stone is higher the price.
Hence, the stones having higher width should consider in higher profitable stones. The
‘Premium Cut’ on diamonds are the most expensive, followed by ‘Very Good’ Cut, these
should be considered in higher profitable stones.
The diamonds clarity with ‘VS1’ and ‘VS2’ are the most expensive. Hence, these two
categories should also be considered in higher profitable stones.
We observed for ‘x’, higher the length of the stone, lower is the price. Likewise, higher the
‘z’, lower is the price. This is because if a diamond’s height is too large, diamond will become
‘Dark’ in appearance because it will no longer return an attractive amount of light.
Stones with higher ‘z’ is also are lower is profitability.
Problem Statement 2:
You are hired by a tour and travel agency which deals in selling holiday packages. You are
provided details of 872 employees of a company. Among these employees, some opted for the
package and some didn't. You have to help the company in predicting whether an employee will
opt for the package or not on the basis of the information given in the data set. Also, find out the
important factors on the basis of which the company will focus on particular employees to sell
their packages.
Q1. Read the dataset. Do the descriptive statistics and do null value
condition check, write an inference on it. Perform Univariate and
Bivariate Analysis. Do exploratory data analysis.
Answer:
Shape of dataset:
Info of dataset:
We have no null values in the dataset
We have variables of integer and object data type.
Univariate Analysis
Figure 16: Univariate Analysis
Inference:
We can see that most of the distribution are right skewed except for educ.
Salary distribution has maximum number of outliers
There are some outliers in educ, no of young children and no. of older children.
Inference:
There is hardly any correlation between the data, the data seems to be normal. There is no
huge difference in the data distribution among the holiday package.
Inference:
There is hardly any correlation between the data so there is no collinearity.
Q2. Do not scale the data. Encode the data (having string values) for
Modelling. Data Split: Split the data into train and test (70:30). Apply
Logistic Regression and LDA (linear discriminant analysis).
Answer:
We have done One hot encoding to create dummy variables and we can see all
values for foreign_yes as 0.
Better results are predicted by logistic regression model if encoding is done.
We have then split the data in 70:30 ratio. Below is the output:
Figure 19: Data Splitting
Logistic Regression
Inference:
We observed that precision for 1 is 0.63, recall is 0.45 accuracy is 0.63 and f1
score is 0.63.
Inference:
LDA works better when there is category target variable.
Answer: Following are the inferences we conclude from the whole analyses:
Most of the employees who are above 50 don’t opt for holiday
packages. It seems like they are not interested in holiday packages at all.
Employees who are in the age gap of 30 to 50 opt for holiday packages.
It seems like young people believe spending on holiday packages so age
here plays a very important role in deciding whether they will opt for
package or not.
Also, people who have salary less than 50000 opt for holiday packages.
Hence, salary is also a deciding factor for the holiday package.
Education also plays an important role in deciding the holiday packages.
To improve the customer base, company needs to look into those
factors.
Recommendations To Business:
We observed that most of the people who are older prefer to visit
religious places so it would be better if we target those places and
provide them with packages where they can visit religious places.
We can also look into family dynamics of the people of the older people,
if the older people have elder children
People who earn more than 150000 don’t spend much on the holiday
packages, they tend to go for lavish holidays and company can provide
them with customized packages according to their wish, such as fancy
hotels, longer vacations, and personal cars during the holiday. Such
people who earn more than 150000, company can provide them extra
facilities according to their own wishes at the moment.