Assignment Report - Predictive Modelling - Rahul Dubey

Download as pdf or txt
Download as pdf or txt
You are on page 1of 18
At a glance
Powered by AI
The document discusses applying linear regression, logistic regression and LDA models on two different datasets to predict sales and survival rates. It evaluates the models on metrics like accuracy, AUC, RMSE to determine the best fitting model.

Problem 1 aims to predict sales of firms using linear regression. Problem 2 aims to classify survival rates using logistic regression and LDA.

For Problem 1, linear regression was applied. For Problem 2, both logistic regression and LDA were applied and logistic regression was found to perform better with 99% accuracy.

A Project report on identifying the best

model to be used between Linear


Regression, Logistic Regression and LDA
based on Python.

Predective
Modelling

Rahul Dubey
PGPDSBA – O – July 2022 – C
January 2, 2023
Table of Contents
Problem Statement ........................................................................................................................................................... 3
Solution ........................................................................................................................................................................... 5
Problem 1 ............................................................................................................................................................. 5
1.1 Read the data and do exploratory data analysis. Describe the data briefly. (Check the null
values, data types, shape, EDA).Perform Univariate and Bivariate Analysis. .................................... 5
1.2 Impute null values if present? Do you think scaling is necessary in this case? .................................. 8
1.3 Encode the data (having string values) for Modelling. Data Split: Split the data into test and
train (30:70). Apply Linear regression. Performance Metrics: Check the performance of
Predictions on Train and Test sets using R-square, RMSE. ............................................................... 9
1.4 Inference: Based on these predictions, what are the business insights and recommendations. ..... 10
Problem 2 ........................................................................................................................................................... 10
2.1 Data Ingestion: Read the dataset. Do the descriptive statistics and do null value condition
check, write an inference on it.Perform Univariate and Bivariate Analysis. Do exploratory data
analysis. ........................................................................................................................................ 10
2.2 Encode the data (having string values) for Modelling. Data Split: Split the data into train and
test (70:30). Apply LogisticRegression and LDA (linear discriminant analysis). ............................... 15
2.3 Performance Metrics: Check the performance of Predictions on Train and Test sets using
Accuracy, Confusion Matrix, PlotROC curve and get ROC_AUC score for each model. Compare
both the models and write inferences, which model is best/optimized. ........................................ 16
2.4 Inference: Based on these predictions, what are the insights and recommendations? .................. 17
Problem Statement
Problem 1 – Linear Regression:
You are a part of an investment firm and your work is to do research about these 759 firms. You are provided
with the dataset containing the sales and other attributes of these 759 firms. Predict the sales of these firms on
the bases of the details given in the dataset so as to help your company in investing consciously. Also, provide
them with 5 attributes that are most important.

Questions for Problem 1:

1.1) Read the data and do exploratory data analysis. Describe the data briefly. (Check the null values,
data types, shape, EDA).Perform Univariate and Bivariate Analysis.
1.2) Impute null values if present? Do you think scaling is necessary in this case?
1.3) Encode the data (having string values) for Modelling. Data Split: Split the data into test and train
(30:70). Apply linear regression. Performance Metrics: Check the performance of Predictions on
Train and Test sets using R-square, RMSE.
1.4) Inference: Based on these predictions, what are the business insights and recommendations?

Data Dictionary for Firm_level_data:

1) sales: Sales (in millions of dollars).


2) capital: Net stock of property, plant, and equipment.
3) patents: Granted patents.
4) randd: R&D stock (in millions of dollars).
5) employment: Employment (in 1000s).
6) sp500: Membership of firms in the S&P 500 index. S&P is a stock market index that measures the
stock performance of 500large companies listed on stock exchanges in the United States
7) tobinq: Tobin's q (also known as q ratio and Kaldor's v) is the ratio between a physical asset's
market value and its replacement value.
8) value: Stock market value.
9) institutions: Proportion of stock owned by institutions.

Problem 2 – Logistic Regression and Linear Discriminant Analysis:


You are hired by the Government to do an analysis of car crashes. You are provided details of car crashes,
among which some people survived and some didn't. You have to help the government in predicting whether a
person will survive or not on the basis of the information given in the data set so as to provide insights that will
help the government to make stronger laws for car manufacturers to ensure safety measures. Also, find out the
important factors on the basis of which you made your predictions.

2.1) Data Ingestion: Read the dataset. Do the descriptive statistics and do null value condition check,
write an inference on it.Perform Univariate and Bivariate Analysis. Do exploratory data analysis.
2.2) Encode the data (having string values) for Modelling. Data Split: Split the data into train and test
(70:30). Apply Logistic Regression and LDA (linear discriminant analysis).
2.3) Performance Metrics: Check the performance of Predictions on Train and Test sets using Accuracy,
Confusion Matrix, Plot ROC curve and get ROC_AUC score for each model. Compare both the
models and write inferences, which model is best/optimized.
2.4) Inference: Based on these predictions, what are the business insights and recommendations?
Data Dictionary for Car_Crash

1) dvcat: factor with levels (estimated impact speeds) 1-9km/h, 10-24, 25-39, 40-54, 55+
2) weight: Observation weights, albeit of uncertain accuracy, designed to account for varying sampling
probabilities. (The inverseprobability weighting estimator can be used to demonstrate causality
when the researcher cannot conduct a controlled experimentbut has observed data to model)
3) Survived: factor with levels Survived or not_survived
4) airbag: a factor with levels none or airbag
5) seatbelt: a factor with levels none or belted
6) frontal: a numeric vector; 0 = non-frontal, 1=frontal impact
7) sex: a factor with levels f: Female or m: Male
8) ageOFocc: age of occupant in years
9) yearacc: year of accident
10) yearVeh: Year of model of vehicle; a numeric vector
11) abcat: Did one or more (driver or passenger) airbag(s) deploy? This factor has levels deploy,
nodeploy and unavail
12) occRole: a factor with levels driver or pass: passenger
13) deploy: a numeric vector: 0 if an airbag was unavailable or did not deploy; 1 if one or more bags
deployed.
14) injSeverity: a numeric vector; 0: none, 1: possible injury, 2: no incapacity, 3: incapacity, 4: killed; 5:
unknown, 6: prior death
15) caseid: character, created by pasting together the populations sampling unit, the case number, and
the vehicle number. Withineach year, use this to uniquely identify the vehicle.
Solution
Problem 1

1.1 Read the data and do exploratory data analysis. Describe the data briefly. (Check the null values, data types, shape,
EDA).Perform Univariate and Bivariate Analysis.
Head of the data

Shape of the Data

Data Information

Data Description
Duplicate Check

Uni-variate Analysis
Sales Capital Patents Randd

Employment Tobinq Value Institutions


Bi-variate Analysis

Multi-variate Analysis
1.2 Impute null values if present? Do you think scaling is necessary in this case?
Checking Null Values on Data Null Values Check after the Treatment

Scaling is not necessary in this case because all the Major features are on same scale as per Data Dictionary but Outliers
needs to be treated as it was observed in the EDA that there are outliers in almost each Numerical column, this has been
treated below.
1.3 Encode the data (having string values) for Modelling. Data Split: Split the data into test and train (30:70). Apply
Linear regression. Performance Metrics: Check the performance of Predictions on Train and Test sets using R-square,
RMSE.
Data Encoding for column (spf500) having String Values

We encoded the data for Modelling and performed the Data Split: Splitting the data into Test and Train at the ratio of
(30:70), 30% Test & 70% Train. We split the data in the given dataset into Training and testing by separating X and Y as
X_test, X_train, Y_test & Y_train and then we fit these in below model.

Performance Metrics or R-Square and RMSE scores as per Linear Regression are as below:

Final Linear regression value is as below:


1.4 Inference: Based on these predictions, what are the business insights and recommendations.
There were lot of Outliers which were treated using the Median of top 50% values on each column, It was observed that
our model explains around 94% of the variance of the Train Dataset and 92% of the Test Dataset and it is advised to the
organisation that for all investing decisions below features should be considered and decision should also be made
based on below features.

1) capital - Will help in identifying the Net worth of the firm.


2) randd - Will help in identifying how much a firm is investing for Future growth but way of Research and Development.
3) employment - Will help in identifying the Employee base of the Firm.
4) tobinq - This will help in identifying current valuation of the Firm.
5) sp500 - This will help in identifying if the firm has been rated by the Rating Agency S&P or not.

Problem 2

2.1 Data Ingestion: Read the dataset. Do the descriptive statistics and do null value condition check, write an inference
on it.Perform Univariate and Bivariate Analysis. Do exploratory data analysis.
Head of the data

Shape of the Data

Data Information

Description of Data
Duplicate Check

Duplicate rows removed

Null Check
Checking Null Values on Data Null Values Check after the Treatment
Checking Outliers

Uni-variate Analysis
Weight Frontal Ageofocc yearacc
Yearveh Deploy InjSeverity

Bi-variate Analysis
Multi-variate Analysis
Conversion of Categorical column (‘dvcat’) to categorical codes and changing the data type to Numerical column.

2.2 Encode the data (having string values) for Modelling. Data Split: Split the data into train and test (70:30). Apply
LogisticRegression and LDA (linear discriminant analysis).
By taking “Survived” as the target Variable we split the data into train and test and Models have been applied.

Logistic Regression Model

Train - Probabilities Test - Probabilities

Linear Discriminant Analysis


Train - Probabilities Test - Probabilities
2.3 Performance Metrics: Check the performance of Predictions on Train and Test sets using Accuracy, Confusion Matrix,
PlotROC curve and get ROC_AUC score for each model. Compare both the models and write inferences, which model
is best/optimized.
The accuracy of both the models for both Train & Test are almost same and is between 98-99% and Confusion Matrix
also shows the similarity. Hence, it can be concluded that Logistic method is better to predict the analysis as it holds a
higher accuracy of 99%.

Logistic Regression Model


Accuracy - Train Accuracy - Test

Confusion Matrix - Train Confusion Matrix – Test

AUC Score & ROC Curve - Train AUC Score & ROC Curve - Test
Linear Discriminant Analysis
Train Test

Confusion Matrix - Train Confusion Matrix – Test

AUC Score & ROC Curve - Train AUC Score & ROC Curve - Test

2.4 Inference: Based on these predictions, what are the insights and recommendations?
On the basis of the information given in the dataset below inferences can be made:
1) It can be concluded that in all the accidents 99% of the people survived on a sample of 7024 samples and 91% of the
cases people did not survived on a sample of 826 samples.
2) The model accuracy of logistic regression on both training data and text ting data is almost same i.e. 99%
3) AUC and accuracy for both training & testing data is almost similar.
4) The other parameters of confusion matrix in logistic regression is also similar, therefore we can presume in this that
our model is over fitted.
5) In case of LDA, the AUC for testing and training data is also same and it was 98%, besides this the other parameters
of confusion matrix of LDA model was also similar and it clearly shows that model is over fitted here too.
6) Logistic regression Model Giving Better Recall and Precision in comparison to LDA. Hence, Logistic Regression Model
cab be considered further upgrading the model.

You might also like