Predictive - Modelling - Project - PDF 1

Predictive Modelling Project
Problem 1: Linear Regression

The comp-activ databases is a collection of a computer systems activity measures .
The data was collected from a Sun Sparcstation 20/712 with 128 Mbytes of memory running in
a multi-user university department. Users would typically be doing a large variety of tasks
ranging from accessing the internet, editing files or running very cpu-bound programs.
As you are a budding data scientist you thought to find out a linear equation to build a model to
predict 'usr'(Portion of time (%) that cpus run in user mode) and to find out how each attribute
affects the system to be in 'usr' mode using a list of system attributes.
1.1 Read the data and do exploratory data analysis. Describe the data briefly. (Check the Data
types, shape, EDA, 5 point summary). Perform Univariate, Bivariate Analysis, Multivariate
Analysis.
Sample of dataset
Exploratory data analysis
Checking the shape and data types
There are a total of 8192 rows and 22 columns in the dataset. Out of 22, 13 are float 8 are
integer type and 1 object type variable
Data Description
Unique values in categorical data

Univariate Analysis – Continuous Variables
Distribution plot of scall
Distribution plot of sread

Distribution plot of fork
Distribution plot of exec

Distribution plot of pfit
Distribution plot of freemem

Distribution plot of freeswap
Distribution plot of usr

1.2 Impute null values if present, also check for the values which are equal to zero. Do they
have any meaning or do we need to change them or drop them? Check for the possibility of
creating new features if required. Also check for outliers and duplicates if there.
There are 104 missing values in rchar and 15 missing values in wchar
As it is a continuous variable, mean / median can be imputed
Checking for duplicate rows
There are no duplicate rows

Checking for outliers
1.3 Encode the data (having string values) for Modelling. Split the data into train
and test (70:30). Apply Linear regression using scikit learn. Perform checks for
significant variables using appropriate method from statsmodel. Create multiple
models and check the performance of Predictions on Train and Test sets using
Rsquare, RMSE & Adj Rsquare. Compare these models and select the best one
with appropriate reasoning.
Encoding the String Values by Get Dummies
Sample of dataset after encoding
Data Split: Split the data into test and train The X and y for the data has been formulated and
split under the criteria 70:30 and random_state = 1.
Then the data of X_train, y_train is being fit into the Linear Regression model
The coefficients for each of the independent attributes is as below.
Intercept for the model
The intercept of the model is 84.13143842096603
R square on training data

The R square on training data is 0.7961565330395104
R square on test data

The R square on test data is 0.7676695029858404
Root Mean Square Error (RMSE) on Training data

The RMSE on training data is 0.20690072466418796 23
Root Mean Square Error (RMSE) on test data

The RMSE on test data is 0.21647817772382874
We can drop scall and fork variables as they have high p values
Problem 2: Logistic Regression, LDA and CART
You are a statistician at the Republic of Indonesia Ministry of Health and you are
provided with a data of 1473 females collected from a Contraceptive Prevalence
Survey. The samples are married women who were either not pregnant or do not know
if they were at the time of the survey.
The problem is to predict do/don't they use a contraceptive method of choice based on
their demographic and socio-economic characteristics.
2.1 Data Ingestion: Read the dataset. Do the descriptive statistics and do null value condition
check, check for duplicates and outliers and write an inference on it. Perform Univariate and
Bivariate Analysis and Multivariate Analysis.
Sample of the dataset
Checking the types of variables in the data frame

The dataset of 10 variables in which there are 7 object, 2 float type and 1 integer type variable
Contraceptive_method_used is the dependent variable
Check for missing values in the dataset
Check for duplicate values in the dataset
Now we go ahead and remove the null values and duplicate rows
Unique values in the categorical data
Percentage of target variable
The split indicates 52.9% have used contraceptive methods

Univariate Analysis – Categorical Variables
There are more Uneducated wives than Husbands and most of the husbands have completed their
tertiary education.
Working Yes = 0
Working No = 1
Box plot
Correlation Matrix Plot
Wife age and No_of_children_born are slightly correlated

Pairplot
Boxplot before treating outliers

Boxplot after treating outlier in No of children variable
Do not scale the data. Encode the data (having string values) for Modelling. Data
Split: Split the data into train and test (70:30). Apply Logistic Regression and LDA
(linear discriminant analysis) and CART.
Encoding the Categorical Values by Get Dummies
This the new data frame with additional columns
Data Split: Split the data into test and train

The X and y for the data has been formulated and split under the criteria 70:30 and
random_state = 1.
The above is the value count of y train dataset.
Getting the probabilities on the test set
Confusion matrix on the training data

Confusion matrix on the test data
Accuracy - Training Data
Accuracy of the training data is 0.6582694414019715
AUC and ROC for the training data

Accuracy - test Data
Accuracy of the test data is 0.6454081632653061
AUC and ROC for the test data
Classification Report on train data
For predicting they used contraceptive method (Label 1 ):

Precision = 0.66
Recall = 0.73
f1 score = 0.69
Classification Report on test data
Precision = 0.64
Recall = 0.77
f1 score = 0.70
We fit our data into LDA model.
Training Data Class Prediction with a cut-off value of 0.5

Test Data Class Prediction with a cut-off value of 0.5
Confusion matrix
AUC and ROC for training and test data
Comparing both the models
LR Train LR Test LR Train LDA Test
Accuracy 0.66 0.65 0.66 0.65
Precision 0.66 0.64 0.66 0.64
Recall 0.73 0.77 0.75 0.79
F1-score 0.69 0.70 0.70 0.71
AUC 0.72 0.72 0.72 0.68
Comparing both these models, we find both results are same, but LDA works
better when there is category target variable
The EDA analysis clearly indicates that women with a tertiary education and very
high standard of living used contraceptive methods
Women ranging from 21 to 38 generally use contraceptive methods more
I believe the usage of contraceptive methods need not depend on their

demographic or socioeconomic backgrounds since the use of contraceptive
methods were almost the same for both working and non-working women
The use of contraceptive method was high for both Scientology and Non-
scientology women

Predictive - Modelling - Project - PDF 1

Uploaded by

Copyright:

Available Formats

Predictive - Modelling - Project - PDF 1

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Predictive - Modelling - Project - PDF 1

Uploaded by

Copyright:

Available Formats

Predictive Modelling Project

Problem 1: Linear Regression

Unique values in categorical data

Distribution plot of scall

Distribution plot of sread

Distribution plot of exec

Distribution plot of freemem

Distribution plot of usr

Checking for duplicate rows

There are no duplicate rows

Encoding the String Values by Get Dummies

Sample of dataset after encoding

R square on training data

R square on test data

Root Mean Square Error (RMSE) on Training data

Root Mean Square Error (RMSE) on test data

Sample of the dataset

Checking the types of variables in the data frame

Check for missing values in the dataset

Check for duplicate values in the dataset

Percentage of target variable

The split indicates 52.9% have used contraceptive methods

Wife age and No_of_children_born are slightly correlated

Boxplot before treating outliers

Data Split: Split the data into test and train

The above is the value count of y train dataset.

Getting the probabilities on the test set

Confusion matrix on the training data

Accuracy of the training data is 0.6582694414019715

AUC and ROC for the training data

Accuracy of the test data is 0.6454081632653061

AUC and ROC for the test data

Classification Report on train data

For predicting they used contraceptive method (Label 1 ):

Classification Report on test data

Training Data Class Prediction with a cut-off value of 0.5

I believe the usage of contraceptive methods need not depend on their

You might also like