Predictive - Modelling - Project - PDF 1
Predictive - Modelling - Project - PDF 1
Predictive - Modelling - Project - PDF 1
As you are a budding data scientist you thought to find out a linear equation to build a model to
predict 'usr'(Portion of time (%) that cpus run in user mode) and to find out how each attribute
affects the system to be in 'usr' mode using a list of system attributes.
1.1 Read the data and do exploratory data analysis. Describe the data briefly. (Check the Data
types, shape, EDA, 5 point summary). Perform Univariate, Bivariate Analysis, Multivariate
Analysis.
Sample of dataset
Exploratory data analysis
Checking the shape and data types
There are a total of 8192 rows and 22 columns in the dataset. Out of 22, 13 are float 8 are
integer type and 1 object type variable
Data Description
There are 104 missing values in rchar and 15 missing values in wchar
As it is a continuous variable, mean / median can be imputed
Data Split: Split the data into test and train The X and y for the data has been formulated and
split under the criteria 70:30 and random_state = 1.
Then the data of X_train, y_train is being fit into the Linear Regression model
The coefficients for each of the independent attributes is as below.
Intercept for the model
The intercept of the model is 84.13143842096603
2.1 Data Ingestion: Read the dataset. Do the descriptive statistics and do null value condition
check, check for duplicates and outliers and write an inference on it. Perform Univariate and
Bivariate Analysis and Multivariate Analysis.
Now we go ahead and remove the null values and duplicate rows
Unique values in the categorical data
There are more Uneducated wives than Husbands and most of the husbands have completed their
tertiary education.
Working Yes = 0
Working No = 1
Box plot
Correlation Matrix Plot
Do not scale the data. Encode the data (having string values) for Modelling. Data
Split: Split the data into train and test (70:30). Apply Logistic Regression and LDA
(linear discriminant analysis) and CART.
Encoding the Categorical Values by Get Dummies
This the new data frame with additional columns
Precision = 0.64
Recall = 0.77
f1 score = 0.70
We fit our data into LDA model.
Confusion matrix
AUC and ROC for training and test data
Comparing both the models
LR Train LR Test LR Train LDA Test
Accuracy 0.66 0.65 0.66 0.65
Precision 0.66 0.64 0.66 0.64
Recall 0.73 0.77 0.75 0.79
F1-score 0.69 0.70 0.70 0.71
AUC 0.72 0.72 0.72 0.68
Comparing both these models, we find both results are same, but LDA works
better when there is category target variable
The EDA analysis clearly indicates that women with a tertiary education and very
high standard of living used contraceptive methods
Women ranging from 21 to 38 generally use contraceptive methods more
The use of contraceptive method was high for both Scientology and Non-
scientology women