100% found this document useful (1 vote)

84 views10 pages

Name: Reg. No.: Lab Exercise:: Shivam Batra 19BPS1131

1. The document describes applying logistic regression to predict red wine quality using a dataset containing physicochemical properties of red wines. 2. Exploratory data analysis was conducted including univariate analysis of variables and bivariate correlation analysis. The data was then prepared by handling missing values and outliers. 3. Logistic regression was performed on a training dataset to predict wine quality, which was evaluated on a test dataset, achieving an accuracy of 98% but with a poor ROC curve AUC of 0.511.

Uploaded by

Shivam Batra

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

100% found this document useful (1 vote)

84 views10 pages

Name: Reg. No.: Lab Exercise:: Shivam Batra 19BPS1131

Uploaded by

Shivam Batra

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

You are on page 1/ 10

CSE3506 Essentials of Data Analytics

Name : Shivam Batra

Reg. No. : 19BPS1131
Lab Exercise: 5

Objective: Applying logistic regression to predict red wine.

Methods -

1. Exploratory data analysis (EDA)

2. Data preparation

3. Modeling -> Logistic regression

4. ROC Curve

STEPS:

#Importing the dataset

data <- read.csv('winequality-red.csv', sep = ';')
str(data)

## 'data.frame': 1599 obs. of 12 variables:

## $ fixed.acidity : num 7.4 7.8 7.8 11.2 7.4 7.4 7.9 7.3 7.8 7.5 ...
## $ volatile.acidity : num 0.7 0.88 0.76 0.28 0.7 0.66 0.6 0.65 0.58 0.5 ...
## $ citric.acid : num 0 0 0.04 0.56 0 0 0.06 0 0.02 0.36 ...
## $ residual.sugar : num 1.9 2.6 2.3 1.9 1.9 1.8 1.6 1.2 2 6.1 ...
## $ chlorides : num 0.076 0.098 0.092 0.075 0.076 0.075 0.069 0.065 0.073 0.071 ...
## $ free.sulfur.dioxide : num 11 25 15 17 11 13 15 15 9 17 ...
## $ total.sulfur.dioxide: num 34 67 54 60 34 40 59 21 18 102 ...
## $ density : num 0.998 0.997 0.997 0.998 0.998 ...
## $ pH : num 3.51 3.2 3.26 3.16 3.51 3.51 3.3 3.39 3.36 3.35 ...
## $ sulphates : num 0.56 0.68 0.65 0.58 0.56 0.56 0.46 0.47 0.57 0.8 ...
## $ alcohol : num 9.4 9.8 9.8 9.8 9.4 9.4 9.4 10 9.5 10.5 ...
## $ quality : int 5 5 5 6 5 5 5 7 7 5 ...

#Format outcome variable

data$quality <- ifelse(data$quality >= 7, 1, 0)
data$quality <- factor(data$quality, levels = c(0, 1))

#Descriptive statistics
summary(data)
## fixed.acidity volatile.acidity citric.acid residual.sugar
## Min. : 4.60 Min. :0.1200 Min. :0.000 Min. : 0.900
## 1st Qu.: 7.10 1st Qu.:0.3900 1st Qu.:0.090 1st Qu.: 1.900
## Median : 7.90 Median :0.5200 Median :0.260 Median : 2.200
## Mean : 8.32 Mean :0.5278 Mean :0.271 Mean : 2.539
## 3rd Qu.: 9.20 3rd Qu.:0.6400 3rd Qu.:0.420 3rd Qu.: 2.600
## Max. :15.90 Max. :1.5800 Max. :1.000 Max. :15.500
## chlorides free.sulfur.dioxide total.sulfur.dioxide density
## Min. :0.01200 Min. : 1.00 Min. : 6.00 Min. :0.9901
## 1st Qu.:0.07000 1st Qu.: 7.00 1st Qu.: 22.00 1st Qu.:0.9956
## Median :0.07900 Median :14.00 Median : 38.00 Median :0.9968
## Mean :0.08747 Mean :15.87 Mean : 46.47 Mean :0.9967
## 3rd Qu.:0.09000 3rd Qu.:21.00 3rd Qu.: 62.00 3rd Qu.:0.9978
## Max. :0.61100 Max. :72.00 Max. :289.00 Max. :1.0037
## pH sulphates alcohol quality
## Min. :2.740 Min. :0.3300 Min. : 8.40 0:1382
## 1st Qu.:3.210 1st Qu.:0.5500 1st Qu.: 9.50 1: 217
## Median :3.310 Median :0.6200 Median :10.20
## Mean :3.311 Mean :0.6581 Mean :10.42
## 3rd Qu.:3.400 3rd Qu.:0.7300 3rd Qu.:11.10
## Max. :4.010 Max. :2.0000 Max. :14.90

Univariate analysis

#Dependent variable
#Frequency plot
par(mfrow=c(1,1))
barplot(table(data[[12]]),
main = sprintf('Frequency plot of the variable: %s',
colnames(data[12])),
xlab = colnames(data[12]),
ylab = 'Frequency')
#Check class BIAS
table(data$quality)

##
## 0 1
## 1382 217

round(prop.table((table(data$quality))),2)

##
## 0 1
## 0.86 0.14

#Independent variable
#Boxplots
par(mfrow=c(3,4))
for (i in 1:(length(data)-1)){
boxplot(x = data[i],
horizontal = TRUE,
main = sprintf('Boxplot of the variable: %s',
colnames(data[i])),
xlab = colnames(data[i]))
}
#Histograms
par(mfrow=c(3,4))
for (i in 1:(length(data)-1)){
hist(x = data[[i]],
main = sprintf('Histogram of the variable: %s',
colnames(data[i])),
xlab = colnames(data[i]))
}
Bivariate analysis

#Correlation matrix
library(ggcorrplot)

## Loading required package: ggplot2

ggcorrplot(round(cor(data[-12]), 2),
type = "lower",
lab = TRUE,
title =
'Correlation matrix of the red wine quality dataset')
Data preparation

#Missing values
sum(is.na(data))

## [1] 0

#Outliers
#Identifing outliers
is_outlier <- function(x) {
return(x < quantile(x, 0.25) - 1.5 * IQR(x) |
x > quantile(x, 0.75) + 1.5 * IQR(x))
}
outlier <- data.frame(variable = character(),
sum_outliers = integer(),
stringsAsFactors=FALSE)
for (j in 1:(length(data)-1)){
variable <- colnames(data[j])
for (i in data[j]){
sum_outliers <- sum(is_outlier(i))
}
row <- data.frame(variable,sum_outliers)
outlier <- rbind(outlier, row)
}
outlier

## variable sum_outliers
## 1 fixed.acidity 49
## 2 volatile.acidity 19
## 3 citric.acid 1
## 4 residual.sugar 155
## 5 chlorides 112
## 6 free.sulfur.dioxide 30
## 7 total.sulfur.dioxide 55
## 8 density 45
## 9 pH 35
## 10 sulphates 59
## 11 alcohol 13

#Identifying the percentage of outliers

for (i in 1:nrow(outlier)){
if (outlier[i,2]/nrow(data) * 100 >= 5){
print(paste(outlier[i,1],
'=',
round(outlier[i,2]/nrow(data) * 100, digits = 2),
'%'))
}
}

## [1] "residual.sugar = 9.69 %"

## [1] "chlorides = 7 %"

#Inputting outlier values

for (i in 4:5){
for (j in 1:nrow(data)){
if (data[[j, i]] > as.numeric(quantile(data[[i]], 0.75) +
1.5 * IQR(data[[i]]))){
if (i == 4){
data[[j, i]] <- round(mean(data[[i]]), digits = 2)
} else{
data[[j, i]] <- round(mean(data[[i]]), digits = 3)
}
}
}
}

Modeling
#Splitting the dataset into the Training set and Test set
#Stratified sample
data_ones <- data[which(data$quality == 1), ]
data_zeros <- data[which(data$quality == 0), ]
#Train data
set.seed(123)
train_ones_rows <- sample(1:nrow(data_ones), 0.8*nrow(data_ones))
train_zeros_rows <- sample(1:nrow(data_zeros), 0.8*nrow(data_ones))
train_ones <- data_ones[train_ones_rows, ]
train_zeros <- data_zeros[train_zeros_rows, ]
train_set <- rbind(train_ones, train_zeros)
table(train_set$quality)

##
## 0 1
## 173 173

#Test Data
test_ones <- data_ones[-train_ones_rows, ]
test_zeros <- data_zeros[-train_zeros_rows, ]
test_set <- rbind(test_ones, test_zeros)
table(test_set$quality)

##
## 0 1
## 1209 44

Logistic regression

#Logistic Regression
lr = glm(formula = quality ~.,
data = train_set,
family = binomial)
#Predictions
prob_pred = predict(lr,
type = 'response',
newdata = test_set[-12])
library(InformationValue)
optCutOff <- optimalCutoff(test_set$quality, prob_pred)[1]
y_pred = ifelse(prob_pred > optCutOff, 1, 0)

#Making the confusion matrix

cm_lr = table(test_set[, 12], y_pred)
cm_lr

## y_pred
## 0 1
## 0 1207 2
## 1 43 1

#Accuracy
accuracy_lr = (cm_lr[1,1] + cm_lr[1,1])/
(cm_lr[1,1] + cm_lr[1,1] + cm_lr[2,1] + cm_lr[1,2])
accuracy_lr

## [1] 0.9816999

#ROC curve
library(ROSE)

## Loaded ROSE 0.0-4

par(mfrow = c(1, 1))

roc.curve(test_set$quality, y_pred)

## Area under the curve (AUC): 0.511

Methodology of History
No ratings yet
Methodology of History
109 pages
Unit-1 Correlation and Regression
No ratings yet
Unit-1 Correlation and Regression
46 pages
8multiple Linear Regression
100% (1)
8multiple Linear Regression
21 pages
Methodology of Educational Research
100% (2)
Methodology of Educational Research
25 pages
EDA Lecture Module 2
100% (1)
EDA Lecture Module 2
42 pages
Dokumen - Pub Approaching Almost Any Machine Learning Problem 9788269211528 L 5276104
100% (1)
Dokumen - Pub Approaching Almost Any Machine Learning Problem 9788269211528 L 5276104
151 pages
Logistic Regression
100% (1)
Logistic Regression
56 pages
Leer Los Datos: Import As Import As Import As From Import From Import
100% (1)
Leer Los Datos: Import As Import As Import As From Import From Import
14 pages
Photon Prog Guide
100% (1)
Photon Prog Guide
919 pages
KPMG
100% (1)
KPMG
2 pages
LPTHW
100% (1)
LPTHW
220 pages
M&A Deal of ABC Inc. and XYZ Inc.: Insert Your Title Here
100% (1)
M&A Deal of ABC Inc. and XYZ Inc.: Insert Your Title Here
25 pages
CS229 Lecture 3 PDF
100% (1)
CS229 Lecture 3 PDF
35 pages
Correlation & Regression
100% (1)
Correlation & Regression
53 pages
KPMG - Data Set
100% (1)
KPMG - Data Set
1,685 pages
Regression Models Course Project
100% (1)
Regression Models Course Project
4 pages
January 1, 1983 1990 5 July 1994 1930 1960
100% (1)
January 1, 1983 1990 5 July 1994 1930 1960
13 pages
Risk Return Summery
100% (1)
Risk Return Summery
85 pages
Import As
100% (1)
Import As
27 pages
Using Statistical Techniq Ues in Analyzing Data
100% (1)
Using Statistical Techniq Ues in Analyzing Data
40 pages
Logistic Regression: Gunjan Bharadwaj Assistant Professor Dept of CEA
100% (1)
Logistic Regression: Gunjan Bharadwaj Assistant Professor Dept of CEA
42 pages
Block-2 Types of Research
No ratings yet
Block-2 Types of Research
41 pages
Homework 2
100% (1)
Homework 2
12 pages
Formulation of Research Hypothesis
No ratings yet
Formulation of Research Hypothesis
31 pages
Case Study 2
100% (1)
Case Study 2
12 pages
Preparation and Evaluation of Polyherbal Hair Oil
100% (1)
Preparation and Evaluation of Polyherbal Hair Oil
13 pages
Logistic Regression
100% (1)
Logistic Regression
17 pages
Principles of X-Ray Spectros
No ratings yet
Principles of X-Ray Spectros
13 pages
G11 Modules
No ratings yet
G11 Modules
32 pages
Logistic Regression Example
100% (1)
Logistic Regression Example
22 pages
Quiz Feedback1 - Coursera
100% (1)
Quiz Feedback1 - Coursera
7 pages
Community Medicine Trans - Epidemic Investigation 2
100% (1)
Community Medicine Trans - Epidemic Investigation 2
10 pages
QuantEconlectures Python3 PDF
100% (1)
QuantEconlectures Python3 PDF
1,125 pages
Course Title: Data Pre-Processing and Visualization
100% (2)
Course Title: Data Pre-Processing and Visualization
11 pages
Stats For Managers - Intro
100% (1)
Stats For Managers - Intro
101 pages
Airbnbs in Seattle, Wa: Questions
100% (1)
Airbnbs in Seattle, Wa: Questions
5 pages
ML MU Unit 2
100% (2)
ML MU Unit 2
42 pages
Thomas Parker Boyd - Borderland Experiences, or Do The Dead Return (1919)
100% (1)
Thomas Parker Boyd - Borderland Experiences, or Do The Dead Return (1919)
70 pages
R Project
No ratings yet
R Project
22 pages
Noir PDF
No ratings yet
Noir PDF
10 pages
Chapter 8 Solution
No ratings yet
Chapter 8 Solution
10 pages
Decision Tree Classification
100% (1)
Decision Tree Classification
11 pages
7. Heteroscedasticity: y = β + β x + · · · + β x + u
100% (1)
7. Heteroscedasticity: y = β + β x + · · · + β x + u
21 pages
1 The Role of Statistics and The Data Analysis Process
100% (1)
1 The Role of Statistics and The Data Analysis Process
30 pages
Universiti Teknologi Mara Assessment 1 (Quiz) : Confidential 1 CS/MAY 2022/ST552
No ratings yet
Universiti Teknologi Mara Assessment 1 (Quiz) : Confidential 1 CS/MAY 2022/ST552
5 pages
Logistic Regression Model Study Assignment
100% (1)
Logistic Regression Model Study Assignment
5 pages
Python For You and Me: Release 0.3.alpha1
100% (1)
Python For You and Me: Release 0.3.alpha1
143 pages
CPE412 Pattern Recognition (Week 8)
100% (1)
CPE412 Pattern Recognition (Week 8)
25 pages
Business Research Methods
No ratings yet
Business Research Methods
4 pages
Project 5 PDF
100% (1)
Project 5 PDF
48 pages
SMDM - Week 1 Checklist
100% (1)
SMDM - Week 1 Checklist
3 pages
Grounded Theory in Medical Education Research: AMEE Guide No. 70
No ratings yet
Grounded Theory in Medical Education Research: AMEE Guide No. 70
12 pages
EFFIE 2002 Case Studies
100% (1)
EFFIE 2002 Case Studies
16 pages
wine
No ratings yet
wine
15 pages
Forecasting of Stock Prices Using Multi Layer Perceptron
100% (1)
Forecasting of Stock Prices Using Multi Layer Perceptron
6 pages
Almond Separate Tables
No ratings yet
Almond Separate Tables
16 pages
Sas Notes Module 4-Categorical Data Analysis Testing Association Between Categorical Variables
100% (1)
Sas Notes Module 4-Categorical Data Analysis Testing Association Between Categorical Variables
16 pages
1.1 Simple Linear Regression Model
100% (1)
1.1 Simple Linear Regression Model
15 pages
Stat1012 Cheatsheet Double-Sided
100% (1)
Stat1012 Cheatsheet Double-Sided
2 pages
1D Box Applications
No ratings yet
1D Box Applications
2 pages
Quest Stat
100% (1)
Quest Stat
2 pages
Python Vs R in Data and Machine Learning PDF
100% (1)
Python Vs R in Data and Machine Learning PDF
6 pages
Project Chapter 3
No ratings yet
Project Chapter 3
2 pages
Z Test
No ratings yet
Z Test
25 pages
Blank: CFC Cumulative Forecast Error or Bias Error
100% (1)
Blank: CFC Cumulative Forecast Error or Bias Error
2 pages
CONSCI 3940 Final Exam Review copy
No ratings yet
CONSCI 3940 Final Exam Review copy
14 pages
Course Outline - Research Methods and Presentation - ECE - 2019
No ratings yet
Course Outline - Research Methods and Presentation - ECE - 2019
2 pages
Taller Practica Churn
50% (2)
Taller Practica Churn
6 pages
Concepts and Approaches in Research: Prepared By, Mr. Gireesh S Pillai Holy Cross College of Nursing
No ratings yet
Concepts and Approaches in Research: Prepared By, Mr. Gireesh S Pillai Holy Cross College of Nursing
61 pages
The Scientifically Minded Psychologist Science As A Core Competency
No ratings yet
The Scientifically Minded Psychologist Science As A Core Competency
11 pages
Linear Regression With LM Function, Diagnostic Plots, Interaction Term, Non-Linear Transformation of The Predictors, Qualitative Predictors
100% (1)
Linear Regression With LM Function, Diagnostic Plots, Interaction Term, Non-Linear Transformation of The Predictors, Qualitative Predictors
15 pages
Chapter 3
No ratings yet
Chapter 3
2 pages
Tutor
100% (1)
Tutor
309 pages
KPMG Data
50% (2)
KPMG Data
3,723 pages
Homework 2
100% (1)
Homework 2
14 pages
Scip y Lectures
100% (1)
Scip y Lectures
329 pages
Tutorial 1
No ratings yet
Tutorial 1
3 pages
Fitness Index Formal Lab Report
100% (1)
Fitness Index Formal Lab Report
7 pages
Linear Regression Chap01
100% (1)
Linear Regression Chap01
7 pages
Poly
100% (1)
Poly
108 pages
Employee Attrition Miniblogs
100% (1)
Employee Attrition Miniblogs
15 pages
Action Research Reporting
No ratings yet
Action Research Reporting
58 pages
Definition of Research
No ratings yet
Definition of Research
4 pages
Marketing Research Feb 2021
No ratings yet
Marketing Research Feb 2021
2 pages
Adr Ian Forty Nature
100% (2)
Adr Ian Forty Nature
20 pages
Chap013 Test Bank
No ratings yet
Chap013 Test Bank
7 pages
Classification With Decision Trees: Instructor: Qiang Yang
100% (1)
Classification With Decision Trees: Instructor: Qiang Yang
62 pages
1st Periodic Test - Science 7
75% (4)
1st Periodic Test - Science 7
4 pages
Linear Regression with Multiple Covariates
From Everand
Linear Regression with Multiple Covariates
Brett Kottmann
No ratings yet
The Essential R Reference
From Everand
The Essential R Reference
Mark Gardener
No ratings yet