FinalProject STAT4444
FinalProject STAT4444
FinalProject STAT4444
STAT 4444
Prof. Hamdy Mahmood
10th April, 2022
Final Project - Analysis of Real Estate Data using simple linear regression and
bayesian simple linear regression.
- Introduction
We have a Real Estate dataset with 5 columns - Sale Price, Age of the house, Percent College,
Bedrooms and lot size. The original dataset has 350 observations but for as per the project
requirement we are only supposed to use 100 observations so we will use sample_n()
Our goal for this analysis is to make a model using the frequentist approach and bayesian
approach to predict the sale price of a house. The sale price will be our response variable while
the other variables will be explanatory variables. After establishing predictive models we will
test their efficiency and robustness using different parameters. Finally, we will compare and
discuss the contrast between the frequentist model and the bayesian model to provide
conclusion. Software tools used in this project are RStudio and MiniTab
Computing the mean, median and the inter-quartile range which will depict the dispersion
We graph histograms to examine the distribution of each variable in the dataset. It should be
noted that the distribution of Sale Price is shaped like a bell curve which suggests that it is
Page 1
normally distributed which is aligns with our goal to predict the sale price and we can assume
Page 2
We Check for outliers in dataset using a scatter-plot. We find outliers in all of the graphs, and
upon further investigation we find that one of the houses listed for sale had 7 bedrooms and
had the largest lot size and was the one of the oldest house. So we one outlier in our dataset.
Page 3
Fitting the Regression Model
Regression equation:
Sale Price = -29298 - 804 Age + 1706 Pct College + 52706 Bedrooms + 2.265 Lot Size
The model summary gives us the diagnostic measures to check model fit and its predictive
capability. The R-sq is 27.55%, which suggest that the predictive capability of this model and
model fit is not good. We will use other measures such as AIC/BIC and PRESS to compare this
model with another one. The p- value for the regressors suggests that all the variables are
Page 4
The analysis of variance indicates that number of bedrooms has the most impact on the
regression model, while Pct College and and age of the house have similar effect on the
regression.
of estimating a linear model. However, the normal probability plot suggests that the model fit
is not accurate and it is S-shaped. The variance in the deleted residuals indicate there is
constant variance upto some extent but there are some outliers which we can spot in versus
Page 5
Bayesian Simple Linear Regression
We implement a bayesian approach using R. Using the Bayesian framework, we can now
interpret credible intervals as the probabilities of the coefficients lying in such intervals.
The median estimate and MAD_SD (median absolute deviation) are computed by the bayesian
model which uses MCMC (Monte Carlo) as a sampling tool. We plot the graphs for each
predictor which also have the median estimate for each predictor.
Page 6
Evaluating the model parameters (Posterior)
We evaluate the the model parameter by analyzing the posteriors using some specific
statistical measures. We use describe_posterior() function from the Bayes library for the
following output.
We have the 95% credible set for the model, which information about the uncertainty of the
regression coefficients. We will also use a equal-tailed confidence interval and the high -
density intervals.
Page 7
The interpretation of any such confidence interval in bayesian approach is that with 95%
probability (given the data) that a coefficient lies above the low value and under high value.
a p-value in the bayesian framework. It is the probability that the effect goes to the positive or
to the negative direction. The rhat value checks for convergence, we have value close to 1 for
every variable so we do not have any convergence problem with MCMC. The ESS is basically
the ‘Effective sample size’ generated by the MCMC for each, generally the higher the ESS the
better.
Sale Price = -29298 - 804 Age + 1706 Pct College + 52706 Bedrooms + 2.265 Lot Size
The intercept values for our regression model and bayesian model are all nearly similar with
bayesian coefficients being on the higher side. The 95% confidence interval for both our model
are compared below. The frequentist method gives us a narrower confidence interval and we
can say that the linear model is better than the bayesian model.
Page 8
Appendix
(RE100$Bedrooms)library(dplyr)
RealEstate <- read_excel("D:/STAT 4444/RealEstate.xls")
RealEstate
#Taking 100 samples from the main dataset
RE100 <- sample_n(RealEstate, 100)
RE100
hist(RE100$Age)
hist(RE100$Pct.College)
hist
hist(RE100$Lot.Size)
hist(RE100$Sale.Price)
summary(RE100)
install.packages("mlbench")
library(mlbench)
install.packages("rstanarm")
library(rstanarm)
install.packages("bayestestR")
library(bayestestR)
library(bayesplot)
library(insight)
library(broom)
print(model_bayes, digits = 3)
mcmc_dens(model_bayes, pars=c("PctCollege"))+
vline_at(1688.493, col="red")
Page 9
mcmc_dens(model_bayes, pars=c("Bedrooms"))+
vline_at(52794.568, col="red")
mcmc_dens(model_bayes, pars=c("LotSize"))+
vline_at(2.260, col="red")
BIC(model_bayes)
print(purrr::map_dbl(post, map_estimate),digits = 3)
hdi(model_bayes)
eti(model_bayes)
Page 10
Page 11