House Report
House Report
House Report
We are greatly indebted to our Mini Project guide Ms.Manjramkar M.A.(Assistant Professor)
for her valuable guidance throughout this work. It has been an altogether different experience to
work with her and we would like to thank her for their help, suggestions and numerous
discussions.
We gladly take this opportunity to thank Prof. Mr. Hashmi S.A. (Head of Information
Technology, MGM’s College of Engineering, Nanded).
Last but not least we are also thankful to all those who helped us directly or indirectly to develop
this project and complete it successfully.
Vikas Jondhale[01]
Akshay Narsanne[15]
Renuka Opalkar[17]
[BY IT]
1|Page
ABSTRACT
House Price Prediction Project proves to be the Hello World of the Machine Learning world. It is
a very easy project which simply uses Linear Regression to predict house prices. we are going to
work on a dataset which consists information about the location of the house, price and other
aspects such as square feet etc.
When we work on these sorts of data, we need to see which column is important for us and
which is not. Our main aim today is to make a model which can give us a good prediction on the
price of the house based on other variables.
We are going to use Linear Regression for this dataset and see if it gives us a good accuracy or
not. We’ll use the dataset stored in a CSV file for a detailed view and easier access. Download
the CSV file from this link. Save the CSV file in the same directory as the python file.
We have gone through how to implement the entire machine learning pipeline, and we have an
intuitive understanding of machine learning algorithms.
2|Page
CONTENTS
TITLE
ACKNOWLEDGEMENT
ABSTRACT
CONTENTS
LIST OF ABBREVIATIONS
LIST OF FIGURES
1 INTRODUCTION 1
2 PROJECT PIPELINE 4
2.1 Pandas 4
2.2 Scikit Learn 4
2.3 NumPy 5
2.4 Seaborn 5
2.5 Project Pipeline 6
3 METHODS 8
3.1 Cleaning data 8
3.2 Prediction and evaluation 9
3.3 Error metrics 10
4 ALGORITHMS 12
4.1 Machine Learning algorithms 12
4.2 The Ames Housing data 14
3|Page
5 HYPERPARAMETERS 15
5.1 Grid search over hyperparameters 15
5.2 k-NN hyperparameters 15
5.3 Random Forest hyperparameters 16
5.4 Algorithm comparison 17
CONCLUSION 18
REFERENCES 19
4|Page
LIST OF ABBREVIATIONS
5|Page
LIST OF FIGURES
2.5 Land 7
3.2 Cross-Validation 10
4.1 K-Nearest Neighbor 13
5.2 MAE for K-NN with values for k from 1 to10 15
5.3 MAE for random forests with estimators from 1 to 50 16
5.3.2 MAE for Random forests with MAX_features from 1 to 221 17
5.4 Errors of the two methods 17
6|Page
Chapter 1
INTRODUCTION
This project is to build an end-to-end solution or application that is capable of predicting the
house prices better than individuals. Another motivation for this project is to implement similar
solution at my workplace and help Investments and Residential team make data driven decisions.
One of the increasing of property demand is because of high population.
The result of this census indicates that the younger generation will need a house or buy a house
in the future. Based on preliminary research conducted, there are two standards of house price
which are valid in buying and selling transaction of a house that is house price based on the
developer market selling price.
We will be building models to predict house prices using Census data which consist of metrics
such as population, median income, median house price and others for each block group which
typically consists of population from 600 to 3,000.
The ultimate goal of the project is to build a prediction engine capable of predicting district’s
median housing price. We know that this is supervised learning problem as our data set consists
of labelled observations.
It does looks like multivariate regression should be our got to option but we will explore multiple
ways of building the model and finally pick the one with lowest error rate RMSE (Root Mean
Square Error) or MAE (Mean Absolute Error) or any other metrics we choose.
Python is commonly used for developing websites and software, task automation, data analysis,
and data visualization. Since it’s relatively easy to learn, Python has been adopted by many non
programmers such as accountants and scientists, for a variety of everyday tasks, like organizing
finances.
1|Page
Python has become a staple in data science, allowing data analysts and other professionals to use
the language to conduct complex statistical calculations, create data visualizations, build machine
learning algorithms, manipulate and analyze data, and complete other data-related tasks.
Python can build a wide range of different data visualizations, like line and bar graphs, pie charts,
histograms, and 3D plots. Python also has a number of libraries that enable coders to write
programs for data analysis and machine learning more quickly and efficiently, like TensorFlow
and Keras.
The Jupyter Notebook is an open-source web application that allows you to create and share
documents that contain live code, equations, visualizations, and narrative text. Its uses include
data cleaning and transformation, numerical simulation, statistical modeling, data visualization,
machine learning, and much more. The name, Jupyter, comes from the core supported
programming languages that it supports: Julia, Python, and R. Jupyter ships with the IPython
kernel, which allows you to write your programs in Python Jupyter Notebook (formerly IPython
Notebooks) is a web-based interactive computational environment for creating Jupyter notebook
documents. The “notebook” term can colloquially make reference to many different entities,
2|Page
mainly the Jupyter web application, Jupyter Python web server, or Jupyter document format
depending on context.
These are frameworks in python to handle commonly required tasks. I Implore any budding data
scientists to familiarize themselves with these libraries.
3|Page
Chapter 2
PROJECT PIPELINE
2.1 Pandas
pandas is a Python library for data analysis. Started by Wes McKinney in 2008 out of a need for
a powerful and flexible quantitative analysis tool, pandas has grown into one of the most popular
Python libraries. It has an extremely active community of contributors.
Scikit-learn is probably the most useful library for machine learning in Python. The sklearn
library contains a lot of efficient tools for machine learning and statistical modeling including
classification, regression, clustering and dimensionality reduction.
2.2.1Components
Cross-validation:
4|Page
Cross-validation is a technique for evaluating ML models by training several ML models
on subsets of the available input data and evaluating them on the complementary subset
of the data. Use cross-validation to detect overfitting, ie, failing to generalize a pattern.
Again there is a large spread of machine learning algorithms in the offering – starting
from clustering, factor analysis, principal component analysis to unsupervised neural
networks.
This came in handy while learning scikit-learn. I had learned SAS using various
academic datasets (e.g. IRIS dataset, Boston House prices dataset). Having them handy
while learning a new library helped a lot.
Feature extraction:
Scikit-learn for extracting features from images and text (e.g. Bag of words)
2.3 NumPy
NumPy is the fundamental package for scientific computing in Python. It is a Python library that
provides a multidimensional array object, various derived objects (such as masked arrays and
matrices), and an assortment of routines for fast operations on arrays, including mathematical,
logical, shape manipulation, sorting, selecting, I/O, discrete Fourier transforms, basic linear
algebra, basic statistical operations, random simulation and much more.
At the core of the NumPy package, is the ndarray object. This encapsulates n-dimensional arrays
of homogeneous data types, with many operations being performed in compiled code for
performance
2.4 Seaborn
5|Page
Seaborn is an open-source Python library built on top of matplotlib. It is used for data
visualization and exploratory data analysis. Seaborn works easily with dataframes and the
Pandas library. The graphs created can also be customized easily. Below are a few benefits of
Data Visualization.
Graphs can help us find data trends that are useful in any machine learning or forecasting project.
Visually attractive graphs can make presentations and reports much more appealing to the
reader.
Generally speaking, machine learning projects follow the same process. Data ingestion, data
cleaning, exploratory data analysis, feature engineering and finally machine learning.
The pipeline is not linear and you might find you have to jump back and forth between different
stages. It’s important I mention this because tutorials often make you believe the process is much
6|Page
cleaner than in reality. So please keep this in mind, your first machine learning project might be a
mess.
Before we begin: I’ll reiterate, machine learning is an iterative process and It’s rarely
straightforward! Please do not be discouraged if you find yourself lost in an ML project. Keep
reading, keep experimenting, keep asking questions and one day it will click.
The rest of this article will talk through the stages of the project pipeline. Where useful, I will
drop in code examples from python. The full end to end project is available to use and play with
here. I’ll share a link for this at the end of the article.
7|Page
Chapter 3
METHODS
Machine learning method is better to use for the house price problem the algorithms k-NN and
Random Forest. Instead of implementing the algorithms from scratch for this study, algorithms
from the scikit-learn library have been used. It is a state-of-the-art library part of the scikit suite
of scientific toolkits for Python.
We have also used the “Our Python” data analysis library Pandas. Prior to comparing the
algorithms, the data set has been pre-processed and cleaned in order for the algorithms to take
the data as input. Moreover, a method for evaluating the data has been established and finally the
machine learning algorithms have been executed with the cleaned data set and tested with
different values for relevant hyperparameters for prediction.
Machine learning algorithms are largely implemented to only take data that is in a numeric
format as input. More than half of the columns in the Ames Housing data set are non-numerical
and need to be encoded, in this case using one-hot encoding and labeling. Additionally, various
columns contain some empty values that have been dealt with in different ways as described in
section
The numerical variables of the data set take on a large range of values and depending on the
column these ranges can be quite different. I METHODS not to introduce bias normalization has
8|Page
been applied, scaling the numbers to the range [0, 1]. The Euclidean norm has been used for
normalizing data in this study, although there are other options.
The Euclidean norm is denoted ||x||2 and is calculated with the following formula ||x||2 = p x 2 1
+ · · · + x 2 n where x1 · · · xn are all of the values of a feature in the data set. Each value of the
feature is then normalized by dividing it by the Euclidean norm. [12] [13] [14] Normalization
with the Euclidean norm has been used for all features in the data.
In order to compare the machine learning models they have been tested on data where the price
of the properties are known. Therefore, the data set is split into training data and testing data.
Training data is, as the name suggests, used to train the machine learning model and constitutes
the larger share of the split data set. When the model is fit with the training data the sales price of
the properties are known to the model in order for it to learn from the data.
The testing data is used to get a measurement of how well the machine learning model predicts
house prices. The machine learning model is given the test data but without the price of the
properties in order to predict the price for them given the various features for the properties. The
predicted price is then compared to the actual price in the test data.
The data set is used in two ways. First to train the algorithm, and then to test it, and for these
intents we have split the set in two. The ratio between the number of rows in the training data
and the test data needs to be carefully selected. If the test data is too small the result is less
convincing since it is not tested on a large variety of rows.
Increasing the test data size improves reliability but reduces the number of rows in the training
data which causes the model to predict worse. A way of mitigating the effects of a larger test set
is to use cross-validation, which is used for this experiment. [17] Cross-validation means running
the model on the data set multiple times, alternating the part of the data that is used as test data.
Using five-fold cross validation is common practice, meaning 20 % of the data set is used as
testing data and alternated for five runs. On the first run the first 20 % of the data set is used as
9|Page
testing data and the remaining 80 % is used as training data. On the second run the subsequent 20
% is used as testing data
Housing data set is relatively small cross-validation is an important part in ensuring that a small
error for a model is not the effect of the test data working particularly well with the model while
another set of testing data might not.
Both of the k-Nearest neighbor and Random forest algorithms have a variety of hyperparameters
directly affecting their functioning and by extension, their predictions. A hyperparameter is a
parameter to the algorithm that is set prior to computation, as opposed to parameters that are
derived through training. This study has focused on the particular hyperparameters introduced in
section
For measuring how good predictions the model makes, four error metrics have been used. Mean
absolute error (MAE), Mean squared error (MSE), Median absolute error (MedAE) and
Coefficient of determination (R2).
Mean absolute error measures the prediction error by taking the mean of all absolute values of
all errors, that is:
10 | P a g e
MAE = Pn i=0 |yi − yˆi |/n
Where n is the number of samples, y are the target values and yˆ are the predicted values. A
MAE closer to 0 means that the model predicts with lower error and that the prediction is better
the closer the MAE is to 0.
Mean squared error is similar to MAE, but the impact of a term is quadratically proportional to
its size. It measures the prediction error by taking the mean of all squared absolute values of all
errors, that is:
The median absolute error (MedAE) is the median of all absolute differences between the
predicted value and the target value. In difference to MAE and MSE, the median absolute error is
more robust to outliers by virtue of using the median instead of the mean.
11 | P a g e
Chapter 4
ALGORITHMS
The house pricing problem was approached by Baldominos et viewpoint of finding investment
opportunities. They formulated the regression problem and used several Machine Learning
algorithms such as k-Nearest neighbor, variations of neural networks and decision trees. Another
study by Oxenstierna investigated it for the purposes of valuation of houses.
Data set included 5000 entries. Again, the k-Nearest neighbour method was used as well as
Artificial Neural Networks, to minimize the median absolute percentage error of the prediction.
The methods performed similarly at around 8-9 % Median Absolute Percentage Error. 2.1
Machine learning algorithms in this study two machine learning algorithms were compared
In this study two machine learning algorithms were compared against each other in order to
investigate which one is more successful in predicting housing prices. As mentioned in the
previous section, Baldominos et al. performed a similar study in which they compare four
machine learning algorithms for housing prices. In their study they found that the Random Forest
regression algorithm predicted with the smallest error followed by k-Nearest neighbors’
regression. In the study performed by Oxenstierna [4] k-Nearest neighbor’s regression and
Artificial neural networks are suggested as methods for predicting house prices. Even though
Oxenstierna finds the Artificial neural networks to perform better in many cases this study has
excluded Artificial neural networks in order to limit the scope of the report and the time frame
for the project. Since both reports study the performance of k-Nearest neighbors’ regression, the
algorithm will be studied in this report as well.
12 | P a g e
4.1.1 k-Nearest neighbors’ regression
k-Nearest neighbors (k-NN) is a non-parametric algorithm that can be used for both
classification and regression problems. The algorithm relies on the assumption that any item in
the data set should be have a similar value for the prediction target if they share similar values
for other features. In our particular case, the house price is the target variable that is predicted by
the k number of neighbors that are the most similar in the features. The data can be
conceptualized as points in a n-dimensional cartesian space. As an example, in the 2-dimensional
case, we have two features of which the values are represented as points on a plane. Figure 2.1
illustrates the case when k = 3.
predicting the value of a sample by the single closest neighbor from the training data does not
capture much information, thus multiple data items is used for prediction in training. If the
prediction is based on the k nearest neighboring items to the one being predicted (thus the name
of the algorithm) more information underlies the estimation and thus a more accurate prediction
could be obtained
Random Forest is an algorithm which can be used both for classification and regression.
Random forest models are constructed by using a collection of decision trees based on the
training data. Instead of taking the target value from a single tree, the Random Forest algorithm
makes a prediction on the average prediction of a collection of trees. The decision trees
themselves are constructed by fitting to randomly drawn groups of rows and columns in the
training data. This method is called bagging, and results in a reduction of bias as each tree is built
13 | P a g e
on different parts of the input at random. The method of averaging the predictions of decision
trees reduces the overfitting that can occur when using single decision trees. [7, pp. 587-588] [8]
The number of trees in the Random Forest is an important hyperparameter of the algorithm
called ’estimator’s’ and the more trees used the more will overfitting be prevented. The tradeoff
however, is an increase in the computation time needed. ’estimator’s’ will be tested with
different values.
Set in order to train a reliable machine learning model a sufficiently large data set was required.
The Ames Housing data set, compiled by Cock is a fairly complete data set consisting of almost
3000 entries for houses in the city of Ames, Iowa with as much as 80 different variables
describing various properly details. Because of the number of entries and features, the Ames
Housing data set was selected as a good fit for this project with enough data for both training and
testing the machine learning model. Any details about the data set can be found in the Data
Documentation.
The data set includes a large number of variables describing almost every aspect of the
properties that might be of interest when evaluating a house. There are 23 nominal, 23 ordinal,
13 discrete and 20 continuous variables for each property. The nominal variables are for example
the roof type or the material on the exterior facade of the property.
The nominal variables have about 5 to 10 possible values although three of the variables have
about 20 different possible values. These are being” one-hot” encoded for this study, which is
described in more detail in section 3.1. The 23 ordinal variables refer to various types of qualities
of the properties which includes the overall quality of the property, quality of the basement, etc.
Some of the ordinal variables take on numerical values while others use categorical values. The
categorical values have been” one-hot” encoded while the numerical values have been left as is.
Almost all of the 13 discrete variables describe a count of some kind. Examples are number of
fireplaces or bathrooms. Since these variables are numerical by default, they have only been
treated for missing values. The 20 continuous variables are mostly used for describing areas of
various attributes of the house or of the entire lot. Like the discrete variables these do not require
any further encoding to use them in a prediction mode.
14 | P a g e
Chapter 5
HYPERPARAMETERS
The grid search algorithm was used to find the best set of values for the selected
hyperparameters of each algorithm, presented in this section. In short, the following was found:
For the k-Nearest neighbour algorithm the best value for the ’k’ hyper parameter is 9 neighbors
and for the ’weights’ hyperparameter the best value was ’distance’ (inversely proportional).
Two hyperparameters have been tested for k-NN with different values. The value of the k
number of neighbors to use for the algorithm has been tested for the values 1 through 10
comparing by MAE.
can be seen in the figure, the MAE is least when k is equal to 9 which provides the best
prediction with the current setup for the k-NN algorithm when k is ranging between 1 and 10.
The value for the ’weights’ hyperparameter was set to ’distance’ when testing the different
values for the k parameter since this value produced the lowest error in the grid search over the
hyperparameters.
15 | P a g e
Fig 5.2: MAE for k-NN with values for k from 1 to 10
The random forest algorithm has been tested with three different hyperparameters;
’n_estimators’, ’max_features’ and ’criterion’. ’n_estimators’, which represents the number of
decision trees in the random forest, has been tested with values ranging between 1 and 50 and the
results are presented in figure 4.2 where the values are compared by MAE.
When testing the ’n_estimators’ parameter, the ’criterion’ hyperparameter was set to ’mse’ and
the ’max_features’ hyperparameter was set to 63 since those values performed best in the grid
search. The ’n_estimators’ value that yields the lowest MAE.
16 | P a g e
Fig 5.3: MAE for Random forests with estimators from 1 to 50
Secondly, the ’max_features’ hyperparameter was tested with values from 1 to 221 which is the
total number of features in the cleaned data set (including one-hot encoded features). The results
are presented in figure 4.2 where values of ’max_features’ are compared by MAE. The value
with the least MAE is 63.
Fig 5.3.2: MAE for Random forests with max_features from 1 to 221.
17 | P a g e
Finally, errors of the k-Nearest neighbour algorithm and the Random Forest algorithm are
compared, which gets to the focus of this study. the respective algorithms running on their
selected optimal hyperparameters. On all four-error metrics investigated, it is clear that the
Random Forest algorithm performs considerably better than the k-Nearest neighbour algorithm,
with regards to predicting with the smallest error.
CONCLUSION
In this project, we will develop and evaluate the performance and the predictive power of a
model trained and tested on data collected from houses in Boston’s suburbs. Once we get a good
fit, we will use this model to predict the monetary value of a house located at the Boston’s area.
Thousands of houses are sold everyday. There are some questions every buyer asks himself like:
What is the actual price that this house deserves? Am I paying a fair price? In this paper, a
machine learning model is proposed to predict a house price based on data related to the house
(its size, the year it was built in, etc.). During the development and evaluation of our model, we
18 | P a g e
will show the code used for each step followed by its output. This will facilitate the
reproducibility of our work. In this study, Python programming language with a number of
Python packages will be used.
The research question for this study is to study how well house prices can be predicted by using
k-Nearest neighbor and Random Forest regression. In this study we have found that the Random
Forest regression algorithm performs better at predicting house prices than the k-Nearest
neighbor algorithm.
However, there is still a difference between the actual prices in our testing data and the prices
predicted by the Random Forest regression algorithm. The lowest error achieved was for the
Random Forest with an MAE of 16208.5 dollars, about 9 % of the mean price.
[1] Book: hands-On Machine Learning with Scikit-Learn, Keras, and TensorFlow by Geron
Aurelien.
[2] https://www.coursera.org/articles/machine-learning-algorithms.
19 | P a g e
20 | P a g e