Machine Learning Approach for House Price Prediction

Article in Asian Journal of Research in Computer Science · June 2023

DOI: 10.9734/ajrcos/2023/v16i2339


Asian Journal of Research in Computer Science

Volume 16, Issue 2, Page 54-61, 2023; Article no.AJRCOS.101262

ISSN: 2581-8260

Machine Learning Approach for House

Price Prediction
M. Jagan Chowhaan a, D. Nitish a, G. Akash a,
Nelli Sreevidya a* and Subhani Shaik a
Department of IT, Sreenidhi Institute of Science and Technology, Yamnampet, Ghatkesar,
Hyderabad, India.

Received: 03/04/2023
Accepted: 05/06/2023
Published: 15/06/2023


In our ecosystem, real estate is clearly a distinct industry. Predicting house prices, significant
housing characteristics, and many other things is made a lot easier by the capacity to extract data
from raw data and extract essential information. Daily fluctuations in housing costs are still present,
and they occasionally rise without regard to calculations. According to research, changes in
property prices frequently have an impact on both homeowners and the real estate market.
To analyze the key elements and the best predictive models for home prices, literature research is
conducted. The analyses' findings supported the usage of artificial neural networks, support vector
regression, and linear regression as the most effective modeling techniques. Our results also imply
that real estate agents and geography play important roles in determining property prices. Finding
the most crucial factors affecting housing prices and identifying the best machine learning model to
utilize for this research would both be greatly aided by this study, especially for housing developers
and researchers.


*Corresponding author: E-mail: sreevidya1509@gmail.com;

Asian J. Res. Com. Sci., vol. 16, no. 2, pp. 54-61, 2023
Chowhaan et al.; Asian J. Res. Com. Sci., vol. 16, no. 2, pp. 54-61, 2023; Article no.AJRCOS.101262

Keywords: House price prediction; linear regression; machine learning.

1. INTRODUCTION project, we have applied these five algorithms

namely linear regression, support vector
In this report, we propose our system “House machine, Lasso regression, Random Forest and
price prediction using Machine Learning”. Along XGBoost to predict house prices using a dataset
with other fundamental requirements like food, of real estate properties. Because it can handle
water, and many other things, a place to call a large number of characteristics and capture
house is one of a person's most basic wants. In intricate correlations between the features and
the real estate sector, predicting house prices is the target variable (price), XGBoost is an
essential to work since it aids buyers and sellers effective algorithm for this purpose.
in making wise choices [1,2]. Numerous
algorithms have been created to accurately 1.1 Explanation
anticipate property prices thanks to advances in
machine learning. In this research, we use a Input: The input section represents the initial
dataset of real estate properties along with stage of the house price prediction process.
XGBoost, an advanced gradient boosting Here, one has to gather relevant data that could
technique, to forecast house values [3-5]. influence house prices, it can be the dataset.
Powerful algorithm XGBoost effectively manages This data may include factors such as the size of
structured datasets [6-8]. It has been found to the house, number of bedrooms, location,
perform well in forecasting complex datasets and neighborhood amenities, historical sales data,
has been utilized in a number of machine- and other relevant features.
learning competitions. In this experiment, we
used XGBoost to solve the problem of predicting Preprocessing: In the preprocessing stage, the
housing prices and assessed its effectiveness. collected data goes through various cleaning and
transformation steps to ensure its quality and
The aim of house price prediction is to create a suitability for analysis. This involves tasks like
model that can precisely estimate the price of a handling missing values, removing outliers,
new house based on its attributes using previous normalizing or scaling the data, and
data on house features (such as square footage, encoding categorical variables. Preprocessing
number of bedrooms and bathrooms, location, helps to prepare the data for effective
etc.) and their corresponding prices. In this modeling [9-13].

Fig. 1. Flow of execution

Chowhaan et al.; Asian J. Res. Com. Sci., vol. 16, no. 2, pp. 54-61, 2023; Article no.AJRCOS.101262

Model: The model section represents the core of sellers are keenly interested in price points. In
the house price prediction process. Here, one this study, explanatory variables that encompass
has to select an appropriate machine learning a wide range of residential dwelling
algorithm or ensemble of algorithms to build a characteristics will be used to forecast house
predictive model. Commonly used algorithms for values. The objective of this project is to develop
house price prediction include linear regression, a regression model that can precisely calculate
decision trees, random forests, support vector the house's price given its attributes.
machines, or neural networks. The model takes
the preprocessed data as input and learns 3. LITERATURE REVIEW
patterns and relationships within the data to
make predictions on house prices. 1. Sushant Kulkarni. (2021)Testing the
dataset using four distinct retrogression
Ensembling: Ensembling refers to the practice algorithms—Velicet Lasso Regression,
of combining multiple predictive models to Logistic Retrogression, Decision Tree, and
improve the accuracy and robustness of the Support Vector Regression—is one of the
predictions. In this stage, one has to employ approaches suggested in the study by
techniques such as averaging, bagging, Neelam Shinde and Kiran Gawande. When
boosting, or stacking to create an ensemble comparing error criteria such as R-Square
model. By leveraging the strengths of different Value, Mean Absolute Error, Mean
models, ensembling aims to achieve more Squared Error, and Root Mean Squared
accurate and reliable predictions by reducing Error, Decision Tree emerged as the
bias and variance. fashionable algorithm with the highest
delicacy score of 86.4 and the lowest error
Output: The output section represents the final values, while Lasso Regression performed
stage of the house price prediction process. the worst with a delicacy score of 60.32.
Here, the trained model or ensemble provides 2. To predict the cost of resale homes, P.
predictions on house prices based on the given Durganjali suggested using classification
input data. The predictions can be in the form of algorithms. The selling price of a property
specific price values or in percentage. is predicted in this study using a variety of
classification methods, including Linear
2. PROBLEM STATEMENT regression, Decision Tree, K-Means, and
The asking price and general description are Random Forest. A home's price is
frequently presented independently from the influenced by its physical attributes, its
generic and standardized real estate attributes. geographic location, and even the state of
These qualities may be easily compared across the economy. Here, they apply these
the entire spectrum of potential houses because techniques, use RMSE as the performance
they are given separately and in a systematic matrix for different datasets, and find the
manner. House sellers might list a summary of best accurate model that predicts better
all the key aspects of the house in the description results.
because every house also has distinctive 3. Bengaluru has been chosen by Manasa
elements, such as a particular view or style of and Gupta as the case study city. The
washbasin. Potential purchasers can take into square footage of the property, its location,
account all provided real estate features, but and its amenities are all significant
owing to the great diversity, it is almost not determinants of price. There are 9 different
possible to provide an automatic comparison of qualities employed. For experimental
all variables. This also applies in the opposite work, Multiple Linear Regression (Least
direction: house sellers must evaluate the worth Squares), Lasso/Ridge Regression, SVM,
based on the attributes of the house in relation to and XGBoost are employed.
the current market price of comparable houses. It 4. According to Panjali and Vani, especially
is difficult to determine a fair market price due to for those who plan to live there for a long
the variety of features. In addition to outlining the time before selling it again. It also applies
property's essential features and capturing the to people who want no risks taken when
reader's curiosity, the house description building their houses. To determine the
functions as a persuasive tool. house's resale value, authors use a variety
of classification techniques, including
Housing prices are a significant indicator of the Logistic Regression, Decision trees, Naive
health of the economy, and both buyers and Bayes, and Random Forest. Additionally, it

Chowhaan et al.; Asian J. Res. Com. Sci., vol. 16, no. 2, pp. 54-61, 2023; Article no.AJRCOS.101262

uses the AdaBoost method to help weak prices). The model learns from the training set to
students become strong ones. The resale make predictions.
price of a home is determined by its
physical attributes, location, as well as 4.4 Phase IV: Testing the Model
numerous economic factors that are
persuasive at the moment. In order to Once the model is trained, it is evaluated using
release the best-selling strategy for each the testing dataset to assess its predictive
dataset, accuracy is employed to gauge capabilities. The model's performance is
performance. measured by comparing its predictions with the
actual house prices in the testing set. Evaluation
4. SYSTEM DESIGN AND ARCHITEC- metrics such as mean absolute error or root
TURE mean squared error can be used to quantify the
accuracy of the predictions.
4.1 Phase I: Collection of Data
The testing set, on the other hand, is a separate
In this phase, relevant data pertaining to house subset of the dataset that is used to evaluate the
prices is gathered from reliable sources such as performance and generalization ability of the
real estate websites and public datasets. The trained model. It is unseen by the model during
data may include features such as location, size, the training phase and is used to assess how
number of rooms, area_type, availability, and well the model can predict house prices on new,
sale prices. Care should be taken to ensure the unseen data.
data is diverse and representative of the target
market. The division of the dataset into training and
testing sets is typically done randomly, ensuring
4.2 Phase II: Data Pre-processing that the two subsets have similar distributions
and characteristics. A common practice is to
This phase involves cleaning and preparing the allocate around 80% of the data to the training
collected data for model training. Tasks such as set and the remaining 20% to the testing set and
handling missing values, removing outliers, we followed the same.
normalizing numerical features, and encoding
categorical variables are performed. Feature Regarding the number of rounds of training, it
selection techniques can be applied to identify depends on various factors such as the
the most relevant attributes for predicting house complexity of the dataset, the chosen machine
prices. Additionally, data splitting techniques learning algorithm, and the performance
such as stratified sampling can be used to create requirements. Generally, multiple rounds of
training and testing datasets. training are effective to improve the model's
accuracy and fine-tune its performance.
4.3 Phase III: Training the Model
In this phase, various machine learning
algorithms are applied to train a predictive model To estimate housing values in this study, we
using the pre-processed data. Common used a number of well-known machine learning
approaches include linear regression, decision methods. Support vector machines (SVM),
trees, random forests, or more advanced random forest, XGBoost, Lasso regression, and
techniques like gradient boosting or neural linear regression were some of the methods
networks. The training process involves fitting the used in our investigation.
model to the training data, optimizing
hyperparameters, and evaluating the model's Algorithms: In the process of developing this
performance using appropriate metrics such as model, various machine learning algorithms were
mean squared error or R-squared. studied. The model is trained on Support vector
machines (SVM), random forest, XGBoost,
The training set is used to train the machine Lasso regression, and linear regression. Out of
learning model. It comprises a majority portion of this Random Forest gives highest accuracy in
the dataset and is used to teach the model the prediction of housing prices and the next highest
patterns and relationships between the input accuracy achieved is by XGBoost algorithm and
features (e.g., number of rooms, location, square this algorithm is preferred due to its ability to
footage) and the target variable (i.e., house handle complex, structured datasets and its

Chowhaan et al.; Asian J. Res. Com. Sci., vol. 16, no. 2, pp. 54-61, 2023; Article no.AJRCOS.101262

ability to automatically handle missing values and as the dataset size, feature complexity, and
outliers. Therefore, we recommend the use of interpretability requirements. Common algorithms
XGBoost for house price prediction tasks in the include linear regression, decision trees, random
real estate industry. The decision to choose the forests, or more advanced techniques. We
algorithm depends on the dimensions and type of trained and tested five algorithms and finally
data used. XGBoost is the best fit for our XGBoost is performing better.
6.4 Exploratory Data Analysis
XGBoost: The XGBoost observes features of an
attribute and train the model by analyzing In the exploratory data analysis (EDA) conducted
given features. XGBoost from the graph, for the house price prediction project, an image
attribute combination, labels including features was generated to visualize the relationship
and according to the system analyzes the between the variables "balcony," "bath," and
data. "price."
6. IMPLEMENTATION The findings from this EDA analysis could be
Here are the steps that we followed in valuable for potential homebuyers, real estate
implementation. agents, and property developers, as it sheds light
on the factors that influence house prices
6.1 Data Collection
6.5 Correlation Heatmap
Gather a dataset either from github or it will be
also available on Kaggle, that includes relevant In our exploratory data analysis (EDA) for house
features of houses such as location, number of price prediction, we created a correlation
rooms, square feets, and sale prices. Ensure the heatmap to examine the relationships between
dataset that has the features that you are about the variables bath, balcony, and price. The
to consider. correlation heatmap visually represents the
strength and direction of correlations between
6.2 Data Pre-processing these variables.
Clean and prepare the collected data for model
The correlation heatmap reveals valuable
training. Handle missing values, perform feature
insights regarding the influence of bath and
scaling to bring features to a similar range,
balcony on house prices. We observed a positive
encode categorical variables, and address
correlation between the number of bathrooms
outliers. Additionally, one can explore feature
and the price of the house, indicating that
engineering techniques to create new meaningful
properties with more bathrooms tend to have
higher prices. Additionally, we observed a
6.3 Model Selection positive correlation between the number of
balconies and the house price, suggesting that
Choose a suitable machine learning algorithm for houses with more balconies may command
house price prediction, considering factors such higher prices as well.

Fig. 2. Data Pre-processing

Chowhaan et al.; Asian J. Res. Com. Sci., vol. 16, no. 2, pp. 54-61, 2023; Article no.AJRCOS.101262

Fig. 3. Exploratory Data Analysis

Fig. 4. Correlation heatmap

Chowhaan et al.; Asian J. Res. Com. Sci., vol. 16, no. 2, pp. 54-61, 2023; Article no.AJRCOS.101262

6.6 Training and Testing the Model Table 1. Model outputs

In the training and testing phase of our house S. Model Score RMSE
price prediction model, we applied a 1 Linear Regression 0.790384 64.898435
comprehensive approach by training and testing 2 Lasso Regression 0.803637 62.813243
the data using five different algorithms. This 3 Support Vector 0.206380 126.278064
approach allowed us to evaluate the Machine(SVM)
4 Random Forest 0.903507 44.032172
performance and effectiveness of each algorithm 5 XGBoost 0.886607 47.732530
in predicting house prices.
The superior performance of Random Forest and
The five algorithms employed in our study XGBoost can be attributed to their ability to
include linear regression, Lasso regression, handle high-dimensional datasets, capture
XGBoost, random forest, and support vector complex relationships, and effectively manage
machines (SVM). Each algorithm was trained on feature interactions. These algorithms are known
a portion of the preprocessed dataset and then
for their robustness, scalability, and versatility in
tested on a dataset to assess its predictive
handling a wide range of machine learning tasks.
By utilizing multiple algorithms, we aimed to
capture a wide range of modeling techniques and The goal of the project "House Price Prediction
identify the best-performing approach for our Using Machine Learning" is to forecast house
specific house price prediction task. prices based on various features in the provided
data. Our best accuracy was around 90% after
The training and testing phase involved tuning we trained and tested the model. To make this
hyperparameters for each algorithm, using model distinct from other prediction systems, we
techniques such as cross-validation and grid must include more parameters like tax and air
search, to optimize their performance. Evaluation quality. People can purchase houses on a
metrics such as mean squared error (MSE), root budget and minimize financial loss. Numerous
mean squared error (RMSE), and R-squared algorithms are used to determine house values.
were utilized to compare and assess the The selling price was determined with greater
accuracy and predictive power of the trained precision and accuracy. People will benefit
model [14-17]. greatly from this. Numerous elements that
influence housing prices must be taken into
7. RESULTS AND ANALYSIS account and handled.

To use various machine learning algorithms for COMPETING INTERESTS

solving this problem.
Authors have declared that no competing
Random Forest achieves a high accuracy score interests exist.
of 0.903 and a low root mean squared error
(RMSE) value of 44.032. This suggests that the REFERENCES
Random Forest model captures the underlying
patterns and relationships in the data effectively, 1. Available:https://www.researchgate.net/pu
resulting in accurate predictions of house prices. blication/347584803_House_Price_Predicti
Similarly, XGBoost achieves a commendable Survey_of_Literature
accuracy score of 0.887 and a reasonably low 2. House price prediction using a hedonic
RMSE value of 47.733. XGBoost is a boosting price model vs an artificial neural network.
algorithm that builds an ensemble of weak American Journal of Applied Sciences.
learners iteratively. It has the ability to handle Limsombunchai, Christopher Gan, and
complex feature interactions and can effectively Minsoo Lee. 3:193–201.
capture non-linear relationships, resulting in 3. Joep Steegmans and Wolter Hassink. an
accurate predictions. The regularization empirical investigation of how wealth and
techniques employed in XGBoost help prevent income affect one's financial status and
overfitting and improve generalization ability to purchase a home. Journal of
performance. Housing Economics. 2017;36:8–24.

Chowhaan et al.; Asian J. Res. Com. Sci., vol. 16, no. 2, pp. 54-61, 2023; Article no.AJRCOS.101262

4. Ankit Mohokar, Nihar Baghat, and 12. Kai-Hsuan Chu, Li, Li. Prediction of real
Shreyash Mane. House Price Forecasting estate price variation based on economic
Using Data Mining, International Journal of parameters, International Conference on.
Computer Applications. 152:23–26. IEEE, Applied System Innovation (ICASI);
5. Joao Gama, Torgo, and Luis. Logic 2017.
regression using Classification Algorithms. 13. Jae Kwon Bae, Byeonghwa Park. Housing
Intelligent Data Analysis. 4:275-292. Price Forecast Using Machine Learning
6. Available:https://www.ijraset.com/research- Algorithms. 42:2928–2934.
paper/house-price-prediction-using-ml 14. Subhani Shaik, Uppu Ravibabu.
7. Available:https://ieeexplore.ieee.org/docum Classification of EMG Signal Analysis
ent/8473231 based on Curvelet Transform and
8. Fabian Pedregosa et al. Python's Scikit- Random Forest tree Method. Paper
learn library for machine learning, Journal selected for Journal of Theoretical and
of Machine Learning Research. 12:2825– Applied Information Technology (JATIT).
830. 95.
9. Real Estate Economics. Heidelberg,
15. Shiva Keertan J, Subhani Shaik. Machine
Bork M. and Moller VS, House Price
Learning Algorithms for Oil Price
Forecast Ability: A Factor Analysis.
Prediction, International Journal of
Innovative Technology and Exploring
10. Hy Dang, Minh Nguyen, Bo Mei, and
Engineering. 8(8).
Quang Troung. Improvements to home
price prediction methods using machine 16. KP Surya Teja, Vigneswar Reddy and
learning. Precedia Engineering. 174:433- Subhani Shaik, Flight Delay Prediction
442. Using Machine Learning Algorithm
11. Atharva Chogle, Priyankakhaire, Akshata XGBoost, Jour of Adv Research in
Gaud, and Jinal Jain. A article titled House Dynamical & Control Systems. 11(5).
Price Forecasting Using Data Mining 17. Subhani Shaik, Vijayalakshmi K,
Techniques was published in the Ramakanth Reddy. Location based house
International Journal of Advanced prediction using data science techniques”,
Research in Computer and Asian Journal of Advanced Research and
Communication Engineering. 6:24-28. Reports. 17(4).
