1822 B.E Ece Batchno 120

HOUSE PRICE
PREDICTION USING
MACHINE LEARNING
Guide : DR .S . LALITHAKUMARI
Batch members:
38130145 – NACHIAPPAN M
38130138 – MOHAMMED IMRAN
A Project Report on
HOUSE PRICE PREDICTION USING MACHINE LEARNING
Submitted in partial fulfillment of the requirements
for the degree of
Bachelor of Technology
in
Electronics and Communication Engineering
Under the guidance of
DR.S.LALITHAKUMARI
Department of Electronics and Communication Engineering

SATHYABAMA
INSTITUTE OF SCIENCE AND TECHNOLOGY
(DEEMED TO BE UNIVERSITY)
Accredited with Grade “A” by NAAC

JEPPIAAR NAGAR, RAJIV GANDHI SALAI, CHENNAI - 600 119
DECLARATION
I NACHIAPPAN M hereby declare that the Project Report entitled

House price prediction using machine learning done by me under the guidance
of DR.S.LALITHAKUMARI is submitted in partial fulfillment of the
requirements for the award of Bachelor of Engineering / Technology degree in
Electronics and communication Engineering.
DATE OF VIVA: 05. 05. 2022
PLACE: Sathyabama university chennai
SIGNATURE OF THE CANDIDATE

TABLE OF CONTENT
Sl. No. Title Page No.
1 Aim and Objective 5
2 Proposed System 6
3 Block Diagram 7
4 Proposed System Phases 8
5 Alternate Regressor 9
6 Factors that Affect House Pricing 10
7 Sample Code 12
8 Advantages of LSTM 17
9 Result Output and Dataset Explanation 18
10 Algorithm Brief Outline 24
11 Acknowledgement 26
12 Conclusion 27
13 Software Tools 28
14 References 29
AIM & OBJECTIVE
 People looking to buy a new home tend to be more conservative with their budgets and
market strategies.
 This project aims to analyse various parameters like average income, average area etc.
andpredict the house price accordingly.
 This application will help customers to invest in an estate without approaching an

agent
 To provide a better and fast way of performing operations.
 To provide proper house price to the customers.
 To eliminate need of real estate agent to gain information regarding house prices.
 To provide best price to user without getting cheated.
 To enable user to search home as per the budget.
 The aim is to predict the efficient house pricing for real estate customers with respect
to their budgets and priorities. By analyzing previous market trends and price ranges,
and alsoupcoming developments future prices will be predicted.
 House prices increase every year, so there is a need for a system to predict house
prices in the future.
 House price prediction can help the developer determine the selling price of a
house and can help the customer to arrange the right time to purchase a house.
 We use linear regression algorithm in machine learning for predicting the house price
trends
5
PROPOSED SYSTEM
• Linear Regression is a supervised machine learning model that attempts to model a

linear relationship between dependent variables (Y) and independent variables (X).
Every evaluated observation witha model, the target (Y)’s actual value is compared to
the target (Y)’s predicted value, and the major differences in these values are called
residuals. The Linear Regression model aims to minimize the sum of all squared
residuals. Here is the mathematical representation of thelinear regression:
Y= a0+a1X+ ε
The values of X and Y variables are training datasets for the model representation of
linear regression. When a user implements a linear regression, algorithms start to find
the best fit line using a0 and a1. In such a way, it becomes more accurate to actual data
points; since we recognize the value of a0 and a1, we can use a model for predicting
the response.
 As you can see in the above diagram, the red dots are observed values for both X and
Y.
 The black line, which is called a line of best fit, minimizes a sum of a squared error.
 The blue lines represent the errors; it is a distance between the line of best fit and
observed values.
 The value of the a1is the slope of the black line.
6
BLOCK DIAGRAM
7
PROPOSED SYSTEM PHASES
Phase 1: Collection of data
Data processing techniques and processes are numerous. We collected data for USA/Mumbai
real estate properties from various real estate websites. The data would be having attributes
such as Location, carpet area, built-up area, age of the property, zip code, price, no of
bedrooms etc. We must collect the quantitative data which is structured and categorized. Data
collection is needed before any kind of machine learning research is carried out. Dataset
validity is a must otherwise there is no point in analyzing the data.
Phase 2: Data preprocessing
Data preprocessing is the process of cleaning our data set. There might be missing values or
outliers in the dataset. These can be handled by data cleaning. If there are many missing
values in a variable we will drop those values or substitute it with the average value.
Phase 3: Training the model
Since the data is broken down into two modules: a Training set and Test set, we must initially
train the model. The training set includes the target variable. The decision tree regressor
algorithm is applied to the training data set. The Decision tree builds a regression model in the
form of a tree structure.
Phase 4: Testing and Integrating with UI
The trained model is applied to test dataset and house prices are predicted. The trained model
is then integrated with the front end using Flask in python
8
ALTERNATIVE REGRESSOR (XG BOOST REGRESSOR)
The results of the regression problems are continuous or real values. Some
commonly used regression algorithms are Linear Regression and Decision
Trees. There are several metrics involved in regression like root-mean-squared
error (RMSE) and mean-squared-error (MAE). These are some key members
of XGBoost models, each plays an important role.
 RMSE: It is the square root of mean squared error (MSE).

 MAE: It is an absolute sum of actual and predicted differences, but it
lacks mathematically, that’s why it is rarely used, as compared to other
metrics.
XGBoost is a powerful approach for building supervised regression models.

The validity of this statement can be inferred by knowing about its (XGBoost)
objective function and base learners.
9
FACTORS THAT AFFECT HOUSE PRICING
In order to predict house prices, first we have to understand the factors that affect house
pricing.
• Economic growth. Demand for housing is dependent upon income. With higher
economic growth and rising incomes, people will be able to spend more on
houses; this will increase demand and push up prices. In fact, demand for housing
is often noted to be income elastic (luxury good); rising incomes leading to a
bigger % of income being spent on houses. Similarly, in a recession, falling
incomes will mean people can’t afford to buy and those who lose their job may
fall behind on their mortgage payments and end up with their home repossessed.
• Unemployment. Related to economic growth is unemployment. When

unemployment is rising,fewer people will be able to afford a house. But, even the
fear of unemployment may discouragepeople from entering the property market.
• Interest rates. Interest rates affect the cost of monthly mortgage payments. A
period of high- interest rates will increase cost of mortgage payments and will
cause lower demand for buying a house. High-interest rates make renting
relatively
10
more attractive compared to buying. Interest rates have a bigger effect if
homeowners have large variable mortgages. For example, in 1990-92, the sharp
rise in interest rates caused a very steep fall in UK house prices because many
homeowners couldn’t afford the rise in interest rates.
• Consumer confidence. Confidence is important for determining whether people

want to take the risk of taking out a mortgage. In particular expectations towards
the housing market is important; if people fear house prices could fall, people will
defer buying.
• Mortgage availability. In the boom years of 1996-2006, many banks were very
keen to lend mortgages. They allowed people to borrow large income multiples
(e.g. five times income). Also, banks required very low deposits (e.g. 100%
mortgages). This ease of getting a mortgage meant that demand for housing
increased as more people were now able to buy. However, since the credit crunch
of 2007, banks and building societies struggled to raise funds for lending on the
money markets. Therefore, they have tightened their lending criteria requiring a
bigger deposit to buy a house. This has reduced the availability of mortgages and
demand fell.
• Supply. A shortage of supply pushes up prices. Excess supply will cause prices to
fall. For example, in the Irish property boom of 1996-2006, an estimated 700,000
new houses were built. When the property market collapsed, the market was left
with a fundamental oversupply. Vacancy rates reached 15%, and with supply
greater than demand, prices fell.
11
By contrast, in the UK, housing supply fell behind demand. With a shortage, UK
house prices didn’t fall as much as in Ireland and soon recovered – despite the
ongoing credit crunch. The supply of housing depends on existing stock and new
house builds. Supply of housing tends to be quite inelastic because to get planning
permission and build houses is a time-consuming process. Periods of rising house
prices may not cause an equivalent rise in supply, especially in countries like the
UK, with limited land for home-building.
 Affordability/house prices to earnings. The ratio of house prices to earnings

influences the demand. As house prices rise relative to income, you would expect
fewer people to be able to afford. For example, in the 2007 boom, the ratio of
house prices to income rose to 5. At this level, house prices were relatively
expensive, and we saw a correction with house prices falling.
Another way of looking at the affordability of housing is to look at the

percentage of take-home pay that is spent on mortgages. This takes into account
both house prices, but mainly interest rates and the cost of monthly mortgage
payments. In late 1989, we see housing become very unaffordable because of
rising interest rates. This caused a sharp fall in prices in 1990-92.
12
 Geographical factors. Many housing markets are highly geographical. For example,
national house prices may be falling, but some areas (e.g. London, Oxford) may
still see rising prices. Desirable areas can buck market trends as demand is high,
and supply limited. For example, houses near goodschools or a good rail link may
have a significant premium to other areas. This graph shows that first time buyers
in London face much more expensive house prices – over 9.0 times earnings
compared to the north, where house prices are only 3.3 times earnings.
13
SAMPLE CODE
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline
HouseDF = pd.read_csv('USA_Housing.csv')
HouseDF.head()
HouseDF=HouseDF.reset_index()
HouseDF.head()
HouseDF.info()
HouseDF.describe()
HouseDF.columns
sns.pairplot(HouseDF)
sns.distplot(HouseDF['Price’])
sns.heatmap(HouseDF.corr(), annot=True)
X = HouseDF[['Avg. Area Income', 'Avg. Area House Age', 'Avg. Area Number of Rooms', 'Avg. Area
Number of Bedrooms', 'Area Population']]
y = HouseDF['Price’]
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.4, random_state=101)
from sklearn.linear_model import minmaxscaler
lm = minmaxscaler(feature_range=(0,1))
lm.fit_transform(X_train,y_train)
print(lm.intercept_)
coeff_df = pd.DataFrame(lm.coef_,X.columns,columns=['Coefficient’])
coeff_df
14
from keras.layers import Dense,Dropout,LSTM
from keras.models import Sequential
model = Sequential()
model.add(LSTM(units = 50,activation = 'relu',return_sequences = True,input_shape =
(x_train.shape[1], 1)))
model.add(Dropout(0.2))
model.add(LSTM(units = 60,activation = 'relu',return_sequences = True))

model.add(LSTM(units = 80,activation = 'relu',return_sequences = True))

model.add(LSTM(units = 120,activation = 'relu'))

model.add(Dense(units = 1))
model.compile(optimizer='adam', loss = 'mean_squared_error’)

model.fit(x_train, y_train,epochs=50)
print(lm.intercept_)
coeff_df = pd.DataFrame(lm.coef_,X.columns,columns=['Coefficient’])
coeff_df
predictions = lm.predict(X_test)
scale_factor = 1/0.02099517
y_predicted = y_predicted * scale_factory
y_test = y_test * scale_factor
plt.scatter(y_test,predictions)
sns.distplot((y_test-predictions),bins=50);
15
plt.figure(figsize=(12,6))
plt.plot(y_test,'b',label = 'Original Price')
plt.plot(y_predicted,'r',label = 'Predicted Price')
plt.xlabel('Time')
plt.ylabel('Price')
plt.legend()
plt.show()
from sklearn import metrics
print('MAE:', metrics.mean_absolute_error(y_test, predictions))

print('MSE:', metrics.mean_squared_error(y_test, predictions))
print('RMSE:', np.sqrt(metrics.mean_squared_error(y_test, predictions)))
16
ADVANTAGE OF LSTM OVER OTHER MODELS
The LSTM model can be tuned for various parameters such as changing the number of LSTM
layers, adding dropoutvalue or increasing the number of epochs.
Long Short Term Memory (LSTM)

LSTMs are widely used for sequence prediction problems and have proven to be
extremely effective. The reason they work so well is because LSTM is able to store past
information that is important, and forget the information that is not. LSTM has three
gates:
The input gate: The input gate adds information to the cell state
The forget gate: It removes the information that is no longer required by the model. The
output gate: Output Gate at LSTM selects the information to be shown as output.
17
EXPLANATION OF THE OUTPUT RESULTS
AND THE DATASET
First we import a sample data from sklearn library , you can get different types of sample data
from Kaggle. The data taken here is the data of various parameters and the house prices in a
given city called boston in the year between 1970 to 2020.
Here the data parameters are explained as follows:
Here for understanding purpose we have taken first 5 index/instance of data and printed them.
In total there are 506 rows of data from the dataset , of which we have printed first 5 rows
using head() function. There are 14 columns in total, i.e, 13 colums containing data of the
place, and the 14th column is the target column which contains the house prices.
18
Then we check if our data has some null values i.e missing values. Since if the data is
incomplete , then there will be error during processing state which may lead to loss of
accuracy in predicting model. Here in our given data , there is nomissing value as we can see.
Since our data contains no missing value, the program will skip the dropping phase in data
processing, where data is dropped to increase accuracy and fit missing values in a way so that
it is suitable for modelling.
Next we try to describe the data in such a way so that both people and machine find it easy to
understand the given data . In order to do thiswe use the describe() function.
19
Counts refers to the number of instances of data in each column i.e 506 since there are 506
rows of data for each columnMean refers to mean value of data in given colum.
Std means the standard value i.e the most common value in given set of data for a particular
column.
Min refers the least data value in each column.
Max refers to the maximum data value in each column.
25% refers that 25 percentile of the data in that column is equal to or below that value.
Next we try to understand the correlation between the different values, in order to do that, the
best way is by using heat map. Heat map is a representation of data in the form of a map or
diagram in which data values are represented as colours.
Correlation is a statistical measure that expresses the extent to which two variables are
linearly related (meaning they change together at a constant rate)
There are two types of correlation, they are:
1. Positive correlation: A positive correlation is a relationship between two variables

that move in tandem—that is, inthe same direction. A positive correlation exists when
one variable decreases as the other variable decreases, or one variable increases while
the other increases.
2. Negative correlation: Negative correlation is a relationship between two variables in

which one variable increasesas the other decreases, and vice versa.
In statistics, a perfect negative correlation is represented by the value -1.0, while a 0

indicates no correlation, and +1.0 indicates a perfect positive correlation. A perfect
negative correlation means the relationship that exists between two variables is exactly
opposite all of the time. These are two types of correlation are represented numerically
and as well as by shade of colour in the heat map.
20
HEATMAP – for better understanding of which place is best suited for individual personal
preference based on given dataset. This uses correlation concept
Next we split our data into variables x and y , in order to train our model to predict data.
21
Here the varible x contains the value of the first 13 columns i.e the parameters that are
required for calculating and predicting the house prices. The varible y contains the 14th
column values which are the house prices.
First we predict the values in y using the values in x . Then we compare the actual prices and
predicted prices by using scatter plot. Then we find the r square error and mean square error
between them . If the errors is less enough then we proceed for testing of the model since the
training phase is over. If the error is large , thenwe use optimizers like adam, and repeat drop
and fitting process for a set number of epochs to reduce the error.
The r square error or mean square error for good accuracy of the model in predicting the data
is indicatednumerically also.
A model is good if these error values are less then 5.
Then during testing process we predict the future house prices using present and past data
parameters of houses in an location. Then we plot this graphically as a house price over time
graph.
For training the model , the error needs to be minimum for greater accuracy of model. The
error between the actual and predicted price is plotted graphically using scatter plot. Here we
can see that error is minimum since the data points of actual and predicted value are close to
each other
22
PREDICTED VALUE OF HOUSE PRICE BASED ON TEST SAMPLE DATA
23
ALGORITHM BRIEF OUTLINE
1. Import the python libraries that are required for house price prediction using linear
regression. Example: numpy is used for convention of data to 2d or 3d array format
which is required for linear regression model ,matplotlib for plotting the graph , pandas
for readingthe data from source and manipulation that data, etc.
2. First Get the value from source and give it to a data frame and thenmanipulate this data
to required form using head(),indexing, drop().
3. Next we have to train a model, its always best to spilt the data intotraining data and test
data for modelling.
4. Its always good to use shape() to avoid null spaces which will cause error during
modelling process.
5. Its good to normalize the value since the values are in very large quantity for house
prices , for this we may use minmaxscaler to reduce the gap between prices so that its
easy and less time consuming for comparing and values.range usually specified is
between 0 to 1 using fittransform.
6. Then we have to make few imports from keras: like sequential for initializing the
network,lstm to add lstm layer, dropout to prevent overfitting of lstm layers, dense to
add a densely connected networklayer for output unit.
7. In lstm layer declaration its best to declare the unit, activiation,returnsequence.
8. To compile this model its always best to use adam optimizer and set the loss as
required for the specific data.
9. We can fit the model to run for a number of epochs. Epochs are the number of times
the learning algorithm will work through the entire training set.
24
10. Then we convert the values back to normal form by using inverse minimal scale by
scale factor.
11. Then we give a test data(present data)to the trained model to get the predicted
value(future data).
12. Then we can use matplotlib to plot a graph comparing the test and predicted value to
see the increase/decrease rate of values in each time of the year in a particular place.
Based on this people will know when its best time to sell or buy a place in a given
location.
25
ACKNOWLEDGEMENT
I am pleased to acknowledge my sincere thanks to Board of Management of

SATHYABAMA for their kind encouragement in doing this project and for completing it
successfully. I am grateful to them.
I convey my thanks to Dr.N.M.Nandhitha , Dean, School of electronics and Dr.T.Ravi, Head

of the Department, Dept. of ECE for providing me necessary support and details at the right
time during the progressive reviews.
I would like to express my sincere and deep sense of gratitude to my Project Guide
Dr.S.Lalithakumari for his valuable guidance, suggestions and constant encouragement paved
way for the successful completion of my project work. I wish to express my thanks to all
Teaching and Non-teaching staff members of the Department of ECE who were helpful in
many ways for the completion of the project.
26
CONCLUSION
Thus the machine learning model to predict the house price based on given dataset is executed
successfully using xg regressor (a upgraded/slighted boosted form of regular linear regression,
this gives lesser error). This model further helps people understand whether this place is more
suited for them based on heatmap correlation. It also helps people looking to sell a house at
best time for greater profit. Any house price in any location can be predicted with minimum
errorby giving appropriate dataset.
27
SOFTWARE TOOLS
• Keras
• Jupyter
• Visual studio
• R Square
• Adjusted R Square
• MSE
• RMSE
• MAE
• Google colla
28
REFERENCES
• Real Estate Price Prediction with Regression and Classification, CS 229 Autumn 2016
Project Final Report
• Gongzhu Hu, Jinping Wang, and Wenying Feng Multivariate Regression Modellingfor
Home Value Estimates with Evaluation using Maximum Information Coefficient
• Byeonghwa Park , Jae Kwon Bae (2015). Using machine learning algorithms for
housing price prediction , Volume 42, Pages 2928-2934 [4] Douglas C. Montgomery,
Elizabeth A. Peck, G. Geoffrey Vining, 2015. Introduction to Linear Regression
Analysis.
• Iain Pardoe, 2008, Modelling Home Prices Using Realtor Data
• Aaron Ng, 2015, Machine Learning for a London Housing Price Prediction Mobile
Application
• Wang, X., Wen, J., Zhang, Y.Wang, Y. (2014). Real estate price forecasting based on
SVM optimized by PSO. Optik-International Journal for Light and Electron Optics,
125(3), 14391443
29

1822 B.E Ece Batchno 120

Uploaded by

Copyright:

Available Formats

1822 B.E Ece Batchno 120

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

1822 B.E Ece Batchno 120

Uploaded by

Copyright:

Available Formats

HOUSE PRICE

Department of Electronics and Communication Engineering

Accredited with Grade “A” by NAAC

I NACHIAPPAN M hereby declare that the Project Report entitled

DATE OF VIVA: 05. 05. 2022

PLACE: Sathyabama university chennai

SIGNATURE OF THE CANDIDATE

Sl. No. Title Page No.

1 Aim and Objective 5

4 Proposed System Phases 8

6 Factors that Affect House Pricing 10

9 Result Output and Dataset Explanation 18

10 Algorithm Brief Outline 24

 This application will help customers to invest in an estate without approaching an

 To provide a better and fast way of performing operations.

 To provide proper house price to the customers.

 To provide best price to user without getting cheated.

 To enable user to search home as per the budget.

• Linear Regression is a supervised machine learning model that attempts to model a

PROPOSED SYSTEM PHASES

Phase 1: Collection of data

Phase 2: Data preprocessing

Phase 3: Training the model

Phase 4: Testing and Integrating with UI

ALTERNATIVE REGRESSOR (XG BOOST REGRESSOR)

 RMSE: It is the square root of mean squared error (MSE).

XGBoost is a powerful approach for building supervised regression models.

FACTORS THAT AFFECT HOUSE PRICING

• Unemployment. Related to economic growth is unemployment. When

• Consumer confidence. Confidence is important for determining whether people

 Affordability/house prices to earnings. The ratio of house prices to earnings

Another way of looking at the affordability of housing is to look at the

from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.4, random_state=101)

from sklearn.linear_model import minmaxscaler

model.add(LSTM(units = 60,activation = 'relu',return_sequences = True))

model.add(LSTM(units = 80,activation = 'relu',return_sequences = True))

model.add(LSTM(units = 120,activation = 'relu'))

model.compile(optimizer='adam', loss = 'mean_squared_error’)

from sklearn import metrics

print('MAE:', metrics.mean_absolute_error(y_test, predictions))

Long Short Term Memory (LSTM)

Here the data parameters are explained as follows:

Min refers the least data value in each column.

Max refers to the maximum data value in each column.

There are two types of correlation, they are:

1. Positive correlation: A positive correlation is a relationship between two variables

2. Negative correlation: Negative correlation is a relationship between two variables in

In statistics, a perfect negative correlation is represented by the value -1.0, while a 0

A model is good if these error values are less then 5.

7. In lstm layer declaration its best to declare the unit, activiation,returnsequence.

I am pleased to acknowledge my sincere thanks to Board of Management of

I convey my thanks to Dr.N.M.Nandhitha , Dean, School of electronics and Dr.T.Ravi, Head

You might also like