House

Download as pptx, pdf, or txt
Download as pptx, pdf, or txt
You are on page 1of 19
At a glance
Powered by AI
The presentation discusses predicting house sale prices using linear regression and gradient boosting regression models on a dataset of home sales in King County, Washington.

The dataset contains features and sale prices of over 21,000 homes sold in King County, Washington between May 2014 and May 2015.

Linear regression and gradient boosting regression models were used to predict house sale prices based on features like number of bedrooms, square footage, lot size, etc.

HOUSE SALES PRICE

PREDICTION ANALYSIS

Presented by,
SANDHYA NAIDU E
• Abstract
• Project Objective
• Introduction

CONTENTS • About Dataset


• Linear Regression
• Gradient Booster Regression
• Snapshots
• Conclusion
ABSTRACT

• This project focuses on predicting the selling price of the house depending on various
parameters like Year built, Square feet, Lot size, number of beds and baths, features,
Walk score etc.
• The data is taken from Kaggle.com
What is Kaggle ?
KAGGLE is an online community of data scientist and machine learners owned
by Google. Kaggle allows users to find and publish data sets, explore and build models in
a web-based data-science environment, work with other data scientists and machine
learning engineers, and enter competitions to solve data science challenges.
PROJECT OBJECTIVE
• This project aims in constructing a mathematical model using Linear
Regression and Gradient Booster Regression to estimate the selling price of
the house based on a set of variables.
• Analysis Software Used – Python Jupyter Notebook (ANACONDA
ENVIRNONMENT)
INTRODUCTION
• Our goal for this project was to use regression techniques in order to
estimate the sale price of a house in King County, Washington given the
feature and pricing data for around 21,000 houses sold within one year.
ABOUT DATASET
 Our dataset comes from a Kaggle competition.
 The dataset contains house sale prices and its
features for homes sold in King County,
Washington between May 2014 and May 2015.
 King County is the most populous county in Washington
and is included in the Seattle-Tacoma-Bellevue
metropolitan statistical area. The county is considered
the 13th most populous county in the United States.
 There are 21,613 observations in the dataset.
 There are 25 total attributes in the dataset, four of which
we derived from current columns. We are using 22
attributes in our models: all attributes except for date,
latitude and longitude
DATA PREPROCESSING
Attribute Percentage
Bedrooms 0.308349598
Bathroom 0.525137505
• During cleaning and preprocessing, we created four Sqft_living 0.702035055
Sqft_lot 0.089660861
attributes derived from other attributes: Age,
Floors 0.256793888
Age_renovated, Sqrt_living15_diff, and Sqft_lot15_diff. Waterfront 0.266369434

• We chose not to use the variable data, because it only View 0.397293488
Condition 0.036361789
shows us when the data was entered into the database. Grade 0.667434256

• We chose not to use latitude and longitude because the Sqft_above 0.605567298
Sqft_basement 0.323816021
attribute, Zip code, contains the same information and was Yr_built 0.054011531
Age -0.054011531
easier to work with in our models.
• We checked for missing variables and the dataset didn’t Yr_renovated
Age_renovation
0.126433793
-0.105754631
contain any Sqft_living15 0.585378904
Sqft_living15_diff 0.405391664
Sqft_lot15 0.082447153
Sqft_lot15_diff 0.050590661
LINEAR REGRESSION
So what did we do ? Let’s go step by step.
• We import our dependencies , for linear regression we use sklearn (built in python library) and
import linear regression from it.
• We then initialize Linear Regression to a variable reg.
• Now we know that prices are to be predicted , hence we set labels (output) as price columns and we
also convert dates to 1’s and 0’s so that it doesn’t influence our data much . We use 0 for houses
which are new that is built after 2014.
• We again import another dependency to split our data into train and test.
• I’ve made my train data as 90% and 10% of the data to be my test data , and randomized the
splitting of data by using random_state.
• So now , we have train data , test data and labels for both let us fit our train and test data into linear
regression model.
• After fitting our data to the model we can check the score of our data ie , prediction. in this case the
prediction is 73%
THE ACCURACY OF THE MODEL IS LOWER
THAN OUR AIM OF 85. SO HOW DO WE
ACHIEVE THAT 85% TARGET ?
GRADIENT BOOSTER REGRESSION

What is Gradient Boosting ?


It is a machine learning technique for regression and
classification problems, which produces a prediction
model in the form of an ensemble of weak prediction
models, typically decision trees.
1. We first import the library from sklearn ( it is the best library for all statistical
related models)
2. We create a variable where we define our gradient boosting regressor and set
parameters to it , here
 n_estimator — The number of boosting stages to perform. We should not set it too
high which would overfit our model.
 max_depth — The depth of the tree node.
 learning_rate — Rate of learning the data.
 loss — loss function to be optimized. ‘ls’ refers to least squares regression
 minimum sample split — Number of sample to be split for learning the data
3.We then fit our training data into the gradient boosting model and check for
accuracy
4.We got an accuracy of 91.94% which is amazing!
SNAPSHOTS
CONCLUSION

In this project, we attempted to predict the price of houses in King County. We found that the best
we could achieve was a GRADIENT BOOSTER REGRESSION MODEL of an accuracy of 91%
which we need to consider to be very accurate. Predicting house prices is an extremely complex and
challenging problems because houses vary widely and house prices are not only based on the
physical properties of a house but also on the emotional, social and financial position of the parties
involved. Our results indicate that in order to provide accurate predictions of house prices, a very
large number of features must be used and that they most likely need to be combined with a
powerful, complex and non-linear model.
THANK YOU

You might also like