Faseeh Chap 2 Report

Download as pdf or txt
Download as pdf or txt
You are on page 1of 30

Department of Electrical Engineering

Summer Project 2023

Report on Median Housing Price Prediction


through Neural Network

Submitted by

1) Faseeh Ahmed | 02-3-1-013-2021

Section: A

Submitted To: Dr. Sufi Tabassum Gul

Due: July 20, 2023


1 Median Housing Price Prediction Project
1.1 Fetching Data:
1.1.1 Importing Libraries and Functions:

[63]: import os
import tarfile
from six.moves import urllib
import pandas as pd
%matplotlib inline
import matplotlib.pyplot as plt
import numpy as np
from zlib import crc32
from sklearn.model_selection import train_test_split
from sklearn.model_selection import StratifiedShuffleSplit
from pandas.plotting import scatter_matrix
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import OrdinalEncoder
from sklearn.preprocessing import OneHotEncoder
from sklearn.base import BaseEstimator, TransformerMixin
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.compose import ColumnTransformer
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error
from sklearn.tree import DecisionTreeRegressor
from sklearn.model_selection import cross_val_score
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import mean_absolute_error
from sklearn.model_selection import GridSearchCV
from sklearn.model_selection import RandomizedSearchCV
from scipy.stats import randint
from scipy import stats
from sklearn.svm import SVR
from sklearn.model_selection import GridSearchCV
from sklearn.model_selection import RandomizedSearchCV
from scipy.stats import randint
from scipy import stats

1.1.2 Fetching Data from the web and Extracting it to custom Directory:
This part of code is commented because data reading is done directly through the file stored locally.

[2]: #DOWNLOAD_ROOT = "https://raw.githubusercontent.com/ageron/handson-ml2/master/"


#HOUSING_PATH = os.path.join("datasets", "housing")
#HOUSING_URL = DOWNLOAD_ROOT + "datasets/housing/housing.tgz"

1
#def fetch_housing_data(housing_url=HOUSING_URL, housing_path=HOUSING_PATH):
# if not os.path.isdir(housing_path):
# os.makedirs(housing_path)
# tgz_path = os.path.join(housing_path, "housing.tgz")
# urllib.request.urlretrieve(housing_url, tgz_path)
# housing_tgz = tarfile.open(tgz_path)
# housing_tgz.extractall(path=housing_path)
# housing_tgz.close()

#fetch_housing_data()

1.1.3 Function to load the dataset:


Since we are reading the data through locally stored file, we don’t have to load it seperately after
fetching from the web.

[3]: #def load_housing_data(housing_path=HOUSING_PATH):


# csv_path = os.path.join(housing_path, "housing.csv")
# return pd.read_csv(csv_path)

1.1.4 Directly reading the Data from Custom Directory:


Along with reading the data through pandas library function pd.read_csv, we also displayed the
first five entries of the dataset using housing.head() command

[4]: csv_path = 'E:/Uni Stuff/Summer Projects/Jupyter Notebook/Housing/datasets/


,→housing/housing.csv'

housing = pd.read_csv(csv_path)
housing.head()

[4]: longitude latitude housing_median_age total_rooms total_bedrooms \


0 -122.23 37.88 41.0 880.0 129.0
1 -122.22 37.86 21.0 7099.0 1106.0
2 -122.24 37.85 52.0 1467.0 190.0
3 -122.25 37.85 52.0 1274.0 235.0
4 -122.25 37.85 52.0 1627.0 280.0

population households median_income median_house_value ocean_proximity


0 322.0 126.0 8.3252 452600.0 NEAR BAY
1 2401.0 1138.0 8.3014 358500.0 NEAR BAY
2 496.0 177.0 7.2574 352100.0 NEAR BAY
3 558.0 219.0 5.6431 341300.0 NEAR BAY
4 565.0 259.0 3.8462 342200.0 NEAR BAY

Using the housing.info() function to get a brief overview of dataset including total labels, total in-
stances and their datatype. Also we get to know that “total_bedrooms” label has 207 less instances.

[5]: housing.info()

2
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 20640 entries, 0 to 20639
Data columns (total 10 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 longitude 20640 non-null float64
1 latitude 20640 non-null float64
2 housing_median_age 20640 non-null float64
3 total_rooms 20640 non-null float64
4 total_bedrooms 20433 non-null float64
5 population 20640 non-null float64
6 households 20640 non-null float64
7 median_income 20640 non-null float64
8 median_house_value 20640 non-null float64
9 ocean_proximity 20640 non-null object
dtypes: float64(9), object(1)
memory usage: 1.6+ MB
To see all the available options for “ocean_proximity” label we used value_counts() function. We
also get to know the total instances for each option available from dataset.

[6]: housing["ocean_proximity"].value_counts()

[6]: ocean_proximity
<1H OCEAN 9136
INLAND 6551
NEAR OCEAN 2658
NEAR BAY 2290
ISLAND 5
Name: count, dtype: int64

Using the describe() function we can get a more statistical view of the data set. The count, max
and min values indicate the total instances, maximum value and minimum value from all instances
respectively. The std and percentages indicate the standard deviation and percentiles respectively.

[7]: housing.describe()

[7]: longitude latitude housing_median_age total_rooms \


count 20640.000000 20640.000000 20640.000000 20640.000000
mean -119.569704 35.631861 28.639486 2635.763081
std 2.003532 2.135952 12.585558 2181.615252
min -124.350000 32.540000 1.000000 2.000000
25% -121.800000 33.930000 18.000000 1447.750000
50% -118.490000 34.260000 29.000000 2127.000000
75% -118.010000 37.710000 37.000000 3148.000000
max -114.310000 41.950000 52.000000 39320.000000

total_bedrooms population households median_income \


count 20433.000000 20640.000000 20640.000000 20640.000000

3
mean 537.870553 1425.476744 499.539680 3.870671
std 421.385070 1132.462122 382.329753 1.899822
min 1.000000 3.000000 1.000000 0.499900
25% 296.000000 787.000000 280.000000 2.563400
50% 435.000000 1166.000000 409.000000 3.534800
75% 647.000000 1725.000000 605.000000 4.743250
max 6445.000000 35682.000000 6082.000000 15.000100

median_house_value
count 20640.000000
mean 206855.816909
std 115395.615874
min 14999.000000
25% 119600.000000
50% 179700.000000
75% 264725.000000
max 500001.000000

1.1.5 Visualizing Numerical Data


Using hist() function we can easily plot the histograms of each label. The argument ‘bins’ indicate
the number of bars in which the data will be divided horizontally. While ‘figsize’ indicate the size
of plot.

[8]: housing.hist(bins=50, figsize=(20,15))


plt.show()

4
1.2 Creating a Test Set
To create a test set we can split the given data set using a number of methods. First we can split
by either selecting random data instances, but it will eventually cause the whole data to be viewed
by our model which is not recommended. So we can either use hash of each instance. This ensures
that the test set will remain consistent across multiple runs, even if you refresh the dataset. The
new test set will contain 20% of the new instances, but it will not contain any instance that was
previously in the training set.

[9]: def test_set_check(identifier, test_ratio):


return crc32(np.int64(identifier)) & 0xffffffff < test_ratio * 2**32

def split_train_test_by_id(data, test_ratio, id_column):


ids = data[id_column]
in_test_set = ids.apply(lambda id_: test_set_check(id_, test_ratio))
return data.loc[~in_test_set], data.loc[in_test_set]

#housing_with_id = housing.reset_index()

#housing_with_id["id"] = housing["longitude"] * 1000 + housing["latitude"]


#train_set, test_set = split_train_test_by_id(housing_with_id, 0.2, "id")

5
train_set, test_set = train_test_split(housing, test_size = 0.2, random_state =␣
,→42)

print(len(train_set))
print(len(test_set))

16512
4128
Just creating a new label “income_cat”. Using Pandas cut() function we split and store instances
in particular range in certain groups and plot the histogram.

[10]: housing["income_cat"] = pd.cut(housing["median_income"], bins=[0., 1.5, 3.0, 4.


,→5, 6., np.inf], labels=[1, 2, 3, 4, 5])

housing["income_cat"].hist()

[10]: <Axes: >

1.2.1 Stratification of data set to minimize error


The StratifiedShuffleSplit() function in Python is a utility class from the scikit-learn library used
for generating stratified random train-test splits. It is particularly useful when you have imbalanced

6
class distributions in your dataset.
When working with classification tasks, it is important to have representative train and test sets that
preserve the same class distribution as the original dataset. StratifiedShuffleSplit() helps achieve
this by randomly shuffling the data while maintaining the relative proportions of different classes
in each split.

[11]: split = StratifiedShuffleSplit(n_splits=1, test_size=0.2, random_state=42)


for train_index, test_index in split.split(housing, housing["income_cat"]):
strat_train_set = housing.loc[train_index]
strat_test_set = housing.loc[test_index]
strat_test_set["income_cat"].value_counts() / len(strat_test_set)

[11]: income_cat
3 0.350533
2 0.318798
4 0.176357
5 0.114341
1 0.039971
Name: count, dtype: float64

Dropping the “income_cat” label and getting original dataset

[12]: for set_ in (strat_train_set, strat_test_set):


set_.drop("income_cat", axis=1, inplace=True)

1.3 Visualizing Geographical Data


Using copy() function to make a copy of training dataset and store it in ‘housing’ variable, then
using plot() function to visualize the data with alpha parameter determining the brightness of each
instance.

[13]: housing = strat_train_set.copy()


housing.plot(kind="scatter", x="longitude", y="latitude", alpha=0.1)

[13]: <Axes: xlabel='longitude', ylabel='latitude'>

7
Plotting the population as radius of circles while using colors to indicate the ‘median_house_value’
where red bieng the highest and blue bieng the lowest.

[14]: housing.plot(kind="scatter", x="longitude", y="latitude", alpha=0.4,


s=housing["population"]/100, label="population", figsize=(10,7),
c="median_house_value", cmap=plt.get_cmap("jet"), colorbar=True,
)
plt.legend()

[14]: <matplotlib.legend.Legend at 0x1be295b2a50>

8
1.4 Correlations
First dropping the “ocean_proximity” label so that we can use corr() function which only takes
numbers i.e int and float and not strings.

[15]: # for set2 in (housing):


# print (set2)
# # set2.drop("ocean_proximity", axis=1, inplace=True)
# #corr_matrix = housing.corr()
# #corr_matrix["median_house_value"].sort_values(ascending=False)
# housing.head()
housing.drop(['ocean_proximity'], axis=1, inplace=True)
housing.head()

[15]: longitude latitude housing_median_age total_rooms total_bedrooms \


12655 -121.46 38.52 29.0 3873.0 797.0
15502 -117.23 33.09 7.0 5320.0 855.0
2908 -119.04 35.37 44.0 1618.0 310.0
14053 -117.13 32.75 24.0 1877.0 519.0
20496 -118.70 34.28 27.0 3536.0 646.0

population households median_income median_house_value

9
12655 2237.0 706.0 2.1736 72100.0
15502 2015.0 768.0 6.3373 279600.0
2908 667.0 300.0 2.8750 82700.0
14053 898.0 483.0 2.2264 112500.0
20496 1837.0 580.0 4.4964 238300.0

Using corr() function to find correlations between all labels in the given dataset and then sorting
values of “median_housing_value” in descending order and displaying.

[16]: corr_matrix = housing.corr()


corr_matrix["median_house_value"].sort_values(ascending=False)

[16]: median_house_value 1.000000


median_income 0.687151
total_rooms 0.135140
housing_median_age 0.114146
households 0.064590
total_bedrooms 0.047781
population -0.026882
longitude -0.047466
latitude -0.142673
Name: median_house_value, dtype: float64

1.4.1 Plotting Correlations


First creating a list named ‘attributes’ storing relevant labels of dataset and then using scat-
ter_matrix() function to plot their correlations.

[17]: attributes = ["median_house_value", "median_income", "total_rooms",␣


,→"housing_median_age"]

scatter_matrix(housing[attributes], figsize=(12, 8))

[17]: array([[<Axes: xlabel='median_house_value', ylabel='median_house_value'>,


<Axes: xlabel='median_income', ylabel='median_house_value'>,
<Axes: xlabel='total_rooms', ylabel='median_house_value'>,
<Axes: xlabel='housing_median_age', ylabel='median_house_value'>],
[<Axes: xlabel='median_house_value', ylabel='median_income'>,
<Axes: xlabel='median_income', ylabel='median_income'>,
<Axes: xlabel='total_rooms', ylabel='median_income'>,
<Axes: xlabel='housing_median_age', ylabel='median_income'>],
[<Axes: xlabel='median_house_value', ylabel='total_rooms'>,
<Axes: xlabel='median_income', ylabel='total_rooms'>,
<Axes: xlabel='total_rooms', ylabel='total_rooms'>,
<Axes: xlabel='housing_median_age', ylabel='total_rooms'>],
[<Axes: xlabel='median_house_value', ylabel='housing_median_age'>,
<Axes: xlabel='median_income', ylabel='housing_median_age'>,
<Axes: xlabel='total_rooms', ylabel='housing_median_age'>,
<Axes: xlabel='housing_median_age', ylabel='housing_median_age'>]],

10
dtype=object)

By plotting only the correlation between “median_income” and “median_house_value” we can see
a relatively linear graph showing that increase in income corresponds to increase in house value.

[18]: housing.plot(kind="scatter", x="median_income", y="median_house_value", alpha=0.


,→1)

[18]: <Axes: xlabel='median_income', ylabel='median_house_value'>

11
1.4.2 Combining Labels
We can also combine some labels together and check whether they provide a more better correlation
between “median_house_value” and other newly created labels.

[19]: housing["rooms_per_household"] = housing["total_rooms"]/housing["households"]


housing["bedrooms_per_room"] = housing["total_bedrooms"]/housing["total_rooms"]
housing["population_per_household"]=housing["population"]/housing["households"]

corr_matrix = housing.corr()
corr_matrix["median_house_value"].sort_values(ascending=False)

[19]: median_house_value 1.000000


median_income 0.687151
rooms_per_household 0.146255
total_rooms 0.135140
housing_median_age 0.114146
households 0.064590
total_bedrooms 0.047781
population_per_household -0.021991
population -0.026882
longitude -0.047466

12
latitude -0.142673
bedrooms_per_room -0.259952
Name: median_house_value, dtype: float64

1.5 Preparing Data for Machine Learning Algorithms


[20]: housing = strat_train_set.drop("median_house_value", axis=1)
housing_labels = strat_train_set["median_house_value"].copy()

1.5.1 Cleaning the Data

[21]: sample_incomplete_rows = housing[housing.isnull().any(axis=1)].head()


sample_incomplete_rows

[21]: longitude latitude housing_median_age total_rooms total_bedrooms \


1606 -122.08 37.88 26.0 2947.0 NaN
10915 -117.87 33.73 45.0 2264.0 NaN
19150 -122.70 38.35 14.0 2313.0 NaN
4186 -118.23 34.13 48.0 1308.0 NaN
16885 -122.40 37.58 26.0 3281.0 NaN

population households median_income ocean_proximity


1606 825.0 626.0 2.9330 NEAR BAY
10915 1970.0 499.0 3.4193 <1H OCEAN
19150 954.0 397.0 3.7813 <1H OCEAN
4186 835.0 294.0 4.2891 <1H OCEAN
16885 1145.0 480.0 6.3580 NEAR OCEAN

Instead of removing all instances in which “total_bedrooms” has null values or even completely
dropping the “total_bedrooms” column, we computed the median value for this column and replaced
the values.

[22]: median = housing["total_bedrooms"].median()


sample_incomplete_rows["total_bedrooms"].fillna(median, inplace=True)

[23]: sample_incomplete_rows

[23]: longitude latitude housing_median_age total_rooms total_bedrooms \


1606 -122.08 37.88 26.0 2947.0 433.0
10915 -117.87 33.73 45.0 2264.0 433.0
19150 -122.70 38.35 14.0 2313.0 433.0
4186 -118.23 34.13 48.0 1308.0 433.0
16885 -122.40 37.58 26.0 3281.0 433.0

population households median_income ocean_proximity


1606 825.0 626.0 2.9330 NEAR BAY
10915 1970.0 499.0 3.4193 <1H OCEAN
19150 954.0 397.0 3.7813 <1H OCEAN

13
4186 835.0 294.0 4.2891 <1H OCEAN
16885 1145.0 480.0 6.3580 NEAR OCEAN

Using the SimpleImputer class with “median” strategy and naming it ‘imputer’

[24]: imputer = SimpleImputer(strategy="median")

Dropping the “ocean_proximity” label because we can’t deal with objects yet.

[25]: housing_num = housing.drop("ocean_proximity", axis=1)

[26]: imputer.fit(housing_num)

[26]: SimpleImputer(strategy='median')

Calculating the median values for all labels

[27]: imputer.statistics_

[27]: array([-118.51 , 34.26 , 29. , 2119. , 433. ,


1164. , 408. , 3.54155])

Transforming the dataset from Pandas DataFrame form to Numpy Array format

[28]: X = imputer.transform(housing_num)
X

[28]: array([[-1.2146e+02, 3.8520e+01, 2.9000e+01, ..., 2.2370e+03,


7.0600e+02, 2.1736e+00],
[-1.1723e+02, 3.3090e+01, 7.0000e+00, ..., 2.0150e+03,
7.6800e+02, 6.3373e+00],
[-1.1904e+02, 3.5370e+01, 4.4000e+01, ..., 6.6700e+02,
3.0000e+02, 2.8750e+00],
...,
[-1.2272e+02, 3.8440e+01, 4.8000e+01, ..., 4.5800e+02,
1.7200e+02, 3.1797e+00],
[-1.2270e+02, 3.8310e+01, 1.4000e+01, ..., 1.2080e+03,
5.0100e+02, 4.1964e+00],
[-1.2214e+02, 3.9970e+01, 2.7000e+01, ..., 6.2500e+02,
1.9700e+02, 3.1319e+00]])

Converting the array format back to dataframe format and saving it in ‘housing_tr’ variable. We can
now simply display the instances with previuosly null values using the ‘sample_incomplete_rows’
variable.

[29]: housing_tr = pd.DataFrame(X, columns=housing_num.columns,


index=housing.index)
housing_tr.loc[sample_incomplete_rows.index.values]

14
[29]: longitude latitude housing_median_age total_rooms total_bedrooms \
1606 -122.08 37.88 26.0 2947.0 433.0
10915 -117.87 33.73 45.0 2264.0 433.0
19150 -122.70 38.35 14.0 2313.0 433.0
4186 -118.23 34.13 48.0 1308.0 433.0
16885 -122.40 37.58 26.0 3281.0 433.0

population households median_income


1606 825.0 626.0 2.9330
10915 1970.0 499.0 3.4193
19150 954.0 397.0 3.7813
4186 835.0 294.0 4.2891
16885 1145.0 480.0 6.3580

[30]: housing_tr.head()

[30]: longitude latitude housing_median_age total_rooms total_bedrooms \


12655 -121.46 38.52 29.0 3873.0 797.0
15502 -117.23 33.09 7.0 5320.0 855.0
2908 -119.04 35.37 44.0 1618.0 310.0
14053 -117.13 32.75 24.0 1877.0 519.0
20496 -118.70 34.28 27.0 3536.0 646.0

population households median_income


12655 2237.0 706.0 2.1736
15502 2015.0 768.0 6.3373
2908 667.0 300.0 2.8750
14053 898.0 483.0 2.2264
20496 1837.0 580.0 4.4964

1.5.2 Handling Text based Attributes


Extracting only “ocean_proximity” label and saving it in ‘housing_cat’.

[31]: housing_cat = housing[["ocean_proximity"]]


housing_cat.head(10)

[31]: ocean_proximity
12655 INLAND
15502 NEAR OCEAN
2908 INLAND
14053 NEAR OCEAN
20496 <1H OCEAN
1481 NEAR BAY
18125 <1H OCEAN
5830 <1H OCEAN
17989 <1H OCEAN
4861 <1H OCEAN

15
Using ordinal encoder class and encoding the text based attributes of “ocean_proximity” label to a
certain number.

[32]: ordinal_encoder = OrdinalEncoder()


housing_cat_encoded = ordinal_encoder.fit_transform(housing_cat)
housing_cat_encoded[:10]

[32]: array([[1.],
[4.],
[1.],
[4.],
[0.],
[3.],
[0.],
[0.],
[0.],
[0.]])

Checking how encoder has assigned these numbers to all attributes. We get to know that first
attribute is assigned the number ‘0’ while next attribute is assigned ‘1’ and so on.

[33]: ordinal_encoder.categories_

[33]: [array(['<1H OCEAN', 'INLAND', 'ISLAND', 'NEAR BAY', 'NEAR OCEAN'],


dtype=object)]

[34]: cat_encoder = OneHotEncoder()


housing_cat_1hot = cat_encoder.fit_transform(housing_cat)
housing_cat_1hot

[34]: <16512x5 sparse matrix of type '<class 'numpy.float64'>'


with 16512 stored elements in Compressed Sparse Row format>

By using the OneHotEncoder, we can further encode the attributes in binary form so that our model
doesn’t make any relation between the previusly encoded attributes.

[35]: housing_cat_1hot.toarray()

[35]: array([[0., 1., 0., 0., 0.],


[0., 0., 0., 0., 1.],
[0., 1., 0., 0., 0.],
...,
[1., 0., 0., 0., 0.],
[1., 0., 0., 0., 0.],
[0., 1., 0., 0., 0.]])

Another method to encode with auto transforming to array format.

16
[36]: cat_encoder = OneHotEncoder(sparse_output=False)
housing_cat_1hot = cat_encoder.fit_transform(housing_cat)
housing_cat_1hot

[36]: array([[0., 1., 0., 0., 0.],


[0., 0., 0., 0., 1.],
[0., 1., 0., 0., 0.],
...,
[1., 0., 0., 0., 0.],
[1., 0., 0., 0., 0.],
[0., 1., 0., 0., 0.]])

[37]: cat_encoder.categories_

[37]: [array(['<1H OCEAN', 'INLAND', 'ISLAND', 'NEAR BAY', 'NEAR OCEAN'],


dtype=object)]

1.5.3 Custom Transformers


We can also create our own transformers to add new attributes.

[38]: rooms_ix, bedrooms_ix, population_ix, households_ix = 3, 4, 5, 6

class CombinedAttributesAdder(BaseEstimator, TransformerMixin):


def __init__(self, add_bedrooms_per_room=True): # no *args or **kargs
self.add_bedrooms_per_room = add_bedrooms_per_room
def fit(self, X, y=None):
return self # nothing else to do
def transform(self, X):
rooms_per_household = X[:, rooms_ix] / X[:, households_ix]
population_per_household = X[:, population_ix] / X[:, households_ix]
if self.add_bedrooms_per_room:
bedrooms_per_room = X[:, bedrooms_ix] / X[:, rooms_ix]
return np.c_[X, rooms_per_household, population_per_household,
bedrooms_per_room]
else:
return np.c_[X, rooms_per_household, population_per_household]

attr_adder = CombinedAttributesAdder(add_bedrooms_per_room=False)
housing_extra_attribs = attr_adder.transform(housing.values)

Recovering the dataframe because in ‘housing_extra_attribs’ we lost the column names.

[39]: housing_extra_attribs = pd.DataFrame(


housing_extra_attribs,
columns=list(housing.columns)+["rooms_per_household",␣
,→"population_per_household"],

index=housing.index)

17
housing_extra_attribs.head()

[39]: longitude latitude housing_median_age total_rooms total_bedrooms \


12655 -121.46 38.52 29.0 3873.0 797.0
15502 -117.23 33.09 7.0 5320.0 855.0
2908 -119.04 35.37 44.0 1618.0 310.0
14053 -117.13 32.75 24.0 1877.0 519.0
20496 -118.7 34.28 27.0 3536.0 646.0

population households median_income ocean_proximity rooms_per_household \


12655 2237.0 706.0 2.1736 INLAND 5.485836
15502 2015.0 768.0 6.3373 NEAR OCEAN 6.927083
2908 667.0 300.0 2.875 INLAND 5.393333
14053 898.0 483.0 2.2264 NEAR OCEAN 3.886128
20496 1837.0 580.0 4.4964 <1H OCEAN 6.096552

population_per_household
12655 3.168555
15502 2.623698
2908 2.223333
14053 1.859213
20496 3.167241

1.5.4 Transformation Pipelines


We can create pipelines that can handle numerical or text based attributes without human inter-
vention

[40]: num_pipeline = Pipeline([


('imputer', SimpleImputer(strategy="median")),
('attribs_adder', CombinedAttributesAdder()),
('std_scaler', StandardScaler()),
])

housing_num_tr = num_pipeline.fit_transform(housing_num)
housing_num_tr

[40]: array([[-0.94135046, 1.34743822, 0.02756357, ..., 0.01739526,


0.00622264, -0.12112176],
[ 1.17178212, -1.19243966, -1.72201763, ..., 0.56925554,
-0.04081077, -0.81086696],
[ 0.26758118, -0.1259716 , 1.22045984, ..., -0.01802432,
-0.07537122, -0.33827252],
...,
[-1.5707942 , 1.31001828, 1.53856552, ..., -0.5092404 ,
-0.03743619, 0.32286937],
[-1.56080303, 1.2492109 , -1.1653327 , ..., 0.32814891,
-0.05915604, -0.45702273],

18
[-1.28105026, 2.02567448, -0.13148926, ..., 0.01407228,
0.00657083, -0.12169672]])

Now we can add text based attribute to our pipeline.

[41]: num_attribs = list(housing_num)


cat_attribs = ["ocean_proximity"]

full_pipeline = ColumnTransformer([
("num", num_pipeline, num_attribs),
("cat", OneHotEncoder(), cat_attribs),
])

housing_prepared = full_pipeline.fit_transform(housing)
housing_prepared

[41]: array([[-0.94135046, 1.34743822, 0.02756357, ..., 0. ,


0. , 0. ],
[ 1.17178212, -1.19243966, -1.72201763, ..., 0. ,
0. , 1. ],
[ 0.26758118, -0.1259716 , 1.22045984, ..., 0. ,
0. , 0. ],
...,
[-1.5707942 , 1.31001828, 1.53856552, ..., 0. ,
0. , 0. ],
[-1.56080303, 1.2492109 , -1.1653327 , ..., 0. ,
0. , 0. ],
[-1.28105026, 2.02567448, -0.13148926, ..., 0. ,
0. , 0. ]])

[42]: housing_prepared.shape

[42]: (16512, 16)

1.6 Selecting and Training a Model


1.6.1 Training and Evaluating the Model

[43]: lin_reg = LinearRegression()


lin_reg.fit(housing_prepared, housing_labels)

[43]: LinearRegression()

Testing our preprocessing pipeline on some training instances.

[44]: some_data = housing.iloc[:5]


some_labels = housing_labels.iloc[:5]
some_data_prepared = full_pipeline.transform(some_data)

19
print("Predictions:", lin_reg.predict(some_data_prepared))

Predictions: [ 85657.90192014 305492.60737488 152056.46122456 186095.70946094


244550.67966089]
Comparing with original values.

[45]: print("Labels:", list(some_labels))

Labels: [72100.0, 279600.0, 82700.0, 112500.0, 238300.0]

[46]: some_data_prepared

[46]: array([[-0.94135046, 1.34743822, 0.02756357, 0.58477745, 0.64037127,


0.73260236, 0.55628602, -0.8936472 , 0.01739526, 0.00622264,
-0.12112176, 0. , 1. , 0. , 0. ,
0. ],
[ 1.17178212, -1.19243966, -1.72201763, 1.26146668, 0.78156132,
0.53361152, 0.72131799, 1.292168 , 0.56925554, -0.04081077,
-0.81086696, 0. , 0. , 0. , 0. ,
1. ],
[ 0.26758118, -0.1259716 , 1.22045984, -0.46977281, -0.54513828,
-0.67467519, -0.52440722, -0.52543365, -0.01802432, -0.07537122,
-0.33827252, 0. , 1. , 0. , 0. ,
0. ],
[ 1.22173797, -1.35147437, -0.37006852, -0.34865152, -0.03636724,
-0.46761716, -0.03729672, -0.86592882, -0.59513997, -0.10680295,
0.96120521, 0. , 0. , 0. , 0. ,
1. ],
[ 0.43743108, -0.63581817, -0.13148926, 0.42717947, 0.27279028,
0.37406031, 0.22089846, 0.32575178, 0.2512412 , 0.00610923,
-0.47451338, 1. , 0. , 0. , 0. ,
0. ]])

[47]: housing_predictions = lin_reg.predict(housing_prepared)


lin_mse = mean_squared_error(housing_labels, housing_predictions)
lin_rmse = np.sqrt(lin_mse)
lin_rmse

[47]: 68627.87390018745

Finding mean absolute error using the mean_absolute_error() function. We observe that we have
performed some mistake and are getting huge error.

[48]: lin_mae = mean_absolute_error(housing_labels, housing_predictions)


lin_mae

[48]: 49438.66860915801

20
[49]: tree_reg = DecisionTreeRegressor(random_state=42)
tree_reg.fit(housing_prepared, housing_labels)

[49]: DecisionTreeRegressor(random_state=42)

We are getting zero error. This verify that we have overfitted our model.

[50]: housing_predictions = tree_reg.predict(housing_prepared)


tree_mse = mean_squared_error(housing_labels, housing_predictions)
tree_rmse = np.sqrt(tree_mse)
tree_rmse

[50]: 0.0

1.6.2 Authentic Evaluation via Cross Validation


By using cross_val_score() function, we can find the standard deviation.

[51]: scores = cross_val_score(tree_reg, housing_prepared, housing_labels,


scoring="neg_mean_squared_error", cv=10)
tree_rmse_scores = np.sqrt(-scores)

[52]: def display_scores(scores):


print("Scores:", scores)
print("Mean:", scores.mean())
print("Standard deviation:", scores.std())

display_scores(tree_rmse_scores)

Scores: [72831.45749112 69973.18438322 69528.56551415 72517.78229792


69145.50006909 79094.74123727 68960.045444 73344.50225684
69826.02473916 71077.09753998]
Mean: 71629.89009727491
Standard deviation: 2914.035468468928

[53]: lin_scores = cross_val_score(lin_reg, housing_prepared, housing_labels,


scoring="neg_mean_squared_error", cv=10)
lin_rmse_scores = np.sqrt(-lin_scores)
display_scores(lin_rmse_scores)

Scores: [71762.76364394 64114.99166359 67771.17124356 68635.19072082


66846.14089488 72528.03725385 73997.08050233 68802.33629334
66443.28836884 70139.79923956]
Mean: 69104.07998247063
Standard deviation: 2880.3282098180666

[54]: forest_reg = RandomForestRegressor(n_estimators=100, random_state=42)


forest_reg.fit(housing_prepared, housing_labels)

21
[54]: RandomForestRegressor(random_state=42)

[55]: housing_predictions = forest_reg.predict(housing_prepared)


forest_mse = mean_squared_error(housing_labels, housing_predictions)
forest_rmse = np.sqrt(forest_mse)
forest_rmse

[55]: 18650.698705770003

[56]: forest_scores = cross_val_score(forest_reg, housing_prepared, housing_labels,


scoring="neg_mean_squared_error", cv=10)
forest_rmse_scores = np.sqrt(-forest_scores)
display_scores(forest_rmse_scores)

Scores: [51559.63379638 48737.57100062 47210.51269766 51875.21247297


47577.50470123 51863.27467888 52746.34645573 50065.1762751
48664.66818196 54055.90894609]
Mean: 50435.58092066179
Standard deviation: 2203.3381412764606

[57]: scores = cross_val_score(lin_reg, housing_prepared, housing_labels,␣


,→scoring="neg_mean_squared_error", cv=10)

pd.Series(np.sqrt(-scores)).describe()

[57]: count 10.000000


mean 69104.079982
std 3036.132517
min 64114.991664
25% 67077.398482
50% 68718.763507
75% 71357.022543
max 73997.080502
dtype: float64

[62]: svm_reg = SVR(kernel="linear")


svm_reg.fit(housing_prepared, housing_labels)
housing_predictions = svm_reg.predict(housing_prepared)
svm_mse = mean_squared_error(housing_labels, housing_predictions)
svm_rmse = np.sqrt(svm_mse)
svm_rmse

[62]: 111095.06635291968

22
1.7 Fine-Tuning the Model
1.7.1 Searching: By Grids

[64]: param_grid = [
# try 12 (3×4) combinations of hyperparameters
{'n_estimators': [3, 10, 30], 'max_features': [2, 4, 6, 8]},
# then try 6 (2×3) combinations with bootstrap set as False
{'bootstrap': [False], 'n_estimators': [3, 10], 'max_features': [2, 3, 4]},
]

forest_reg = RandomForestRegressor(random_state=42)
# train across 5 folds, that's a total of (12+6)*5=90 rounds of training
grid_search = GridSearchCV(forest_reg, param_grid, cv=5,
scoring='neg_mean_squared_error',
return_train_score=True)
grid_search.fit(housing_prepared, housing_labels)

[64]: GridSearchCV(cv=5, estimator=RandomForestRegressor(random_state=42),


param_grid=[{'max_features': [2, 4, 6, 8],
'n_estimators': [3, 10, 30]},
{'bootstrap': [False], 'max_features': [2, 3, 4],
'n_estimators': [3, 10]}],
return_train_score=True, scoring='neg_mean_squared_error')

[65]: grid_search.best_params_

[65]: {'max_features': 8, 'n_estimators': 30}

[66]: grid_search.best_estimator_

[66]: RandomForestRegressor(max_features=8, n_estimators=30, random_state=42)

[67]: cvres = grid_search.cv_results_


for mean_score, params in zip(cvres["mean_test_score"], cvres["params"]):
print(np.sqrt(-mean_score), params)

63895.161577951665 {'max_features': 2, 'n_estimators': 3}


54916.32386349543 {'max_features': 2, 'n_estimators': 10}
52885.86715332332 {'max_features': 2, 'n_estimators': 30}
60075.3680329983 {'max_features': 4, 'n_estimators': 3}
52495.01284985185 {'max_features': 4, 'n_estimators': 10}
50187.24324926565 {'max_features': 4, 'n_estimators': 30}
58064.73529982314 {'max_features': 6, 'n_estimators': 3}
51519.32062366315 {'max_features': 6, 'n_estimators': 10}
49969.80441627874 {'max_features': 6, 'n_estimators': 30}
58895.824998155826 {'max_features': 8, 'n_estimators': 3}
52459.79624724529 {'max_features': 8, 'n_estimators': 10}
49898.98913455217 {'max_features': 8, 'n_estimators': 30}

23
62381.765106921855 {'bootstrap': False, 'max_features': 2, 'n_estimators': 3}
54476.57050944266 {'bootstrap': False, 'max_features': 2, 'n_estimators': 10}
59974.60028085155 {'bootstrap': False, 'max_features': 3, 'n_estimators': 3}
52754.5632813202 {'bootstrap': False, 'max_features': 3, 'n_estimators': 10}
57831.136061214274 {'bootstrap': False, 'max_features': 4, 'n_estimators': 3}
51278.37877140253 {'bootstrap': False, 'max_features': 4, 'n_estimators': 10}

[68]: pd.DataFrame(grid_search.cv_results_)

[68]: mean_fit_time std_fit_time mean_score_time std_score_time \


0 0.434555 0.012872 0.007660 0.005015
1 1.459154 0.026740 0.013123 0.004191
2 4.203955 0.022398 0.040018 0.000665
3 0.778874 0.039577 0.009652 0.000805
4 2.373583 0.038118 0.019650 0.000789
5 7.109145 0.016872 0.031455 0.000967
6 0.997442 0.028853 0.002009 0.004019
7 3.307111 0.020813 0.013660 0.005255
8 10.074654 0.061280 0.036148 0.005023
9 1.318197 0.024051 0.002833 0.002719
10 4.385539 0.028792 0.013332 0.003516
11 13.200334 0.100890 0.037406 0.002714
12 0.523636 0.003927 0.003608 0.004462
13 1.745432 0.017430 0.014061 0.004903
14 0.698562 0.013071 0.006027 0.004200
15 2.331028 0.030022 0.012862 0.003942
16 0.892025 0.013723 0.004430 0.004654
17 2.955679 0.024702 0.015503 0.004075

param_max_features param_n_estimators param_bootstrap \


0 2 3 NaN
1 2 10 NaN
2 2 30 NaN
3 4 3 NaN
4 4 10 NaN
5 4 30 NaN
6 6 3 NaN
7 6 10 NaN
8 6 30 NaN
9 8 3 NaN
10 8 10 NaN
11 8 30 NaN
12 2 3 False
13 2 10 False
14 3 3 False
15 3 10 False
16 4 3 False

24
17 4 10 False

params split0_test_score \
0 {'max_features': 2, 'n_estimators': 3} -4.119912e+09
1 {'max_features': 2, 'n_estimators': 10} -2.973521e+09
2 {'max_features': 2, 'n_estimators': 30} -2.801229e+09
3 {'max_features': 4, 'n_estimators': 3} -3.528743e+09
4 {'max_features': 4, 'n_estimators': 10} -2.742620e+09
5 {'max_features': 4, 'n_estimators': 30} -2.522176e+09
6 {'max_features': 6, 'n_estimators': 3} -3.362127e+09
7 {'max_features': 6, 'n_estimators': 10} -2.622099e+09
8 {'max_features': 6, 'n_estimators': 30} -2.446142e+09
9 {'max_features': 8, 'n_estimators': 3} -3.590333e+09
10 {'max_features': 8, 'n_estimators': 10} -2.721311e+09
11 {'max_features': 8, 'n_estimators': 30} -2.492636e+09
12 {'bootstrap': False, 'max_features': 2, 'n_est... -4.020842e+09
13 {'bootstrap': False, 'max_features': 2, 'n_est... -2.901352e+09
14 {'bootstrap': False, 'max_features': 3, 'n_est... -3.687132e+09
15 {'bootstrap': False, 'max_features': 3, 'n_est... -2.837028e+09
16 {'bootstrap': False, 'max_features': 4, 'n_est... -3.549428e+09
17 {'bootstrap': False, 'max_features': 4, 'n_est... -2.692499e+09

split1_test_score ... mean_test_score std_test_score rank_test_score \


0 -3.723465e+09 ... -4.082592e+09 1.867375e+08 18
1 -2.810319e+09 ... -3.015803e+09 1.139808e+08 11
2 -2.671474e+09 ... -2.796915e+09 7.980892e+07 9
3 -3.490303e+09 ... -3.609050e+09 1.375683e+08 16
4 -2.609311e+09 ... -2.755726e+09 1.182604e+08 7
5 -2.440241e+09 ... -2.518759e+09 8.488084e+07 3
6 -3.311863e+09 ... -3.371513e+09 1.378086e+08 13
7 -2.669655e+09 ... -2.654240e+09 6.967978e+07 5
8 -2.446594e+09 ... -2.496981e+09 7.357046e+07 2
9 -3.232664e+09 ... -3.468718e+09 1.293758e+08 14
10 -2.675886e+09 ... -2.752030e+09 6.258030e+07 6
11 -2.444818e+09 ... -2.489909e+09 7.086483e+07 1
12 -3.951861e+09 ... -3.891485e+09 8.648595e+07 17
13 -3.036875e+09 ... -2.967697e+09 4.582448e+07 10
14 -3.446245e+09 ... -3.596953e+09 8.011960e+07 15
15 -2.619558e+09 ... -2.783044e+09 8.862580e+07 8
16 -3.318176e+09 ... -3.344440e+09 1.099355e+08 12
17 -2.542704e+09 ... -2.629472e+09 8.510266e+07 4

split0_train_score split1_train_score split2_train_score \


0 -1.155630e+09 -1.089726e+09 -1.153843e+09
1 -5.982947e+08 -5.904781e+08 -6.123850e+08
2 -4.412567e+08 -4.326398e+08 -4.553722e+08
3 -9.782368e+08 -9.806455e+08 -1.003780e+09

25
4 -5.063215e+08 -5.257983e+08 -5.081984e+08
5 -3.776568e+08 -3.902106e+08 -3.885042e+08
6 -8.909397e+08 -9.583733e+08 -9.000201e+08
7 -4.939906e+08 -5.145996e+08 -5.023512e+08
8 -3.760968e+08 -3.876636e+08 -3.875307e+08
9 -9.505012e+08 -9.166119e+08 -9.033910e+08
10 -4.998373e+08 -4.997970e+08 -5.099880e+08
11 -3.801679e+08 -3.832972e+08 -3.823818e+08
12 -0.000000e+00 -4.306828e+01 -1.051392e+04
13 -0.000000e+00 -3.876145e+00 -9.462528e+02
14 -0.000000e+00 -0.000000e+00 -0.000000e+00
15 -0.000000e+00 -0.000000e+00 -0.000000e+00
16 -0.000000e+00 -0.000000e+00 -0.000000e+00
17 -0.000000e+00 -0.000000e+00 -0.000000e+00

split3_train_score split4_train_score mean_train_score std_train_score


0 -1.118149e+09 -1.093446e+09 -1.122159e+09 2.834288e+07
1 -5.727681e+08 -5.905210e+08 -5.928894e+08 1.284978e+07
2 -4.320746e+08 -4.311606e+08 -4.385008e+08 9.184397e+06
3 -1.016515e+09 -1.011270e+09 -9.980896e+08 1.577372e+07
4 -5.174405e+08 -5.282066e+08 -5.171931e+08 8.882622e+06
5 -3.830866e+08 -3.894779e+08 -3.857872e+08 4.774229e+06
6 -8.964731e+08 -9.151927e+08 -9.121998e+08 2.444837e+07
7 -4.959467e+08 -5.147087e+08 -5.043194e+08 8.880106e+06
8 -3.760938e+08 -3.861056e+08 -3.826981e+08 5.418747e+06
9 -9.070642e+08 -9.459386e+08 -9.247014e+08 1.973471e+07
10 -5.047868e+08 -5.348043e+08 -5.098427e+08 1.303601e+07
11 -3.778452e+08 -3.817589e+08 -3.810902e+08 1.916605e+06
12 -0.000000e+00 -0.000000e+00 -2.111398e+03 4.201294e+03
13 -0.000000e+00 -0.000000e+00 -1.900258e+02 3.781165e+02
14 -0.000000e+00 -0.000000e+00 0.000000e+00 0.000000e+00
15 -0.000000e+00 -0.000000e+00 0.000000e+00 0.000000e+00
16 -0.000000e+00 -0.000000e+00 0.000000e+00 0.000000e+00
17 -0.000000e+00 -0.000000e+00 0.000000e+00 0.000000e+00

[18 rows x 23 columns]

1.8 Searching: Randomized


[69]: param_distribs = {
'n_estimators': randint(low=1, high=200),
'max_features': randint(low=1, high=8),
}

forest_reg = RandomForestRegressor(random_state=42)
rnd_search = RandomizedSearchCV(forest_reg, param_distributions=param_distribs,

26
n_iter=10, cv=5,␣
scoring='neg_mean_squared_error', random_state=42)
,→

rnd_search.fit(housing_prepared, housing_labels)

[69]: RandomizedSearchCV(cv=5, estimator=RandomForestRegressor(random_state=42),


param_distributions={'max_features':
<scipy.stats._distn_infrastructure.rv_discrete_frozen object at
0x000001BE2B8F0C90>,
'n_estimators':
<scipy.stats._distn_infrastructure.rv_discrete_frozen object at
0x000001BE3EEDAAD0>},
random_state=42, scoring='neg_mean_squared_error')

[70]: cvres = rnd_search.cv_results_


for mean_score, params in zip(cvres["mean_test_score"], cvres["params"]):
print(np.sqrt(-mean_score), params)

49117.55344336652 {'max_features': 7, 'n_estimators': 180}


51450.63202856348 {'max_features': 5, 'n_estimators': 15}
50692.53588182537 {'max_features': 3, 'n_estimators': 72}
50783.614493515 {'max_features': 5, 'n_estimators': 21}
49162.89877456354 {'max_features': 7, 'n_estimators': 122}
50655.798471042704 {'max_features': 3, 'n_estimators': 75}
50513.856319990606 {'max_features': 3, 'n_estimators': 88}
49521.17201976928 {'max_features': 5, 'n_estimators': 100}
50302.90440763418 {'max_features': 3, 'n_estimators': 150}
65167.02018649492 {'max_features': 5, 'n_estimators': 2}

1.8.1 Analyzing the Best Model

[71]: feature_importances = grid_search.best_estimator_.feature_importances_


feature_importances

[71]: array([6.96542523e-02, 6.04213840e-02, 4.21882202e-02, 1.52450557e-02,


1.55545295e-02, 1.58491147e-02, 1.49346552e-02, 3.79009225e-01,
5.47789150e-02, 1.07031322e-01, 4.82031213e-02, 6.79266007e-03,
1.65706303e-01, 7.83480660e-05, 1.52473276e-03, 3.02816106e-03])

[72]: extra_attribs = ["rooms_per_hhold", "pop_per_hhold", "bedrooms_per_room"]


#cat_encoder = cat_pipeline.named_steps["cat_encoder"] # old solution
cat_encoder = full_pipeline.named_transformers_["cat"]
cat_one_hot_attribs = list(cat_encoder.categories_[0])
attributes = num_attribs + extra_attribs + cat_one_hot_attribs
sorted(zip(feature_importances, attributes), reverse=True)

[72]: [(0.3790092248170967, 'median_income'),


(0.16570630316895876, 'INLAND'),
(0.10703132208204355, 'pop_per_hhold'),

27
(0.06965425227942929, 'longitude'),
(0.0604213840080722, 'latitude'),
(0.054778915018283726, 'rooms_per_hhold'),
(0.048203121338269206, 'bedrooms_per_room'),
(0.04218822024391753, 'housing_median_age'),
(0.015849114744428634, 'population'),
(0.015554529490469328, 'total_bedrooms'),
(0.01524505568840977, 'total_rooms'),
(0.014934655161887772, 'households'),
(0.006792660074259966, '<1H OCEAN'),
(0.0030281610628962747, 'NEAR OCEAN'),
(0.0015247327555504937, 'NEAR BAY'),
(7.834806602687504e-05, 'ISLAND')]

1.8.2 Evaluating the Model on a Test Set

[73]: final_model = grid_search.best_estimator_

X_test = strat_test_set.drop("median_house_value", axis=1)


y_test = strat_test_set["median_house_value"].copy()

X_test_prepared = full_pipeline.transform(X_test)
final_predictions = final_model.predict(X_test_prepared)

final_mse = mean_squared_error(y_test, final_predictions)


final_rmse = np.sqrt(final_mse)
final_rmse

[73]: 47873.26095812988

[74]: confidence = 0.95


squared_errors = (final_predictions - y_test) ** 2
np.sqrt(stats.t.interval(confidence, len(squared_errors) - 1,
loc=squared_errors.mean(),
scale=stats.sem(squared_errors)))

[74]: array([45893.36082829, 49774.46796717])

[75]: m = len(squared_errors)
mean = squared_errors.mean()
tscore = stats.t.ppf((1 + confidence) / 2, df=m - 1)
tmargin = tscore * squared_errors.std(ddof=1) / np.sqrt(m)
np.sqrt(mean - tmargin), np.sqrt(mean + tmargin)

[75]: (45893.360828285535, 49774.46796717361)

28
[76]: zscore = stats.norm.ppf((1 + confidence) / 2)
zmargin = zscore * squared_errors.std(ddof=1) / np.sqrt(m)
np.sqrt(mean - zmargin), np.sqrt(mean + zmargin)

[76]: (45893.9540110131, 49773.921030650374)

[ ]:

29

You might also like