Faseeh Chap 2 Report
Faseeh Chap 2 Report
Faseeh Chap 2 Report
Submitted by
Section: A
[63]: import os
import tarfile
from six.moves import urllib
import pandas as pd
%matplotlib inline
import matplotlib.pyplot as plt
import numpy as np
from zlib import crc32
from sklearn.model_selection import train_test_split
from sklearn.model_selection import StratifiedShuffleSplit
from pandas.plotting import scatter_matrix
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import OrdinalEncoder
from sklearn.preprocessing import OneHotEncoder
from sklearn.base import BaseEstimator, TransformerMixin
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.compose import ColumnTransformer
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error
from sklearn.tree import DecisionTreeRegressor
from sklearn.model_selection import cross_val_score
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import mean_absolute_error
from sklearn.model_selection import GridSearchCV
from sklearn.model_selection import RandomizedSearchCV
from scipy.stats import randint
from scipy import stats
from sklearn.svm import SVR
from sklearn.model_selection import GridSearchCV
from sklearn.model_selection import RandomizedSearchCV
from scipy.stats import randint
from scipy import stats
1.1.2 Fetching Data from the web and Extracting it to custom Directory:
This part of code is commented because data reading is done directly through the file stored locally.
1
#def fetch_housing_data(housing_url=HOUSING_URL, housing_path=HOUSING_PATH):
# if not os.path.isdir(housing_path):
# os.makedirs(housing_path)
# tgz_path = os.path.join(housing_path, "housing.tgz")
# urllib.request.urlretrieve(housing_url, tgz_path)
# housing_tgz = tarfile.open(tgz_path)
# housing_tgz.extractall(path=housing_path)
# housing_tgz.close()
#fetch_housing_data()
housing = pd.read_csv(csv_path)
housing.head()
Using the housing.info() function to get a brief overview of dataset including total labels, total in-
stances and their datatype. Also we get to know that “total_bedrooms” label has 207 less instances.
[5]: housing.info()
2
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 20640 entries, 0 to 20639
Data columns (total 10 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 longitude 20640 non-null float64
1 latitude 20640 non-null float64
2 housing_median_age 20640 non-null float64
3 total_rooms 20640 non-null float64
4 total_bedrooms 20433 non-null float64
5 population 20640 non-null float64
6 households 20640 non-null float64
7 median_income 20640 non-null float64
8 median_house_value 20640 non-null float64
9 ocean_proximity 20640 non-null object
dtypes: float64(9), object(1)
memory usage: 1.6+ MB
To see all the available options for “ocean_proximity” label we used value_counts() function. We
also get to know the total instances for each option available from dataset.
[6]: housing["ocean_proximity"].value_counts()
[6]: ocean_proximity
<1H OCEAN 9136
INLAND 6551
NEAR OCEAN 2658
NEAR BAY 2290
ISLAND 5
Name: count, dtype: int64
Using the describe() function we can get a more statistical view of the data set. The count, max
and min values indicate the total instances, maximum value and minimum value from all instances
respectively. The std and percentages indicate the standard deviation and percentiles respectively.
[7]: housing.describe()
3
mean 537.870553 1425.476744 499.539680 3.870671
std 421.385070 1132.462122 382.329753 1.899822
min 1.000000 3.000000 1.000000 0.499900
25% 296.000000 787.000000 280.000000 2.563400
50% 435.000000 1166.000000 409.000000 3.534800
75% 647.000000 1725.000000 605.000000 4.743250
max 6445.000000 35682.000000 6082.000000 15.000100
median_house_value
count 20640.000000
mean 206855.816909
std 115395.615874
min 14999.000000
25% 119600.000000
50% 179700.000000
75% 264725.000000
max 500001.000000
4
1.2 Creating a Test Set
To create a test set we can split the given data set using a number of methods. First we can split
by either selecting random data instances, but it will eventually cause the whole data to be viewed
by our model which is not recommended. So we can either use hash of each instance. This ensures
that the test set will remain consistent across multiple runs, even if you refresh the dataset. The
new test set will contain 20% of the new instances, but it will not contain any instance that was
previously in the training set.
#housing_with_id = housing.reset_index()
5
train_set, test_set = train_test_split(housing, test_size = 0.2, random_state =␣
,→42)
print(len(train_set))
print(len(test_set))
16512
4128
Just creating a new label “income_cat”. Using Pandas cut() function we split and store instances
in particular range in certain groups and plot the histogram.
housing["income_cat"].hist()
6
class distributions in your dataset.
When working with classification tasks, it is important to have representative train and test sets that
preserve the same class distribution as the original dataset. StratifiedShuffleSplit() helps achieve
this by randomly shuffling the data while maintaining the relative proportions of different classes
in each split.
[11]: income_cat
3 0.350533
2 0.318798
4 0.176357
5 0.114341
1 0.039971
Name: count, dtype: float64
7
Plotting the population as radius of circles while using colors to indicate the ‘median_house_value’
where red bieng the highest and blue bieng the lowest.
8
1.4 Correlations
First dropping the “ocean_proximity” label so that we can use corr() function which only takes
numbers i.e int and float and not strings.
9
12655 2237.0 706.0 2.1736 72100.0
15502 2015.0 768.0 6.3373 279600.0
2908 667.0 300.0 2.8750 82700.0
14053 898.0 483.0 2.2264 112500.0
20496 1837.0 580.0 4.4964 238300.0
Using corr() function to find correlations between all labels in the given dataset and then sorting
values of “median_housing_value” in descending order and displaying.
10
dtype=object)
By plotting only the correlation between “median_income” and “median_house_value” we can see
a relatively linear graph showing that increase in income corresponds to increase in house value.
11
1.4.2 Combining Labels
We can also combine some labels together and check whether they provide a more better correlation
between “median_house_value” and other newly created labels.
corr_matrix = housing.corr()
corr_matrix["median_house_value"].sort_values(ascending=False)
12
latitude -0.142673
bedrooms_per_room -0.259952
Name: median_house_value, dtype: float64
Instead of removing all instances in which “total_bedrooms” has null values or even completely
dropping the “total_bedrooms” column, we computed the median value for this column and replaced
the values.
[23]: sample_incomplete_rows
13
4186 835.0 294.0 4.2891 <1H OCEAN
16885 1145.0 480.0 6.3580 NEAR OCEAN
Using the SimpleImputer class with “median” strategy and naming it ‘imputer’
Dropping the “ocean_proximity” label because we can’t deal with objects yet.
[26]: imputer.fit(housing_num)
[26]: SimpleImputer(strategy='median')
[27]: imputer.statistics_
Transforming the dataset from Pandas DataFrame form to Numpy Array format
[28]: X = imputer.transform(housing_num)
X
Converting the array format back to dataframe format and saving it in ‘housing_tr’ variable. We can
now simply display the instances with previuosly null values using the ‘sample_incomplete_rows’
variable.
14
[29]: longitude latitude housing_median_age total_rooms total_bedrooms \
1606 -122.08 37.88 26.0 2947.0 433.0
10915 -117.87 33.73 45.0 2264.0 433.0
19150 -122.70 38.35 14.0 2313.0 433.0
4186 -118.23 34.13 48.0 1308.0 433.0
16885 -122.40 37.58 26.0 3281.0 433.0
[30]: housing_tr.head()
[31]: ocean_proximity
12655 INLAND
15502 NEAR OCEAN
2908 INLAND
14053 NEAR OCEAN
20496 <1H OCEAN
1481 NEAR BAY
18125 <1H OCEAN
5830 <1H OCEAN
17989 <1H OCEAN
4861 <1H OCEAN
15
Using ordinal encoder class and encoding the text based attributes of “ocean_proximity” label to a
certain number.
[32]: array([[1.],
[4.],
[1.],
[4.],
[0.],
[3.],
[0.],
[0.],
[0.],
[0.]])
Checking how encoder has assigned these numbers to all attributes. We get to know that first
attribute is assigned the number ‘0’ while next attribute is assigned ‘1’ and so on.
[33]: ordinal_encoder.categories_
By using the OneHotEncoder, we can further encode the attributes in binary form so that our model
doesn’t make any relation between the previusly encoded attributes.
[35]: housing_cat_1hot.toarray()
16
[36]: cat_encoder = OneHotEncoder(sparse_output=False)
housing_cat_1hot = cat_encoder.fit_transform(housing_cat)
housing_cat_1hot
[37]: cat_encoder.categories_
attr_adder = CombinedAttributesAdder(add_bedrooms_per_room=False)
housing_extra_attribs = attr_adder.transform(housing.values)
index=housing.index)
17
housing_extra_attribs.head()
population_per_household
12655 3.168555
15502 2.623698
2908 2.223333
14053 1.859213
20496 3.167241
housing_num_tr = num_pipeline.fit_transform(housing_num)
housing_num_tr
18
[-1.28105026, 2.02567448, -0.13148926, ..., 0.01407228,
0.00657083, -0.12169672]])
full_pipeline = ColumnTransformer([
("num", num_pipeline, num_attribs),
("cat", OneHotEncoder(), cat_attribs),
])
housing_prepared = full_pipeline.fit_transform(housing)
housing_prepared
[42]: housing_prepared.shape
[43]: LinearRegression()
19
print("Predictions:", lin_reg.predict(some_data_prepared))
[46]: some_data_prepared
[47]: 68627.87390018745
Finding mean absolute error using the mean_absolute_error() function. We observe that we have
performed some mistake and are getting huge error.
[48]: 49438.66860915801
20
[49]: tree_reg = DecisionTreeRegressor(random_state=42)
tree_reg.fit(housing_prepared, housing_labels)
[49]: DecisionTreeRegressor(random_state=42)
We are getting zero error. This verify that we have overfitted our model.
[50]: 0.0
display_scores(tree_rmse_scores)
21
[54]: RandomForestRegressor(random_state=42)
[55]: 18650.698705770003
pd.Series(np.sqrt(-scores)).describe()
[62]: 111095.06635291968
22
1.7 Fine-Tuning the Model
1.7.1 Searching: By Grids
[64]: param_grid = [
# try 12 (3×4) combinations of hyperparameters
{'n_estimators': [3, 10, 30], 'max_features': [2, 4, 6, 8]},
# then try 6 (2×3) combinations with bootstrap set as False
{'bootstrap': [False], 'n_estimators': [3, 10], 'max_features': [2, 3, 4]},
]
forest_reg = RandomForestRegressor(random_state=42)
# train across 5 folds, that's a total of (12+6)*5=90 rounds of training
grid_search = GridSearchCV(forest_reg, param_grid, cv=5,
scoring='neg_mean_squared_error',
return_train_score=True)
grid_search.fit(housing_prepared, housing_labels)
[65]: grid_search.best_params_
[66]: grid_search.best_estimator_
23
62381.765106921855 {'bootstrap': False, 'max_features': 2, 'n_estimators': 3}
54476.57050944266 {'bootstrap': False, 'max_features': 2, 'n_estimators': 10}
59974.60028085155 {'bootstrap': False, 'max_features': 3, 'n_estimators': 3}
52754.5632813202 {'bootstrap': False, 'max_features': 3, 'n_estimators': 10}
57831.136061214274 {'bootstrap': False, 'max_features': 4, 'n_estimators': 3}
51278.37877140253 {'bootstrap': False, 'max_features': 4, 'n_estimators': 10}
[68]: pd.DataFrame(grid_search.cv_results_)
24
17 4 10 False
params split0_test_score \
0 {'max_features': 2, 'n_estimators': 3} -4.119912e+09
1 {'max_features': 2, 'n_estimators': 10} -2.973521e+09
2 {'max_features': 2, 'n_estimators': 30} -2.801229e+09
3 {'max_features': 4, 'n_estimators': 3} -3.528743e+09
4 {'max_features': 4, 'n_estimators': 10} -2.742620e+09
5 {'max_features': 4, 'n_estimators': 30} -2.522176e+09
6 {'max_features': 6, 'n_estimators': 3} -3.362127e+09
7 {'max_features': 6, 'n_estimators': 10} -2.622099e+09
8 {'max_features': 6, 'n_estimators': 30} -2.446142e+09
9 {'max_features': 8, 'n_estimators': 3} -3.590333e+09
10 {'max_features': 8, 'n_estimators': 10} -2.721311e+09
11 {'max_features': 8, 'n_estimators': 30} -2.492636e+09
12 {'bootstrap': False, 'max_features': 2, 'n_est... -4.020842e+09
13 {'bootstrap': False, 'max_features': 2, 'n_est... -2.901352e+09
14 {'bootstrap': False, 'max_features': 3, 'n_est... -3.687132e+09
15 {'bootstrap': False, 'max_features': 3, 'n_est... -2.837028e+09
16 {'bootstrap': False, 'max_features': 4, 'n_est... -3.549428e+09
17 {'bootstrap': False, 'max_features': 4, 'n_est... -2.692499e+09
25
4 -5.063215e+08 -5.257983e+08 -5.081984e+08
5 -3.776568e+08 -3.902106e+08 -3.885042e+08
6 -8.909397e+08 -9.583733e+08 -9.000201e+08
7 -4.939906e+08 -5.145996e+08 -5.023512e+08
8 -3.760968e+08 -3.876636e+08 -3.875307e+08
9 -9.505012e+08 -9.166119e+08 -9.033910e+08
10 -4.998373e+08 -4.997970e+08 -5.099880e+08
11 -3.801679e+08 -3.832972e+08 -3.823818e+08
12 -0.000000e+00 -4.306828e+01 -1.051392e+04
13 -0.000000e+00 -3.876145e+00 -9.462528e+02
14 -0.000000e+00 -0.000000e+00 -0.000000e+00
15 -0.000000e+00 -0.000000e+00 -0.000000e+00
16 -0.000000e+00 -0.000000e+00 -0.000000e+00
17 -0.000000e+00 -0.000000e+00 -0.000000e+00
forest_reg = RandomForestRegressor(random_state=42)
rnd_search = RandomizedSearchCV(forest_reg, param_distributions=param_distribs,
26
n_iter=10, cv=5,␣
scoring='neg_mean_squared_error', random_state=42)
,→
rnd_search.fit(housing_prepared, housing_labels)
27
(0.06965425227942929, 'longitude'),
(0.0604213840080722, 'latitude'),
(0.054778915018283726, 'rooms_per_hhold'),
(0.048203121338269206, 'bedrooms_per_room'),
(0.04218822024391753, 'housing_median_age'),
(0.015849114744428634, 'population'),
(0.015554529490469328, 'total_bedrooms'),
(0.01524505568840977, 'total_rooms'),
(0.014934655161887772, 'households'),
(0.006792660074259966, '<1H OCEAN'),
(0.0030281610628962747, 'NEAR OCEAN'),
(0.0015247327555504937, 'NEAR BAY'),
(7.834806602687504e-05, 'ISLAND')]
X_test_prepared = full_pipeline.transform(X_test)
final_predictions = final_model.predict(X_test_prepared)
[73]: 47873.26095812988
[75]: m = len(squared_errors)
mean = squared_errors.mean()
tscore = stats.t.ppf((1 + confidence) / 2, df=m - 1)
tmargin = tscore * squared_errors.std(ddof=1) / np.sqrt(m)
np.sqrt(mean - tmargin), np.sqrt(mean + tmargin)
28
[76]: zscore = stats.norm.ppf((1 + confidence) / 2)
zmargin = zscore * squared_errors.std(ddof=1) / np.sqrt(m)
np.sqrt(mean - zmargin), np.sqrt(mean + zmargin)
[ ]:
29