BSD 3101-Lab Exercise 1

Download as docx, pdf, or txt
Download as docx, pdf, or txt
You are on page 1of 12

BSD 3101

LAB EXERCISE 1
TITLE: DEMOSTRATING LINEAR REGRESSION BY USING PREDICTING HOME
PRICES CASE STUDY
Objectives
i. To identify which of the several attributes are required to accurately predict the median
price of a house.
ii. To build a multiple linear regression model to predict the median price using the most
important attributes.
To achieve the above objectives, the following activities will be involved:
i. Building a linear regression model
ii. Measuring the performance of the model
iii. Understanding the commonly used options for the Linear Regression operator
iv. Applying the model to predict MEDV prices for unseen data

THERORY/INTRODUCTION

Linear regression is not only one of the oldest data science methodologies, but it is also the most
easily explained method for demonstrating function fitting. The basic idea is to come up with a
function that explains and predicts the value of the target variable when given the values of the
predictor variables.
Linear regression model explains the relationship between a quantitative label and one or more
predictors (regular attributes) by fitting a linear equation to observed objects (with labels). The
developed linear model will predict the label for unlabeled objects.
A common goal that all businesses have to address in order to be successful is growth, in
revenues and profits. Customers are what will enable this to happen. Understanding and
increasing the likelihood that someone will buy again from the company is, therefore, critical.
This case study addresses this problem, using multiple linear regressions to predict the median
home prices in an urban region given the characteristics of a home.
Fig.1 shows a simple regression model. As seen in Fig.1 assume one would like to know the
effect of the number of rooms in a house (predictor) on its median sale price (target).
Each data point on the chart corresponds to a house. It is evident that on average, increasing the
number of rooms tends to also increase median price. This general statement can be captured by
drawing a straight line through the data. The problem in linear regression is, therefore, finding a
line (or a curve) that best explains this tendency. If there are two predictors, then the problem is

1
to find a surface (in a three-dimensional space). With more than two predictors, visualization
becomes difficult and one has to revert to a general statement where the dependent variables are
expressed as a linear combination of independent variables:
y=b0 +b 1 x 1+ b2 x 2 +… … .+b n x n (1)

FIGURE 1. A simple regression model.


TOOLS/MATERIALS
RapidMiner Studio Software, Laptop with OS and MS office processors.
METHODOLOGY
1. Data set
The original data consist of thirteen predictors and one response variable, which is the
variable that needs to be predicted. The predictors include physical characteristics of the house
(such as number of rooms, age, tax, and location) and neighborhood features (schools, industries,
zoning) among others. The response variable is of course the median value (MEDV) of the house
in thousands of dollars. Table 1 shows a snapshot of the dataset, which has altogether 506
examples. Table 2 describes the features or attributes of the dataset.
2. How to Implement
Here, you are required to build a multiple linear regression model for the Boston Housing
dataset.
The following activities will be involved:

2
i. Building a linear regression model
ii. Measuring the performance of the model
iii. Understanding the commonly used options for the Linear Regression operator
iv. Applying the model to predict MEDV prices for unseen data
Table 1. Sample View of the Classic Boston Housing Dataset

Table 2. Attributes of Boston Housing Dataset

Step 1: Data Preparation


As a first step, the data is separated into a training set and an unseen test set. The idea is to build
the model with the training data and test its performance on the unseen data. With the help of the
Retrieve operator, import the raw data (available in the companion website
www.IntroDataScience.com) into the RapidMiner process. Apply the Shuffle operator to
randomize the order of the data so that when the two partitions are separated, they are
statistically
similar. Next, using the Filter Examples Range operator, divide the data into two sets as shown
in Fig. 2. The raw data has 506 examples, which will be linearly split into a training set (from
row 1 to 450) and a test set (row 451_506) using the two operators.
Insert the Set Role operator, change the role of MEDV to label and connect the output to a Split
Validation operator’s input training port as shown in Fig. 3. The training data is now going to be
further split into a training set and a validation set (keep the default Split Validation options as is,
i.e., relative, 0.7, and shuffled). This will be needed in order to measure the performance of the

3
linear regression model. It is also a good idea to set the local random seed (to default the value of
1992), which ensures that RapidMiner selects the same samples if this process is run later.
After this step, double-click the Validation operator to enter the nested process.
Inside this process, insert the Linear Regression operator on the left window and Apply Model
and Performance (Regression) in the right window as shown in Fig.4. Click on the Performance
operator and check squared error, correlation, and squared correlation inside the Parameters
options selector on the right (Fig. 5).

FIGURE 2. Separating the data into training and testing samples.

FIGURE 3. Using the split validation operator.

4
FIGURE 4. Applying the linear regression operator and measuring performance.

FIGURE 5. Selecting performance criteria for the MLR.


Step 2: Model Building
Select the Linear Regression operator and change the feature selection option to none. Keep the
default eliminate collinear features checked, which will remove factors that are linearly
correlated from the modeling process. When two or more attributes are correlated to one another,
the resulting model will tend to have coefficients that cannot be intuitively interpreted and,
furthermore, the statistical significance of the coefficients also tends to be quite low. Also, keep
the use bias checked to build a model with an intercept [the b0 in Eq. (1)]. Keep the other default
options intact (Fig.4). When this process is run, the results shown in Fig. 6 will be generated.

5
(A)

(B)

FIGURE 6. (A) Description of the linear regression model. (B) Tabular view of the model. Sort
the table according to significance by double-clicking on the Code column.

6
Step 3: Execution and Interpretation
There are two views that one can examine in the Linear Regression output tab:
the Description view, which actually shows the function that is fitted (Fig. 6A) and the more
useful Data view, which not only shows the coefficients of the linear regression function, but
also gives information about the significance of these coefficients (Fig. 6B). The best way to
read this table is to sort it by double-clicking on the column named Code, which will sort the
different factors according to their decreasing level of significance.
RapidMiner assigns four stars (TTTT) to any factor that is highly significant. In this model, no
feature selection method was used and as a result, all 13 factors are in the model, including AGE
and INDUS, which have very low significance.
However, if the same model were to be run by selecting any of the options that are available in
the drop-down menu of the feature selection parameter, RapidMiner would have removed the
least significant factors from the model. In the next iteration, the greedy feature selection is used,
and this will have removed the least significant factors, INDUS and AGE, from the function
(Fig. 7A & B).

FIGURE 7. (A) Model without any feature selection. (B) Model with greedy feature selection.

7
Feature selection in RapidMiner can be done automatically within the Linear Regression
operator as described or by using external wrapper functions such as forward selection and
backward elimination.

FIGURE 8. Generating the r 2 for the model.


The second output to pay attention to is the Performance: a handy check to test the goodness of
fit in a regression model is the squared correlation. Conventionally this is the same as the
adjusted r2 for a model, which can take values between 0.0 and 1.0, with values closer to 1
indicating a better
model. For either of the models shown above, a value around 0.822 can be obtained (Fig. 8). The
squared error output was also requested: the raw value in itself may not reveal much, but this is
useful in comparing two different models. In this case, it was around 25.
One additional insight that can be extracted from the modeling process is ranking of the factors.
The easiest way to check this is to rank by p-value. As seen in Fig. 5.9, RM, LSTAT, and DIS
seem to be the most significant factors.
This is also reflected in their absolute t-stat values. The t-stat and P-values are the result of the
hypothesis tests conducted on the regression coefficients. For the purposes of predictive analysis,
the key takeaway is that a higher t-stat signals that the null hypothesis—, which assumes that the
coefficient is zero—can be safely rejected.

The corresponding p-value indicates the probability of wrongly rejecting the null hypothesis. It’s
already been noted how the number of rooms (RM) was a good predictor of the home prices, but
it was unable to explain all of the variations in median price. The r 2 and squared error for that
one-variable model were 0.405 and 45, respectively.

8
FIGURE 9. Ranking variables by their P-values.

This can be verified by rerunning the model built so far using only one independent variable, the
number of rooms, RM. This is done by using the Select Attributes operator, which has to be
inserted in the process before the Set Role operator.
When this model is run, the equation shown in, Eq. (2), will be obtained in the model
Description. By comparing the corresponding values from the MLR model (0.676 and 25) to the
simple linear regression model, it’s evident that both of these quantities have improved, thus,
affirming the decision to use multiple factors.
One now has a more comprehensive model that can account for much of the variability in the
response variable, MEDV.

Median price=9.1 × ( number of rooms )−34.7

where b 1 is 9.1 and b 0 is -34.7.

Finally, a word about the sign of the coefficients: LSTAT refers to the percentage of low-income
households in the neighborhood. A lower LSTAT is correlated with higher median home price,
and this is the reason for the negative coefficient on LSTAT.

Step 4: Application to Unseen Test Data


This model is now ready to be deployed against the unseen data that was created at the beginning
of this section using the second Filter Examples operator (Fig. 2). A new Set Role operator will
need to be added, select MEDV under parameters and set it to target role prediction from the
pull-down menu.
Add another Apply Model operator and connect the output of Set Role to its unlabeled port;
additionally, connect the output model from the Validation process to the input model port of the

9
new Apply Model. What has been done is, the attribute MEDV has been changed from the
unseen set of 56 examples
to a prediction. When the model is applied to this example set, one will be able to compare the
prediction (MEDV) values to the original MEDV values (which exist in the set) to test how well
the model would behave on new data. The difference between prediction (MEDV) and MEDV is
termed residual.

Fig. 10 shows one way to quickly check the residuals for the models application. A Rename
operator would be needed to change the name of “prediction (MEDV)” to predictedMEDV to
avoid confusing RapidMiner when the next operator, Generate Attributes, is used to calculate
residuals (try without using the Rename operator to understand this issue as it can pop up in
other instances where Generate Attributes is used). Fig. 11 shows the statistics for the new
attribute residuals which indicate that the mean is close to 0 (20.27) but the standard deviation
(and hence, variance) at 4.350 is not quite small. The histogram also seems to indicate that the
residuals are not quite normally distributed, which would be another motivation to continue to
improve the model.

FIGURE.10 Setting up a process to do the comparison between the unseen data and the model
predicted values.

10
FIGURE 11. Statistics of the residuals for the unseen data show that some model optimization
may be necessary.
RESLUTS AND DISCUSSION

1. Using RapidMiner Studio, follow the steps above to reproduce the results and present screen
shots of your figures from Fig.1 to Fig.11 as obtained in your RapidMiner software environment.
Do not copy paste the figures already given here. Summarize the information presented in each
figure.
2. Provide and explain at least three different visualizations (E.g. histogram, scatter plots,
distribution plots, etc.) of the results obtained with the application of linear regression to Boston
home prices data set.
3. Build your own typical RapidMiner Process with Linear Regression Model and analyze the
model and results. For example, you can construct a simple predictive learning process in
RapidMiner Studio by using the linear regression model to predict a continuous value for a
polynomial. Sample models are shown in Fig.12 and Fig.13. Add the operators that are included
in the models. Connect the ports to enable data flows. Each operator requires specific parameter
settings.
Conclusions and Recommendations
Give your Conclusions and Recommendations
References
Provide the references used in the lab exercises.

11
FIGURE 11. Sample linear regression Process No. 1

FIGURE 12. Sample linear regression Process No. 2

12

You might also like