BSD 3101-Lab Exercise 1
BSD 3101-Lab Exercise 1
BSD 3101-Lab Exercise 1
LAB EXERCISE 1
TITLE: DEMOSTRATING LINEAR REGRESSION BY USING PREDICTING HOME
PRICES CASE STUDY
Objectives
i. To identify which of the several attributes are required to accurately predict the median
price of a house.
ii. To build a multiple linear regression model to predict the median price using the most
important attributes.
To achieve the above objectives, the following activities will be involved:
i. Building a linear regression model
ii. Measuring the performance of the model
iii. Understanding the commonly used options for the Linear Regression operator
iv. Applying the model to predict MEDV prices for unseen data
THERORY/INTRODUCTION
Linear regression is not only one of the oldest data science methodologies, but it is also the most
easily explained method for demonstrating function fitting. The basic idea is to come up with a
function that explains and predicts the value of the target variable when given the values of the
predictor variables.
Linear regression model explains the relationship between a quantitative label and one or more
predictors (regular attributes) by fitting a linear equation to observed objects (with labels). The
developed linear model will predict the label for unlabeled objects.
A common goal that all businesses have to address in order to be successful is growth, in
revenues and profits. Customers are what will enable this to happen. Understanding and
increasing the likelihood that someone will buy again from the company is, therefore, critical.
This case study addresses this problem, using multiple linear regressions to predict the median
home prices in an urban region given the characteristics of a home.
Fig.1 shows a simple regression model. As seen in Fig.1 assume one would like to know the
effect of the number of rooms in a house (predictor) on its median sale price (target).
Each data point on the chart corresponds to a house. It is evident that on average, increasing the
number of rooms tends to also increase median price. This general statement can be captured by
drawing a straight line through the data. The problem in linear regression is, therefore, finding a
line (or a curve) that best explains this tendency. If there are two predictors, then the problem is
1
to find a surface (in a three-dimensional space). With more than two predictors, visualization
becomes difficult and one has to revert to a general statement where the dependent variables are
expressed as a linear combination of independent variables:
y=b0 +b 1 x 1+ b2 x 2 +… … .+b n x n (1)
2
i. Building a linear regression model
ii. Measuring the performance of the model
iii. Understanding the commonly used options for the Linear Regression operator
iv. Applying the model to predict MEDV prices for unseen data
Table 1. Sample View of the Classic Boston Housing Dataset
3
linear regression model. It is also a good idea to set the local random seed (to default the value of
1992), which ensures that RapidMiner selects the same samples if this process is run later.
After this step, double-click the Validation operator to enter the nested process.
Inside this process, insert the Linear Regression operator on the left window and Apply Model
and Performance (Regression) in the right window as shown in Fig.4. Click on the Performance
operator and check squared error, correlation, and squared correlation inside the Parameters
options selector on the right (Fig. 5).
4
FIGURE 4. Applying the linear regression operator and measuring performance.
5
(A)
(B)
FIGURE 6. (A) Description of the linear regression model. (B) Tabular view of the model. Sort
the table according to significance by double-clicking on the Code column.
6
Step 3: Execution and Interpretation
There are two views that one can examine in the Linear Regression output tab:
the Description view, which actually shows the function that is fitted (Fig. 6A) and the more
useful Data view, which not only shows the coefficients of the linear regression function, but
also gives information about the significance of these coefficients (Fig. 6B). The best way to
read this table is to sort it by double-clicking on the column named Code, which will sort the
different factors according to their decreasing level of significance.
RapidMiner assigns four stars (TTTT) to any factor that is highly significant. In this model, no
feature selection method was used and as a result, all 13 factors are in the model, including AGE
and INDUS, which have very low significance.
However, if the same model were to be run by selecting any of the options that are available in
the drop-down menu of the feature selection parameter, RapidMiner would have removed the
least significant factors from the model. In the next iteration, the greedy feature selection is used,
and this will have removed the least significant factors, INDUS and AGE, from the function
(Fig. 7A & B).
FIGURE 7. (A) Model without any feature selection. (B) Model with greedy feature selection.
7
Feature selection in RapidMiner can be done automatically within the Linear Regression
operator as described or by using external wrapper functions such as forward selection and
backward elimination.
The corresponding p-value indicates the probability of wrongly rejecting the null hypothesis. It’s
already been noted how the number of rooms (RM) was a good predictor of the home prices, but
it was unable to explain all of the variations in median price. The r 2 and squared error for that
one-variable model were 0.405 and 45, respectively.
8
FIGURE 9. Ranking variables by their P-values.
This can be verified by rerunning the model built so far using only one independent variable, the
number of rooms, RM. This is done by using the Select Attributes operator, which has to be
inserted in the process before the Set Role operator.
When this model is run, the equation shown in, Eq. (2), will be obtained in the model
Description. By comparing the corresponding values from the MLR model (0.676 and 25) to the
simple linear regression model, it’s evident that both of these quantities have improved, thus,
affirming the decision to use multiple factors.
One now has a more comprehensive model that can account for much of the variability in the
response variable, MEDV.
Finally, a word about the sign of the coefficients: LSTAT refers to the percentage of low-income
households in the neighborhood. A lower LSTAT is correlated with higher median home price,
and this is the reason for the negative coefficient on LSTAT.
9
new Apply Model. What has been done is, the attribute MEDV has been changed from the
unseen set of 56 examples
to a prediction. When the model is applied to this example set, one will be able to compare the
prediction (MEDV) values to the original MEDV values (which exist in the set) to test how well
the model would behave on new data. The difference between prediction (MEDV) and MEDV is
termed residual.
Fig. 10 shows one way to quickly check the residuals for the models application. A Rename
operator would be needed to change the name of “prediction (MEDV)” to predictedMEDV to
avoid confusing RapidMiner when the next operator, Generate Attributes, is used to calculate
residuals (try without using the Rename operator to understand this issue as it can pop up in
other instances where Generate Attributes is used). Fig. 11 shows the statistics for the new
attribute residuals which indicate that the mean is close to 0 (20.27) but the standard deviation
(and hence, variance) at 4.350 is not quite small. The histogram also seems to indicate that the
residuals are not quite normally distributed, which would be another motivation to continue to
improve the model.
FIGURE.10 Setting up a process to do the comparison between the unseen data and the model
predicted values.
10
FIGURE 11. Statistics of the residuals for the unseen data show that some model optimization
may be necessary.
RESLUTS AND DISCUSSION
1. Using RapidMiner Studio, follow the steps above to reproduce the results and present screen
shots of your figures from Fig.1 to Fig.11 as obtained in your RapidMiner software environment.
Do not copy paste the figures already given here. Summarize the information presented in each
figure.
2. Provide and explain at least three different visualizations (E.g. histogram, scatter plots,
distribution plots, etc.) of the results obtained with the application of linear regression to Boston
home prices data set.
3. Build your own typical RapidMiner Process with Linear Regression Model and analyze the
model and results. For example, you can construct a simple predictive learning process in
RapidMiner Studio by using the linear regression model to predict a continuous value for a
polynomial. Sample models are shown in Fig.12 and Fig.13. Add the operators that are included
in the models. Connect the ports to enable data flows. Each operator requires specific parameter
settings.
Conclusions and Recommendations
Give your Conclusions and Recommendations
References
Provide the references used in the lab exercises.
11
FIGURE 11. Sample linear regression Process No. 1
12