Practical 1-3 SPSS
Practical 1-3 SPSS
Practical 1-3 SPSS
Practical
Aim: To carry out simple linear regression on the basis of given data.
Problem: The following data gives the House price in Lakhs (Y) and Area in square yards (X) of a
reality firm. Fit the simple linear regression model to the following data and carry out the analysis.
Y X Y X Y X Y X
186 175 182 167 162 156 179 160
180 168 162 160 192 180 170 149
160 154 169 165 185 167 170 160
186 166 176 167 163 157 165 148
163 162 180 175 185 167 165 154
172 152 157 157 170 157 169 171
192 179 170 172 176 168 171 165
170 163 186 181 176 167 192 175
174 172 180 166 160 145 176 161
191 170 188 181 167 156 168 162
182 170 153 148 157 153 169 162
178 147 179 169 180 162 184 176
181 165 175 170 172 156 171 160
168 162 165 157 184 174 161 158
162 154 156 162 185 160 185 175
188 166 185 174 165 152 184 174
168 167 172 168 181 175 179 168
183 174 166 162 170 169 184 177
188 175 179 159 161 149 175 158
166 164 181 155 188 176 173 161
180 163 176 171 181 165 164 146
176 163 170 159 156 143 181 168
185 171 165 164 161 158 187 178
169 161 183 175 152 141 181 170
Theory:
Simple Linear Regression:
Model: y = β0 + β1x + ε,
where the intercept β0 and slope β1 are unknown constants, called parameter, which are to be
estimated by method of least square and ε is a random error component.
Assumptions:
1. There is a linear relationship between the response (y) and regressor (x).
2. The errors are assumed to be Normally distributed with mean 0 and unknown variance σ2
i.e., εi ~ N(0, σ2) for all i .
3. The error terms are uncorrelated which implies the absence of autocorrelation. i.e.,
Cov(εi , εj) = 0 for all i ≠ j.
4. There is no multicollinearity and the variables are homoscedastic.
The model along with the above assumptions is known as Classical Linear Regression Model (CLRM).
Coefficient of determination (R2): It tells us the proportion or percentage of variation can be
explained by regressor x. The value of R2 lies between 0 and 1. The values of R2 that are close to 1
imply that most of the variability in y is explained by the regression model. The value of R 2 always
increases when we add new regressor variables.
The Adjusted R2 tells the percentage of variation explained by only those regressors that actually
affect the dependent variable y.
Test criteria: If p-value<0.05, we reject H0 at 5% level of significance and conclude on the basis of
given data that the regressor is statistically significant.
Steps:
Analyze → Regression → Linear → Dependent: House_price → Independent(s) → Area_yards →
Statistics → Estimates, Model fit, Descriptives → Continue → OK
Output:
Y X
1 186.00 175.00 22 176.00 163.00
2 180.00 168.00 23 185.00 171.00
3 160.00 154.00 24 169.00 161.00
4 186.00 166.00 25 182.00 167.00
5 163.00 162.00 26 162.00 160.00
6 172.00 152.00 27 169.00 165.00
7 192.00 179.00 28 176.00 167.00
8 170.00 163.00 29 180.00 175.00
9 174.00 172.00 30 157.00 157.00
10 191.00 170.00 31 170.00 172.00
11 182.00 170.00 32 186.00 181.00
12 178.00 147.00 33 180.00 166.00
13 181.00 165.00 34 188.00 181.00
14 168.00 162.00 35 153.00 148.00
15 162.00 154.00 36 179.00 169.00
16 188.00 166.00 37 175.00 170.00
17 168.00 167.00 38 165.00 157.00
18 183.00 174.00 39 156.00 162.00
19 188.00 175.00 40 185.00 174.00
20 166.00 164.00 41 172.00 168.00
21 180.00 163.00 42 166.00 162.00
43 179.00 159.00 71 161.00 158.00
44 181.00 155.00 72 152.00 141.00
45 176.00 171.00 73 179.00 160.00
46 170.00 159.00 74 170.00 149.00
47 165.00 164.00 75 170.00 160.00
48 183.00 175.00 76 165.00 148.00
49 162.00 156.00 77 165.00 154.00
50 192.00 180.00 78 169.00 171.00
51 185.00 167.00 79 171.00 165.00
52 163.00 157.00 80 192.00 175.00
53 185.00 167.00 81 176.00 161.00
54 170.00 157.00 82 168.00 162.00
55 176.00 168.00 83 169.00 162.00
56 176.00 167.00 84 184.00 176.00
57 160.00 145.00 85 171.00 160.00
58 167.00 156.00 86 161.00 158.00
59 157.00 153.00 87 185.00 175.00
60 180.00 162.00 88 184.00 174.00
61 172.00 156.00 89 179.00 168.00
62 184.00 174.00 90 184.00 177.00
63 185.00 160.00 91 175.00 158.00
64 165.00 152.00 92 173.00 161.00
65 181.00 175.00 93 164.00 146.00
66 170.00 169.00 94 181.00 168.00
67 161.00 149.00 95 187.00 178.00
68 188.00 176.00 96 181.00 170.00
69 181.00 165.00 Total N 96
70 156.00 143.00 a. Limited to first 100 cases.
Conclusion:
1. From table 1.2 of descriptive statistics, the mean and standard deviation of House_price are
174.32 Lakhs and 9.960 Lakhs respectively. Again, mean and standard deviation of
Area_yards are 163.92 square yards and 9.152 square yards respectively.
2. From table 1.3, the value of correlation between House_price and Area_yards is 0.765 and
p-value = 0<0.05. So, we reject null hypothesis implying that there is a significant correlation
between House_price and Area_yards.
3. From table 1.4, the value of coefficient of determination R2 = 0.585 which implies that the
regression model explains 58.5% of the total variation in House_price. Also, Adjusted R 2 =
0.580 ≈ R2. So, the model is a good fit.
4. From table 1.5: ANOVA, the p-value is 0 which is less than 0.05. So, we reject H 0 at 5% level
of significance implying that overall regression is significant.
5. From table 1.6, coefficients of the regression model are ^β 0 = 37.922 and ^β 1 = 0.832. So, the
fitted regression model is: House_price = 37.922 + 0.832*Area_yards
We can also see that the p-value for testing significance of β 1 is 0 < 0.05. So, we reject null
hypothesis at 5% level of significance implying that the regressor (Area_yards) is significant.
Name: RAGINI
Roll No: 21026765023
Group: A
Practical
Aim: To fit a multiple linear regression model on the basis of given data.
Problem: A recent survey of clerical employees of a large financial organization included questions
related to employee satisfaction with their supervisors. There was a question designed to measure
the overall performance of a supervisor as well as questions that were related to specific activities
involving interaction between employee and supervisor. An exploratory study was conducted to try
to explain relationship between specific supervisor activities and overall satisfaction with supervisors
as perceived by employees. Y = Overall rating of job being done by supervisor
X1 = Handles employee complaints X2 = Doesn’t allow special privileges
X3 = Opportunity to learn new things X4 = Raises based on performance
X5 = Too critical of poor performance X6 = Rate of advancing to better jobs
Sl. Overall Handles Doesn’t allow Opportunity Raises based Too critical Rate of
No. Rating Employee’s special to learn on of poor advancing
of Job Complaints preferences new things performance performance to better
(Y) (X1) (X2) (X3) (X4) (X5) jobs (X6)
1 43 51 30 39 61 92 45
2 63 64 51 54 63 73 47
3 71 70 68 69 76 86 48
4 61 63 45 47 54 84 35
5 81 78 56 66 71 83 47
6 43 55 49 44 54 49 34
7 58 67 42 56 66 68 35
8 71 75 50 55 70 66 41
9 72 82 72 67 71 83 31
10 67 61 45 47 62 80 41
11 64 53 53 58 58 67 34
12 67 60 47 39 59 74 41
13 69 62 57 42 55 63 25
14 68 83 83 45 59 77 35
15 77 77 54 72 79 77 46
16 81 90 50 72 60 54 36
17 74 85 64 69 79 79 63
18 65 60 65 75 55 80 60
19 65 70 46 57 75 85 46
20 50 58 68 54 64 78 52
21 50 40 33 34 43 64 33
22 64 61 52 62 66 80 41
23 53 66 52 50 63 80 37
24 40 37 42 58 50 57 49
25 63 54 42 48 66 75 33
26 66 77 66 63 88 76 72
27 78 75 58 74 80 78 49
28 48 57 44 45 51 83 38
29 85 85 71 71 77 74 55
30 82 82 39 59 64 78 39
Total 30 30 30 30 30 30 30
Theory:
Multiple Linear Regression:
Assumptions:
1. There is a linear relationship between the response and regressors.
2. The errors are assumed to be Normally distributed with mean 0 and unknown variance σ 2.
i.e., εi ~ N(0 , σ2) for all i .
3. The error terms are uncorrelated i.e., absence of autocorrelation.
i.e., Cov(εi , εj) = 0 for all i ≠ j .
4. There is no multicollinearity, and the variables are homoscedastic.
The Adjusted R2 tells the percentage of variation explained by only those regressors that actually
affect the dependent variable y.
Test criteria: If p-value<0.05, we reject H0 at 5% level of significance and conclude on the basis of
given data that the regressor is statistically significant.
Steps:
Analyze → Regression → Linear → Dependent: Y → Independent(s) → X1 X2 X3 X4 X5 X6 → Statistics
→ Estimates, Model fit, Descriptives → Continue → OK
Output:
Conclusion:
1. From table 2.2 of descriptives, we can gather information about mean and standard
deviation of given variables.
2. From table 2.3 of correlations, we can gather information about individual correlation
between the response and the predictors and also the pairwise correlation between the
various predictors.
Using the p value from the tables, we can say that there is significant correlation between all
pairs except (Y, X5); (Y, X6); (X1, X5); (X2, X5); (X3, X5); (X1, X6) and (X5, X6) since p value for these
pairs > 0.05 implying we fail to reject null hypothesis at 5% level of significance.
Also, we can say that the predictors X1, X3 and X4 have a significant effect on the response Y
as evident from their correlation coefficients as well as the p-values.
3. From table 2.4, we see that R2 = 0.733 implying the regression model explains 73.3% of the
total variation in the response and model is a good fit.
Also, Adjusted R2 = 0.663 implying 66.3% of the total variation of the response is explained
by only those predictors that have a significant effect on the response.
4. From table 2.5 of ANOVA, the p-value for testing the null hypothesis: β1 = β2 = …. = β6 = 0
which is less than the level of significance α = 0.05. So, we reject the null hypothesis at 5%
level of significance and conclude that the overall regression is significant.
5. From table 2.6, the coefficients of the regression model are ^β 0= 10.787, ^β 1 = 0.613, ^β 2 = -
0.073, ^β 3 = 0.320, ^β 4 = 0.082, ^β 5 = 0.038 and ^β 6 = -0.217 and the fitted regression equation
is: Y = 10.787 + 0.613 X1 – 0.073 X2 + 0.320 X3 + 0.082 X4 + 0.038 X5 – 0.217 X6
We can also see that the p-value for testing significance of β 1 is 0.001 < 0.05. So, we reject
null hypothesis at 5% level of significance implying that the regressor X 1 (Handles employee
complaints) is significantly significant. However, the p value for testing the significance of
βi=0; i = 2, 3, …, 6 are greater than 0.05 implying individually X2, X3, X4, X5 and X6 do not have
a significant effect on the response.
Name: RAGINI
Roll No: 21026765023
Group: A
Practical
Aim: To test the significance of regression coefficient and check the presence of autocorrelation and
heteroscedasticity.
Problem: The following data is based on time income consumption and expenditure of 30 families in
some locality. Assuming that consumption is linearly related to income, propose a model and test for
significance of regression. Also, compute coefficient of determination and discuss its importance on
the model adequacy measure. Check for presence of serial correlation and heteroscedasticity.
Theory:
Simple Linear Regression:
Model: y = β0 + β1x + ε,
where the intercept β0 and slope β1 are unknown constants, called parameter, which are to be
estimated by method of least square and ε is a random error component.
Assumptions:
1. There is a linear relationship between the response (y) and regressor (x).
2. The errors are assumed to be Normally distributed with mean 0 and unknown variance σ2
i.e., εi ~ N(0, σ2) for all i .
3. The error terms are uncorrelated which implies the absence of autocorrelation. i.e.,
Cov(εi , εj) = 0 for all i ≠ j.
4. There is no multicollinearity and the variables are homoscedastic.
The model along with the above assumptions is known as Classical Linear Regression Model (CLRM).
Test criteria: If p-value<0.05, we reject H0 at 5% level of significance and conclude on the basis of
given data that the regressor is statistically significant.
Hypothesis:
H0: Errors are serially uncorrelated
H1: Errors follow a first order autoregressive process
Test Criteria: If the value of Durbin-Watson statistic is less than 1 or greater than 4, we reject the null
hypothesis whereas if the value is close to 2, then, we accept H0.
Steps:
1. Analyze → Regression → Linear → Dependent: Consumption → Independent(s): Income →
Statistics → Estimates, Model fit, Descriptives, Durbin-Watson → Continue → Save →
Residuals: Unstandardized → Continue → OK
2. Transform → Compute Variable → Target Variable: Absolute_Residuals → Numeric
Expression: ABS(RES_1) → OK
3. Analyze → Correlate → Bivariate → Variables: Income, Absolute_Residuals → Correlation
Coefficients: Pearson → OK
Output:
Conclusion:
1. The residuals and their absolute values have been tabulated in Table 3.1.
2. From table 2.2 of descriptives, we can gather information about mean and standard
deviation of given variables.
3. From table 3.3, the value of correlation between Consumption and Income is 0.928 and
p-value = 0 < 0.05. So, we reject null hypothesis that there is a significant correlation
between Consumption and Income.
4. From table 3.4, the value of coefficient of determination R2 = 0.862 which implies that the
regression model explains 86.2% of the total variation in Consumption. Also, Adjusted R 2 =
0.857 ≈ R2. So, the model is a good fit.
5. From table 3.4, the value of Durbin-Watson statistic is 1.537 which is close to 2. Therefore,
we fail to reject null hypothesis that errors are serially uncorrelated implying autocorrelation
is absent in the data.
6. From table 3.5: ANOVA, the p-value is 0 which is less than 0.05. So, we reject H 0 at 5% level
of significance implying that overall regression is significant.
7. From table 3.6, coefficients of the regression model are ^β 0 = 7.196 and ^β 1 = 0.634. So, the
fitted regression model is: Consumption = 7.196 + 0.634*Income
We can also see that the p-value for testing significance of β 1 is 0 < 0.05. So, we reject null
hypothesis at 5% level of significance implying that the regressor (Income) is significant.
8. From table 3.7 of residual statistics we can gather that the minimum, maximum, mean and
standard deviation values of the predicted values, residuals, standardized predicted values
and standardized residuals.
9. From table 3.8, the value of Pearson’s correlation coefficient between the regressor
(Income) and the absolute value of the residuals is 0.275 and p-value for testing null
hypothesis i.e. presence of homoscedasticity = 0.141 > 0.05. So, we fail to reject null
hypothesis at 5% level of significance implying that the errors have a constant variance.