Unit 2 Da

Unit 2
Data Analytics
Business Statistics, A First Course (4e) © 2006 Prentice-Hall, Inc. Chap 9-1
What is a Hypothesis?
▪ A hypothesis is a claim
(assumption) about a
population parameter:
▪ population mean
Example: The mean monthly cell phone bill
of this city is μ = $42
▪ population proportion
Example: The proportion of adults in this
city with cell phones is π = 0.68
The Null Hypothesis, H0
▪ States the claim or assertion to be tested

Example: The average number of TV sets in
U.S. Homes is equal to three ( H0 : μ = 3 )
▪ Is always about a population parameter,

not about a sample statistic
H0 : μ = 3 H0 : X = 3
The Null Hypothesis, H0
(continued)
▪ Begin with the assumption that the null

hypothesis is true
▪ Similar to the notion of innocent until
proven guilty
▪ Refers to the status quo
▪ Always contains “=” , “≤” or “” sign
▪ May or may not be rejected
The Alternative Hypothesis, H1
▪ Is the opposite of the null hypothesis
▪ e.g., The average number of TV sets in U.S.
homes is not equal to 3 ( H1: μ ≠ 3 )
▪ Challenges the status quo
▪ Never contains the “=” , “≤” or “” sign
▪ May or may not be proven
▪ Is generally the hypothesis that the
researcher is trying to prove
Hypothesis Testing Process
Claim: the
population
mean age is 50.
(Null Hypothesis:
Population
H0: μ = 50 )
Now select a
random sample
Is X= 20 likely if μ = 50?
If not likely, Suppose
the sample
REJECT mean age Sample
Null Hypothesis is 20: X = 20
Business Statistics, A First Course (4e) © 2006 Prentice-Hall, Inc.
Reason for Rejecting H0
Sampling Distribution of X
X
20 μ = 50
If H0 is true
... then we
If it is unlikely that
reject the null
we would get a
... if in fact this were hypothesis that
sample mean of
the population mean… μ = 50.
this value ...
Level of Significance, 
▪ Defines the unlikely values of the sample

statistic if the null hypothesis is true
▪ Defines rejection region of the sampling
distribution
▪ Is designated by  , (level of significance)
▪ Typical values are 0.01, 0.05, or 0.10
▪ Is selected by the researcher at the beginning
▪ Provides the critical value(s) of the test
Level of Significance
and the Rejection Region
Level of significance =  Represents
critical value
H0: μ = 3 /2 /2
H1: μ ≠ 3 Rejection
Two-tail test 0 region is
shaded
H0: μ ≤ 3 
H1: μ > 3
Upper-tail test 0
H0: μ ≥ 3

H1: μ < 3
Lower-tail test 0
Errors in Making Decisions
▪ Type I Error
▪ Reject a true null hypothesis
▪ Considered a serious type of error
The probability of Type I Error is 
▪ Called level of significance of the test

▪ Set by the researcher in advance
Errors in Making Decisions
(continued)
▪ Type II Error
▪ Fail to reject a false null hypothesis
The probability of Type II Error is β
Outcomes and Probabilities
Possible Hypothesis Test Outcomes
Actual
Situation
Decision H0 True H0 False
Do Not
No error Type II Error
Reject
Key: (1 -  ) (β)
Outcome H0
(Probability) Reject Type I Error No Error
H0 () (1-β)
Hypothesis Tests for the Mean
Hypothesis
Tests for 
 Known  Unknown
(Z test) (t test)
One-Tail Tests
▪ In many cases, the alternative hypothesis

focuses on a particular direction
This is a lower-tail test since the

H0: μ ≥ 3
alternative hypothesis is focused on
H1: μ < 3 the lower tail below the mean of 3
H0: μ ≤ 3 This is an upper-tail test since the

alternative hypothesis is focused on
H1: μ > 3
the upper tail above the mean of 3
Lower-Tail Tests
H0: μ ≥ 3
▪ There is only one H1: μ < 3
critical value, since
the rejection area is
in only one tail 
Reject H0 Do not reject H0

-Z Z
0
μ X
Critical value
Upper-Tail Tests
H0: μ ≤ 3
▪ There is only one
critical value, since H1: μ > 3
the rejection area is
in only one tail 
Do not reject H0 Reject H0

Z Zα
0
_
X μ
Critical value
Example: Upper-Tail Z Test
for Mean ( Known)
A phone industry manager thinks that
customer monthly cell phone bills have
increased, and now average over $52 per
month. The company wishes to test this
claim. (Assume  = 10 is known)
Form hypothesis test:

H0: μ ≤ 52 the average is not over $52 per month
H1: μ > 52 the average is greater than $52 per month
(i.e., sufficient evidence exists to support the
manager’s claim)
Example: Find Rejection Region
(continued)
▪ Suppose that  = 0.10 is chosen for this test
Find the rejection region: Reject H0
 = 0.10

0 1.28
Reject H0 if Z > 1.28
Review:
One-Tail Critical Value
Standardized Normal
What is Z given  = 0.10? Distribution Table (Portion)
0.90 0.10
Z .07 .08 .09
 = 0.10
1.1 .8790 .8810 .8830
0.90
1.2 .8980 .8997 .9015
z 0 1.28
1.3 .9147 .9162 .9177
Critical Value
= 1.28
Example: Test Statistic
(continued)
Obtain sample and compute the test statistic
Suppose a sample is taken with the following

results: n = 64, X = 53.1 (=10 was assumed known)
▪ Then the test statistic is:
X−μ 53.1 − 52
Z = = = 0.88
σ 10
n 64
Example: Decision
(continued)
Reach a decision and interpret the result:
Reject H0
 = 0.10

1.28
0
Z = 0.88
Do not reject H0 since Z = 0.88 ≤ 1.28

i.e.: there is not sufficient evidence that the
mean bill is over $52
p -Value Solution
(continued)
Calculate the p-value and compare to 
(assuming that μ = 52.0)
p-value = 0.1894
Reject H0 P( X  53.1)
 = 0.10
 53.1 − 52.0 
= P Z  
0  10/ 64 
= P(Z  0.88) = 1− 0.8106
1.28
Z = 0.88
= 0.1894
Do not reject H0 since p-value = 0.1894 >  = 0.10

t Test of Hypothesis for the Mean
(σ Unknown)
▪ Convert sample statistic ( X ) to a t test statistic
Hypothesis
Tests for 
σKnown
Known σUnknown
Unknown
(Z test) (t test)
The test statistic is:
X −μ
t n-1 =
S
n
Example: Two-Tail Test
( Unknown)
The average cost of a

hotel room in New York
is said to be $168 per
night. A random sample
of 25 hotels resulted in
X = $172.50 and H0: μ = 168
S = $15.40. Test at the H1: μ  168
 = 0.05 level.
(Assume the population distribution is normal)
Example Solution:
Two-Tail Test
H0: μ = 168 /2=.025 /2=.025

H1: μ  168
▪  = 0.05 Reject H0 Do not reject H0 Reject H0

t n-1,α/2
-t n-1,α/2 0
▪ n = 25 -2.0639 2.0639
1.46
▪  is unknown, so X −μ 172.50 − 168
t n −1 = = = 1.46
use a t statistic S 15.40
▪ Critical Value: n 25
t24 = ± 2.0639 Do not reject H0: not sufficient evidence that

true mean cost is different than $168
What is an ANOVA?
Definition:
An ANOVA test is a type of statistical test used to
determine if there is a statistically significant
difference between two or more categorical
groups by testing for differences of means using
variance.
What is an ANOVA?(contd.)
▪ Another Key part of ANOVA is that it splits the

independent variable into 2 or more groups.
For example:
one or more groups might be expected to
influence the dependent variable while the other
group is used as a control group and is not
expected to influence the dependent variable.
What is an ANOVA?(contd.)
Contd.
▪ ANOVA tests for significance use the F-test
for statistical significance. The F-test is a
groupwise comparison test, which means it
compares the variance in each group mean to
the overall variance in the dependent variable.
The Formula for ANOVA is:
▪
Assumptions of ANOVA
▪ Can only be conducted if there is no relationship between the

subjects in each sample i.e. that subjects in the first group
cannot also be in the second group (e.g. independent
samples/between-groups).
▪ The different groups/levels must have equal sample sizes.
▪ Can only be conducted if the dependent variable is normally

distributed, so that the middle scores are most frequent and
extreme scores are least frequent.
▪ Population variances must be equal (i.e.

homoscedastic). Homogeneity of variance means that the
deviation of scores (measured by the range or standard deviation
for example) is similar between populations.
Types of ANOVA Tests
One way ANOVA
▪ Definition:
▪ A one way ANOVA is used to compare two means
from two independent (unrelated) groups using
the F-distribution.
▪ The null hypothesis for the test is that the
two means are equal.
Example:
▪ Situation 1: You have a group of individuals randomly
split into smaller groups and completing different tasks.
For example, you might be studying the effects
of tea on weight loss and form three groups:
green tea, black tea, and no tea.
▪ Situation 2: Similar to situation 1, but in this case the
individuals are split into groups based on an attribute
they possess.
For example, you might be studying leg strength of people
according to weight. You could split participants into
weight categories (obese, overweight and normal) and
measure their leg strength on a weight machine.
Limitation of one way ANOVA
▪ A one way ANOVA will tell you that at least two

groups were different from each other. But it
won’t tell you which groups were different.
Two way ANOVA
▪ Definition:
▪ A two-way ANOVA is used to estimate how
the mean of a quantitative variable changes
according to the levels of two categorical variables.
▪ Use a two-way ANOVA when you want to know how
two independent variables, in combination, affect a
dependent variable.
Assumptions for Two Way
ANOVA
▪ The population must be close to a normal

distribution.
▪ Samples must be independent.
▪ Population variances must be equal
(i.e. homoscedastic).
▪ Groups must have equal sample sizes.
Examples
You are researching which type of fertilizer and

planting density produces the greatest crop yield
in a field experiment.
You assign different plots in a field to a
combination of fertilizer type (1, 2, or 3) and
planting density (1=low density, 2=high density),
and measure the final crop yield in bushels per
acre at harvest time.
You can use a two-way ANOVA to find out if
fertilizer type and planting density have an effect
on average crop yield.
Examples
▪ In real-life development procedure, the data given

to any modeller has various features.
▪ Various features are given in the data which are not

even required for the generation of the models.
▪ The presence of those features can reduce

the performance level of the model. So in
modelling, it becomes an important step of data
preprocessing.
Sequential Feature
▪ There are various algorithms to extract features,

one such algorithm is Sequential Feature
Selection Algorithms.
▪ Sequential feature selection algorithms are

basically part of the wrapper methods where it
adds and removes features from the dataset
sequentially.
The sequential feature selection method
has two components:
▪ An objective function:
The method finds to minimize the number of
overall features in a subset from the set of all
features to enhance the results.
▪ A sequential search algorithm:
This searching algorithm adds or removes the
feature candidate from the candidate subset while
evaluating the objective function or criterion.
Sequential Feature
Sequential searches follow only one direction:
Either it increases the number of features in the

subset or reduces the number of features in the
candidate feature subset.
▪ On the basis of movement, we can divide them into
two variants.
▪ Sequential forward selection(SFS)
▪ Sequential backward selection(SBS)

▪ Sequential forward selection(SFS):
▪ In SFS variant features are sequentially added to an
empty set of features until the addition of extra
features does not reduce the criterion.
▪ Sequential Backward Selection (SBS):
▪ This variant algorithm picks all the features from the
input data and combines them in a set and
sequentially removes them from the set until the
removal of further features increases the criterion.
Stepwise Regression
▪ Definition:
▪ Stepwise regression is the step-by-step iterative
construction of a regression model that involves the
selection of independent variables to be used in a
final model. It involves adding or removing potential
explanatory variables in succession and testing for
statistical significance after each iteration.
Stepwise Regression
There are three approaches to stepwise regression:
▪ Forward selection begins with no variables in the model,

tests each variable as it is added to the model, then keeps
those that are deemed most statistically significant—
repeating the process until the results are optimal.
▪ Backward elimination starts with a set of independent

variables, deleting one at a time, then testing to see if the
removed variable is statistically significant.
▪ Bidirectional elimination is a combination of the first two

methods that test which variables should be included or
excluded.
Example
▪ An example of a stepwise regression using the

backward elimination method would be an attempt to
understand energy usage at a factory using variables
such as equipment run time, equipment age, staff size,
temperatures outside, and time of year.
▪ The model includes all of the variables—then each is

removed, one at a time, to determine which is least
statistically significant.
▪ In the end, the model might show that time of year and
temperatures are most significant, possibly suggesting
the peak energy consumption at the factory is when air
conditioner usage is at its highest.
Dummy Variables
▪ Definition:
▪ Numeric variables used in regression analysis to
represent categorical data that can only take on one
of two values: zero or one.
When to use dummy variable?
▪ When using categorical variables, it doesn’t make sense
to just assign values like 1, 2, 3, to values like “blue”,
“green”, and “brown” because it doesn’t make sense to
say that green is twice as colorful as blue or that brown is
three times as colorful as blue.
▪ Instead, the solution is to use dummy variables.
▪ These are variables that we create specifically for

regression analysis that take on one of two values: zero
or one.
▪ The number of dummy variables we must create is equal

to k-1 where k is the number of different values that the
categorical variable can take on.
Example
▪ Suppose we have the following dataset and we

would like to use gender and age to
predict income:
Example
▪ To use gender as a predictor variable in a regression model,

we must convert it into a dummy variable.
▪ Since it is currently a categorical variable that can take on

two different values (“Male” or “Female”), we only need to
create k-1 = 2-1 = 1 dummy variable.
▪ To create this dummy variable, we can choose one of the

values (“Male” or “Female”) to represent 0 and the other to
represent 1.
▪ In general, we usually represent the most frequently

occurring value with a 0, which would be “Male” in this
dataset.
▪ Thus, here’s how we would convert gender into
a dummy variable:
We could then
use Age and Gender_Dummy as predictor
variables in a regression model.
Logistic Regression
▪ Definition:
▪ Logistic regression is a process of modeling the
probability of a discrete outcome given an input
variable. The most common logistic regression
models a binary outcome; something that can take
two values such as true/false, yes/no, and so on.
▪ Logistic regression, despite its name, is
a classification model rather than regression model.
Linear VS Logistic Regression
▪ Linear regression models are used to identify

the relationship between a continuous
dependent variable and one or more
independent variables.
▪ Similar to linear regression, logistic regression

is also used to estimate the relationship
between a dependent variable and one or more
independent variables, but it is used to make a
prediction about a categorical variable versus a
continuous one.
Types of logistic regression
▪ Binary logistic regression: In this approach, the response or
dependent variable is dichotomous in nature—i.e. it has only two
possible outcomes (e.g. 0 or 1).
▪ Example: predicting if an e-mail is spam or not spam or if a tumor is
malignant or not malignant.
▪ Multinomial logistic regression: In this type of logistic regression

model, the dependent variable has three or more possible outcomes;
however, these values have no specified order.
▪ Example, movie studios want to predict what genre of film a moviegoer is
likely to see to market films more effectively. A multinomial logistic
regression model can help the studio to determine the strength of influence a
person's age, gender, and dating status may have on the type of film that
they prefer.
▪ Ordinal logistic regression: This type of logistic regression model

is leveraged when the response variable has three or more possible
outcome, but in this case, these values do have a defined order.
▪ Example: ordinal responses include grading scales from A to F or rating
scales from 1 to 5.
Logistic regression and machine learning
▪ Within machine learning, logistic regression

belongs to the family of supervised machine
learning models.
▪ Unlike a generative algorithm, such as naïve bayes,

it cannot generate information, such as an image,
of the class that it is trying to predict (e.g. a picture
of a cat).
▪ Logistic regression can also be prone to overfitting,

particularly when there is a high number of
predictor variables within the model.
Use cases of logistic regression
▪ Logistic regression is commonly used for prediction
and classification problems. Some of these use cases include:
▪ Fraud detection: Logistic regression models can help teams

identify data anomalies, which are predictive of fraud. Certain
behaviors or characteristics may have a higher association with
fraudulent activities, which is particularly helpful to banking and
other financial institutions in protecting their clients.
▪ Disease prediction: In medicine, this analytics approach can be

used to predict the likelihood of disease or illness for a given
population. Healthcare organizations can set up preventative care
for individuals that show higher propensity for specific illnesses.

Unit 2 Da

Uploaded by

Copyright:

Available Formats

Unit 2 Da

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Unit 2 Da

Uploaded by

Copyright:

Available Formats

Unit 2

▪ States the claim or assertion to be tested

▪ Is always about a population parameter,

▪ Begin with the assumption that the null

▪ Defines the unlikely values of the sample

The probability of Type I Error is 

▪ Called level of significance of the test

The probability of Type II Error is β

Possible Hypothesis Test Outcomes

▪ In many cases, the alternative hypothesis

This is a lower-tail test since the

H0: μ ≤ 3 This is an upper-tail test since the

Reject H0 Do not reject H0

Do not reject H0 Reject H0

Form hypothesis test:

Find the rejection region: Reject H0

Do not reject H0 Reject H0

Reject H0 if Z > 1.28

Obtain sample and compute the test statistic

Suppose a sample is taken with the following

Do not reject H0 Reject H0

Do not reject H0 since Z = 0.88 ≤ 1.28

Do not reject H0 since p-value = 0.1894 >  = 0.10

The average cost of a

H0: μ = 168 /2=.025 /2=.025

▪  = 0.05 Reject H0 Do not reject H0 Reject H0

t24 = ± 2.0639 Do not reject H0: not sufficient evidence that

▪ Another Key part of ANOVA is that it splits the

▪ Can only be conducted if there is no relationship between the

▪ The different groups/levels must have equal sample sizes.

▪ Can only be conducted if the dependent variable is normally

▪ Population variances must be equal (i.e.

▪ A one way ANOVA will tell you that at least two

▪ The population must be close to a normal

You are researching which type of fertilizer and

▪ In real-life development procedure, the data given

▪ Various features are given in the data which are not

▪ The presence of those features can reduce

▪ There are various algorithms to extract features,

▪ Sequential feature selection algorithms are

Sequential searches follow only one direction:

Either it increases the number of features in the

▪ Sequential forward selection(SFS)

▪ Sequential backward selection(SBS)

There are three approaches to stepwise regression:

▪ Forward selection begins with no variables in the model,

▪ Backward elimination starts with a set of independent

▪ Bidirectional elimination is a combination of the first two

▪ An example of a stepwise regression using the

▪ The model includes all of the variables—then each is

▪ Instead, the solution is to use dummy variables.

▪ These are variables that we create specifically for

▪ The number of dummy variables we must create is equal

▪ Suppose we have the following dataset and we

▪ To use gender as a predictor variable in a regression model,

▪ Since it is currently a categorical variable that can take on

▪ To create this dummy variable, we can choose one of the

▪ In general, we usually represent the most frequently

▪ Linear regression models are used to identify