Unit 2 Da

Download as pdf or txt
Download as pdf or txt
You are on page 1of 56

Unit 2

Data Analytics

Business Statistics, A First Course (4e) © 2006 Prentice-Hall, Inc. Chap 9-1
What is a Hypothesis?
▪ A hypothesis is a claim
(assumption) about a
population parameter:

▪ population mean
Example: The mean monthly cell phone bill
of this city is μ = $42
▪ population proportion
Example: The proportion of adults in this
city with cell phones is π = 0.68
Business Statistics, A First Course (4e) © 2006 Prentice-Hall, Inc. Chap 9-2
The Null Hypothesis, H0

▪ States the claim or assertion to be tested


Example: The average number of TV sets in
U.S. Homes is equal to three ( H0 : μ = 3 )

▪ Is always about a population parameter,


not about a sample statistic

H0 : μ = 3 H0 : X = 3

Business Statistics, A First Course (4e) © 2006 Prentice-Hall, Inc. Chap 9-3
The Null Hypothesis, H0
(continued)

▪ Begin with the assumption that the null


hypothesis is true
▪ Similar to the notion of innocent until
proven guilty
▪ Refers to the status quo
▪ Always contains “=” , “≤” or “” sign
▪ May or may not be rejected

Business Statistics, A First Course (4e) © 2006 Prentice-Hall, Inc. Chap 9-4
The Alternative Hypothesis, H1
▪ Is the opposite of the null hypothesis
▪ e.g., The average number of TV sets in U.S.
homes is not equal to 3 ( H1: μ ≠ 3 )
▪ Challenges the status quo
▪ Never contains the “=” , “≤” or “” sign
▪ May or may not be proven
▪ Is generally the hypothesis that the
researcher is trying to prove

Business Statistics, A First Course (4e) © 2006 Prentice-Hall, Inc. Chap 9-5
Hypothesis Testing Process

Claim: the
population
mean age is 50.
(Null Hypothesis:
Population
H0: μ = 50 )
Now select a
random sample
Is X= 20 likely if μ = 50?
If not likely, Suppose
the sample
REJECT mean age Sample
Null Hypothesis is 20: X = 20
Business Statistics, A First Course (4e) © 2006 Prentice-Hall, Inc.
Reason for Rejecting H0
Sampling Distribution of X

X
20 μ = 50
If H0 is true
... then we
If it is unlikely that
reject the null
we would get a
... if in fact this were hypothesis that
sample mean of
the population mean… μ = 50.
this value ...
Business Statistics, A First Course (4e) © 2006 Prentice-Hall, Inc. Chap 9-7
Level of Significance, 

▪ Defines the unlikely values of the sample


statistic if the null hypothesis is true
▪ Defines rejection region of the sampling
distribution
▪ Is designated by  , (level of significance)
▪ Typical values are 0.01, 0.05, or 0.10
▪ Is selected by the researcher at the beginning
▪ Provides the critical value(s) of the test

Business Statistics, A First Course (4e) © 2006 Prentice-Hall, Inc. Chap 9-8
Level of Significance
and the Rejection Region
Level of significance =  Represents
critical value
H0: μ = 3 /2 /2
H1: μ ≠ 3 Rejection
Two-tail test 0 region is
shaded
H0: μ ≤ 3 
H1: μ > 3
Upper-tail test 0

H0: μ ≥ 3

H1: μ < 3
Lower-tail test 0
Business Statistics, A First Course (4e) © 2006 Prentice-Hall, Inc. Chap 9-9
Errors in Making Decisions

▪ Type I Error
▪ Reject a true null hypothesis
▪ Considered a serious type of error

The probability of Type I Error is 

▪ Called level of significance of the test


▪ Set by the researcher in advance

Business Statistics, A First Course (4e) © 2006 Prentice-Hall, Inc. Chap 9-10
Errors in Making Decisions
(continued)

▪ Type II Error
▪ Fail to reject a false null hypothesis

The probability of Type II Error is β

Business Statistics, A First Course (4e) © 2006 Prentice-Hall, Inc. Chap 9-11
Outcomes and Probabilities

Possible Hypothesis Test Outcomes

Actual
Situation
Decision H0 True H0 False
Do Not
No error Type II Error
Reject
Key: (1 -  ) (β)
Outcome H0
(Probability) Reject Type I Error No Error
H0 () (1-β)

Business Statistics, A First Course (4e) © 2006 Prentice-Hall, Inc. Chap 9-12
Hypothesis Tests for the Mean

Hypothesis
Tests for 

 Known  Unknown
(Z test) (t test)

Business Statistics, A First Course (4e) © 2006 Prentice-Hall, Inc. Chap 9-13
One-Tail Tests

▪ In many cases, the alternative hypothesis


focuses on a particular direction

This is a lower-tail test since the


H0: μ ≥ 3
alternative hypothesis is focused on
H1: μ < 3 the lower tail below the mean of 3

H0: μ ≤ 3 This is an upper-tail test since the


alternative hypothesis is focused on
H1: μ > 3
the upper tail above the mean of 3

Business Statistics, A First Course (4e) © 2006 Prentice-Hall, Inc. Chap 9-14
Lower-Tail Tests
H0: μ ≥ 3
▪ There is only one H1: μ < 3
critical value, since
the rejection area is
in only one tail 

Reject H0 Do not reject H0


-Z Z
0

μ X

Critical value

Business Statistics, A First Course (4e) © 2006 Prentice-Hall, Inc. Chap 9-15
Upper-Tail Tests

H0: μ ≤ 3
▪ There is only one
critical value, since H1: μ > 3
the rejection area is
in only one tail 

Do not reject H0 Reject H0


Z Zα
0
_
X μ

Critical value

Business Statistics, A First Course (4e) © 2006 Prentice-Hall, Inc. Chap 9-16
Example: Upper-Tail Z Test
for Mean ( Known)
A phone industry manager thinks that
customer monthly cell phone bills have
increased, and now average over $52 per
month. The company wishes to test this
claim. (Assume  = 10 is known)

Form hypothesis test:


H0: μ ≤ 52 the average is not over $52 per month
H1: μ > 52 the average is greater than $52 per month
(i.e., sufficient evidence exists to support the
manager’s claim)

Business Statistics, A First Course (4e) © 2006 Prentice-Hall, Inc. Chap 9-17
Example: Find Rejection Region
(continued)
▪ Suppose that  = 0.10 is chosen for this test

Find the rejection region: Reject H0

 = 0.10

Do not reject H0 Reject H0


0 1.28

Reject H0 if Z > 1.28

Business Statistics, A First Course (4e) © 2006 Prentice-Hall, Inc. Chap 9-18
Review:
One-Tail Critical Value

Standardized Normal
What is Z given  = 0.10? Distribution Table (Portion)
0.90 0.10
Z .07 .08 .09
 = 0.10
1.1 .8790 .8810 .8830
0.90
1.2 .8980 .8997 .9015
z 0 1.28
1.3 .9147 .9162 .9177
Critical Value
= 1.28

Business Statistics, A First Course (4e) © 2006 Prentice-Hall, Inc. Chap 9-19
Example: Test Statistic
(continued)

Obtain sample and compute the test statistic

Suppose a sample is taken with the following


results: n = 64, X = 53.1 (=10 was assumed known)
▪ Then the test statistic is:

X−μ 53.1 − 52
Z = = = 0.88
σ 10
n 64
Business Statistics, A First Course (4e) © 2006 Prentice-Hall, Inc. Chap 9-20
Example: Decision
(continued)
Reach a decision and interpret the result:
Reject H0

 = 0.10

Do not reject H0 Reject H0


1.28
0
Z = 0.88

Do not reject H0 since Z = 0.88 ≤ 1.28


i.e.: there is not sufficient evidence that the
mean bill is over $52
Business Statistics, A First Course (4e) © 2006 Prentice-Hall, Inc. Chap 9-21
p -Value Solution
(continued)
Calculate the p-value and compare to 
(assuming that μ = 52.0)
p-value = 0.1894

Reject H0 P( X  53.1)
 = 0.10
 53.1 − 52.0 
= P Z  
0  10/ 64 
= P(Z  0.88) = 1− 0.8106
Do not reject H0 Reject H0
1.28
Z = 0.88
= 0.1894

Do not reject H0 since p-value = 0.1894 >  = 0.10


Business Statistics, A First Course (4e) © 2006 Prentice-Hall, Inc. Chap 9-22
t Test of Hypothesis for the Mean
(σ Unknown)
▪ Convert sample statistic ( X ) to a t test statistic
Hypothesis
Tests for 

σKnown
Known σUnknown
Unknown
(Z test) (t test)
The test statistic is:

X −μ
t n-1 =
S
n
Business Statistics, A First Course (4e) © 2006 Prentice-Hall, Inc. Chap 9-23
Example: Two-Tail Test
( Unknown)

The average cost of a


hotel room in New York
is said to be $168 per
night. A random sample
of 25 hotels resulted in
X = $172.50 and H0: μ = 168
S = $15.40. Test at the H1: μ  168
 = 0.05 level.
(Assume the population distribution is normal)

Business Statistics, A First Course (4e) © 2006 Prentice-Hall, Inc. Chap 9-24
Example Solution:
Two-Tail Test

H0: μ = 168 /2=.025 /2=.025


H1: μ  168

▪  = 0.05 Reject H0 Do not reject H0 Reject H0


t n-1,α/2
-t n-1,α/2 0
▪ n = 25 -2.0639 2.0639
1.46
▪  is unknown, so X −μ 172.50 − 168
t n −1 = = = 1.46
use a t statistic S 15.40
▪ Critical Value: n 25

t24 = ± 2.0639 Do not reject H0: not sufficient evidence that


true mean cost is different than $168

Business Statistics, A First Course (4e) © 2006 Prentice-Hall, Inc. Chap 9-25
What is an ANOVA?

Definition:
An ANOVA test is a type of statistical test used to
determine if there is a statistically significant
difference between two or more categorical
groups by testing for differences of means using
variance.
What is an ANOVA?(contd.)

▪ Another Key part of ANOVA is that it splits the


independent variable into 2 or more groups.
For example:
one or more groups might be expected to
influence the dependent variable while the other
group is used as a control group and is not
expected to influence the dependent variable.
What is an ANOVA?(contd.)

Contd.
▪ ANOVA tests for significance use the F-test
for statistical significance. The F-test is a
groupwise comparison test, which means it
compares the variance in each group mean to
the overall variance in the dependent variable.
The Formula for ANOVA is:


Assumptions of ANOVA

▪ Can only be conducted if there is no relationship between the


subjects in each sample i.e. that subjects in the first group
cannot also be in the second group (e.g. independent
samples/between-groups).

▪ The different groups/levels must have equal sample sizes.

▪ Can only be conducted if the dependent variable is normally


distributed, so that the middle scores are most frequent and
extreme scores are least frequent.

▪ Population variances must be equal (i.e.


homoscedastic). Homogeneity of variance means that the
deviation of scores (measured by the range or standard deviation
for example) is similar between populations.
Types of ANOVA Tests
One way ANOVA

▪ Definition:
▪ A one way ANOVA is used to compare two means
from two independent (unrelated) groups using
the F-distribution.
▪ The null hypothesis for the test is that the
two means are equal.
Example:
▪ Situation 1: You have a group of individuals randomly
split into smaller groups and completing different tasks.
For example, you might be studying the effects
of tea on weight loss and form three groups:
green tea, black tea, and no tea.
▪ Situation 2: Similar to situation 1, but in this case the
individuals are split into groups based on an attribute
they possess.
For example, you might be studying leg strength of people
according to weight. You could split participants into
weight categories (obese, overweight and normal) and
measure their leg strength on a weight machine.
Limitation of one way ANOVA

▪ A one way ANOVA will tell you that at least two


groups were different from each other. But it
won’t tell you which groups were different.
Two way ANOVA

▪ Definition:
▪ A two-way ANOVA is used to estimate how
the mean of a quantitative variable changes
according to the levels of two categorical variables.
▪ Use a two-way ANOVA when you want to know how
two independent variables, in combination, affect a
dependent variable.
Assumptions for Two Way
ANOVA

▪ The population must be close to a normal


distribution.
▪ Samples must be independent.
▪ Population variances must be equal
(i.e. homoscedastic).
▪ Groups must have equal sample sizes.
Examples

You are researching which type of fertilizer and


planting density produces the greatest crop yield
in a field experiment.
You assign different plots in a field to a
combination of fertilizer type (1, 2, or 3) and
planting density (1=low density, 2=high density),
and measure the final crop yield in bushels per
acre at harvest time.
You can use a two-way ANOVA to find out if
fertilizer type and planting density have an effect
on average crop yield.
Examples

▪ In real-life development procedure, the data given


to any modeller has various features.

▪ Various features are given in the data which are not


even required for the generation of the models.

▪ The presence of those features can reduce


the performance level of the model. So in
modelling, it becomes an important step of data
preprocessing.
Sequential Feature

▪ There are various algorithms to extract features,


one such algorithm is Sequential Feature
Selection Algorithms.

▪ Sequential feature selection algorithms are


basically part of the wrapper methods where it
adds and removes features from the dataset
sequentially.
The sequential feature selection method
has two components:

▪ An objective function:
The method finds to minimize the number of
overall features in a subset from the set of all
features to enhance the results.
▪ A sequential search algorithm:
This searching algorithm adds or removes the
feature candidate from the candidate subset while
evaluating the objective function or criterion.
Sequential Feature

Sequential searches follow only one direction:

Either it increases the number of features in the


subset or reduces the number of features in the
candidate feature subset.
▪ On the basis of movement, we can divide them into
two variants.

▪ Sequential forward selection(SFS)

▪ Sequential backward selection(SBS)


▪ Sequential forward selection(SFS):
▪ In SFS variant features are sequentially added to an
empty set of features until the addition of extra
features does not reduce the criterion.
▪ Sequential Backward Selection (SBS):
▪ This variant algorithm picks all the features from the
input data and combines them in a set and
sequentially removes them from the set until the
removal of further features increases the criterion.
Stepwise Regression

▪ Definition:
▪ Stepwise regression is the step-by-step iterative
construction of a regression model that involves the
selection of independent variables to be used in a
final model. It involves adding or removing potential
explanatory variables in succession and testing for
statistical significance after each iteration.
Stepwise Regression

There are three approaches to stepwise regression:

▪ Forward selection begins with no variables in the model,


tests each variable as it is added to the model, then keeps
those that are deemed most statistically significant—
repeating the process until the results are optimal.

▪ Backward elimination starts with a set of independent


variables, deleting one at a time, then testing to see if the
removed variable is statistically significant.

▪ Bidirectional elimination is a combination of the first two


methods that test which variables should be included or
excluded.
Example

▪ An example of a stepwise regression using the


backward elimination method would be an attempt to
understand energy usage at a factory using variables
such as equipment run time, equipment age, staff size,
temperatures outside, and time of year.

▪ The model includes all of the variables—then each is


removed, one at a time, to determine which is least
statistically significant.

▪ In the end, the model might show that time of year and
temperatures are most significant, possibly suggesting
the peak energy consumption at the factory is when air
conditioner usage is at its highest.
Dummy Variables

▪ Definition:
▪ Numeric variables used in regression analysis to
represent categorical data that can only take on one
of two values: zero or one.
When to use dummy variable?
▪ When using categorical variables, it doesn’t make sense
to just assign values like 1, 2, 3, to values like “blue”,
“green”, and “brown” because it doesn’t make sense to
say that green is twice as colorful as blue or that brown is
three times as colorful as blue.

▪ Instead, the solution is to use dummy variables.

▪ These are variables that we create specifically for


regression analysis that take on one of two values: zero
or one.

▪ The number of dummy variables we must create is equal


to k-1 where k is the number of different values that the
categorical variable can take on.
Example

▪ Suppose we have the following dataset and we


would like to use gender and age to
predict income:
Example

▪ To use gender as a predictor variable in a regression model,


we must convert it into a dummy variable.

▪ Since it is currently a categorical variable that can take on


two different values (“Male” or “Female”), we only need to
create k-1 = 2-1 = 1 dummy variable.

▪ To create this dummy variable, we can choose one of the


values (“Male” or “Female”) to represent 0 and the other to
represent 1.

▪ In general, we usually represent the most frequently


occurring value with a 0, which would be “Male” in this
dataset.
▪ Thus, here’s how we would convert gender into
a dummy variable:

We could then
use Age and Gender_Dummy as predictor
variables in a regression model.
Logistic Regression

▪ Definition:
▪ Logistic regression is a process of modeling the
probability of a discrete outcome given an input
variable. The most common logistic regression
models a binary outcome; something that can take
two values such as true/false, yes/no, and so on.
▪ Logistic regression, despite its name, is
a classification model rather than regression model.
Linear VS Logistic Regression

▪ Linear regression models are used to identify


the relationship between a continuous
dependent variable and one or more
independent variables.

▪ Similar to linear regression, logistic regression


is also used to estimate the relationship
between a dependent variable and one or more
independent variables, but it is used to make a
prediction about a categorical variable versus a
continuous one.
Types of logistic regression
▪ Binary logistic regression: In this approach, the response or
dependent variable is dichotomous in nature—i.e. it has only two
possible outcomes (e.g. 0 or 1).
▪ Example: predicting if an e-mail is spam or not spam or if a tumor is
malignant or not malignant.

▪ Multinomial logistic regression: In this type of logistic regression


model, the dependent variable has three or more possible outcomes;
however, these values have no specified order.
▪ Example, movie studios want to predict what genre of film a moviegoer is
likely to see to market films more effectively. A multinomial logistic
regression model can help the studio to determine the strength of influence a
person's age, gender, and dating status may have on the type of film that
they prefer.

▪ Ordinal logistic regression: This type of logistic regression model


is leveraged when the response variable has three or more possible
outcome, but in this case, these values do have a defined order.
▪ Example: ordinal responses include grading scales from A to F or rating
scales from 1 to 5.
Logistic regression and machine learning

▪ Within machine learning, logistic regression


belongs to the family of supervised machine
learning models.

▪ Unlike a generative algorithm, such as naïve bayes,


it cannot generate information, such as an image,
of the class that it is trying to predict (e.g. a picture
of a cat).

▪ Logistic regression can also be prone to overfitting,


particularly when there is a high number of
predictor variables within the model.
Use cases of logistic regression
▪ Logistic regression is commonly used for prediction
and classification problems. Some of these use cases include:

▪ Fraud detection: Logistic regression models can help teams


identify data anomalies, which are predictive of fraud. Certain
behaviors or characteristics may have a higher association with
fraudulent activities, which is particularly helpful to banking and
other financial institutions in protecting their clients.

▪ Disease prediction: In medicine, this analytics approach can be


used to predict the likelihood of disease or illness for a given
population. Healthcare organizations can set up preventative care
for individuals that show higher propensity for specific illnesses.
Business Statistics, A First Course (4e) © 2006 Prentice-Hall, Inc. Chap 9-56

You might also like