Assignment No. 2 Subject: Educational Statistics (8614) (Units 1-4) Subject

Download as docx, pdf, or txt
Download as docx, pdf, or txt
You are on page 1of 7

ASSIGNMENT No.

2
Subject: Educational Statistics(8614)
(Units 1–4)
Subject
Name : Asia Noor Roll # BY627591 B.Ed 1.5 years Spring 2020

Q.1: Define hypothesis testing and logic behind hypothesis testing.

ANS: Hypothesis Testing:


Hypothesis testing is an act in statistics whereby an analyst tests an assumption regarding a population
parameter. The methodology employed by the analyst depends on the nature of the data used and the reason for
the analysis.

Hypothesis testing is used to assess the plausibility of a hypothesis by using sample data. Such data may come
from a larger population, or from a data-generating process.

Hypothesis testing is an act in statistics whereby an analyst tests an assumption regarding a population


parameter. The methodology employed by the analyst depends on the nature of the data used and the reason for
the analysis.

Hypothesis testing is used to assess the plausibility of a hypothesis by using sample data. Such data may come
from a larger population, or from a data-generating process. The word "population" will be used for both of these
cases in the following descriptions.

In hypothesis testing, an analyst tests a statistical sample, with the goal of providing evidence on the plausibility of
the null hypothesis.

Statistical analysts test a hypothesis by measuring and examining a random sample of the population being
analyzed. All analysts use a random population sample to test two different hypotheses: the null hypothesis and
the alternative hypothesis.

The null hypothesis is usually a hypothesis of equality between population parameters; e.g., a null hypothesis may
state that the population means return is equal to zero. The alternative hypothesis is effectively the opposite of a
null hypothesis (e.g., the population means return is not equal to zero). Thus, they are mutually exclusive, and only
one can be true. However, one of the two hypotheses will always be true.

All hypotheses are tested using a four-step process:

1. The first step is for the analyst to state the two hypotheses so that only one can be right.
2. The next step is to formulate an analysis plan, which outlines how the data will be evaluated.
3. The third step is to carry out the plan and physically analyze the sample data.
4. The fourth and final step is to analyze the results and either reject the null hypothesis, or state that the null
hypothesis is plausible, given the data.

Real-World Example of Hypothesis Testing


If, for example, a person wants to test that a penny has exactly a 50% chance of landing on heads, the null
hypothesis would be that 50% is correct, and the alternative hypothesis would be that 50% is not correct.
Mathematically, the null hypothesis would be represented as Ho: P = 0.5. The alternative hypothesis would be
denoted as "Ha" and be identical to the null hypothesis, except with the equal sign struck-through, meaning that it
does not equal 50%.

A random sample of 100 coin flips is taken, and the null hypothesis is then tested. If it is found that the 100 coin
flips were distributed as 40 heads and 60 tails, the analyst would assume that a penny does not have a 50%
chance of landing on heads and would reject the null hypothesis and accept the alternative hypothesis.

If, on the other hand, there were 48 heads and 52 tails, then it is plausible that the coin could be fair and still
produce such a result. In cases such as this where the null hypothesis is "accepted," the analyst states that the
difference between the expected results (50 heads and 50 tails) and the observed results (48 heads and 52 tails)
is "explainable by chance alone."

Logic Behind Hypothesis Testing:

The Logic of Hypothesis Testing As just stated, the logic of hypothesis testing in statistics involves four steps.

1. State the Hypothesis: We state a hypothesis (guess) about a population. Usually the hypothesis concerns
the value of a population parameter.
2. Define the Decision Method: We define a method to make a decision about the hypothesis. The method
involves sample data.
3. Gather Data: We obtain a random sample from the population.
4. Make a Decision: We compare the sample data with the hypothesis about the population. Usually we
compare the value of a statistic computed from the sample data with the hypothesized value of the
population parameter.
o If the data are consistent with the hypothesis we conclude that the hypothesis is reasonable. NOTE:
We do not conclude it is right, but reasonable! AND: We actually do this by rejecting the opposite
hypothesis (called the NULL hypothesis). More on this later.
o If there is a big discrepancy between the data and the hypothesis we conclude that the hypothesis
was wrong.

Q.2: Explain types of ANOVA.. Describe possible situations in which each type should be used.

ANS: ANOVA: Analysis of variance (ANOVA) is an analysis tool used in statistics that splits an observed
aggregate variability found inside a data set into two parts: systematic factors and random factors. The systematic
factors have a statistical influence on the given data set, while the random factors do not. Analysts use the ANOVA
test to determine the influence that independent variables have on the dependent variable in a regression study.

Types Of ANOVA:
There are two main types of ANOVA:

one-way (or unidirectional) and two-way:


There also variations of ANOVA. For example, MANOVA (multivariate ANOVA) differs from ANOVA as the former
tests for multiple dependent variables simultaneously while the latter assesses only one dependent variable at a
time. One-way or two-way refers to the number of independent variables in your analysis of variance test.
A one-way ANOVA: evaluates the impact of a sole factor on a sole response variable. It determines whether all
the samples are the same. The one-way ANOVA is used to determine whether there are any statistically
significant differences between the means of three or more independent (unrelated) groups.

A two-way ANOVA: is an extension of the one-way ANOVA. With a one-way, you have one independent variable
affecting a dependent variable. With a two-way ANOVA, there are two independents. For example, a two-way
ANOVA allows a company to compare worker productivity based on two independent variables, such as salary
and skill set. It is utilized to observe the interaction between the two factors and tests the effect of two factors at
the same time.

When to use a one-way ANOVA


Use a one-way ANOVA when you have collected data about one categorical independent variable and
one quantitative dependent variable. The independent variable should have at least three levels (i.e. at least three
different groups or categories).
ANOVA tells you if the dependent variable changes according to the level of the independent variable. For
example:

 Your independent variable is social media use, and you assign groups to low, medium, and high levels of
social media use to find out if there is a difference in hours of sleep per night.
 Your independent variable is brand of soda, and you collect data on Coke, Pepsi, Sprite, and Fanta to find
out if there is a difference in the price per 100ml.
 You independent variable is type of fertilizer, and you treat crop fields with mixtures 1, 2 and 3 to find out if
there is a difference in crop yield.
The null hypothesis (H0) of ANOVA is that there is no difference among group means. The alternate hypothesis
(Ha) is that at least one group differs significantly from the overall mean of the dependent variable.
If you only want to compare two groups, use a t-test instead.

When to use a two-way ANOVA


You can use a two-way ANOVA when you have collected data on a quantitative dependent variable at multiple
levels of two categorical independent variables.
A quantitative variable represents amounts or counts of things. It can be divided to find a group mean.
Bushels per acre is a quantitative variable because it represents the amount of crop produced. It can be divided to
find the average bushels per acre.
A categorical variable represents types or categories of things. A level is an individual category within the
categorical variable.
Fertilizer types 1, 2, and 3 are levels within the categorical variable fertilizer type. Planting densities 1 and 2 are
levels within the categorical variable planting density.
ou should have enough observations in your data set to be able to find the mean of the quantitative dependent
variable at each combination of levels of the independent variables.
Both of your independent variables should be categorical. If one of your independent variables is categorical and
one is quantitative, use an ANCOVA instead.

Q.3:What is the range of correlation coefficient? Explain strong, moderate and weak relationship?
ANS: Range of correlation coefficient:
Correlation of coefficients are indicators of the strength of the linear relationship between two different variables, x
and y. A linear correlation coefficient that is greater than zero indicates a positive relationship. A value that is less
than zero signifies a negative relationship. Finally, a value of zero indicates no relationship between the two
variables x and y. This article explains the significance of linear correlation coefficient for investors, how to
calculate covariance for stocks, and how investors can use correlation to predict the market.
Strong, Moderate And Weak Relationship:
The correlation coefficient (ρ) is a measure that determines the degree to which the movement of two different
variables is associated. The most common correlation coefficient, generated by the Pearson product-moment
correlation, is used to measure the linear relationship between two variables. However, in a non-linear relationship,
this correlation coefficient may not always be a suitable measure of dependence.The possible range of values for
the correlation coefficient is -1.0 to 1.0. In other words, the values cannot exceed 1.0 or be less than -1.0.
A correlation of -1.0 indicates a perfect negative correlation, and a correlation of 1.0 indicates a perfect positive
correlation. If the correlation coefficient is greater than zero, it is a positive relationship. Conversely, if the value is
less than zero, it is a negative relationship. A value of zero indicates that there is no relationship between the two
variables.

For example, suppose that the prices of coffee and computers are observed and found to have a correlation of
+.0008. This means that there is no correlation, or relationship, between the two variables.

Positive Correlation
A positive correlation—when the correlation coefficient is greater than 0—signifies that both variables move in the
same direction. When ρ is +1, it signifies that the two variables being compared have a perfect positive
relationship; when one variable moves higher or lower, the other variable moves in the same direction with the
same magnitude.

The closer the value of ρ is to +1, the stronger the linear relationship. For example, suppose the value of oil
prices is directly related to the prices of airplane tickets, with a correlation coefficient of +0.95. The relationship
between oil prices and airfares has a very strong positive correlation since the value is close to +1. So, if the price
of oil decreases, airfares also decrease, and if the price of oil increases, so do the prices of airplane tickets.

In the chart below, we compare one of the largest U.S. banks, JPMorgan Chase & Co. (JPM), with the Financial
Select SPDR Exchange Traded Fund (ETF) (XLF).1  2 As you can imagine, JPMorgan Chase & Co. should have a
positive correlation to the banking industry as a whole. We can see the correlation coefficient is currently at 0.98,
which is signaling a strong positive correlation. A reading above 0.50 typically signals a positive correlation.

Negative Correlation
A negative (inverse) correlation occurs when the correlation coefficient is less than 0. This is an indication that
both variables move in the opposite direction. In short, any reading between 0 and -1 means that the two securities
move in opposite directions. When ρ is -1, the relationship is said to be perfectly negatively correlated. In short, if
one variable increases, the other variable decreases with the same magnitude (and vice versa). However, the
degree to which two securities are negatively correlated might vary over time (and they are almost never exactly
correlated all the time). 

Linear Correlation Coefficient


The linear correlation coefficient is a number calculated from given data that measures the strength of the linear
relationship between two variables, x and y. The sign of the linear correlation coefficient indicates the direction of
the linear relationship between x and y. When r (the correlation coefficient) is near 1 or −1, the linear relationship is
strong; when it is near 0, the linear relationship is weak.

Even for small datasets, the computations for the linear correlation coefficient can be too long to do manually.
Thus, data are often plugged into a calculator or, more likely, a computer or statistics program to find the
coefficient.
Q.4:Explain chi Square Independent test. In what situation should it be applied?
ANS: Chi Square Independent Test:

The Chi-square test of independence is a statistical hypothesis test used to determine whether two categorical or
nominal variables are likely to be related or not.

The Chi-Square Test of Independence determines whether there is an association between categorical variables
(i.e., whether the variables are independent or related). It is a nonparametric test.

This test is also known as:


 Chi-Square Test of Association.
This test utilizes a contingency table to analyze the data. A contingency table (also known as a cross-
tabulation, crosstab, or two-way table) is an arrangement in which data is classified according to two categorical
variables. The categories for one variable appear in the rows, and the categories for the other variable appear in
columns. Each variable must have two or more categories. Each cell reflects the total count of cases for a
Specific pair of categories.
In what situation should it be applied:

The Chi-Square Test of Independence is commonly used to test the following:


 Statistical independence or association between two or more categorical variables.
The Chi-Square Test of Independence can only compare categorical variables. It cannot make comparisons
between continuous variables or between categorical and continuous variables. Additionally, the Chi-Square Test
of Independence only assesses associations between categorical variables, and can not provide any inferences
about causation.
If your categorical variables represent "pre-test" and "post-test" observations, then the chi-square test of
independence is not appropriate. This is because the assumption of the independence of observations is
violated. In this situation, McNemar's Test is appropriate.

Conditions for Using the Chi-Square Test


Exercise caution when there are small expected counts. Minitab will give a count of the number of cells that have
expected frequencies less than five. Some statisticians hesitate to use the chi-square test if more than 20% of the
cells have expected frequencies below five, especially if the p-value is small and these cells give a large
contribution to the total chi-square value.
For the Chi-square test of independence, we need two variables. Our idea is that the variables are not related.
Here are a couple of examples:

 We have a list of movie genres; this is our first variable. Our second variable is whether or not the patrons
of those genres bought snacks at the theater. Our idea (or, in statistical terms, our null hypothesis) is that the type
of movie and whether or not people bought snacks are unrelated. The owner of the movie theater wants to
estimate how many snacks to buy. If movie type and snack purchases are unrelated, estimating will be simpler
than if the movie types impact snack sales.
 A veterinary clinic has a list of dog breeds they see as patients. The second variable is whether owners
feed dry food, canned food or a mixture. Our idea is that the dog breed and types of food are unrelated. If this is
true, then the clinic can order food based only on the total number of dogs, without consideration for the breeds.
For a valid test, we need:

 Data values that are a simple random sample from the population of interest.
 Two categorical or nominal variables. Don't use the independence test with continous variables that define
the category combinations. However, the counts for the combinations of the two categorical variables will be
continuous.
 For each combination of the levels of the two variables, we need at least five expected values. When we
have fewer than five for any one combination, the test results are not reliable.

Q.5: Correlation is pre requisite of Regression Analysis. Explain?

ANS: Correlation Analysis


Correlation analysis is applied in quantifying the association between two continuous variables, for example, an
dependent and independent variable or among two independent variables.

Regression Analysis
Regression analysis refers to assessing the relationship between the outcome variable and one or more variables.
The outcome variable is known as the dependent or response variable and the risk elements, and co-founders are
known as predictors or independent variables. The dependent variable is shown by “y” and independent variables
are shown by “x” in regression analysis.
The sample of a correlation coefficient is estimated in the correlation analysis. It ranges between -1 and +1,
denoted by r and quantifies the strength and direction of the linear association among two variables. The
correlation among two variables can either be positive, i.e. a higher level of one variable is related to a higher level
of another or negative, i.e. a higher level of one variable is related to a lower level of the other.
The sign of the coefficient of correlation shows the direction of the association. The magnitude of the coefficient
shows the strength of the association.
For example, a correlation of r = 0.8 indicates a positive and strong association among two variables, while a
correlation of r = -0.3 shows a negative and weak association. A correlation near to zero shows the non-existence
of linear association among two continuous variables.

Linear Regression
Linear regression is a linear approach to modelling the relationship between the scalar components and one or
more independent variables. If the regression has one independent variable, then it is known as a simple linear
regression. If it has more than one independent variable, then it is known as multiple linear regression. Linear
regression only focuses on the conditional probability distribution of the given values rather than the joint
probability distribution. In general, all the real world regressions models involve multiple predictors. So, the term
linear regression often describes multivariate linear regression.

Correlation and Regression Differences


There are some differences between Correlation and regression.

 Correlation shows the quantity of the degree to which two variables are associated. It does not fix a line
through the data points. You compute a correlation that shows how much one variable changes when the
other remains constant. When r is 0.0, the relationship does not exist. When r is positive, one variable goes
high as the other goes up. When r is negative, one variable goes high as the other goes down.

 Linear regression finds the best line that predicts y from x, but Correlation does not fit a line.

 Correlation is used when you measure both variables, while linear regression is mostly applied when x is a
variable that is manipulated.

Comparison Between Correlation and Regression

Basis Correlation Regression

Meaning A statistical measure that Describes how an


defines co-relationship or independent variable is
association of two associated with the
variables. dependent variable.

Dependent and Both variables are


No difference
Independent variables different.

To describe a linear To fit the best line and


Usage relationship between two estimate one variable
variables. based on another variable.

To estimate values of a
To find a value expressing
random variable based on
Objective the relationship between
the values of a fixed
variables.
variable.

Correlation and Regression Statistics


The degree of association is measured by “r” after its originator and a measure of linear association. Other
complicated measures are used if a curved line is needed to represent the relationship.
The coefficient of correlation is measured on a scale that varies from +1 to -1 through 0. The complete correlation
among two variables is represented by either +1 or -1. The correlation is positive when one variable increases and
so does the other; while it is negative when one decreases as the other increases. The absence of correlation is
described by 0.

Regression analysis is a set of statistical methods used for the estimation of relationships between a dependent
variable and one or more independent variablesIndependent VariableAn independent variable is an input,
assumption, or driver that is changed in order to assess its impact on a dependent variable.
Regression Analysis – Linear model assumptions:

 Linear regression analysis is based on six fundamental assumptions:


 The dependent and independent variables show a linear relationship between the slope and the intercept.
 The independent variable is not random.
 The value of the residual (error) is zero.
 The value of the residual (error) is constant across all observations.
 The value of the residual (error) is not correlated across all observations.
 The residual (error) values follow the normal distribution.

You might also like