Stats Notes

Statistics
Statistics is a mathematical science pertaining to the collection, analysis, interpretation or

explanation, and presentation of data. Also with prediction and forecasting based on data.
It is applicable to a wide variety of academic disciplines, from the natural and social
sciences to the humanities, government and business.
Types of Statistics
Descriptive Statistics are used to describe the data set
Examples: graphing, calculating averages, looking for extreme scores
Inferential Statistics allow you to infer something about the the parameters of the
population based on the statistics of the sample, and various tests we perform on the
sample
Examples: Chi-Square, T-Tests, Correlations, ANOVA
Measure Central Tendency

A way of summarising data using the value which is most typical. Three examples are the
Mean, Median and Mode.
The three most commonly-used measures of central tendency are the following.
Mean
The sum of the values divided by the number of values--often called the "average."
 Add all of the values together.

 Divide by the number of values to obtain the mean.
Example: The mean of 7, 12, 24, 20, 19 is (7 + 12 + 24 + 20 + 19) / 5 = 16.4.

Median
The value which divides the values into two equal halves, with half of the values being
lower than the median and half higher than the median.
 Sort the values into ascending order.

 If you have an odd number of values, the median is the middle value.
 If you have an even number of values, the median is the arithmetic mean (see above) of
the two middle values.
Example: The median of the same five numbers (7, 12, 24, 20, 19) is 19.
Mode
The most frequently-occurring value (or values).
 Calculate the frequencies for all of the values in the data.

 The mode is the value (or values) with the highest frequency.
Example: For individuals having the following ages -- 18, 18, 19, 20, 20, 20, 21, and 23, the
mode is 20.
Dispersion
Measurements of central tendency (mean, mode and median) locate the distribution within the
range of possible values, measurements of dispersion describe the spread of values.
The dispersion of values within variables is especially important in social and political research
because:
 Dispersion or "variation" in observations is what we seek to explain.

 Researchers want to know WHY some cases lie above average and others below average
for a given variable:
o TURNOUT in voting: why do some states show higher rates than others?
o CRIMES in cities: why are there differences in crime rates?
o CIVIL STRIFE among countries: what accounts for differing amounts?
 Much of statistical explanation aims at explaining DIFFERENCES in observations -- also
known as
o VARIATION, or the more technical term, VARIANCE.
Range
The range is the simplest measure of dispersion. The range can be thought of in two ways.
As a quantity: the difference between the highest and lowest scores in a distribution.
Variance and Standard Deviation

By far the most commonly used measures of dispersion in the social sciences are
variance and standard deviation. Variance is the average squared difference of scores
from the mean score of a distribution. Standard deviation is the square root of the
variance.
In calculating the variance of data points, we square the difference between each point
and the mean because if we summed the differences directly, the result would always be
zero. For example, suppose three friends work on campus and earn $5.50, $7.50, and $8
per hour, respectively. The mean of these values is $(5.50 + 7.50 + 8)/3 = $7 per hour. If
we summed the differences of the mean from each wage, we would get (5.50-7) + (7.50-
7) + (8-7) = -1.50 + .50 + 1 = 0. Instead, we square the terms to obtain a variance equal to
2.25 + .25 + 1 = 3.50. This figure is a measure of dispersion in the set of scores.
The variance is the minimum sum of squared differences of each score from any number.
In other words, if we used any number other than the mean as the value from which each
score is subtracted, the resulting sum of squared differences would be greater. (You can
try it yourself -- see if any number other than 7 can be plugged into the preceeding
calculation and yield a sum of squared differences less than 3.50.)
The standard deviation is simply the square root of the variance. In some sense, taking the
square root of the variance "undoes" the squaring of the differences that we did when we
calculated the variance.
Variance and standard deviation of a population are designated by and , respectively.
Variance and standard deviation of a sample are designated by s2 and s, respectively.
Variance Standard Deviation
Population
Sample
In these equations, is the population mean, is the sample mean, N is the total
number of scores in the population, and n is the number of scores in the sample.
Coefficient of Variation
This is the ratio of the standard deviation to the mean:

The coefficient of variation describes the magnitude sample values and the variation
within them.
Co-efficient Of Variation ( C. V. )
To compare the variations ( dispersion ) of two different series, relative measures of standard
deviation must be calculated. This is known as co-efficient of variation or the co-efficient of s. d.
Its formula is
C. V. =
Thus it is defined as the ratio s. d. to its mean.
Remark: It is given as a percentage and is used to compare the consistency or variability of two
more series. The higher the C. V. , the higher the variability and lower the C. V., the higher is the
consistency of the data.
What are variables. Variables are things that we measure, control, or manipulate in research. They
differ in many respects, most notably in the role they are given in our research and in the type of
measures that can be applied to them.
Correlational vs. experimental research. Most empirical research belongs clearly to one of
those two general categories. In correlational research we do not (or at least try not to) influence
any variables but only measure them and look for relations (correlations) between some set of
variables, such as blood pressure and cholesterol level. In experimental research, we manipulate
some variables and then measure the effects of this manipulation on other variables; for example,
a researcher might artificially increase blood pressure and then record cholesterol level. Data
analysis in experimental research also comes down to calculating "correlations" between
variables, specifically, those manipulated and those affected by the manipulation. However,
experimental data may potentially provide qualitatively better information: Only experimental
data can conclusively demonstrate causal relations between variables. For example, if we found
that whenever we change variable A then variable B changes, then we can conclude that "A
influences B." Data from correlational research can only be "interpreted" in causal terms based
on some theories that we have, but correlational data cannot conclusively prove causality.
Dependent vs. independent variables. Independent variables are those that are manipulated
whereas dependent variables are only measured or registered. This distinction appears
terminologically confusing to many because, as some students say, "all variables depend on
something." However, once you get used to this distinction, it becomes indispensable. The terms
dependent and independent variable apply mostly to experimental research where some variables
are manipulated, and in this sense they are "independent" from the initial reaction patterns,
features, intentions, etc. of the subjects. Some other variables are expected to be "dependent" on
the manipulation or experimental conditions. That is to say, they depend on "what the subject
will do" in response. Somewhat contrary to the nature of this distinction, these terms are also
used in studies where we do not literally manipulate independent variables, but only assign
subjects to "experimental groups" based on some pre-existing properties of the subjects. For
example, if in an experiment, males are compared with females regarding their white cell count
(WCC), Gender could be called the independent variable and WCC the dependent variable.
Measurement scales. Variables differ in "how well" they can be measured, i.e., in how much
measurable information their measurement scale can provide. There is obviously some To index
measurement error involved in every measurement, which determines the "amount of
information" that we can obtain. Another factor that determines the amount of information that
can be provided by a variable is its "type of measurement scale." Specifically variables are
classified as (a) nominal, (b) ordinal, (c) interval or (d) ratio.
a. Nominal variables allow for only qualitative classification. That is, they can be measured
only in terms of whether the individual items belong to some distinctively different
categories, but we cannot quantify or even rank order those categories. For example, all
we can say is that 2 individuals are different in terms of variable A (e.g., they are of
different race), but we cannot say which one "has more" of the quality represented by the
variable. Typical examples of nominal variables are gender, race, color, city, etc.
b. Ordinal variables allow us to rank order the items we measure in terms of which has less
and which has more of the quality represented by the variable, but still they do not allow
us to say "how much more." A typical example of an ordinal variable is the
socioeconomic status of families. For example, we know that upper-middle is higher than
middle but we cannot say that it is, for example, 18% higher. Also this very distinction
between nominal, ordinal, and interval scales itself represents a good example of an
ordinal variable. For example, we can say that nominal measurement provides less
information than ordinal measurement, but we cannot say "how much less" or how this
difference compares to the difference between ordinal and interval scales.
c. Interval variables allow us not only to rank order the items that are measured, but also to
quantify and compare the sizes of differences between them. For example, temperature,
as measured in degrees Fahrenheit or Celsius, constitutes an interval scale. We can say
that a temperature of 40 degrees is higher than a temperature of 30 degrees, and that an
increase from 20 to 40 degrees is twice as much as an increase from 30 to 40 degrees.
d. Ratio variables are very similar to interval variables; in addition to all the properties of
interval variables, they feature an identifiable absolute zero point, thus they allow for
statements such as x is two times more than y. Typical examples of ratio scales are
measures of time or space. For example, as the Kelvin temperature scale is a ratio scale,
not only can we say that a temperature of 200 degrees is higher than one of 100 degrees,
we can correctly state that it is twice as high. Interval scales do not have the ratio
property. Most statistical data analysis procedures do not distinguish between the interval
and ratio properties of the measurement scales.
Correlations
The correlation is one of the most common and most useful statistics. A correlation is a single
number that describes the degree of relationship between two variables
Purpose (What is Correlation?) Correlation is a measure of the relation between two or more
variables. The measurement scales used should be at least interval scales, but other correlation
coefficients are available to handle other types of data. Correlation coefficients can range from
-1.00 to +1.00. The value of -1.00 represents a perfect negative correlation while a value of
+1.00 represents a perfect positive correlation. A value of 0.00 represents a lack of correlation.
How to Interpret the Values of Correlations. As mentioned before, the correlation

coefficient (r) represents the linear relationship between two variables. If the correlation
coefficient is squared, then the resulting value (r2, the coefficient of determination) will represent
the proportion of common variation in the two variables (i.e., the "strength" or "magnitude" of
the relationship). In order to evaluate the correlation between variables, it is important to know
this "magnitude" or "strength" as well as the significance of the correlation.
In statistics, Spearman's rank correlation coefficient named after Charles Spearman and often
denoted by the Greek letter ρ (rho) or as rs, is a non-parametric measure of correlation – that is, it
assesses how well an arbitrary monotonic function could describe the relationship between two
variables, without making any assumptions about the frequency distribution of the variables.
Linear regression is a form of regression analysis in which the relationship between

one or more independent variables and another variable, called dependent variable, is modeled
by a least squares function, called linear regression equation. This function is a linear
combination of one or more model parameters, called regression coefficients. A linear regression
equation with one independent variable represents a straight line. The results are subject to
statistical analysis.
From algebra, any straight line can be described as:
Y = a + bX, where a is the intercept and b is the slope
Linear Regression
Correlation gives us the idea of the measure of magnitude and direction between correlated
variables. Now it is natural to think of a method that helps us in estimating the value of one
variable when the other is known. Also correlation does not imply causation. The fact that the
variables x and y are correlated does not necessarily mean that x causes y or vice versa. For
example, you would find that the number of schools in a town is correlated to the number of
accidents in the town. The reason for these accidents is not the school attendance; but these two
increases what is known as population. A statistical procedure called regression is concerned
with causation in a relationship among variables. It assesses the contribution of one or more
variable called causing variable or independent variable or one which is being caused
(dependent variable). When there is only one independent variable then the relationship is
expressed by a straight line. This procedure is called simple linear regression.
Regression can be defined as a method that estimates the value of one variable when that of
other variable is known, provided the variables are correlated. The dictionary meaning of
regression is "to go backward." It was used for the first time by Sir Francis Galton in his research
paper "Regression towards mediocrity in hereditary stature."
Lines of Regression: In scatter plot, we have seen that if the variables are highly correlated then
the points (dots) lie in a narrow strip. if the strip is nearly straight, we can draw a straight line,
such that all points are close to it from both sides. such a line can be taken as an ideal
representation of variation. This line is called the line of best fit if it minimizes the distances of
all data points from it.
This line is called the line of regression. Now prediction is easy because now all we need to do
is to extend the line and read the value. Thus to obtain a line of regression, we need to have a line
of best fit. But statisticians don’t measure the distances by dropping perpendiculars from points
on to the line. They measure deviations ( or errors or residuals as they are called) (i) vertically
and (ii) horizontally. Thus we get two lines of regressions as shown in the figure (1) and (2).
(1) Line of regression of y on x
Its form is y = a + b x
It is used to estimate y when x is given
(2) Line of regression of x on y
Its form is x = a + b y
It is used to estimate x when y is given.
They are obtained by (1) graphically - by Scatter plot (ii) Mathematically

- by the method of least squares.
Regression can be used for prediction (including forecasting of time-series data), inference,
hypothesis testing, and modeling of causal relationships
What is the difference between correlation and linear regression?
Correlation and linear regression are not the same. Consider these differences:
 Correlation quantifies the degree to which two variables are related. Correlation does not
find a best-fit line (that is regression). You simply are computing a correlation coefficient
(r) that tells you how much one variable tends to change when the other one does.
 With correlation you don't have to think about cause and effect. You simply quantify how
well two variables relate to each other. With regression, you do have to think about cause
and effect as the regression line is determined as the best way to predict Y from X.
 With correlation, it doesn't matter which of the two variables you call "X" and which you
call "Y". You'll get the same correlation coefficient if you swap the two. With linear
regression, the decision of which variable you call "X" and which you call "Y" matters a
lot, as you'll get a different best-fit line if you swap the two. The line that best predicts Y
from X is not the same as the line that predicts X from Y.
 Correlation is almost always used when you measure both variables. It rarely is
appropriate when one variable is something you experimentally manipulate. With linear
regression, the X variable is often something you experimentall manipulate (time,
concentration...) and the Y variable is something you measure.
Probable Error
It is used to help in the determination of the Karl Pearson’s coefficient of correlation ‘ r ’. Due to
this ‘ r ’ is corrected to a great extent but note that ‘ r ’ depends on the random sampling and its
conditions. it is given by
P. E. = 0.6745
i. If the value of r is less than P. E., then there is no evidence of correlation i.e. r is not
significant.
ii. If r is more than 6 times the P. E. ‘ r ’ is practically certain .i.e. significant.
iii. By adding or subtracting P. E. to ‘ r ’ , we get the upper and Lower limits within which ‘ r
’ of the population can be expected to lie.
Symbolically e = r  P. E.
P = Correlation ( coefficient ) of the population.
Probability basics
Sample space
The totality of all the outcomes or results of a random experiment is denoted by Greek alphabet
 or English alphabets and is called the sample space. Each outcome or element of this sample
space is known as a sample print.
Event
Any subset of a sample space is called an event. A sample space S serves as the universal set for
all questions related to an experiment 'S' and an event A w.r.t it is a set of all possible outcomes
favorable to the even t A
For example,
A random experiment :- flipping a coin twice
Sample space :-  or S = {(HH), (HT), (TH), (TT)}
The question : "both the flipps show same face"
Therefore, the event A : { (HH), (TT) }
Equally Likely Events
All possible results of a random experiment are called equally likely outcomes and we have no
reason to expect any one rather than the other. For example, as the result of drawing a card from
a well shuffled pack, any card may appear in draw, so that the 52 cards become 52 different
events which are equally likely.
Mutually Exclusive Events
Events are called mutually exclusive or disjoint or incompatible if the occurrence of one of them
precludes the occurrence of all the others. For example in tossing a coin, there are two mutually
exclusive events viz turning up a head and turning up of a tail. Since both these events cannot
happen simultaneously. But note that events are compatible if it is possible for them to happen
simultaneously. For instance in rolling of two dice, the cases of the face marked 5 appearing on
one dice and face 5 appearing on the other, are compatible.
Exhaustive Events
Events are exhaustive when they include all the possibilities associated with the same trial. In
throwing a coin, the turning up of head and of a tail are exhaustive events assuming of course
that the coin cannot rest on its edge.
Independent Events
Two events are said to be independent if the occurrence of any event does not affect the
occurrence of the other event. For example in tossing of a coin, the events corresponding to the
two successive tosses of it are independent. The flip of one penny does not affect in any way the
flip of a nickel.
Dependent Events
If the occurrence or non-occurrence of any event affects the happening of the other, then the
events are said to be dependent events. For example, in drawing a card from a pack of cards, let
the event A be the occurrence of a king in the 1st draw and B be the occurrence of a king in the
1st draw and B be the occurrence of a king in the second draw. If the card drawn at the first trial
is not replaced then events A and B are independent events.
Note
(1) If an event contains a single simple point i.e. it is a singleton set, then this event is
called an elementary or a simple event.
(2) An event corresponding to the empty set is an "impossible event."
(3) An event corresponding to the entire sample space is called a ‘certain event’.
Complementary Events
Let S be the sample space for an experiment and A be an event in S. Then A is a subset of S.
Hence , the complement of A in S is also an event in S which contains the outcomes which are
not favorable to the occurrence of A i.e. if A occurs, then the outcome of the experiment belongs
to A, but if A does not occur, then the outcomes of the experiment belongs to
It is obvious that A and are mutually exclusive. A  =  and A  = S.
If S contains n equally likely, mutually exclusive and exhaustive points and A contains m out of
these n points then contains (n - m) sample points.
Definitions of Probability
Mathematical (or A Priori or Classic) Definition
If there are ‘n’ exhaustive, mutually exclusive and equally likely cases and m of them are
favorable to an event A, the probability of A happening is defined as the ratio m/n
Expressed as a formula :-
This definition is due to ‘Laplace.’ Thus probability is a concept which measures numerically the
degree of certainty or uncertainty of the occurrence of an event.
For example, the probability of randomly drawing taking from a well-shuffled deck of cards is
4/52. Since 4 is the number of favorable outcomes (i.e. 4 kings of diamond, spade, club and
heart) and 52 is the number of total outcomes (the number of cards in a deck).
If A is any event of sample space having probability P, then clearly, P is a positive number
(expressed as a fraction or usually as a decimal) not greater than unity. 0  P 1 i.e. 0 (no chance
or for impossible event) to a high of 1 (certainty). Since the number of cases not favorable to A
are (n - m), the probability q that event A will not happen is, q = or q = 1 - m/n or q = 1 - p.
Now note that the probability q is nothing but the probability of the complementary event A i.e.
Thus p ( ) = 1 - p or p ( ) = 1 - p ( )
so that p (A) + p ( ) = 1 i.e. p + q = 1
Addition theorem
In general, if the letters A and B stands for any two events, then
Clearly, the outcomes of both A and B are non-mutually exclusive
Multiplication Law of Probability
If there are two independent events; the respective probability of which are known, then the
probability that both will happen is the product of the probabilities of their happening
respectively P (AB) = P (A)  P (B)
Two tailed and one tailed test.
The two-tailed test is a statistical test used in inference, in which a given statistical hypothesis,
H0 (the null hypothesis) will be rejected when the value of the test statistic is either sufficiently
small or sufficiently large. This contrasts with a one-tailed test, in which only one of the
rejection regions "sufficiently small" or "sufficiently large" is preselected according to the
alternative hypothesis being selected, and the hypothesis is rejected only if the test statistic
satisfies that criterion. Alternative names are one-sided and two-sided tests.
The test is named after the "tail" of data under the far left and far right of a bell-shaped normal
data distribution, or bell curve. However, the terminology is extended to tests relating to
distributions other than normal. In general a test is called two-tailed if the null hypothesis is
rejected for values of the test statistic falling into either tail of its sampling distribution, and it is
called one-sided or one-tailed if the null hypothesis is rejected only for values of the test statistic
falling into one specified tail of its sampling distribution.[1] For example, if the alternative
hypothesis is , rejecting the null hypothesis of μ = 42.5 for small or for large values
of the sample mean, the test is called "two-tailed" or "two-sided". If the alternative hypothesis is
μ > 1.4, rejecting the null hypothesis of only for large values of the sample mean, it is
then called "one-tailed" or "one-sided".
A General Procedure for Conducting Hypothesis Tests

All hypothesis tests are conducted the same way. The researcher states a hypothesis to be tested,
formulates an analysis plan, analyzes sample data according to the plan, and accepts or rejects
the null hypothesis, based on results of the analysis.
 State the hypotheses. Every hypothesis test requires the analyst to state a null hypothesis
and an alternative hypothesis. The hypotheses are stated in such a way that they are
mutually exclusive. That is, if one is true, the other must be false; and vice versa.
 Formulate an analysis plan. The analysis plan describes how to use sample data to accept
or reject the null hypothesis. It should specify the following elements.
o Significance level. Often, researchers choose significance levels equal to 0.01,

0.05, or 0.10; but any value between 0 and 1 can be used.
o Test method. Typically, the test method involves a test statistic and a sampling
distribution. Computed from sample data, the test statistic might be a mean score,
proportion, difference between means, difference between proportions, z-score, t-
score, chi-square, etc. Given a test statistic and its sampling distribution, a
researcher can assess probabilities associated with the test statistic. If the test
statistic probability is less than the significance level, the null hypothesis is
rejected.
 Analyze sample data. Using sample data, perform computations called for in the analysis
plan.
o Test statistic. When the null hypothesis involves a mean or proportion, use either
of the following equations to compute the test statistic.
Test statistic = (Statistic - Parameter) / (Standard deviation of statistic)

Test statistic = (Statistic - Parameter) / (Standard error of statistic)
where Parameter is the value appearing in the null hypothesis, and Statistic is the
point estimate of Parameter. As part of the analysis, you may need to compute the
standard deviation or standard error of the statistic. Previously, we presented
common formulas for the standard deviation and standard error.
When the parameter in the null hypothesis involves categorical data, you may use
a chi-square statistic as the test statistic. Instructions for computing a chi-square
test statistic are presented in the lesson on the chi-square goodness of fit test.
o P-value. The P-value is the probability of observing a sample statistic as extreme

as the test statistic, assuming the null hypotheis is true.
 Interpret the results. If the sample findings are unlikely, given the null hypothesis, the
researcher rejects the null hypothesis. Typically, this involves comparing the P-value to
the significance level, and rejecting the null hypothesis when the P-value is less than the
significance level.
Why Use Probability Sampling?
If school administrators wished to conduct a survey assessing the popularity of pizza on the
cafeteria menu, they could stop students on the way to the library and ask them the survey
questions. Although this non-probability sampling type is a convenient way to conduct a survey,
it’s not as accurate or rigorous as some probability sampling modalities.
In any field of scholarly research, researchers must set up a process that assures that the different
members of a population have an equal chance of selection. This allows researchers to draw
some general conclusions beyond those people included in the study. Another reason for
probability sampling is the need to eliminate any possible researcher bias. Returning to the pizza
survey example, the survey administrator might not be inclined to stop the troublemaker who
threw water balloons in the cafeteria last week.
Simple Random Sampling
Simple random sampling is akin to pulling a number out of a hat. However, in a large population,
it can be time-consuming to write down 3000 names on slips of paper to draw from a hat. An
easier way to draw the sample for the pizza survey is to utilize a random number table to choose
students. Administrators could use the last two digits of the students’ social security number to
identify the table column and the first two digits to identify the row. However, just using luck of
the draw may not provide the administrators with a good representation of subgroups in the
student population.
Stratified Random Sampling
This sampling method involves dividing the population into subgroups based on variables known
about those subgroups, and then taking a simple random sample of each subgroup. This would
assure the administrator that he was accurately representing not only the overall population, but
also key subgroups, such as students with low attendance or minority groups. This method can
be tricky for the uninitiated, as the researcher must decide what weights to assign to each
stratification variable.
Cluster Sampling
This stepwise process is useful for those who know little about the population they’re studying.
First, the researcher would divide the population into clusters (usually geographic boundaries).
Then, the researcher randomly samples the clusters. Finally, the researcher must measure all
units within the sampled clusters. Researchers use this method when economy of administration
is important. Because a school population is confined to a three-block area, the school
administrators wouldn’t need to get so elaborate.
Multistage Sampling
This is the most complex sampling strategy. The researcher combines simpler sampling methods
to address sampling needs in the most effective way possible. For example, the administrator
might begin with a cluster sample of all schools in the district. Then he might set up a stratified
sampling process within clusters. Within schools, the administrator could conduct a simple
random sample of classes or grades. By combining various methods, researchers achieve a rich
variety of results useful in different contexts.
NON-PROBABILITY SAMPLING
Non-probability sampling is a sampling technique where the samples are gathered in a process that does
not give all the individuals in the population equal chances of being selected.
by Joan Joseph Castillo (2009)
In any form of research, true random sampling is always difficult to achieve.

Most researchers are bounded by time, money and workforce and because of these limitations, it
is almost impossible to randomly sample the entire population and it is often necessary to
employ another sampling technique, the non-probability sampling technique.
In contrast with probability sampling, non-probability sample is not a product of a randomized

selection processes. Subjects in a non-probability sample are usually selected on the basis of
their accessibility or by the purposive personal judgment of the researcher.
The downside of this is that an unknown proportion of the entire population was not sampled.
This entails that the sample may or may not represent the entire population accurately. Therefore,
the results of the research cannot be used in generalizations pertaining to the entire population.
TYPES OF NON-PROBABILITY SAMPLING

CONVENIENCE SAMPLING
Convenience sampling is probably the most common of all sampling techniques. With convenience
sampling, the samples are selected because they are accessible to the researcher. Subjects are chosen
simply because they are easy to recruit. This technique is considered easiest, cheapest and least time
consuming.
CONSECUTIVE SAMPLING
Consecutive sampling is very similar to convenience sampling except that it seeks to include ALL
accessible subjects as part of the sample. This non-probability sampling technique can be considered as
the best of all non-probability samples because it includes all subjects that are available that makes the
sample a better representation of the entire population.
QUOTA SAMPLING
Quota sampling is a non-probability sampling technique wherein the researcher ensures equal or
proportionate representation of subjects depending on which trait is considered as basis of the quota.
For example, if basis of the quota is college year level and the researcher needs equal
representation, with a sample size of 100, he must select 25 1st year students, another 25 2nd year
students, 25 3rd year and 25 4th year students. The bases of the quota are usually age, gender,
education, race, religion and socioeconomic status.
JUDGMENTAL SAMPLING
Judgmental sampling is more commonly known as purposive sampling. In this type of sampling, subjects
are chosen to be part of the sample with a specific purpose in mind. With judgmental sampling, the
researcher believes that some subjects are more fit for the research compared to other individuals. This
is the reason why they are purposively chosen as subjects.
SNOWBALL SAMPLING
Snowball sampling is usually done when there is a very small population size. In this type of sampling,
the researcher asks the initial subject to identify another potential subject who also meets the criteria of
the research. The downside of using a snowball sample is that it is hardly representative of the
population.
WHEN TO USE NON-PROBABILITY SAMPLING

 This type of sampling can be used when demonstrating that a particular trait exists in the
population.
 It can also be used when the researcher aims to do a qualitative, pilot or exploratory study.
 It can be used when randomization is impossible like when the population is almost limitless.
 It can be used when the research does not aim to generate results that will be used to create
generalizations pertaining to the entire population.
 It is also useful when the researcher has limited budget, time and workforce.
 This technique can also be used in an initial study which will be carried out again using a
randomized, probability sampling.
Statistical inference
• Population -collection of objects having some common characteristic of interest under the
consideration for a statistical investigation.
• Sample- a finite subset of population.
• Sample error- the inherent and unavoidable error caused while approximating the
characteristic of the object.
• Random sample –if n objects are selected from a population each of them are equiprobable of
getting selected.
• Standard error-standard deviation of sampling distribution.
• Confidence interval and confidence limits-In order to find the population mean we cannot
draw large number of the samples occurring in the entire population. So we setup certain
limits on both sides of the population mean on the basis that the mean of samples are
normally distributed around the population mean.These limits are called confidence limits
and range between the two is called the confidence interval.
• The field of statistical inference consists of those methods used to make decisions or draw
conclusions about a population. These methods utilize the information contained in a
samplefrom the population in drawing conclusions.
Point Estimation
Hypothesis Testing
For example, suppose that we are interested in the burning rate of a solid propellant used to
power aircrew escape systems.
• Now burning rate is a random variable that can be described by a probability

distribution.
• Suppose that our interest focuses on the mean burning rate (a parameter of this
distribution).
• Specifically, we are interested in deciding whether or not the mean burning rate is 50
centimeters per second.
• Null hypotheses – hypothesis which is being tested for possible rejection.
• Alternative hypotheses-The hypothesis which is accepted when the null hypotheses is

rejected .
• Critical region-The set of all those samples which lead to the rejection of null
hypothesis.
• Level of significance-is the probability of rejection of null hypothesis when it is actually

true.
• Two-sided Alternative Hypothesis

One-sided Alternative Hypotheses
Test of a Hypothesis
• A procedure leading to a decision about a particular hypothesis
• Hypothesis-testing procedures rely on using the information in a random sample from

the population of interest.
• If this information is consistent with the hypothesis, then we will conclude that the
hypothesis is true; if this information is inconsistent with the hypothesis, we will
conclude that the hypothesis is false.
• The power is computed as 1 - b, and power can be interpreted as the probability of

correctly rejecting a false null hypothesis. We often compare statistical tests by
comparing their power properties.
• For example, consider the propellant burning rate problem whenwe are testing H 0 :
m = 50 centimeters per second against H 1 : m not equal 50 centimeters per second .
Suppose that the true value of the mean is m = 52. When n = 10, we found that b =
0.2643, so the power of this test is 1 - b = 1 - 0.2643 = 0.7357 when m = 52
General Procedure for Hypothesis Testing

T test
• In many real life problems population mean is known the exact population standard
deviation cant be calculated.In such cases t test is used.
• Sample size of 30-40
Types
• One sample t test – is used to compare the mean of a single sample with the
population mean.
• An economist wants to know if the per capita income of a particular region is same as
the national average.
• Independent sample t test-detecting differences between the means of two

independent groups.
• An economist wants to compare the per capita income of two different region.
Z test
• For z test population mean and population standard deviation should be known.
• Large sample size.

Analysis of variance
• ANOVA is used to compare the means of more than two population
• Extensive application in consumer behavior and marketing management related

problems.
• A marketing manager wants to investigate the impact of different discount schemes

on sale of three major brands of edible oil.
F statistic
• ANOVA uses F statistic ,which tests if the means o the groups formed by one
independent variable or combination of independent variable are significantly
different.It is based on the comparison of variance.
• Condition- dependent variable should be interval or ratio,the population should be

normally distributed.
Chi square test

• One of the popular methods for testing hypothesis on discrete data.
• Is used to test the hypothesis that two categorical variables are independent of each
other.
An organizations research wants to determine if the satisfaction level of the firm is

dependent on their placements
One Sample t-test
A one sample t-test is a hypothesis test for answering questions about the mean where
the data are a random sample of independent observations from an underlying normal
distribution N(µ, ), where is unknown.
The null hypothesis for the one sample t-test is:

H0: µ = µ0, where µ0 is known.
That is, the sample has been drawn from a population of given mean and unknown
variance (which therefore has to be estimated from the sample).
This null hypothesis, H0 is tested against one of the following alternative hypotheses,
depending on the question posed:
H1: µ is not equal to µ
H1: µ > µ
H1: µ < µ
Two Sample t-test

A two sample t-test is a hypothesis test for answering questions about the mean where
the data are collected from two random samples of independent observations, each
from an underlying normal distribution:
When carrying out a two sample t-test, it is usual to assume that the variances for the
two populations are equal, i.e.
The null hypothesis for the two sample t-test is:

H 0 : µ 1 = µ2
That is, the two samples have both been drawn from the same population. This null
hypothesis is tested against one of the following alternative hypotheses, depending on
the question posed.
H1: µ1 is not equal to µ2
H 1 : µ 1 > µ2
H 1 : µ 1 < µ2
What is the difference between correlation and linear regression?
Correlation and linear regression are not the same. Consider these differences:
 Correlation quantifies the degree to which two variables are related. Correlation does not
find a best-fit line (that is regression). You simply are computing a correlation coefficient
(r) that tells you how much one variable tends to change when the other one does.
 With correlation you don't have to think about cause and effect. You simply quantify how
well two variables relate to each other. With regression, you do have to think about cause
and effect as the regression line is determined as the best way to predict Y from X.
 With correlation, it doesn't matter which of the two variables you call "X" and which you
call "Y". You'll get the same correlation coefficient if you swap the two. With linear
regression, the decision of which variable you call "X" and which you call "Y" matters a
lot, as you'll get a different best-fit line if you swap the two. The line that best predicts Y
from X is not the same as the line that predicts X from Y.
 Correlation is almost always used when you measure both variables. It rarely is
appropriate when one variable is something you experimentally manipulate. With linear
regression, the X variable is often something you experimentall manipulate (time,
concentration...) and the Y variable is something you measure.
Definition of Time Series: An ordered sequence of values of a variable at equally spaced time
intervals. Time series occur frequently when looking at industrial data
Applications: The usage of time series models is twofold:
 Obtain an understanding of the underlying forces and structure that produced the observed data
 Fit a model and proceed to forecasting, monitoring or even feedback and feed forward control.
Time Series Analysis is used for many applications such as:
 Economic Forecasting
 Sales Forecasting
 Budgetary Analysis
 Stock Market Analysis
 Yield Projections
 Process and Quality Control
 Inventory Studies
 Workload Projections
 Utility Studies
 Census Analysis.
Business benefits
What you do with time-series forecasting depend on your business activity?.
Time-series for Retail
 The sales history of each product constitutes a time-series to be forecasted. The sales forecasts are
used to optimize the inventory levels. Too much inventory, and your expenses go up. Too few
inventory, and sales opportunities are lost in out-of-stock situations.
 Time-series for Manufacturers
The production history and/or the inputs consumptions constitute time-series to be forecasted.
The inputs consumptions are forecasted in order to minimize the inventory levels. The production
history can be forecasted and used as an approximation of the future demand. Forecasting the
demand enables to perform efficient capacity planning.
 Time-series for Customer Services

Customer activities can be represented as time-series (ex: hourly volume customer calls in call
centers). Forecasting the customer activity can be used to optimize the staff scheduling. Too
much staff, and money is wasted in paying idle staff. Too few staff, and the customer satisfaction
drops.
Forecasting in Business
Business leaders and economists are continually involved in the process of trying to forecast, or
predict, the future of business in the economy. Business leaders engage in this process because
much of what happens in businesses today depends on what is going to happen in the future.
Qualitative Forecasting Models
Qualitative forecasting models have often proven to be most effective for short-term projections.
In this method of forecasting, which works best when the scope is limited, experts in the
appropriate fields are asked to agree on a common forecast. Two methods are used frequently.
Delphi Method. This method involves asking various experts what they anticipate will happen in the
future relative to the subject under consideration. Experts in the automotive industry, for example, might
be asked to forecast likely innovative enhancements for cars five years from now. They are not expected
to be precise, but rather to provide general opinions.
Market Research Method. This method involves surveys and questionnaires about people's subjective
reactions to changes. For example, a company might develop a new way to launder clothes; after people
have had an opportunity to try the new method, they would be asked for feedback about how to improve
the processes or how it might be made more appealing for the general public. This method is difficult
because it is hard to identify an appropriate sample that is representative of the larger audience for whom
the product is intended.
Quantitative Forecasting Models

Three quantitative methods are in common use.
Time-Series Methods. This forecasting model uses historical data to try to predict future events. For
example, assume that you are interested in knowing how long a recession will last. You might look at all
past recessions and the events leading up to and surrounding them and then, from that data, try to predict
how long the current recession will last.
A specific variable in the time series is identified by the series name and date. If gross domestic product
(GDP) is the variable, it might be identified as GDP2000.1 for the first-quarter statistics for the year 2000.
This is just one example, and different groups might use different methods to identify variables in a time
period.
Many government agencies prepare and release time-series data. The Federal Reserve, for example,
collects data on monetary policy and financial institutions and publishes that data in the Federal Reserve
Bulletin. These data become the foundation for making decisions about regulating the growth of the
economy.
Time-series models provide accurate forecasts when the changes that occur in the variable's environment
are slow and consistent. When large-degree changes occur, the forecasts are not reliable for the long term.
Since time-series forecasts are relatively easy and inexpensive to construct, they are used quite
extensively.
The Indicator Approach. The U.S. government is a primary user of the indicator approach of
forecasting. The government uses such indicators as the Composite Index of Leading, Lagging, and
Coincident Indicators, often referred to as Composite Indexes. The indexes predict by assuming that past
trends and relationships will continue into the future. The government indexes are made by averaging the
behavior of the different indicator series that make up each composite series.
The timing and strength of each indicator series relationship with general business activity, reflected in
the business cycle, change over time. This relationship makes forecasting changes in the business cycle
difficult.
Econometric Models. Econometric models are causal models that statistically identify the relationships
between variables and how changes in one or more variables cause changes in another variable.
Econometric models then use the identified relationship to predict the future. Econometric models are
also called regression models.
There are two types of data used in regression analysis. Economic forecasting models predominantly use
time-series data, where the values of the variables change over time. Additionally, cross-section data,
which capture the relationship between variables at a single point in time, are used. A lending institution,
for example, might want to determine what influences the sale of homes. It might gather data on home
prices, interest rates, and statistics on the homes being sold, such as size and location. This is the cross-
section data that might be used with time-series data to try to determine such things as what size home
will sell best in which location.
An econometric model is a way of determining the strength and statistical significance of a hypothesized
relationship. These models are used extensively in economics to prove, disprove, or validate the existence
of a casual relationship between two or more variables. It is obvious that this model is highly
mathematical, using different statistical equations.
For the sake of simplicity, mathematical analysis is not addressed here. Just as there are these qualitative
and quantitative forecasting models, there are others equally as sophisticated; however, the discussion
here should provide a general sense of the nature of forecasting models.
The Forecasting Process

When beginning the forecasting process, there are typical steps that must be followed. These steps follow
an acceptable decision-making process that includes the following elements:
1. Identification of the problem. Forecasters must identify what is going to be forecasted, or what is
of primary concern. There must be a timeline attached to the forecasting period. This will help the
forecasters to determine the methods to be used later.
2. Theoretical considerations. It is necessary to determine what forecasting has been done in the
past using the same variables and how relevant these data are to the problem that is currently
under consideration. It must also be determined what economic theory has to say about the
variables that might influence the forecast.
3. Data concerns. How easy will it be to collect the data needed to be able to make the forecasts is a
significant issue.
4. Determination of the assumption set. The forecaster must identify the assumptions that will be
made about the data and the process.
5. Modeling methodology. After careful examination of the problem, the types of models most
appropriate for the problem must be determined.
6. Preparation of the forecast. This is the analysis part of the process. After the model to be used is
determined, the analysis can begin and the forecast can be prepared.
7. Forecast verification. Once the forecasts have been made, the analyst must determine whether
they are reasonable and how they can be compared against the actual behavior of the data.
Forecasting Concerns
Forecasting does present some problems. Even though very detailed and sophisticated
mathematical models might be used, they do not always predict correctly. There are some who
would argue that the future cannot be predicted at all— period!
Some of the concerns about forecasting the future are that (1) predictions are made using
historical data, (2) they fail to account for unique events, and (3) they ignore coevolution
(developments created by our own actions). Additionally, there are psychological challenges
implicit in forecasting. An example of a psychological challenge is when plans based on
forecasts that use historical data become so confining as to prohibit management freedom. It is
also a concern that many decision makers feel that because they have the forecasting data in hand
they have control over the future.
Binomial Distribution
To understand binomial distributions and binomial probability, it helps to understand binomial

experiments and some associated notation; so we cover those topics first.
Binomial Experiment
A binomial experiment (also known as a Bernoulli trial) is a statistical experiment that has the
following properties:
 The experiment consists of n repeated trials.

 Each trial can result in just two possible outcomes. We call one of these outcomes a success and
the other, a failure.
 The probability of success, denoted by P, is the same on every trial.
 The trials are independent; that is, the outcome on one trial does not affect the outcome on
other trials.
Consider the following statistical experiment. You flip a coin 2 times and count the number of
times the coin lands on heads. This is a binomial experiment because:
 The experiment consists of repeated trials. We flip a coin 2 times.

 Each trial can result in just two possible outcomes - heads or tails.
 The probability of success is constant - 0.5 on every trial.
 The trials are independent; that is, getting heads on one trial does not affect whether we get
heads on other trials.
Notation
The following notation is helpful, when we talk about binomial probability.
 x: The number of successes that result from the binomial experiment.

 n: The number of trials in the binomial experiment.
 P: The probability of success on an individual trial.
 Q: The probability of failure on an individual trial. (This is equal to 1 - P.)
 b(x; n, P): Binomial probability - the probability that an n-trial binomial experiment results in
exactly x successes, when the probability of success on an individual trial is P.
 nCr: The number of combinations of n things, taken r at a time.
Binomial Distribution
A binomial random variable is the number of successes x in n repeated trials of a binomial
experiment. The probability distribution of a binomial random variable is called a binomial
distribution (also known as a Bernoulli distribution).
Suppose we flip a coin two times and count the number of heads (successes). The binomial
random variable is the number of heads, which can take on values of 0, 1, or 2. The binomial
distribution is presented below.
Number of heads Probability
0 0.25
1 0.50
2 0.25
The binomial distribution has the following properties:
 The mean of the distribution is equal to n * P .

 The variance (σ2x) is npq
 The standard deviation (σx) is
npq
Binomial Probability
The binomial probability refers to the probability that a binomial experiment results in exactly
x successes. For example, in the above table, we see that the binomial probability of getting
exactly one head in two coin flips is 0.50.
Given x, n, and P, we can compute the binomial probability based on the following formula:
Binomial Formula. Suppose a binomial experiment consists of n trials and results in x successes. If the
probability of success on an individual trial is P, then the binomial probability is:
p ( x )  n cx p x q n  x
The Poisson Distribution
The poisson distribution is another discrete probability distribution. It is named after Simeon-
Denis Poisson (1781-1840), a French mathematician. The poisson distribution depends only on
the average number of occurrences per unit time of space. There is no n, and no p. The Poisson
probability distribution provides a close approximation to the binomial probability distribution
when n is large and p is quite small or quite large.
The Poisson distribution is most commonly used to model the number of random occurrences of
some phenomenon in a specified unit of space or time. For example,
 The number of phone calls received by a telephone operator in a 10-minute period.

 The number of flaws in a bolt of fabric.
Normal Distribution
The normal distribution, also called the Gaussian distribution, is an important family of
continuous probability distributions, applicable in many fields. The importance of the normal
distribution as a model of quantitative phenomena in the natural and behavioral sciences is due in
part to the central limit theorem.
Characteristics of a Normal Distribution
1) Continuous Random Variable.

2) Mound or Bell-shaped curve.
3) The normal curve extends indefinitely in both directions, approaching, but never touching,
the horizontal axis as it does so.
4) Unimodal
5) Mean = Median = Mode
6) Symmetrical with respect to the mean
That is, 50% of the area (data) under the curve lies to the left of the mean and 50% of the
area (data) under the curve lies to the right of the mean.
7) The total area under the normal curve is equal to 1.

Stats Notes

Uploaded by

Copyright:

Available Formats

Stats Notes

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Stats Notes

Uploaded by

Copyright:

Available Formats

Statistics

Statistics is a mathematical science pertaining to the collection, analysis, interpretation or

Examples: graphing, calculating averages, looking for extreme scores

Examples: Chi-Square, T-Tests, Correlations, ANOVA

Measure Central Tendency

 Add all of the values together.

Example: The mean of 7, 12, 24, 20, 19 is (7 + 12 + 24 + 20 + 19) / 5 = 16.4.

 Sort the values into ascending order.

 Calculate the frequencies for all of the values in the data.

 Dispersion or "variation" in observations is what we seek to explain.

Variance and Standard Deviation

Variance Standard Deviation

This is the ratio of the standard deviation to the mean:

Thus it is defined as the ratio s. d. to its mean.

How to Interpret the Values of Correlations. As mentioned before, the correlation

Linear regression is a form of regression analysis in which the relationship between

(1) Line of regression of y on x

It is used to estimate y when x is given

(2) Line of regression of x on y

It is used to estimate x when y is given.

They are obtained by (1) graphically - by Scatter plot (ii) Mathematically

What is the difference between correlation and linear regression?

ii. If r is more than 6 times the P. E. ‘ r ’ is practically certain .i.e. significant.

P = Correlation ( coefficient ) of the population.

A random experiment :- flipping a coin twice

Sample space :-  or S = {(HH), (HT), (TH), (TT)}

The question : "both the flipps show same face"

Therefore, the event A : { (HH), (TT) }

Equally Likely Events

Mutually Exclusive Events

(2) An event corresponding to the empty set is an "impossible event."

It is obvious that A and are mutually exclusive. A  =  and A  = S.

Mathematical (or A Priori or Classic) Definition

so that p (A) + p ( ) = 1 i.e. p + q = 1

Clearly, the outcomes of both A and B are non-mutually exclusive

Multiplication Law of Probability

Two tailed and one tailed test.

A General Procedure for Conducting Hypothesis Tests

o Significance level. Often, researchers choose significance levels equal to 0.01,

Test statistic = (Statistic - Parameter) / (Standard deviation of statistic)

o P-value. The P-value is the probability of observing a sample statistic as extreme

Why Use Probability Sampling?

Simple Random Sampling

Stratified Random Sampling

by Joan Joseph Castillo (2009)

In any form of research, true random sampling is always difficult to achieve.

In contrast with probability sampling, non-probability sample is not a product of a randomized

TYPES OF NON-PROBABILITY SAMPLING

WHEN TO USE NON-PROBABILITY SAMPLING

• Sample- a finite subset of population.

• Standard error-standard deviation of sampling distribution.

• Now burning rate is a random variable that can be described by a probability

• Null hypotheses – hypothesis which is being tested for possible rejection.

• Alternative hypotheses-The hypothesis which is accepted when the null hypotheses is

• Level of significance-is the probability of rejection of null hypothesis when it is actually

• Two-sided Alternative Hypothesis

• Hypothesis-testing procedures rely on using the information in a random sample from

• The power is computed as 1 - b, and power can be interpreted as the probability of

General Procedure for Hypothesis Testing