Stats Notes
Stats Notes
Stats Notes
Types of Statistics
Descriptive Statistics are used to describe the data set
Inferential Statistics allow you to infer something about the the parameters of the
population based on the statistics of the sample, and various tests we perform on the
sample
The three most commonly-used measures of central tendency are the following.
Mean
The sum of the values divided by the number of values--often called the "average."
Example: For individuals having the following ages -- 18, 18, 19, 20, 20, 20, 21, and 23, the
mode is 20.
Dispersion
Measurements of central tendency (mean, mode and median) locate the distribution within the
range of possible values, measurements of dispersion describe the spread of values.
The dispersion of values within variables is especially important in social and political research
because:
Range
The range is the simplest measure of dispersion. The range can be thought of in two ways.
As a quantity: the difference between the highest and lowest scores in a distribution.
The standard deviation is simply the square root of the variance. In some sense, taking the
square root of the variance "undoes" the squaring of the differences that we did when we
calculated the variance.
Variance and standard deviation of a population are designated by and , respectively.
Variance and standard deviation of a sample are designated by s2 and s, respectively.
Population
Sample
In these equations, is the population mean, is the sample mean, N is the total
number of scores in the population, and n is the number of scores in the sample.
Coefficient of Variation
within them.
Co-efficient Of Variation ( C. V. )
To compare the variations ( dispersion ) of two different series, relative measures of standard
deviation must be calculated. This is known as co-efficient of variation or the co-efficient of s. d.
Its formula is
C. V. =
Remark: It is given as a percentage and is used to compare the consistency or variability of two
more series. The higher the C. V. , the higher the variability and lower the C. V., the higher is the
consistency of the data.
What are variables. Variables are things that we measure, control, or manipulate in research. They
differ in many respects, most notably in the role they are given in our research and in the type of
measures that can be applied to them.
Correlational vs. experimental research. Most empirical research belongs clearly to one of
those two general categories. In correlational research we do not (or at least try not to) influence
any variables but only measure them and look for relations (correlations) between some set of
variables, such as blood pressure and cholesterol level. In experimental research, we manipulate
some variables and then measure the effects of this manipulation on other variables; for example,
a researcher might artificially increase blood pressure and then record cholesterol level. Data
analysis in experimental research also comes down to calculating "correlations" between
variables, specifically, those manipulated and those affected by the manipulation. However,
experimental data may potentially provide qualitatively better information: Only experimental
data can conclusively demonstrate causal relations between variables. For example, if we found
that whenever we change variable A then variable B changes, then we can conclude that "A
influences B." Data from correlational research can only be "interpreted" in causal terms based
on some theories that we have, but correlational data cannot conclusively prove causality.
Dependent vs. independent variables. Independent variables are those that are manipulated
whereas dependent variables are only measured or registered. This distinction appears
terminologically confusing to many because, as some students say, "all variables depend on
something." However, once you get used to this distinction, it becomes indispensable. The terms
dependent and independent variable apply mostly to experimental research where some variables
are manipulated, and in this sense they are "independent" from the initial reaction patterns,
features, intentions, etc. of the subjects. Some other variables are expected to be "dependent" on
the manipulation or experimental conditions. That is to say, they depend on "what the subject
will do" in response. Somewhat contrary to the nature of this distinction, these terms are also
used in studies where we do not literally manipulate independent variables, but only assign
subjects to "experimental groups" based on some pre-existing properties of the subjects. For
example, if in an experiment, males are compared with females regarding their white cell count
(WCC), Gender could be called the independent variable and WCC the dependent variable.
Measurement scales. Variables differ in "how well" they can be measured, i.e., in how much
measurable information their measurement scale can provide. There is obviously some To index
measurement error involved in every measurement, which determines the "amount of
information" that we can obtain. Another factor that determines the amount of information that
can be provided by a variable is its "type of measurement scale." Specifically variables are
classified as (a) nominal, (b) ordinal, (c) interval or (d) ratio.
a. Nominal variables allow for only qualitative classification. That is, they can be measured
only in terms of whether the individual items belong to some distinctively different
categories, but we cannot quantify or even rank order those categories. For example, all
we can say is that 2 individuals are different in terms of variable A (e.g., they are of
different race), but we cannot say which one "has more" of the quality represented by the
variable. Typical examples of nominal variables are gender, race, color, city, etc.
b. Ordinal variables allow us to rank order the items we measure in terms of which has less
and which has more of the quality represented by the variable, but still they do not allow
us to say "how much more." A typical example of an ordinal variable is the
socioeconomic status of families. For example, we know that upper-middle is higher than
middle but we cannot say that it is, for example, 18% higher. Also this very distinction
between nominal, ordinal, and interval scales itself represents a good example of an
ordinal variable. For example, we can say that nominal measurement provides less
information than ordinal measurement, but we cannot say "how much less" or how this
difference compares to the difference between ordinal and interval scales.
c. Interval variables allow us not only to rank order the items that are measured, but also to
quantify and compare the sizes of differences between them. For example, temperature,
as measured in degrees Fahrenheit or Celsius, constitutes an interval scale. We can say
that a temperature of 40 degrees is higher than a temperature of 30 degrees, and that an
increase from 20 to 40 degrees is twice as much as an increase from 30 to 40 degrees.
d. Ratio variables are very similar to interval variables; in addition to all the properties of
interval variables, they feature an identifiable absolute zero point, thus they allow for
statements such as x is two times more than y. Typical examples of ratio scales are
measures of time or space. For example, as the Kelvin temperature scale is a ratio scale,
not only can we say that a temperature of 200 degrees is higher than one of 100 degrees,
we can correctly state that it is twice as high. Interval scales do not have the ratio
property. Most statistical data analysis procedures do not distinguish between the interval
and ratio properties of the measurement scales.
Correlations
The correlation is one of the most common and most useful statistics. A correlation is a single
number that describes the degree of relationship between two variables
Purpose (What is Correlation?) Correlation is a measure of the relation between two or more
variables. The measurement scales used should be at least interval scales, but other correlation
coefficients are available to handle other types of data. Correlation coefficients can range from
-1.00 to +1.00. The value of -1.00 represents a perfect negative correlation while a value of
+1.00 represents a perfect positive correlation. A value of 0.00 represents a lack of correlation.
In statistics, Spearman's rank correlation coefficient named after Charles Spearman and often
denoted by the Greek letter ρ (rho) or as rs, is a non-parametric measure of correlation – that is, it
assesses how well an arbitrary monotonic function could describe the relationship between two
variables, without making any assumptions about the frequency distribution of the variables.
Linear Regression
Correlation gives us the idea of the measure of magnitude and direction between correlated
variables. Now it is natural to think of a method that helps us in estimating the value of one
variable when the other is known. Also correlation does not imply causation. The fact that the
variables x and y are correlated does not necessarily mean that x causes y or vice versa. For
example, you would find that the number of schools in a town is correlated to the number of
accidents in the town. The reason for these accidents is not the school attendance; but these two
increases what is known as population. A statistical procedure called regression is concerned
with causation in a relationship among variables. It assesses the contribution of one or more
variable called causing variable or independent variable or one which is being caused
(dependent variable). When there is only one independent variable then the relationship is
expressed by a straight line. This procedure is called simple linear regression.
Regression can be defined as a method that estimates the value of one variable when that of
other variable is known, provided the variables are correlated. The dictionary meaning of
regression is "to go backward." It was used for the first time by Sir Francis Galton in his research
paper "Regression towards mediocrity in hereditary stature."
Lines of Regression: In scatter plot, we have seen that if the variables are highly correlated then
the points (dots) lie in a narrow strip. if the strip is nearly straight, we can draw a straight line,
such that all points are close to it from both sides. such a line can be taken as an ideal
representation of variation. This line is called the line of best fit if it minimizes the distances of
all data points from it.
This line is called the line of regression. Now prediction is easy because now all we need to do
is to extend the line and read the value. Thus to obtain a line of regression, we need to have a line
of best fit. But statisticians don’t measure the distances by dropping perpendiculars from points
on to the line. They measure deviations ( or errors or residuals as they are called) (i) vertically
and (ii) horizontally. Thus we get two lines of regressions as shown in the figure (1) and (2).
Its form is y = a + b x
Its form is x = a + b y
Regression can be used for prediction (including forecasting of time-series data), inference,
hypothesis testing, and modeling of causal relationships
Correlation and linear regression are not the same. Consider these differences:
Correlation quantifies the degree to which two variables are related. Correlation does not
find a best-fit line (that is regression). You simply are computing a correlation coefficient
(r) that tells you how much one variable tends to change when the other one does.
With correlation you don't have to think about cause and effect. You simply quantify how
well two variables relate to each other. With regression, you do have to think about cause
and effect as the regression line is determined as the best way to predict Y from X.
With correlation, it doesn't matter which of the two variables you call "X" and which you
call "Y". You'll get the same correlation coefficient if you swap the two. With linear
regression, the decision of which variable you call "X" and which you call "Y" matters a
lot, as you'll get a different best-fit line if you swap the two. The line that best predicts Y
from X is not the same as the line that predicts X from Y.
Correlation is almost always used when you measure both variables. It rarely is
appropriate when one variable is something you experimentally manipulate. With linear
regression, the X variable is often something you experimentall manipulate (time,
concentration...) and the Y variable is something you measure.
Probable Error
It is used to help in the determination of the Karl Pearson’s coefficient of correlation ‘ r ’. Due to
this ‘ r ’ is corrected to a great extent but note that ‘ r ’ depends on the random sampling and its
conditions. it is given by
P. E. = 0.6745
i. If the value of r is less than P. E., then there is no evidence of correlation i.e. r is not
significant.
iii. By adding or subtracting P. E. to ‘ r ’ , we get the upper and Lower limits within which ‘ r
’ of the population can be expected to lie.
Symbolically e = r P. E.
Probability basics
Sample space
The totality of all the outcomes or results of a random experiment is denoted by Greek alphabet
or English alphabets and is called the sample space. Each outcome or element of this sample
space is known as a sample print.
Event
Any subset of a sample space is called an event. A sample space S serves as the universal set for
all questions related to an experiment 'S' and an event A w.r.t it is a set of all possible outcomes
favorable to the even t A
For example,
All possible results of a random experiment are called equally likely outcomes and we have no
reason to expect any one rather than the other. For example, as the result of drawing a card from
a well shuffled pack, any card may appear in draw, so that the 52 cards become 52 different
events which are equally likely.
Events are called mutually exclusive or disjoint or incompatible if the occurrence of one of them
precludes the occurrence of all the others. For example in tossing a coin, there are two mutually
exclusive events viz turning up a head and turning up of a tail. Since both these events cannot
happen simultaneously. But note that events are compatible if it is possible for them to happen
simultaneously. For instance in rolling of two dice, the cases of the face marked 5 appearing on
one dice and face 5 appearing on the other, are compatible.
Exhaustive Events
Events are exhaustive when they include all the possibilities associated with the same trial. In
throwing a coin, the turning up of head and of a tail are exhaustive events assuming of course
that the coin cannot rest on its edge.
Independent Events
Two events are said to be independent if the occurrence of any event does not affect the
occurrence of the other event. For example in tossing of a coin, the events corresponding to the
two successive tosses of it are independent. The flip of one penny does not affect in any way the
flip of a nickel.
Dependent Events
If the occurrence or non-occurrence of any event affects the happening of the other, then the
events are said to be dependent events. For example, in drawing a card from a pack of cards, let
the event A be the occurrence of a king in the 1st draw and B be the occurrence of a king in the
1st draw and B be the occurrence of a king in the second draw. If the card drawn at the first trial
is not replaced then events A and B are independent events.
Note
(1) If an event contains a single simple point i.e. it is a singleton set, then this event is
called an elementary or a simple event.
(3) An event corresponding to the entire sample space is called a ‘certain event’.
Complementary Events
Let S be the sample space for an experiment and A be an event in S. Then A is a subset of S.
Hence , the complement of A in S is also an event in S which contains the outcomes which are
not favorable to the occurrence of A i.e. if A occurs, then the outcome of the experiment belongs
to A, but if A does not occur, then the outcomes of the experiment belongs to
If S contains n equally likely, mutually exclusive and exhaustive points and A contains m out of
these n points then contains (n - m) sample points.
Definitions of Probability
If there are ‘n’ exhaustive, mutually exclusive and equally likely cases and m of them are
favorable to an event A, the probability of A happening is defined as the ratio m/n
Expressed as a formula :-
This definition is due to ‘Laplace.’ Thus probability is a concept which measures numerically the
degree of certainty or uncertainty of the occurrence of an event.
For example, the probability of randomly drawing taking from a well-shuffled deck of cards is
4/52. Since 4 is the number of favorable outcomes (i.e. 4 kings of diamond, spade, club and
heart) and 52 is the number of total outcomes (the number of cards in a deck).
If A is any event of sample space having probability P, then clearly, P is a positive number
(expressed as a fraction or usually as a decimal) not greater than unity. 0 P 1 i.e. 0 (no chance
or for impossible event) to a high of 1 (certainty). Since the number of cases not favorable to A
are (n - m), the probability q that event A will not happen is, q = or q = 1 - m/n or q = 1 - p.
Now note that the probability q is nothing but the probability of the complementary event A i.e.
Thus p ( ) = 1 - p or p ( ) = 1 - p ( )
Addition theorem
In general, if the letters A and B stands for any two events, then
If there are two independent events; the respective probability of which are known, then the
probability that both will happen is the product of the probabilities of their happening
respectively P (AB) = P (A) P (B)
The two-tailed test is a statistical test used in inference, in which a given statistical hypothesis,
H0 (the null hypothesis) will be rejected when the value of the test statistic is either sufficiently
small or sufficiently large. This contrasts with a one-tailed test, in which only one of the
rejection regions "sufficiently small" or "sufficiently large" is preselected according to the
alternative hypothesis being selected, and the hypothesis is rejected only if the test statistic
satisfies that criterion. Alternative names are one-sided and two-sided tests.
The test is named after the "tail" of data under the far left and far right of a bell-shaped normal
data distribution, or bell curve. However, the terminology is extended to tests relating to
distributions other than normal. In general a test is called two-tailed if the null hypothesis is
rejected for values of the test statistic falling into either tail of its sampling distribution, and it is
called one-sided or one-tailed if the null hypothesis is rejected only for values of the test statistic
falling into one specified tail of its sampling distribution.[1] For example, if the alternative
hypothesis is , rejecting the null hypothesis of μ = 42.5 for small or for large values
of the sample mean, the test is called "two-tailed" or "two-sided". If the alternative hypothesis is
μ > 1.4, rejecting the null hypothesis of only for large values of the sample mean, it is
then called "one-tailed" or "one-sided".
State the hypotheses. Every hypothesis test requires the analyst to state a null hypothesis
and an alternative hypothesis. The hypotheses are stated in such a way that they are
mutually exclusive. That is, if one is true, the other must be false; and vice versa.
Formulate an analysis plan. The analysis plan describes how to use sample data to accept
or reject the null hypothesis. It should specify the following elements.
o Test method. Typically, the test method involves a test statistic and a sampling
distribution. Computed from sample data, the test statistic might be a mean score,
proportion, difference between means, difference between proportions, z-score, t-
score, chi-square, etc. Given a test statistic and its sampling distribution, a
researcher can assess probabilities associated with the test statistic. If the test
statistic probability is less than the significance level, the null hypothesis is
rejected.
Analyze sample data. Using sample data, perform computations called for in the analysis
plan.
o Test statistic. When the null hypothesis involves a mean or proportion, use either
of the following equations to compute the test statistic.
where Parameter is the value appearing in the null hypothesis, and Statistic is the
point estimate of Parameter. As part of the analysis, you may need to compute the
standard deviation or standard error of the statistic. Previously, we presented
common formulas for the standard deviation and standard error.
When the parameter in the null hypothesis involves categorical data, you may use
a chi-square statistic as the test statistic. Instructions for computing a chi-square
test statistic are presented in the lesson on the chi-square goodness of fit test.
Interpret the results. If the sample findings are unlikely, given the null hypothesis, the
researcher rejects the null hypothesis. Typically, this involves comparing the P-value to
the significance level, and rejecting the null hypothesis when the P-value is less than the
significance level.
If school administrators wished to conduct a survey assessing the popularity of pizza on the
cafeteria menu, they could stop students on the way to the library and ask them the survey
questions. Although this non-probability sampling type is a convenient way to conduct a survey,
it’s not as accurate or rigorous as some probability sampling modalities.
In any field of scholarly research, researchers must set up a process that assures that the different
members of a population have an equal chance of selection. This allows researchers to draw
some general conclusions beyond those people included in the study. Another reason for
probability sampling is the need to eliminate any possible researcher bias. Returning to the pizza
survey example, the survey administrator might not be inclined to stop the troublemaker who
threw water balloons in the cafeteria last week.
Simple random sampling is akin to pulling a number out of a hat. However, in a large population,
it can be time-consuming to write down 3000 names on slips of paper to draw from a hat. An
easier way to draw the sample for the pizza survey is to utilize a random number table to choose
students. Administrators could use the last two digits of the students’ social security number to
identify the table column and the first two digits to identify the row. However, just using luck of
the draw may not provide the administrators with a good representation of subgroups in the
student population.
This sampling method involves dividing the population into subgroups based on variables known
about those subgroups, and then taking a simple random sample of each subgroup. This would
assure the administrator that he was accurately representing not only the overall population, but
also key subgroups, such as students with low attendance or minority groups. This method can
be tricky for the uninitiated, as the researcher must decide what weights to assign to each
stratification variable.
Cluster Sampling
This stepwise process is useful for those who know little about the population they’re studying.
First, the researcher would divide the population into clusters (usually geographic boundaries).
Then, the researcher randomly samples the clusters. Finally, the researcher must measure all
units within the sampled clusters. Researchers use this method when economy of administration
is important. Because a school population is confined to a three-block area, the school
administrators wouldn’t need to get so elaborate.
Multistage Sampling
This is the most complex sampling strategy. The researcher combines simpler sampling methods
to address sampling needs in the most effective way possible. For example, the administrator
might begin with a cluster sample of all schools in the district. Then he might set up a stratified
sampling process within clusters. Within schools, the administrator could conduct a simple
random sample of classes or grades. By combining various methods, researchers achieve a rich
variety of results useful in different contexts.
NON-PROBABILITY SAMPLING
Non-probability sampling is a sampling technique where the samples are gathered in a process that does
not give all the individuals in the population equal chances of being selected.
The downside of this is that an unknown proportion of the entire population was not sampled.
This entails that the sample may or may not represent the entire population accurately. Therefore,
the results of the research cannot be used in generalizations pertaining to the entire population.
CONSECUTIVE SAMPLING
Consecutive sampling is very similar to convenience sampling except that it seeks to include ALL
accessible subjects as part of the sample. This non-probability sampling technique can be considered as
the best of all non-probability samples because it includes all subjects that are available that makes the
sample a better representation of the entire population.
QUOTA SAMPLING
Quota sampling is a non-probability sampling technique wherein the researcher ensures equal or
proportionate representation of subjects depending on which trait is considered as basis of the quota.
For example, if basis of the quota is college year level and the researcher needs equal
representation, with a sample size of 100, he must select 25 1st year students, another 25 2nd year
students, 25 3rd year and 25 4th year students. The bases of the quota are usually age, gender,
education, race, religion and socioeconomic status.
JUDGMENTAL SAMPLING
Judgmental sampling is more commonly known as purposive sampling. In this type of sampling, subjects
are chosen to be part of the sample with a specific purpose in mind. With judgmental sampling, the
researcher believes that some subjects are more fit for the research compared to other individuals. This
is the reason why they are purposively chosen as subjects.
SNOWBALL SAMPLING
Snowball sampling is usually done when there is a very small population size. In this type of sampling,
the researcher asks the initial subject to identify another potential subject who also meets the criteria of
the research. The downside of using a snowball sample is that it is hardly representative of the
population.
Statistical inference
• Population -collection of objects having some common characteristic of interest under the
consideration for a statistical investigation.
• Sample error- the inherent and unavoidable error caused while approximating the
characteristic of the object.
• Random sample –if n objects are selected from a population each of them are equiprobable of
getting selected.
• Confidence interval and confidence limits-In order to find the population mean we cannot
draw large number of the samples occurring in the entire population. So we setup certain
limits on both sides of the population mean on the basis that the mean of samples are
normally distributed around the population mean.These limits are called confidence limits
and range between the two is called the confidence interval.
• The field of statistical inference consists of those methods used to make decisions or draw
conclusions about a population. These methods utilize the information contained in a
samplefrom the population in drawing conclusions.
Point Estimation
Hypothesis Testing
For example, suppose that we are interested in the burning rate of a solid propellant used to
power aircrew escape systems.
• Suppose that our interest focuses on the mean burning rate (a parameter of this
distribution).
• Specifically, we are interested in deciding whether or not the mean burning rate is 50
centimeters per second.
• Critical region-The set of all those samples which lead to the rejection of null
hypothesis.
Test of a Hypothesis
• A procedure leading to a decision about a particular hypothesis
• If this information is consistent with the hypothesis, then we will conclude that the
hypothesis is true; if this information is inconsistent with the hypothesis, we will
conclude that the hypothesis is false.
• For example, consider the propellant burning rate problem whenwe are testing H 0 :
m = 50 centimeters per second against H 1 : m not equal 50 centimeters per second .
Suppose that the true value of the mean is m = 52. When n = 10, we found that b =
0.2643, so the power of this test is 1 - b = 1 - 0.2643 = 0.7357 when m = 52
Types
• One sample t test – is used to compare the mean of a single sample with the
population mean.
• An economist wants to know if the per capita income of a particular region is same as
the national average.
• An economist wants to compare the per capita income of two different region.
Z test
• For z test population mean and population standard deviation should be known.
F statistic
• ANOVA uses F statistic ,which tests if the means o the groups formed by one
independent variable or combination of independent variable are significantly
different.It is based on the comparison of variance.
• Is used to test the hypothesis that two categorical variables are independent of each
other.
A one sample t-test is a hypothesis test for answering questions about the mean where
the data are a random sample of independent observations from an underlying normal
distribution N(µ, ), where is unknown.
That is, the sample has been drawn from a population of given mean and unknown
variance (which therefore has to be estimated from the sample).
This null hypothesis, H0 is tested against one of the following alternative hypotheses,
depending on the question posed:
H1: µ is not equal to µ
H1: µ > µ
H1: µ < µ
When carrying out a two sample t-test, it is usual to assume that the variances for the
two populations are equal, i.e.
That is, the two samples have both been drawn from the same population. This null
hypothesis is tested against one of the following alternative hypotheses, depending on
the question posed.
H1: µ1 is not equal to µ2
H 1 : µ 1 > µ2
H 1 : µ 1 < µ2
Correlation and linear regression are not the same. Consider these differences:
Correlation quantifies the degree to which two variables are related. Correlation does not
find a best-fit line (that is regression). You simply are computing a correlation coefficient
(r) that tells you how much one variable tends to change when the other one does.
With correlation you don't have to think about cause and effect. You simply quantify how
well two variables relate to each other. With regression, you do have to think about cause
and effect as the regression line is determined as the best way to predict Y from X.
With correlation, it doesn't matter which of the two variables you call "X" and which you
call "Y". You'll get the same correlation coefficient if you swap the two. With linear
regression, the decision of which variable you call "X" and which you call "Y" matters a
lot, as you'll get a different best-fit line if you swap the two. The line that best predicts Y
from X is not the same as the line that predicts X from Y.
Correlation is almost always used when you measure both variables. It rarely is
appropriate when one variable is something you experimentally manipulate. With linear
regression, the X variable is often something you experimentall manipulate (time,
concentration...) and the Y variable is something you measure.
Definition of Time Series: An ordered sequence of values of a variable at equally spaced time
intervals. Time series occur frequently when looking at industrial data
Applications: The usage of time series models is twofold:
Obtain an understanding of the underlying forces and structure that produced the observed data
Fit a model and proceed to forecasting, monitoring or even feedback and feed forward control.
Economic Forecasting
Sales Forecasting
Budgetary Analysis
Yield Projections
Inventory Studies
Workload Projections
Utility Studies
Census Analysis.
Business benefits
What you do with time-series forecasting depend on your business activity?.
The sales history of each product constitutes a time-series to be forecasted. The sales forecasts are
used to optimize the inventory levels. Too much inventory, and your expenses go up. Too few
inventory, and sales opportunities are lost in out-of-stock situations.
Time-series for Manufacturers
The production history and/or the inputs consumptions constitute time-series to be forecasted.
The inputs consumptions are forecasted in order to minimize the inventory levels. The production
history can be forecasted and used as an approximation of the future demand. Forecasting the
demand enables to perform efficient capacity planning.
Forecasting in Business
Business leaders and economists are continually involved in the process of trying to forecast, or
predict, the future of business in the economy. Business leaders engage in this process because
much of what happens in businesses today depends on what is going to happen in the future.
Qualitative forecasting models have often proven to be most effective for short-term projections.
In this method of forecasting, which works best when the scope is limited, experts in the
appropriate fields are asked to agree on a common forecast. Two methods are used frequently.
Delphi Method. This method involves asking various experts what they anticipate will happen in the
future relative to the subject under consideration. Experts in the automotive industry, for example, might
be asked to forecast likely innovative enhancements for cars five years from now. They are not expected
to be precise, but rather to provide general opinions.
Market Research Method. This method involves surveys and questionnaires about people's subjective
reactions to changes. For example, a company might develop a new way to launder clothes; after people
have had an opportunity to try the new method, they would be asked for feedback about how to improve
the processes or how it might be made more appealing for the general public. This method is difficult
because it is hard to identify an appropriate sample that is representative of the larger audience for whom
the product is intended.
Time-Series Methods. This forecasting model uses historical data to try to predict future events. For
example, assume that you are interested in knowing how long a recession will last. You might look at all
past recessions and the events leading up to and surrounding them and then, from that data, try to predict
how long the current recession will last.
A specific variable in the time series is identified by the series name and date. If gross domestic product
(GDP) is the variable, it might be identified as GDP2000.1 for the first-quarter statistics for the year 2000.
This is just one example, and different groups might use different methods to identify variables in a time
period.
Many government agencies prepare and release time-series data. The Federal Reserve, for example,
collects data on monetary policy and financial institutions and publishes that data in the Federal Reserve
Bulletin. These data become the foundation for making decisions about regulating the growth of the
economy.
Time-series models provide accurate forecasts when the changes that occur in the variable's environment
are slow and consistent. When large-degree changes occur, the forecasts are not reliable for the long term.
Since time-series forecasts are relatively easy and inexpensive to construct, they are used quite
extensively.
The Indicator Approach. The U.S. government is a primary user of the indicator approach of
forecasting. The government uses such indicators as the Composite Index of Leading, Lagging, and
Coincident Indicators, often referred to as Composite Indexes. The indexes predict by assuming that past
trends and relationships will continue into the future. The government indexes are made by averaging the
behavior of the different indicator series that make up each composite series.
The timing and strength of each indicator series relationship with general business activity, reflected in
the business cycle, change over time. This relationship makes forecasting changes in the business cycle
difficult.
Econometric Models. Econometric models are causal models that statistically identify the relationships
between variables and how changes in one or more variables cause changes in another variable.
Econometric models then use the identified relationship to predict the future. Econometric models are
also called regression models.
There are two types of data used in regression analysis. Economic forecasting models predominantly use
time-series data, where the values of the variables change over time. Additionally, cross-section data,
which capture the relationship between variables at a single point in time, are used. A lending institution,
for example, might want to determine what influences the sale of homes. It might gather data on home
prices, interest rates, and statistics on the homes being sold, such as size and location. This is the cross-
section data that might be used with time-series data to try to determine such things as what size home
will sell best in which location.
An econometric model is a way of determining the strength and statistical significance of a hypothesized
relationship. These models are used extensively in economics to prove, disprove, or validate the existence
of a casual relationship between two or more variables. It is obvious that this model is highly
mathematical, using different statistical equations.
For the sake of simplicity, mathematical analysis is not addressed here. Just as there are these qualitative
and quantitative forecasting models, there are others equally as sophisticated; however, the discussion
here should provide a general sense of the nature of forecasting models.
1. Identification of the problem. Forecasters must identify what is going to be forecasted, or what is
of primary concern. There must be a timeline attached to the forecasting period. This will help the
forecasters to determine the methods to be used later.
2. Theoretical considerations. It is necessary to determine what forecasting has been done in the
past using the same variables and how relevant these data are to the problem that is currently
under consideration. It must also be determined what economic theory has to say about the
variables that might influence the forecast.
3. Data concerns. How easy will it be to collect the data needed to be able to make the forecasts is a
significant issue.
4. Determination of the assumption set. The forecaster must identify the assumptions that will be
made about the data and the process.
5. Modeling methodology. After careful examination of the problem, the types of models most
appropriate for the problem must be determined.
6. Preparation of the forecast. This is the analysis part of the process. After the model to be used is
determined, the analysis can begin and the forecast can be prepared.
7. Forecast verification. Once the forecasts have been made, the analyst must determine whether
they are reasonable and how they can be compared against the actual behavior of the data.
Forecasting Concerns
Forecasting does present some problems. Even though very detailed and sophisticated
mathematical models might be used, they do not always predict correctly. There are some who
would argue that the future cannot be predicted at all— period!
Some of the concerns about forecasting the future are that (1) predictions are made using
historical data, (2) they fail to account for unique events, and (3) they ignore coevolution
(developments created by our own actions). Additionally, there are psychological challenges
implicit in forecasting. An example of a psychological challenge is when plans based on
forecasts that use historical data become so confining as to prohibit management freedom. It is
also a concern that many decision makers feel that because they have the forecasting data in hand
they have control over the future.
Binomial Distribution
Binomial Experiment
A binomial experiment (also known as a Bernoulli trial) is a statistical experiment that has the
following properties:
Notation
The following notation is helpful, when we talk about binomial probability.
Binomial Distribution
A binomial random variable is the number of successes x in n repeated trials of a binomial
experiment. The probability distribution of a binomial random variable is called a binomial
distribution (also known as a Bernoulli distribution).
Suppose we flip a coin two times and count the number of heads (successes). The binomial
random variable is the number of heads, which can take on values of 0, 1, or 2. The binomial
distribution is presented below.
0 0.25
1 0.50
2 0.25
Binomial Probability
The binomial probability refers to the probability that a binomial experiment results in exactly
x successes. For example, in the above table, we see that the binomial probability of getting
exactly one head in two coin flips is 0.50.
Given x, n, and P, we can compute the binomial probability based on the following formula:
Binomial Formula. Suppose a binomial experiment consists of n trials and results in x successes. If the
probability of success on an individual trial is P, then the binomial probability is:
p ( x ) n cx p x q n x
The poisson distribution is another discrete probability distribution. It is named after Simeon-
Denis Poisson (1781-1840), a French mathematician. The poisson distribution depends only on
the average number of occurrences per unit time of space. There is no n, and no p. The Poisson
probability distribution provides a close approximation to the binomial probability distribution
when n is large and p is quite small or quite large.
The Poisson distribution is most commonly used to model the number of random occurrences of
some phenomenon in a specified unit of space or time. For example,
Normal Distribution
The normal distribution, also called the Gaussian distribution, is an important family of
continuous probability distributions, applicable in many fields. The importance of the normal
distribution as a model of quantitative phenomena in the natural and behavioral sciences is due in
part to the central limit theorem.
3) The normal curve extends indefinitely in both directions, approaching, but never touching,
the horizontal axis as it does so.
4) Unimodal
That is, 50% of the area (data) under the curve lies to the left of the mean and 50% of the
area (data) under the curve lies to the right of the mean.